mirror of
https://github.com/trycua/computer.git
synced 2026-01-04 12:30:08 -06:00
Add oss blogposts
This commit is contained in:
239
blog/app-use.md
Normal file
239
blog/app-use.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# App-Use: Control Individual Applications with Cua Agents
|
||||
|
||||
*Published on May 31, 2025 by The Cua Team*
|
||||
|
||||
Today, we are excited to introduce a new experimental feature landing in the [Cua GitHub repository](https://github.com/trycua/cua): **App-Use**. App-Use allows you to create lightweight virtual desktops that limit agent access to specific applications, improving precision of your agent's trajectory. Perfect for parallel workflows, and focused task execution.
|
||||
|
||||
> **Note:** App-Use is currently experimental. To use it, you need to enable it by passing `experiments=["app-use"]` feature flag when creating your Computer instance.
|
||||
|
||||
Check out an example of a Cua Agent automating Cua's team Taco Bell order through the iPhone Mirroring app:
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/demo_app_use.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
## What is App-Use?
|
||||
|
||||
App-Use lets you create virtual desktop sessions scoped to specific applications. Instead of giving an agent access to your entire screen, you can say "only work with Safari and Notes" or "just control the iPhone Mirroring app."
|
||||
|
||||
```python
|
||||
# Create a macOS VM with App Use experimental feature enabled
|
||||
computer = Computer(experiments=["app-use"])
|
||||
|
||||
# Create a desktop limited to specific apps
|
||||
desktop = computer.create_desktop_from_apps(["Safari", "Notes"])
|
||||
|
||||
# Your agent can now only see and interact with these apps
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[desktop]
|
||||
)
|
||||
```
|
||||
|
||||
## Key Benefits
|
||||
|
||||
### 1. Lightweight and Fast
|
||||
App-Use creates visual filters, not new processes. Your apps continue running normally - we just control what the agent can see and click on. The virtual desktops are composited views that require no additional compute resources beyond the existing window manager operations.
|
||||
|
||||
### 2. Run Multiple Agents in Parallel
|
||||
Deploy a team of specialized agents, each focused on their own apps:
|
||||
|
||||
```python
|
||||
# Create a Computer with App Use enabled
|
||||
computer = Computer(experiments=["app-use"])
|
||||
|
||||
# Research agent focuses on browser
|
||||
research_desktop = computer.create_desktop_from_apps(["Safari"])
|
||||
research_agent = ComputerAgent(tools=[research_desktop], ...)
|
||||
|
||||
# Writing agent focuses on documents
|
||||
writing_desktop = computer.create_desktop_from_apps(["Pages", "Notes"])
|
||||
writing_agent = ComputerAgent(tools=[writing_desktop], ...)
|
||||
|
||||
async def run_agent(agent, task):
|
||||
async for result in agent.run(task):
|
||||
print(result.get('text', ''))
|
||||
|
||||
# Run both simultaneously
|
||||
await asyncio.gather(
|
||||
run_agent(research_agent, "Research AI trends for 2025"),
|
||||
run_agent(writing_agent, "Draft blog post outline")
|
||||
)
|
||||
```
|
||||
|
||||
## How To: Getting Started with App-Use
|
||||
|
||||
### Requirements
|
||||
|
||||
To get started with App-Use, you'll need:
|
||||
- Python 3.11+
|
||||
- macOS Sequoia (15.0) or later
|
||||
|
||||
### Getting Started
|
||||
|
||||
```bash
|
||||
# Install packages and launch UI
|
||||
pip install -U "cua-computer[all]" "cua-agent[all]"
|
||||
python -m agent.ui.gradio.app
|
||||
```
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def main():
|
||||
computer = Computer()
|
||||
await computer.run()
|
||||
|
||||
# Create app-specific desktop sessions
|
||||
desktop = computer.create_desktop_from_apps(["Notes"])
|
||||
|
||||
# Initialize an agent
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[desktop]
|
||||
)
|
||||
|
||||
# Take a screenshot (returns bytes by default)
|
||||
screenshot = await desktop.interface.screenshot()
|
||||
with open("app_screenshot.png", "wb") as f:
|
||||
f.write(screenshot)
|
||||
|
||||
# Run an agent task
|
||||
async for result in agent.run("Create a new note titled 'Meeting Notes' and add today's agenda items"):
|
||||
print(f"Agent: {result.get('text', '')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Use Case: Automating Your iPhone with Cua
|
||||
|
||||
### ⚠️ Important Warning
|
||||
|
||||
Computer-use agents are powerful tools that can interact with your devices. This guide involves using your own macOS and iPhone instead of a VM. **Proceed at your own risk.** Always:
|
||||
- Review agent actions before running
|
||||
- Start with non-critical tasks
|
||||
- Monitor agent behavior closely
|
||||
|
||||
Remember with Cua it is still advised to use a VM for a better level of isolation for your agents.
|
||||
|
||||
### Setting Up iPhone Automation
|
||||
|
||||
### Step 1: Start the cua-computer-server
|
||||
|
||||
First, you'll need to start the cua-computer-server locally to enable access to iPhone Mirroring via the Computer interface:
|
||||
|
||||
```bash
|
||||
# Install the server
|
||||
pip install cua-computer-server
|
||||
|
||||
# Start the server
|
||||
python -m computer_server
|
||||
```
|
||||
|
||||
### Step 2: Connect iPhone Mirroring
|
||||
|
||||
Then, you'll need to open the "iPhone Mirroring" app on your Mac and connect it to your iPhone.
|
||||
|
||||
### Step 3: Create an iPhone Automation Session
|
||||
|
||||
Finally, you can create an iPhone automation session:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from cua_agent import Agent
|
||||
|
||||
async def automate_iphone():
|
||||
# Connect to your local computer server
|
||||
my_mac = Computer(use_host_computer_server=True, os_type="macos", experiments=["app-use"])
|
||||
await my_mac.run()
|
||||
|
||||
# Create a desktop focused on iPhone Mirroring
|
||||
my_iphone = my_mac.create_desktop_from_apps(["iPhone Mirroring"])
|
||||
|
||||
# Initialize an agent for iPhone automation
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[my_iphone]
|
||||
)
|
||||
|
||||
# Example: Send a message
|
||||
async for result in agent.run("Open Messages and send 'Hello from Cua!' to John"):
|
||||
print(f"Agent: {result.get('text', '')}")
|
||||
|
||||
# Example: Set a reminder
|
||||
async for result in agent.run("Create a reminder to call mom at 5 PM today"):
|
||||
print(f"Agent: {result.get('text', '')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(automate_iphone())
|
||||
```
|
||||
|
||||
### iPhone Automation Use Cases
|
||||
|
||||
With Cua's iPhone automation, you can:
|
||||
- **Automate messaging**: Send texts, respond to messages, manage conversations
|
||||
- **Control apps**: Navigate any iPhone app using natural language
|
||||
- **Manage settings**: Adjust iPhone settings programmatically
|
||||
- **Extract data**: Read information from apps that don't have APIs
|
||||
- **Test iOS apps**: Automate testing workflows for iPhone applications
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **Visual isolation only**: Apps share the same files, OS resources, and user session
|
||||
- **Dynamic resolution**: Desktops automatically scale to fit app windows and menu bars
|
||||
- **macOS only**: Currently requires macOS due to compositing engine dependencies
|
||||
- **Not a security boundary**: This is for agent focus, not security isolation
|
||||
|
||||
## When to Use What: App-Use vs Multiple Cua Containers
|
||||
|
||||
### Use App-Use within the same macOS Cua Container:
|
||||
- ✅ You need lightweight, fast agent focusing (macOS only)
|
||||
- ✅ You want to run multiple agents on one desktop
|
||||
- ✅ You're automating personal devices like iPhones
|
||||
- ✅ Window layout isolation is sufficient
|
||||
- ✅ You want low computational overhead
|
||||
|
||||
### Use Multiple Cua Containers:
|
||||
- ✅ You need maximum isolation between agents
|
||||
- ✅ You require cross-platform support (Mac/Linux/Windows)
|
||||
- ✅ You need guaranteed resource allocation
|
||||
- ✅ Security and complete isolation are critical
|
||||
- ⚠️ Note: Most computationally expensive option
|
||||
|
||||
## Pro Tips
|
||||
|
||||
1. **Start Small**: Test with one app before creating complex multi-app desktops
|
||||
2. **Screenshot First**: Take a screenshot to verify your desktop shows the right apps
|
||||
3. **Name Your Apps Correctly**: Use exact app names as they appear in the system
|
||||
4. **Consider Performance**: While lightweight, too many parallel agents can still impact system performance
|
||||
5. **Plan Your Workflows**: Design agent tasks to minimize app switching for best results
|
||||
|
||||
### How It Works
|
||||
|
||||
When you create a desktop session with `create_desktop_from_apps()`, App Use:
|
||||
- Filters the visual output to show only specified application windows
|
||||
- Routes input events only to those applications
|
||||
- Maintains window layout isolation between different sessions
|
||||
- Shares the underlying file system and OS resources
|
||||
- **Dynamically adjusts resolution** to fit the window layout and menu bar items
|
||||
|
||||
The resolution of these virtual desktops is dynamic, automatically scaling to accommodate the applications' window sizes and menu bar requirements. This ensures that agents always have a clear view of the entire interface they need to interact with, regardless of the specific app combination.
|
||||
|
||||
Currently, App Use is limited to macOS only due to its reliance on Quartz, Apple's powerful compositing engine, for creating these virtual desktops. Quartz provides the low-level window management and rendering capabilities that make it possible to composite multiple application windows into isolated visual environments.
|
||||
|
||||
## Conclusion
|
||||
|
||||
App Use brings a new dimension to computer automation - lightweight, focused, and parallel. Whether you're building a personal iPhone assistant or orchestrating a team of specialized agents, App Use provides the perfect balance of functionality and efficiency.
|
||||
|
||||
Ready to try it? Update to the latest Cua version and start focusing your agents today!
|
||||
|
||||
```bash
|
||||
pip install -U "cua-computer[all]" "cua-agent[all]"
|
||||
```
|
||||
|
||||
Happy automating! 🎯🤖
|
||||
BIN
blog/assets/composite-agents.png
Normal file
BIN
blog/assets/composite-agents.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 3.2 MiB |
BIN
blog/assets/demo_app_use.mp4
Normal file
BIN
blog/assets/demo_app_use.mp4
Normal file
Binary file not shown.
BIN
blog/assets/demo_wsb.mp4
Normal file
BIN
blog/assets/demo_wsb.mp4
Normal file
Binary file not shown.
BIN
blog/assets/docker-ubuntu-support.png
Normal file
BIN
blog/assets/docker-ubuntu-support.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.0 MiB |
BIN
blog/assets/hack-the-north.png
Normal file
BIN
blog/assets/hack-the-north.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.3 MiB |
BIN
blog/assets/hud-agent-evals.png
Normal file
BIN
blog/assets/hud-agent-evals.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.1 MiB |
BIN
blog/assets/human-in-the-loop.mp4
Normal file
BIN
blog/assets/human-in-the-loop.mp4
Normal file
Binary file not shown.
BIN
blog/assets/launch-video-cua-cloud.mp4
Normal file
BIN
blog/assets/launch-video-cua-cloud.mp4
Normal file
Binary file not shown.
BIN
blog/assets/playground_web_ui_sdk_sample.mp4
Normal file
BIN
blog/assets/playground_web_ui_sdk_sample.mp4
Normal file
Binary file not shown.
BIN
blog/assets/trajectory-viewer.jpeg
Normal file
BIN
blog/assets/trajectory-viewer.jpeg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1021 KiB |
353
blog/bringing-computer-use-to-the-web.md
Normal file
353
blog/bringing-computer-use-to-the-web.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# Bringing Computer-Use to the Web
|
||||
|
||||
*Published on August 5, 2025 by Morgan Dean*
|
||||
|
||||
In one of our original posts, we explored building Computer-Use Operators on macOS - first with a [manual implementation](build-your-own-operator-on-macos-1.md) using OpenAI's `computer-use-preview` model, then with our [cua-agent framework](build-your-own-operator-on-macos-2.md) for Python developers. While these tutorials have been incredibly popular, we've received consistent feedback from our community: **"Can we use C/ua with JavaScript and TypeScript?"**
|
||||
|
||||
Today, we're excited to announce the release of the **`@trycua/computer` Web SDK** - a new library that allows you to control your C/ua cloud containers from any JavaScript or TypeScript project. With this library, you can click, type, and grab screenshots from your cloud containers - no extra servers required.
|
||||
|
||||
With this new SDK, you can easily develop CUA experiences like the one below, which we will release soon as open source.
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/playground_web_ui_sdk_sample.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
Let’s see how it works.
|
||||
|
||||
## What You'll Learn
|
||||
|
||||
By the end of this tutorial, you'll be able to:
|
||||
|
||||
- Set up the `@trycua/computer` npm library in any JavaScript/TypeScript project
|
||||
- Connect OpenAI's computer-use model to C/ua cloud containers from web applications
|
||||
- Build computer-use agents that work in Node.js, React, Vue, or any web framework
|
||||
- Handle different types of computer actions (clicking, typing, scrolling) from web code
|
||||
- Implement the complete computer-use loop in JavaScript/TypeScript
|
||||
- Integrate AI automation into existing web applications and workflows
|
||||
|
||||
**Prerequisites:**
|
||||
|
||||
- Node.js 16+ and npm/yarn/pnpm
|
||||
- Basic JavaScript or TypeScript knowledge
|
||||
- OpenAI API access (Tier 3+ for computer-use-preview)
|
||||
- C/ua cloud container credits ([get started here](https://trycua.com/pricing))
|
||||
|
||||
**Estimated Time:** 45-60 minutes
|
||||
|
||||
## Access Requirements
|
||||
|
||||
### OpenAI Model Availability
|
||||
|
||||
At the time of writing, the **computer-use-preview** model has limited availability:
|
||||
|
||||
- Only accessible to OpenAI tier 3+ users
|
||||
- Additional application process may be required even for eligible users
|
||||
- Cannot be used in the OpenAI Playground
|
||||
- Outside of ChatGPT Operator, usage is restricted to the new **Responses API**
|
||||
|
||||
Luckily, the `@trycua/computer` library can be used in conjunction with other models, like [Anthropic’s Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool) or [UI-TARS](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B). You’ll just have to write your own handler to parse the model output for interfacing with the container.
|
||||
|
||||
### C/ua Cloud Containers
|
||||
|
||||
To follow this guide, you’ll need access to a C/ua cloud container.
|
||||
|
||||
Getting access is simple: purchase credits from our [pricing page](https://trycua.com/pricing), then create and provision a new container instance from the [dashboard](https://trycua.com/dashboard/containers). With your container running, you'll be ready to leverage the web SDK and bring automation to your JavaScript or TypeScript applications.
|
||||
|
||||
## Understanding the Flow
|
||||
|
||||
### OpenAI API Overview
|
||||
|
||||
Let's start with the basics. In our case, we'll use OpenAI's API to communicate with their computer-use model.
|
||||
|
||||
Think of it like this:
|
||||
|
||||
1. We send the model a screenshot of our container and tell it what we want it to do
|
||||
2. The model looks at the screenshot and decides what actions to take
|
||||
3. It sends back instructions (like "click here" or "type this")
|
||||
4. We execute those instructions in our container.
|
||||
|
||||
### Model Setup
|
||||
|
||||
Here's how we set up the computer-use model for web development:
|
||||
|
||||
```javascript
|
||||
const res = await openai.responses.create({
|
||||
model: 'computer-use-preview',
|
||||
tools: [
|
||||
{
|
||||
type: 'computer_use_preview',
|
||||
display_width: 1024,
|
||||
display_height: 768,
|
||||
environment: 'linux', // we're using a linux container
|
||||
},
|
||||
],
|
||||
input: [
|
||||
{
|
||||
role: 'user',
|
||||
content: [
|
||||
// what we want the ai to do
|
||||
{ type: 'input_text', text: 'Open firefox and go to trycua.com' },
|
||||
// first screenshot of the vm
|
||||
{
|
||||
type: 'input_image',
|
||||
image_url: `data:image/png;base64,${screenshotBase64}`,
|
||||
detail: 'auto',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
truncation: 'auto'
|
||||
});
|
||||
```
|
||||
|
||||
### Understanding the Response
|
||||
|
||||
When we send a request, the API sends back a response that looks like this:
|
||||
|
||||
```json
|
||||
"output": [
|
||||
{
|
||||
"type": "reasoning", // The AI explains what it's thinking
|
||||
"id": "rs_67cc...",
|
||||
"summary": [
|
||||
{
|
||||
"type": "summary_text",
|
||||
"text": "Clicking on the browser address bar."
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "computer_call", // The actual action to perform
|
||||
"id": "cu_67cc...",
|
||||
"call_id": "call_zw3...", // Used to track previous calls
|
||||
"action": {
|
||||
"type": "click", // What kind of action (click, type, etc.)
|
||||
"button": "left", // Which mouse button to use
|
||||
"x": 156, // Where to click (coordinates)
|
||||
"y": 50
|
||||
},
|
||||
"pending_safety_checks": [], // Any safety warnings to consider
|
||||
"status": "completed" // Whether the action was successful
|
||||
}
|
||||
]
|
||||
|
||||
```
|
||||
|
||||
Each response contains:
|
||||
|
||||
1. **Reasoning**: The AI's explanation of what it's doing
|
||||
2. **Action**: The specific computer action to perform
|
||||
3. **Safety Checks**: Any potential risks to review
|
||||
4. **Status**: Whether everything worked as planned
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Provision a C/ua Cloud Container
|
||||
|
||||
1. Visit [trycua.com](https://trycua.com), sign up, purchase [credits](https://trycua.com/pricing), and create a new container instance from the [dashboard](https://trycua.com/dashboard).
|
||||
2. Create an API key from the dashboard — be sure to save it in a secure location before continuing.
|
||||
3. Start the cloud container from the dashboard.
|
||||
|
||||
### Environment Setup
|
||||
|
||||
1. Install required packages with your preferred package manager:
|
||||
|
||||
```bash
|
||||
npm install --save @trycua/computer # or yarn, pnpm, bun
|
||||
npm install --save openai # or yarn, pnpm, bun
|
||||
```
|
||||
|
||||
Works with any JavaScript/TypeScript project setup - whether you're using Create React App, Next.js, Vue, Angular, or plain JavaScript.
|
||||
|
||||
2. Save your OpenAI API key, C/ua API key, and container name to a `.env` file:
|
||||
|
||||
```bash
|
||||
OPENAI_API_KEY=openai-api-key
|
||||
CUA_API_KEY=cua-api-key
|
||||
CUA_CONTAINER_NAME=cua-cloud-container-name
|
||||
```
|
||||
|
||||
These environment variables work the same whether you're using vanilla JavaScript, TypeScript, or any web framework.
|
||||
|
||||
## Building the Agent
|
||||
|
||||
### Mapping API Actions to `@trycua/computer` Interface Methods
|
||||
|
||||
This helper function handles a `computer_call` action from the OpenAI API — converting the action into an equivalent action from the `@trycua/computer` interface. These actions will execute on the initialized `Computer` instance. For example, `await computer.interface.leftClick()` sends a mouse left click to the current cursor position.
|
||||
|
||||
Whether you're using JavaScript or TypeScript, the interface remains the same:
|
||||
|
||||
```javascript
|
||||
export async function executeAction(
|
||||
computer: Computer,
|
||||
action: OpenAI.Responses.ResponseComputerToolCall['action']
|
||||
) {
|
||||
switch (action.type) {
|
||||
case 'click':
|
||||
const { x, y, button } = action;
|
||||
console.log(`Executing click at (${x}, ${y}) with button '${button}'.`);
|
||||
await computer.interface.moveCursor(x, y);
|
||||
if (button === 'right') await computer.interface.rightClick();
|
||||
else await computer.interface.leftClick();
|
||||
break;
|
||||
case 'type':
|
||||
const { text } = action;
|
||||
console.log(`Typing text: ${text}`);
|
||||
await computer.interface.typeText(text);
|
||||
break;
|
||||
case 'scroll':
|
||||
const { x: locX, y: locY, scroll_x, scroll_y } = action;
|
||||
console.log(
|
||||
`Scrolling at (${locX}, ${locY}) with offsets (scroll_x=${scroll_x}, scroll_y=${scroll_y}).`
|
||||
);
|
||||
await computer.interface.moveCursor(locX, locY);
|
||||
await computer.interface.scroll(scroll_x, scroll_y);
|
||||
break;
|
||||
case 'keypress':
|
||||
const { keys } = action;
|
||||
for (const key of keys) {
|
||||
console.log(`Pressing key: ${key}.`);
|
||||
// Map common key names to CUA equivalents
|
||||
if (key.toLowerCase() === 'enter') {
|
||||
await computer.interface.pressKey('return');
|
||||
} else if (key.toLowerCase() === 'space') {
|
||||
await computer.interface.pressKey('space');
|
||||
} else {
|
||||
await computer.interface.pressKey(key);
|
||||
}
|
||||
}
|
||||
break;
|
||||
case 'wait':
|
||||
console.log(`Waiting for 3 seconds.`);
|
||||
await new Promise((resolve) => setTimeout(resolve, 3 * 1000));
|
||||
break;
|
||||
case 'screenshot':
|
||||
console.log('Taking screenshot.');
|
||||
// This is handled automatically in the main loop, but we can take an extra one if requested
|
||||
const screenshot = await computer.interface.screenshot();
|
||||
return screenshot;
|
||||
default:
|
||||
console.log(`Unrecognized action: ${action.type}`);
|
||||
break;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Implementing the Computer-Use Loop
|
||||
|
||||
This section defines a loop that:
|
||||
|
||||
1. Initializes the `Computer` instance (connecting to a Linux cloud container).
|
||||
2. Captures a screenshot of the current state.
|
||||
3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` model.
|
||||
4. Processes the returned `computer_call` action and executes it using our helper function.
|
||||
5. Captures an updated screenshot after the action.
|
||||
6. Send the updated screenshot and loops until no more actions are returned.
|
||||
|
||||
```javascript
|
||||
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
|
||||
|
||||
// Initialize the Computer Connection
|
||||
const computer = new Computer({
|
||||
apiKey: process.env.CUA_API_KEY!,
|
||||
name: process.env.CUA_CONTAINER_NAME!,
|
||||
osType: OSType.LINUX,
|
||||
});
|
||||
|
||||
await computer.run();
|
||||
// Take the initial screenshot
|
||||
const screenshot = await computer.interface.screenshot();
|
||||
const screenshotBase64 = screenshot.toString('base64');
|
||||
|
||||
// Setup openai config for computer use
|
||||
const computerUseConfig: OpenAI.Responses.ResponseCreateParamsNonStreaming = {
|
||||
model: 'computer-use-preview',
|
||||
tools: [
|
||||
{
|
||||
type: 'computer_use_preview',
|
||||
display_width: 1024,
|
||||
display_height: 768,
|
||||
environment: 'linux', // we're using a linux vm
|
||||
},
|
||||
],
|
||||
truncation: 'auto',
|
||||
};
|
||||
|
||||
// Send initial screenshot to the openai computer use model
|
||||
let res = await openai.responses.create({
|
||||
...computerUseConfig,
|
||||
input: [
|
||||
{
|
||||
role: 'user',
|
||||
content: [
|
||||
// what we want the ai to do
|
||||
{ type: 'input_text', text: 'open firefox and go to trycua.com' },
|
||||
// current screenshot of the vm
|
||||
{
|
||||
type: 'input_image',
|
||||
image_url: `data:image/png;base64,${screenshotBase64}`,
|
||||
detail: 'auto',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
// Loop until there are no more computer use actions.
|
||||
while (true) {
|
||||
const computerCalls = res.output.filter((o) => o.type === 'computer_call');
|
||||
if (computerCalls.length < 1) {
|
||||
console.log('No more computer calls. Loop complete.');
|
||||
break;
|
||||
}
|
||||
// Get the first call
|
||||
const call = computerCalls[0];
|
||||
const action = call.action;
|
||||
console.log('Received action from OpenAI Responses API:', action);
|
||||
let ackChecks: OpenAI.Responses.ResponseComputerToolCall.PendingSafetyCheck[] =
|
||||
[];
|
||||
if (call.pending_safety_checks.length > 0) {
|
||||
console.log('Safety checks pending:', call.pending_safety_checks);
|
||||
// In a real implementation, you would want to get user confirmation here.
|
||||
ackChecks = call.pending_safety_checks;
|
||||
}
|
||||
|
||||
// Execute the action in the container
|
||||
await executeAction(computer, action);
|
||||
// Wait for changes to process within the container (1sec)
|
||||
await new Promise((resolve) => setTimeout(resolve, 1000));
|
||||
|
||||
// Capture new screenshot
|
||||
const newScreenshot = await computer.interface.screenshot();
|
||||
const newScreenshotBase64 = newScreenshot.toString('base64');
|
||||
|
||||
// Screenshot back as computer_call_output
|
||||
|
||||
res = await openai.responses.create({
|
||||
...computerUseConfig,
|
||||
previous_response_id: res.id,
|
||||
input: [
|
||||
{
|
||||
type: 'computer_call_output',
|
||||
call_id: call.call_id,
|
||||
acknowledged_safety_checks: ackChecks,
|
||||
output: {
|
||||
type: 'computer_screenshot',
|
||||
image_url: `data:image/png;base64,${newScreenshotBase64}`,
|
||||
},
|
||||
},
|
||||
],
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
You can find the full example on [GitHub](https://github.com/trycua/cua/tree/main/examples/computer-example-ts).
|
||||
|
||||
## What's Next?
|
||||
|
||||
The `@trycua/computer` Web SDK opens up some interesting possibilities. You could build browser-based testing tools, create interactive demos for your products, or automate repetitive workflows directly from your web apps.
|
||||
|
||||
We're working on more examples and better documentation - if you build something cool with this SDK, we'd love to see it. Drop by our [Discord](https://discord.gg/cua-ai) and share what you're working on.
|
||||
|
||||
Happy automating on the web!
|
||||
547
blog/build-your-own-operator-on-macos-1.md
Normal file
547
blog/build-your-own-operator-on-macos-1.md
Normal file
@@ -0,0 +1,547 @@
|
||||
# Build Your Own Operator on macOS - Part 1
|
||||
|
||||
*Published on March 31, 2025 by Francesco Bonacci*
|
||||
|
||||
In this first blogpost, we'll learn how to build our own Computer-Use Operator using OpenAI's `computer-use-preview` model. But first, let's understand what some common terms mean:
|
||||
|
||||
- A **Virtual Machine (VM)** is like a computer within your computer - a safe, isolated environment where the AI can work without affecting your main system.
|
||||
- **computer-use-preview** is OpenAI's specialized language model trained to understand and interact with computer interfaces through screenshots.
|
||||
- A **Computer-Use Agent** is an AI agent that can control a computer just like a human would - clicking buttons, typing text, and interacting with applications.
|
||||
|
||||
Our Operator will run in an isolated macOS VM, by making use of our [cua-computer](https://github.com/trycua/cua/tree/main/libs/computer) package and [lume virtualization CLI](https://github.com/trycua/cua/tree/main/libs/lume).
|
||||
|
||||
Check out what it looks like to use your own Operator from a Gradio app:
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/demo_gradio.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
## What You'll Learn
|
||||
|
||||
By the end of this tutorial, you'll be able to:
|
||||
- Set up a macOS virtual machine for AI automation
|
||||
- Connect OpenAI's computer-use model to your VM
|
||||
- Create a basic loop for the AI to interact with your VM
|
||||
- Handle different types of computer actions (clicking, typing, etc.)
|
||||
- Implement safety checks and error handling
|
||||
|
||||
**Prerequisites:**
|
||||
- macOS Sonoma (14.0) or later
|
||||
- 8GB RAM minimum (16GB recommended)
|
||||
- OpenAI API access (Tier 3+)
|
||||
- Basic Python knowledge
|
||||
- Familiarity with terminal commands
|
||||
|
||||
**Estimated Time:** 45-60 minutes
|
||||
|
||||
## Introduction to Computer-Use Agents
|
||||
|
||||
Last March OpenAI released a fine-tuned version of GPT-4o, namely [CUA](https://openai.com/index/computer-using-agent/), introducing pixel-level vision capabilities with advanced reasoning through reinforcement learning. This fine-tuning enables the computer-use model to interpret screenshots and interact with graphical user interfaces on a pixel-level such as buttons, menus, and text fields - mimicking human interactions on a computer screen. It scores a remarkable 38.1% success rate on [OSWorld](https://os-world.github.io) - a benchmark for Computer-Use agents on Linux and Windows. This is the 2nd available model after Anthropic's [Claude 3.5 Sonnet](https://www.anthropic.com/news/3-5-models-and-computer-use) to support computer-use capabilities natively with no external models (e.g. accessory [SoM (Set-of-Mark)](https://arxiv.org/abs/2310.11441) and OCR runs).
|
||||
|
||||
Professor Ethan Mollick provides an excellent explanation of computer-use agents in this article: [When you give a Claude a mouse](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse).
|
||||
|
||||
### ChatGPT Operator
|
||||
OpenAI's computer-use model powers [ChatGPT Operator](https://openai.com/index/introducing-operator), a Chromium-based interface exclusively available to ChatGPT Pro subscribers. Users leverage this functionality to automate web-based tasks such as online shopping, expense report submission, and booking reservations by interacting with websites in a human-like manner.
|
||||
|
||||
## Benefits of Custom Operators
|
||||
|
||||
### Why Build Your Own?
|
||||
While OpenAI's Operator uses a controlled Chromium VM instance, there are scenarios where you may want to use your own VM with full desktop capabilities. Here are some examples:
|
||||
|
||||
- Automating native macOS apps like Finder, Xcode
|
||||
- Managing files, changing settings, and running terminal commands
|
||||
- Testing desktop software and applications
|
||||
- Creating workflows that combine web and desktop tasks
|
||||
- Automating media editing in apps like Final Cut Pro and Blender
|
||||
|
||||
This gives you more control and flexibility to automate tasks beyond just web browsing, with full access to interact with native applications and system-level operations. Additionally, running your own VM locally provides better privacy for sensitive user files and delivers superior performance by leveraging your own hardware instead of renting expensive Cloud VMs.
|
||||
|
||||
## Access Requirements
|
||||
|
||||
### Model Availability
|
||||
As we speak, the **computer-use-preview** model has limited availability:
|
||||
- Only accessible to OpenAI tier 3+ users
|
||||
- Additional application process may be required even for eligible users
|
||||
- Cannot be used in the OpenAI Playground
|
||||
- Outside of ChatGPT Operator, usage is restricted to the new **Responses API**
|
||||
|
||||
## Understanding the OpenAI API
|
||||
|
||||
### Responses API Overview
|
||||
Let's start with the basics. In our case, we'll use OpenAI's Responses API to communicate with their computer-use model.
|
||||
|
||||
Think of it like this:
|
||||
1. We send the model a screenshot of our VM and tell it what we want it to do
|
||||
2. The model looks at the screenshot and decides what actions to take
|
||||
3. It sends back instructions (like "click here" or "type this")
|
||||
4. We execute those instructions in our VM
|
||||
|
||||
The [Responses API](https://platform.openai.com/docs/guides/responses) is OpenAI's newest way to interact with their AI models. It comes with several built-in tools:
|
||||
- **Web search**: Let the AI search the internet
|
||||
- **File search**: Help the AI find documents
|
||||
- **Computer use**: Allow the AI to control a computer (what we'll be using)
|
||||
|
||||
As we speak, the computer-use model is only available through the Responses API.
|
||||
|
||||
### Responses API Examples
|
||||
Let's look at some simple examples. We'll start with the traditional way of using OpenAI's API with Chat Completions, then show the new Responses API primitive.
|
||||
|
||||
Chat Completions:
|
||||
```python
|
||||
# The old way required managing conversation history manually
|
||||
messages = [{"role": "user", "content": "Hello"}]
|
||||
response = client.chat.completions.create(
|
||||
model="gpt-4",
|
||||
messages=messages # We had to track all messages ourselves
|
||||
)
|
||||
messages.append(response.choices[0].message) # Manual message tracking
|
||||
```
|
||||
|
||||
Responses API:
|
||||
```python
|
||||
# Example 1: Simple web search
|
||||
# The API handles all the complexity for us
|
||||
response = client.responses.create(
|
||||
model="gpt-4",
|
||||
input=[{
|
||||
"role": "user",
|
||||
"content": "What's the latest news about AI?"
|
||||
}],
|
||||
tools=[{
|
||||
"type": "web_search", # Tell the API to use web search
|
||||
"search_query": "latest AI news"
|
||||
}]
|
||||
)
|
||||
|
||||
# Example 2: File search
|
||||
# Looking for specific documents becomes easy
|
||||
response = client.responses.create(
|
||||
model="gpt-4",
|
||||
input=[{
|
||||
"role": "user",
|
||||
"content": "Find documents about project X"
|
||||
}],
|
||||
tools=[{
|
||||
"type": "file_search",
|
||||
"query": "project X",
|
||||
"file_types": ["pdf", "docx"] # Specify which file types to look for
|
||||
}]
|
||||
)
|
||||
```
|
||||
|
||||
### Computer-Use Model Setup
|
||||
For our operator, we'll use the computer-use model. Here's how we set it up:
|
||||
|
||||
```python
|
||||
# Set up the computer-use model to control our VM
|
||||
response = client.responses.create(
|
||||
model="computer-use-preview", # Special model for computer control
|
||||
tools=[{
|
||||
"type": "computer_use_preview",
|
||||
"display_width": 1024, # Size of our VM screen
|
||||
"display_height": 768,
|
||||
"environment": "mac" # Tell it we're using macOS.
|
||||
}],
|
||||
input=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
# What we want the AI to do
|
||||
{"type": "input_text", "text": "Open Safari and go to google.com"},
|
||||
# Current screenshot of our VM
|
||||
{"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"}
|
||||
]
|
||||
}
|
||||
],
|
||||
truncation="auto" # Let OpenAI handle message length
|
||||
)
|
||||
```
|
||||
|
||||
### Understanding the Response
|
||||
When we send a request, the API sends back a response that looks like this:
|
||||
|
||||
```json
|
||||
"output": [
|
||||
{
|
||||
"type": "reasoning", # The AI explains what it's thinking
|
||||
"id": "rs_67cc...",
|
||||
"summary": [
|
||||
{
|
||||
"type": "summary_text",
|
||||
"text": "Clicking on the browser address bar."
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "computer_call", # The actual action to perform
|
||||
"id": "cu_67cc...",
|
||||
"call_id": "call_zw3...",
|
||||
"action": {
|
||||
"type": "click", # What kind of action (click, type, etc.)
|
||||
"button": "left", # Which mouse button to use
|
||||
"x": 156, # Where to click (coordinates)
|
||||
"y": 50
|
||||
},
|
||||
"pending_safety_checks": [], # Any safety warnings to consider
|
||||
"status": "completed" # Whether the action was successful
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Each response contains:
|
||||
1. **Reasoning**: The AI's explanation of what it's doing
|
||||
2. **Action**: The specific computer action to perform
|
||||
3. **Safety Checks**: Any potential risks to review
|
||||
4. **Status**: Whether everything worked as planned
|
||||
|
||||
## CUA-Computer Interface
|
||||
|
||||
### Architecture Overview
|
||||
Let's break down the main components of our system and how they work together:
|
||||
|
||||
1. **The Virtual Machine (VM)**
|
||||
- Think of this as a safe playground for our AI
|
||||
- It's a complete macOS system running inside your computer
|
||||
- Anything the AI does stays inside this VM, keeping your main system safe
|
||||
- We use `lume` to create and manage this VM
|
||||
|
||||
2. **The Computer Interface (CUI)**
|
||||
- This is how we control the VM
|
||||
- It can move the mouse, type text, and take screenshots
|
||||
- Works like a remote control for the VM
|
||||
- Built using our `cua-computer` package
|
||||
|
||||
3. **The OpenAI Model**
|
||||
- This is the brain of our operator
|
||||
- It looks at screenshots of the VM
|
||||
- Decides what actions to take
|
||||
- Sends back instructions like "click here" or "type this"
|
||||
|
||||
Here's how they all work together:
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User as You
|
||||
participant CUI as Computer Interface
|
||||
participant VM as Virtual Machine
|
||||
participant AI as OpenAI API
|
||||
|
||||
Note over User,AI: The Main Loop
|
||||
User->>CUI: Start the operator
|
||||
CUI->>VM: Create macOS sandbox
|
||||
activate VM
|
||||
VM-->>CUI: VM is ready
|
||||
|
||||
loop Action Loop
|
||||
Note over CUI,AI: Each iteration
|
||||
CUI->>VM: Take a screenshot
|
||||
VM-->>CUI: Return current screen
|
||||
CUI->>AI: Send screenshot + instructions
|
||||
AI-->>CUI: Return next action
|
||||
|
||||
Note over CUI,VM: Execute the action
|
||||
alt Mouse Click
|
||||
CUI->>VM: Move and click mouse
|
||||
else Type Text
|
||||
CUI->>VM: Type characters
|
||||
else Scroll Screen
|
||||
CUI->>VM: Scroll window
|
||||
else Press Keys
|
||||
CUI->>VM: Press keyboard keys
|
||||
else Wait
|
||||
CUI->>VM: Pause for a moment
|
||||
end
|
||||
end
|
||||
|
||||
VM-->>CUI: Task finished
|
||||
deactivate VM
|
||||
CUI-->>User: All done!
|
||||
```
|
||||
|
||||
The diagram above shows how information flows through our system:
|
||||
1. You start the operator
|
||||
2. The Computer Interface creates a virtual macOS
|
||||
3. Then it enters a loop:
|
||||
- Take a picture of the VM screen
|
||||
- Send it to OpenAI with instructions
|
||||
- Get back an action to perform
|
||||
- Execute that action in the VM
|
||||
- Repeat until the task is done
|
||||
|
||||
This design keeps everything organized and safe. The AI can only interact with the VM through our controlled interface, and the VM keeps the AI's actions isolated from your main system.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **Lume CLI Setup**
|
||||
For installing the standalone lume binary, run the following command from a terminal, or download the [latest pkg](https://github.com/trycua/cua/releases/latest/download/lume.pkg.tar.gz).
|
||||
|
||||
```bash
|
||||
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
```
|
||||
|
||||
**Important Storage Notes:**
|
||||
- Initial download requires 80GB of free space
|
||||
- After first run, space usage reduces to ~30GB due to macOS's sparse file system
|
||||
- VMs are stored in `~/.lume`
|
||||
- Cached images are stored in `~/.lume/cache`
|
||||
|
||||
You can check your downloaded VM images anytime:
|
||||
```bash
|
||||
lume ls
|
||||
```
|
||||
|
||||
Example output:
|
||||
|
||||
| name | os | cpu | memory | disk | display | status | ip | vnc |
|
||||
|--------------------------|---------|-------|---------|----------------|-----------|-----------|----------------|---------------------------------------------------|
|
||||
| macos-sequoia-cua:latest | macOS | 12 | 16.00G | 64.5GB/80.0GB | 1024x768 | running | 192.168.64.78 | vnc://:kind-forest-zulu-island@127.0.0.1:56085 |
|
||||
|
||||
After checking your available images, you can run the VM to ensure everything is working correctly:
|
||||
```bash
|
||||
lume run macos-sequoia-cua:latest
|
||||
```
|
||||
|
||||
2. **Python Environment Setup**
|
||||
**Note**: The `cua-computer` package requires Python 3.10 or later. We recommend creating a dedicated Python environment:
|
||||
|
||||
**Using venv:**
|
||||
```bash
|
||||
python -m venv cua-env
|
||||
source cua-env/bin/activate
|
||||
```
|
||||
|
||||
**Using conda:**
|
||||
```bash
|
||||
conda create -n cua-env python=3.10
|
||||
conda activate cua-env
|
||||
```
|
||||
|
||||
Then install the required packages:
|
||||
|
||||
```bash
|
||||
pip install openai
|
||||
pip install cua-computer
|
||||
```
|
||||
|
||||
Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).
|
||||
|
||||
### Building the Operator
|
||||
|
||||
#### Importing Required Modules
|
||||
With the prerequisites installed and configured, we're ready to build our first operator.
|
||||
The following example uses asynchronous Python (async/await). You can run it either in a VS Code Notebook or as a standalone Python script.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import base64
|
||||
import openai
|
||||
|
||||
from computer import Computer
|
||||
```
|
||||
|
||||
#### Mapping API Actions to CUA Methods
|
||||
The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUI interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the lume VM Sandbox. We will use the computer interface to execute the actions.
|
||||
|
||||
```python
|
||||
async def execute_action(computer, action):
|
||||
action_type = action.type
|
||||
|
||||
if action_type == "click":
|
||||
x = action.x
|
||||
y = action.y
|
||||
button = action.button
|
||||
print(f"Executing click at ({x}, {y}) with button '{button}'")
|
||||
await computer.interface.move_cursor(x, y)
|
||||
if button == "right":
|
||||
await computer.interface.right_click()
|
||||
else:
|
||||
await computer.interface.left_click()
|
||||
|
||||
elif action_type == "type":
|
||||
text = action.text
|
||||
print(f"Typing text: {text}")
|
||||
await computer.interface.type_text(text)
|
||||
|
||||
elif action_type == "scroll":
|
||||
x = action.x
|
||||
y = action.y
|
||||
scroll_x = action.scroll_x
|
||||
scroll_y = action.scroll_y
|
||||
print(f"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})")
|
||||
await computer.interface.move_cursor(x, y)
|
||||
await computer.interface.scroll(scroll_y) # Using vertical scroll only
|
||||
|
||||
elif action_type == "keypress":
|
||||
keys = action.keys
|
||||
for key in keys:
|
||||
print(f"Pressing key: {key}")
|
||||
# Map common key names to CUA equivalents
|
||||
if key.lower() == "enter":
|
||||
await computer.interface.press_key("return")
|
||||
elif key.lower() == "space":
|
||||
await computer.interface.press_key("space")
|
||||
else:
|
||||
await computer.interface.press_key(key)
|
||||
|
||||
elif action_type == "wait":
|
||||
wait_time = action.time
|
||||
print(f"Waiting for {wait_time} seconds")
|
||||
await asyncio.sleep(wait_time)
|
||||
|
||||
elif action_type == "screenshot":
|
||||
print("Taking screenshot")
|
||||
# This is handled automatically in the main loop, but we can take an extra one if requested
|
||||
screenshot = await computer.interface.screenshot()
|
||||
return screenshot
|
||||
|
||||
else:
|
||||
print(f"Unrecognized action: {action_type}")
|
||||
```
|
||||
|
||||
#### Implementing the Computer-Use Loop
|
||||
This section defines a loop that:
|
||||
|
||||
1. Initializes the cua-computer instance (connecting to a macOS sandbox).
|
||||
2. Captures a screenshot of the current state.
|
||||
3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` model.
|
||||
4. Processes the returned `computer_call` action and executes it using our helper function.
|
||||
5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).
|
||||
|
||||
For a full loop, you would repeat these steps until no further actions are returned.
|
||||
|
||||
```python
|
||||
async def cua_openai_loop():
|
||||
# Initialize the lume computer instance (macOS sandbox)
|
||||
async with Computer(
|
||||
display="1024x768",
|
||||
memory="4GB",
|
||||
cpu="2",
|
||||
os_type="macos"
|
||||
) as computer:
|
||||
await computer.run() # Start the lume VM
|
||||
|
||||
# Capture the initial screenshot
|
||||
screenshot = await computer.interface.screenshot()
|
||||
screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')
|
||||
|
||||
# Initial request to start the loop
|
||||
response = openai.responses.create(
|
||||
model="computer-use-preview",
|
||||
tools=[{
|
||||
"type": "computer_use_preview",
|
||||
"display_width": 1024,
|
||||
"display_height": 768,
|
||||
"environment": "mac"
|
||||
}],
|
||||
input=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "input_text", "text": "Open Safari, download and install Cursor."},
|
||||
{"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"}
|
||||
]
|
||||
}
|
||||
],
|
||||
truncation="auto"
|
||||
)
|
||||
|
||||
# Continue the loop until no more computer_call actions
|
||||
while True:
|
||||
# Check for computer_call actions
|
||||
computer_calls = [item for item in response.output if item and item.type == "computer_call"]
|
||||
if not computer_calls:
|
||||
print("No more computer calls. Loop complete.")
|
||||
break
|
||||
|
||||
# Get the first computer call
|
||||
call = computer_calls[0]
|
||||
last_call_id = call.call_id
|
||||
action = call.action
|
||||
print("Received action from OpenAI Responses API:", action)
|
||||
|
||||
# Handle any pending safety checks
|
||||
if call.pending_safety_checks:
|
||||
print("Safety checks pending:", call.pending_safety_checks)
|
||||
# In a real implementation, you would want to get user confirmation here
|
||||
acknowledged_checks = call.pending_safety_checks
|
||||
else:
|
||||
acknowledged_checks = []
|
||||
|
||||
# Execute the action
|
||||
await execute_action(computer, action)
|
||||
await asyncio.sleep(1) # Allow time for changes to take effect
|
||||
|
||||
# Capture new screenshot after action
|
||||
new_screenshot = await computer.interface.screenshot()
|
||||
new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')
|
||||
|
||||
# Send the screenshot back as computer_call_output
|
||||
response = openai.responses.create(
|
||||
model="computer-use-preview",
|
||||
tools=[{
|
||||
"type": "computer_use_preview",
|
||||
"display_width": 1024,
|
||||
"display_height": 768,
|
||||
"environment": "mac"
|
||||
}],
|
||||
input=[{
|
||||
"type": "computer_call_output",
|
||||
"call_id": last_call_id,
|
||||
"acknowledged_safety_checks": acknowledged_checks,
|
||||
"output": {
|
||||
"type": "input_image",
|
||||
"image_url": f"data:image/png;base64,{new_screenshot_base64}"
|
||||
}
|
||||
}],
|
||||
truncation="auto"
|
||||
)
|
||||
|
||||
# End the session
|
||||
await computer.stop()
|
||||
|
||||
# Run the loop
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(cua_openai_loop())
|
||||
```
|
||||
|
||||
You can find the full code in our [notebook](https://github.com/trycua/cua/blob/main/notebooks/blog/build-your-own-operator-on-macos-1.ipynb).
|
||||
|
||||
#### Request Handling Differences
|
||||
The first request to the OpenAI Responses API is special in that it includes the initial screenshot and prompt. Subsequent requests are handled differently, using the `computer_call_output` type to provide feedback on the executed action.
|
||||
|
||||
##### Initial Request Format
|
||||
- We use `role: "user"` with `content` that contains both `input_text` (the prompt) and `input_image` (the screenshot)
|
||||
|
||||
##### Subsequent Request Format
|
||||
- We use `type: "computer_call_output"` instead of the user role
|
||||
- We include the `call_id` to link the output to the specific previous action that was executed
|
||||
- We provide any `acknowledged_safety_checks` that were approved
|
||||
- We include the new screenshot in the `output` field
|
||||
|
||||
This structured approach allows the API to maintain context and continuity throughout the interaction session.
|
||||
|
||||
**Note**: For multi-turn conversations, you should include the `previous_response_id` in your initial requests when starting a new conversation with prior context. However, when using `computer_call_output` for action feedback, you don't need to explicitly manage the conversation history - OpenAI's API automatically tracks the context using the `call_id`. The `previous_response_id` is primarily important when the user provides additional instructions or when starting a new request that should continue from a previous session.
|
||||
|
||||
## Conclusion
|
||||
|
||||
### Summary
|
||||
This blogpost demonstrates a single iteration of a OpenAI Computer-Use loop where:
|
||||
|
||||
- A macOS sandbox is controlled using the CUA interface.
|
||||
- A screenshot and prompt are sent to the OpenAI Responses API.
|
||||
- The returned action (e.g. a click or type command) is executed via the CUI interface.
|
||||
|
||||
In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed.
|
||||
|
||||
### Next Steps
|
||||
In the next blogpost, we'll introduce our Agent framework which abstracts away all these tedious implementation steps. This framework provides a higher-level API that handles the interaction loop between OpenAI's computer-use model and the macOS sandbox, allowing you to focus on building sophisticated applications rather than managing the low-level details we've explored here. Can't wait? Check out the [cua-agent](https://github.com/trycua/cua/tree/main/libs/agent) package!
|
||||
|
||||
### Resources
|
||||
- [OpenAI Computer-Use docs](https://platform.openai.com/docs/guides/tools-computer-use)
|
||||
- [cua-computer](https://github.com/trycua/cua/tree/main/libs/computer)
|
||||
- [lume](https://github.com/trycua/cua/tree/main/libs/lume)
|
||||
655
blog/build-your-own-operator-on-macos-2.md
Normal file
655
blog/build-your-own-operator-on-macos-2.md
Normal file
@@ -0,0 +1,655 @@
|
||||
# Build Your Own Operator on macOS - Part 2
|
||||
|
||||
*Published on April 27, 2025 by Francesco Bonacci*
|
||||
|
||||
In our [previous post](build-your-own-operator-on-macos-1.md), we built a basic Computer-Use Operator from scratch using OpenAI's `computer-use-preview` model and our [cua-computer](https://pypi.org/project/cua-computer) package. While educational, implementing the control loop manually can be tedious and error-prone.
|
||||
|
||||
In this follow-up, we'll explore our [cua-agent](https://pypi.org/project/cua-agent) framework - a high-level abstraction that handles all the complexity of VM interaction, screenshot processing, model communication, and action execution automatically.
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/demo.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
|
||||
## What You'll Learn
|
||||
|
||||
By the end of this tutorial, you'll be able to:
|
||||
- Set up the `cua-agent` framework with various agent loop types and model providers
|
||||
- Understand the different agent loop types and their capabilities
|
||||
- Work with local models for cost-effective workflows
|
||||
- Use a simple UI for your operator
|
||||
|
||||
**Prerequisites:**
|
||||
- Completed setup from Part 1 ([lume CLI installed](https://github.com/trycua/cua?tab=readme-ov-file#option-2-full-computer-use-agent-capabilities), macOS CUA image already pulled)
|
||||
- Python 3.10+. We recommend using Conda (or Anaconda) to create an ad hoc Python environment.
|
||||
- API keys for OpenAI and/or Anthropic (optional for local models)
|
||||
|
||||
**Estimated Time:** 30-45 minutes
|
||||
|
||||
## Introduction to cua-agent
|
||||
|
||||
The `cua-agent` framework is designed to simplify building Computer-Use Agents. It abstracts away the complex interaction loop we built manually in Part 1, letting you focus on defining tasks rather than implementing the machinery. Among other features, it includes:
|
||||
|
||||
- **Multiple Provider Support**: Works with OpenAI, Anthropic, UI-Tars, local models (via Ollama), or any OpenAI-compatible model (e.g. LM Studio, vLLM, LocalAI, OpenRouter, Groq, etc.)
|
||||
- **Flexible Loop Types**: Different implementations optimized for various models (e.g. OpenAI vs. Anthropic)
|
||||
- **Structured Responses**: Clean, consistent output following the OpenAI Agent SDK specification we touched on in Part 1
|
||||
- **Local Model Support**: Run cost-effectively with locally hosted models (Ollama, LM Studio, vLLM, LocalAI, etc.)
|
||||
- **Gradio UI**: Optional visual interface for interacting with your agent
|
||||
|
||||
## Installation
|
||||
|
||||
Let's start by installing the `cua-agent` package. You can install it with all features or selectively install only what you need.
|
||||
|
||||
From your python 3.10+ environment, run:
|
||||
|
||||
```bash
|
||||
# For all features
|
||||
pip install "cua-agent[all]"
|
||||
|
||||
# Or selectively install only what you need
|
||||
pip install "cua-agent[openai]" # OpenAI support
|
||||
pip install "cua-agent[anthropic]" # Anthropic support
|
||||
pip install "cua-agent[uitars]" # UI-Tars support
|
||||
pip install "cua-agent[omni]" # OmniParser + VLMs support
|
||||
pip install "cua-agent[ui]" # Gradio UI
|
||||
```
|
||||
|
||||
## Setting Up Your Environment
|
||||
|
||||
Before running any code examples, let's set up a proper environment:
|
||||
|
||||
1. **Create a new directory** for your project:
|
||||
```bash
|
||||
mkdir cua-agent-tutorial
|
||||
cd cua-agent-tutorial
|
||||
```
|
||||
|
||||
2. **Set up a Python environment** using one of these methods:
|
||||
|
||||
**Option A: Using conda command line**
|
||||
```bash
|
||||
# Using conda
|
||||
conda create -n cua-agent python=3.10
|
||||
conda activate cua-agent
|
||||
```
|
||||
|
||||
**Option B: Using Anaconda Navigator UI**
|
||||
- Open Anaconda Navigator
|
||||
- Click on "Environments" in the left sidebar
|
||||
- Click the "Create" button at the bottom
|
||||
- Name your environment "cua-agent"
|
||||
- Select Python 3.10
|
||||
- Click "Create"
|
||||
- Once created, select the environment and click "Open Terminal" to activate it
|
||||
|
||||
**Option C: Using venv**
|
||||
```bash
|
||||
python -m venv cua-env
|
||||
source cua-env/bin/activate # On macOS/Linux
|
||||
```
|
||||
|
||||
3. **Install the cua-agent package**:
|
||||
```bash
|
||||
pip install "cua-agent[all]"
|
||||
```
|
||||
|
||||
4. **Set up your API keys as environment variables**:
|
||||
```bash
|
||||
# For OpenAI models
|
||||
export OPENAI_API_KEY=your_openai_key_here
|
||||
|
||||
# For Anthropic models (if needed)
|
||||
export ANTHROPIC_API_KEY=your_anthropic_key_here
|
||||
```
|
||||
|
||||
5. **Create a Python file or notebook**:
|
||||
|
||||
**Option A: Create a Python script**
|
||||
```bash
|
||||
# For a Python script
|
||||
touch cua_agent_example.py
|
||||
```
|
||||
|
||||
**Option B: Use VS Code notebooks**
|
||||
- Open VS Code
|
||||
- Install the Python extension if you haven't already
|
||||
- Create a new file with a `.ipynb` extension (e.g., `cua_agent_tutorial.ipynb`)
|
||||
- Select your Python environment when prompted
|
||||
- You can now create and run code cells in the notebook interface
|
||||
|
||||
Now you're ready to run the code examples!
|
||||
|
||||
## Understanding Agent Loops
|
||||
|
||||
If you recall from Part 1, we had to implement a custom interaction loop to interact with the compute-use-preview model.
|
||||
|
||||
In the `cua-agent` framework, an **Agent Loop** is the core abstraction that implements the continuous interaction cycle between an AI model and the computer environment. It manages the flow of:
|
||||
1. Capturing screenshots of the computer's state
|
||||
2. Processing these screenshots (with or without UI element detection)
|
||||
3. Sending this visual context to an AI model along with the task instructions
|
||||
4. Receiving the model's decisions on what actions to take
|
||||
5. Safely executing these actions in the environment
|
||||
6. Repeating this cycle until the task is complete
|
||||
|
||||
The loop handles all the complex error handling, retries, context management, and model-specific interaction patterns so you don't have to implement them yourself.
|
||||
|
||||
While the core concept remains the same across all agent loops, different AI models require specialized handling for optimal performance. To address this, the framework provides 4 different agent loop implementations, each designed for different computer-use modalities.
|
||||
| Agent Loop | Supported Models | Description | Set-Of-Marks |
|
||||
|:-----------|:-----------------|:------------|:-------------|
|
||||
| `AgentLoop.OPENAI` | • `computer_use_preview` | Use OpenAI Operator CUA Preview model | Not Required |
|
||||
| `AgentLoop.ANTHROPIC` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219` | Use Anthropic Computer-Use Beta Tools | Not Required |
|
||||
| `AgentLoop.UITARS` | • `ByteDance-Seed/UI-TARS-1.5-7B` | Uses ByteDance's UI-TARS 1.5 model | Not Required |
|
||||
| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama or OpenAI-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
|
||||
|
||||
Each loop handles the same basic pattern we implemented manually in Part 1:
|
||||
1. Take a screenshot of the VM
|
||||
2. Send the screenshot and task to the AI model
|
||||
3. Receive an action to perform
|
||||
4. Execute the action
|
||||
5. Repeat until the task is complete
|
||||
|
||||
### Why Different Agent Loops?
|
||||
|
||||
The `cua-agent` framework provides multiple agent loop implementations to abstract away the complexity of interacting with different CUA models. Each provider has unique API structures, response formats, conventions and capabilities that require specialized handling:
|
||||
|
||||
- **OpenAI Loop**: Uses the Responses API with a specific `computer_call_output` format for sending screenshots after actions. Requires handling safety checks and maintains a chain of requests using `previous_response_id`.
|
||||
|
||||
- **Anthropic Loop**: Implements a [multi-agent loop pattern](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use#understanding-the-multi-agent-loop) with a sophisticated message handling system, supporting various API providers (Anthropic, Bedrock, Vertex) with token management and prompt caching capabilities.
|
||||
|
||||
- **UI-TARS Loop**: Requires custom message formatting and specialized parsing to extract actions from text responses using a "box token" system for UI element identification.
|
||||
|
||||
- **OMNI Loop**: Uses [Microsoft's OmniParser](https://github.com/microsoft/OmniParser) to create a [Set-of-Marks (SoM)](https://arxiv.org/abs/2310.11441) representation of the UI, enabling any vision-language model to interact with interfaces without specialized UI training.
|
||||
|
||||
- **AgentLoop.OMNI**: The most flexible option that works with virtually any vision-language model including local and open-source ones. Perfect for cost-effective development or when you need to use models without native computer-use capabilities.
|
||||
|
||||
These abstractions allow you to easily switch between providers without changing your application code. All loop implementations are available in the [cua-agent GitHub repository](https://github.com/trycua/cua/tree/main/libs/agent/agent/providers).
|
||||
|
||||
Choosing the right agent loop depends not only on your API access and technical requirements but also on the specific tasks you need to accomplish. To make an informed decision, it's helpful to understand how these underlying models perform across different computing environments – from desktop operating systems to web browsers and mobile interfaces.
|
||||
|
||||
## Computer-Use Model Capabilities
|
||||
|
||||
The performance of different Computer-Use models varies significantly across tasks. These benchmark evaluations measure an agent's ability to follow instructions and complete real-world tasks in different computing environments.
|
||||
|
||||
| Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA | Human |
|
||||
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------|-------------|-------------|----------------------|-------------|
|
||||
| **Computer Use** | [OSworld](https://arxiv.org/abs/2404.07972) (100 steps) | **42.5** | 36.4 | 28 | 38.1 (200 step) | 72.4 |
|
||||
| | [Windows Agent Arena](https://arxiv.org/abs/2409.08264) (50 steps) | **42.1** | - | - | 29.8 | - |
|
||||
| **Browser Use** | [WebVoyager](https://arxiv.org/abs/2401.13919) | 84.8 | **87** | 84.1 | 87 | - |
|
||||
| | [Online-Mind2web](https://arxiv.org/abs/2504.01382) | **75.8** | 71 | 62.9 | 71 | - |
|
||||
| **Phone Use** | [Android World](https://arxiv.org/abs/2405.14573) | **64.2** | - | - | 59.5 | - |
|
||||
|
||||
### When to Use Each Loop
|
||||
|
||||
- **AgentLoop.OPENAI**: Choose when you have OpenAI Tier 3 access and need the most capable computer-use agent for web-based tasks. Uses the same [OpenAI Computer-Use Loop](https://platform.openai.com/docs/guides/tools-computer-use) as Part 1, delivering strong performance on browser-based benchmarks.
|
||||
|
||||
- **AgentLoop.ANTHROPIC**: Ideal for users with Anthropic API access who need strong reasoning capabilities with computer-use abilities. Works with `claude-3-5-sonnet-20240620` and `claude-3-7-sonnet-20250219` models following [Anthropic's Computer-Use tools](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use#understanding-the-multi-agent-loop).
|
||||
|
||||
- **AgentLoop.UITARS**: Best for scenarios requiring more powerful OS/desktop, and latency-sensitive automation, as UI-TARS-1.5 leads in OS capabilities benchmarks. Requires running the model locally or accessing it through compatible endpoints (e.g. on Hugging Face).
|
||||
|
||||
- **AgentLoop.OMNI**: The most flexible option that works with virtually any vision-language model including local and open-source ones. Perfect for cost-effective development or when you need to use models without native computer-use capabilities.
|
||||
|
||||
Now that we understand the capabilities and strengths of different models, let's see how easy it is to implement a Computer-Use Agent using the `cua-agent` framework. Let's look at the implementation details.
|
||||
|
||||
## Creating Your First Computer-Use Agent
|
||||
|
||||
With the `cua-agent` framework, creating a Computer-Use Agent becomes remarkably straightforward. The framework handles all the complexities of model interaction, screenshot processing, and action execution behind the scenes. Let's look at a simple example of how to build your first agent:
|
||||
|
||||
**How to run this example:**
|
||||
|
||||
1. Create a new file named `simple_task.py` in your text editor or IDE (like VS Code, PyCharm, or Cursor)
|
||||
2. Copy and paste the following code:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def run_simple_task():
|
||||
async with Computer() as macos_computer:
|
||||
# Create agent with OpenAI loop
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
tools=[macos_computer]
|
||||
)
|
||||
|
||||
# Define a simple task
|
||||
task = "Open Safari and search for 'Python tutorials'"
|
||||
|
||||
# Run the task and process responses
|
||||
async for result in agent.run(task):
|
||||
print(f"Action: {result.get('text')}")
|
||||
|
||||
# Run the example
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(run_simple_task())
|
||||
```
|
||||
|
||||
3. Save the file
|
||||
4. Open a terminal, navigate to your project directory, and run:
|
||||
```bash
|
||||
python simple_task.py
|
||||
```
|
||||
|
||||
5. The code will initialize the macOS virtual machine, create an agent, and execute the task of opening Safari and searching for Python tutorials.
|
||||
|
||||
You can also run this in a VS Code notebook:
|
||||
1. Create a new notebook in VS Code (.ipynb file)
|
||||
2. Copy the code into a cell (without the `if __name__ == "__main__":` part)
|
||||
3. Run the cell to execute the code
|
||||
|
||||
You can find the full code in our [notebook](https://github.com/trycua/cua/blob/main/notebooks/blog/build-your-own-operator-on-macos-2.ipynb).
|
||||
|
||||
Compare this to the manual implementation from Part 1 - we've reduced dozens of lines of code to just a few. The cua-agent framework handles all the complex logic internally, letting you focus on the overarching agentic system.
|
||||
|
||||
## Working with Multiple Tasks
|
||||
|
||||
Another advantage of the cua-agent framework is easily chaining multiple tasks. Instead of managing complex state between tasks, you can simply provide a sequence of instructions to be executed in order:
|
||||
|
||||
**How to run this example:**
|
||||
|
||||
1. Create a new file named `multi_task.py` with the following code:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def run_multi_task_workflow():
|
||||
async with Computer() as macos_computer:
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[macos_computer]
|
||||
)
|
||||
|
||||
tasks = [
|
||||
"Open Safari and go to github.com",
|
||||
"Search for 'trycua/cua'",
|
||||
"Open the repository page",
|
||||
"Click on the 'Issues' tab",
|
||||
"Read the first open issue"
|
||||
]
|
||||
|
||||
for i, task in enumerate(tasks):
|
||||
print(f"\nTask {i+1}/{len(tasks)}: {task}")
|
||||
async for result in agent.run(task):
|
||||
# Print just the action description for brevity
|
||||
if result.get("text"):
|
||||
print(f" → {result.get('text')}")
|
||||
print(f"✅ Task {i+1} completed")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(run_multi_task_workflow())
|
||||
```
|
||||
|
||||
2. Save the file
|
||||
3. Make sure you have set your Anthropic API key:
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY=your_anthropic_key_here
|
||||
```
|
||||
4. Run the script:
|
||||
```bash
|
||||
python multi_task.py
|
||||
```
|
||||
|
||||
This pattern is particularly useful for creating workflows that navigate through multiple steps of an application or process. The agent maintains visual context between tasks, making it more likely to successfully complete complex sequences of actions.
|
||||
|
||||
## Understanding the Response Format
|
||||
|
||||
Each action taken by the agent returns a structured response following the OpenAI Agent SDK specification. This standardized format makes it easy to extract detailed information about what the agent is doing and why:
|
||||
|
||||
```python
|
||||
async for result in agent.run(task):
|
||||
# Basic information
|
||||
print(f"Response ID: {result.get('id')}")
|
||||
print(f"Response Text: {result.get('text')}")
|
||||
|
||||
# Detailed token usage statistics
|
||||
usage = result.get('usage')
|
||||
if usage:
|
||||
print(f"Input Tokens: {usage.get('input_tokens')}")
|
||||
print(f"Output Tokens: {usage.get('output_tokens')}")
|
||||
|
||||
# Reasoning and actions
|
||||
for output in result.get('output', []):
|
||||
if output.get('type') == 'reasoning':
|
||||
print(f"Reasoning: {output.get('summary', [{}])[0].get('text')}")
|
||||
elif output.get('type') == 'computer_call':
|
||||
action = output.get('action', {})
|
||||
print(f"Action: {action.get('type')} at ({action.get('x')}, {action.get('y')})")
|
||||
```
|
||||
|
||||
This structured format allows you to:
|
||||
- Log detailed information about agent actions
|
||||
- Provide real-time feedback to users
|
||||
- Track token usage for cost monitoring
|
||||
- Access the reasoning behind decisions for debugging or user explanation
|
||||
|
||||
## Using Local Models with OMNI
|
||||
|
||||
One of the most powerful features of the framework is the ability to use local models via the OMNI loop. This approach dramatically reduces costs while maintaining acceptable reliability for many agentic workflows:
|
||||
|
||||
**How to run this example:**
|
||||
|
||||
1. First, you'll need to install Ollama for running local models:
|
||||
- Visit [ollama.com](https://ollama.com) and download the installer for your OS
|
||||
- Follow the installation instructions
|
||||
- Pull the Gemma 3 model:
|
||||
```bash
|
||||
ollama pull gemma3:4b-it-q4_K_M
|
||||
```
|
||||
|
||||
2. Create a file named `local_model.py` with this code:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def run_with_local_model():
|
||||
async with Computer() as macos_computer:
|
||||
agent = ComputerAgent(
|
||||
model="omniparser+ollama_chat/gemma3",
|
||||
tools=[macos_computer]
|
||||
)
|
||||
|
||||
task = "Open the Calculator app and perform a simple calculation"
|
||||
|
||||
async for result in agent.run(task):
|
||||
print(f"Action: {result.get('text')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(run_with_local_model())
|
||||
```
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
python local_model.py
|
||||
```
|
||||
|
||||
You can also use other local model servers with the OAICOMPAT provider, which enables compatibility with any API endpoint following the OpenAI API structure:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="gemma-3-12b-it",
|
||||
provider_base_url="http://localhost:1234/v1" # LM Studio endpoint
|
||||
),
|
||||
tools=[macos_computer]
|
||||
)
|
||||
```
|
||||
|
||||
Common local endpoints include:
|
||||
- LM Studio: `http://localhost:1234/v1`
|
||||
- vLLM: `http://localhost:8000/v1`
|
||||
- LocalAI: `http://localhost:8080/v1`
|
||||
- Ollama with OpenAI compat: `http://localhost:11434/v1`
|
||||
|
||||
This approach is perfect for:
|
||||
- Development and testing without incurring API costs
|
||||
- Offline or air-gapped environments where API access isn't possible
|
||||
- Privacy-sensitive applications where data can't leave your network
|
||||
- Experimenting with different models to find the best fit for your use case
|
||||
|
||||
## Deploying and Using UI-TARS
|
||||
|
||||
UI-TARS is ByteDance's Computer-Use model designed for navigating OS-level interfaces. It shows excellent performance on desktop OS tasks. To use UI-TARS, you'll first need to deploy the model.
|
||||
|
||||
### Deployment Options
|
||||
|
||||
1. **Local Deployment**: Follow the [UI-TARS deployment guide](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md) to run the model locally.
|
||||
|
||||
2. **Hugging Face Endpoint**: Deploy UI-TARS on Hugging Face Inference Endpoints, which will give you a URL like:
|
||||
`https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1`
|
||||
|
||||
3. **Using with cua-agent**: Once deployed, you can use UI-TARS with the cua-agent framework:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="tgi",
|
||||
provider_base_url="https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1"
|
||||
),
|
||||
tools=[macos_computer]
|
||||
)
|
||||
```
|
||||
|
||||
UI-TARS is particularly useful for desktop automation tasks, as it shows the highest performance on OS-level benchmarks like OSworld and Windows Agent Arena.
|
||||
|
||||
## Understanding Agent Responses in Detail
|
||||
|
||||
The `run()` method of your agent yields structured responses that follow the OpenAI Agent SDK specification. This provides a rich set of information beyond just the basic action text:
|
||||
|
||||
```python
|
||||
async for result in agent.run(task):
|
||||
# Basic ID and text
|
||||
print("Response ID:", result.get("id"))
|
||||
print("Response Text:", result.get("text"))
|
||||
|
||||
# Token usage statistics
|
||||
usage = result.get("usage")
|
||||
if usage:
|
||||
print("\nUsage Details:")
|
||||
print(f" Input Tokens: {usage.get('input_tokens')}")
|
||||
if "input_tokens_details" in usage:
|
||||
print(f" Input Tokens Details: {usage.get('input_tokens_details')}")
|
||||
print(f" Output Tokens: {usage.get('output_tokens')}")
|
||||
if "output_tokens_details" in usage:
|
||||
print(f" Output Tokens Details: {usage.get('output_tokens_details')}")
|
||||
print(f" Total Tokens: {usage.get('total_tokens')}")
|
||||
|
||||
# Detailed reasoning and actions
|
||||
outputs = result.get("output", [])
|
||||
for output in outputs:
|
||||
output_type = output.get("type")
|
||||
if output_type == "reasoning":
|
||||
print("\nReasoning:")
|
||||
for summary in output.get("summary", []):
|
||||
print(f" {summary.get('text')}")
|
||||
elif output_type == "computer_call":
|
||||
action = output.get("action", {})
|
||||
print("\nComputer Action:")
|
||||
print(f" Type: {action.get('type')}")
|
||||
print(f" Position: ({action.get('x')}, {action.get('y')})")
|
||||
if action.get("text"):
|
||||
print(f" Text: {action.get('text')}")
|
||||
```
|
||||
|
||||
This detailed information is invaluable for debugging, logging, and understanding the agent's decision-making process in an agentic system. More details can be found in the [OpenAI Agent SDK Specification](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
|
||||
|
||||
## Building a Gradio UI
|
||||
|
||||
For a visual interface to your agent, the package also includes a Gradio UI:
|
||||
|
||||
**How to run the Gradio UI:**
|
||||
|
||||
1. Create a file named `launch_ui.py` with the following code:
|
||||
|
||||
```python
|
||||
from agent.ui.gradio.app import create_gradio_ui
|
||||
|
||||
# Create and launch the UI
|
||||
if __name__ == "__main__":
|
||||
app = create_gradio_ui()
|
||||
app.launch(share=False) # Set share=False for local access only
|
||||
```
|
||||
|
||||
2. Install the UI dependencies if you haven't already:
|
||||
```bash
|
||||
pip install "cua-agent[ui]"
|
||||
```
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
python launch_ui.py
|
||||
```
|
||||
|
||||
4. Open your browser to the displayed URL (usually http://127.0.0.1:7860)
|
||||
|
||||
**Creating a Shareable Link (Optional):**
|
||||
|
||||
You can also create a temporary public URL to access your Gradio UI from anywhere:
|
||||
|
||||
```python
|
||||
# In launch_ui.py
|
||||
if __name__ == "__main__":
|
||||
app = create_gradio_ui()
|
||||
app.launch(share=True) # Creates a public link
|
||||
```
|
||||
|
||||
When you run this, Gradio will display both a local URL and a public URL like:
|
||||
```
|
||||
Running on local URL: http://127.0.0.1:7860
|
||||
Running on public URL: https://abcd1234.gradio.live
|
||||
```
|
||||
|
||||
**Security Note:** Be cautious when sharing your Gradio UI publicly:
|
||||
- The public URL gives anyone with the link full access to your agent
|
||||
- Consider using basic authentication for additional protection:
|
||||
```python
|
||||
app.launch(share=True, auth=("username", "password"))
|
||||
```
|
||||
- Only use this feature for personal or team use, not for production environments
|
||||
- The temporary link expires when you stop the Gradio application
|
||||
|
||||
This provides:
|
||||
- Model provider selection
|
||||
- Agent loop selection
|
||||
- Task input field
|
||||
- Real-time display of VM screenshots
|
||||
- Action history
|
||||
|
||||
### Setting API Keys for the UI
|
||||
|
||||
To use the UI with different providers, set your API keys as environment variables:
|
||||
|
||||
```bash
|
||||
# For OpenAI models
|
||||
export OPENAI_API_KEY=your_openai_key_here
|
||||
|
||||
# For Anthropic models
|
||||
export ANTHROPIC_API_KEY=your_anthropic_key_here
|
||||
|
||||
# Launch with both keys set
|
||||
OPENAI_API_KEY=your_key ANTHROPIC_API_KEY=your_key python launch_ui.py
|
||||
```
|
||||
|
||||
### UI Settings Persistence
|
||||
|
||||
The Gradio UI automatically saves your configuration to maintain your preferences between sessions:
|
||||
|
||||
- Settings like Agent Loop, Model Choice, Custom Base URL, and configuration options are saved to `.gradio_settings.json` in the project's root directory
|
||||
- These settings are loaded automatically when you restart the UI
|
||||
- API keys entered in the custom provider field are **not** saved for security reasons
|
||||
- It's recommended to add `.gradio_settings.json` to your `.gitignore` file
|
||||
|
||||
## Advanced Example: GitHub Repository Workflow
|
||||
|
||||
Let's look at a more complex example that automates a GitHub workflow:
|
||||
|
||||
**How to run this advanced example:**
|
||||
|
||||
1. Create a file named `github_workflow.py` with the following code:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import logging
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def github_workflow():
|
||||
async with Computer(verbosity=logging.INFO) as macos_computer:
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
save_trajectory=True, # Save screenshots for debugging
|
||||
only_n_most_recent_images=3, # Only keep last 3 images in context
|
||||
verbosity=logging.INFO,
|
||||
tools=[macos_computer]
|
||||
)
|
||||
|
||||
tasks = [
|
||||
"Look for a repository named trycua/cua on GitHub.",
|
||||
"Check the open issues, open the most recent one and read it.",
|
||||
"Clone the repository in users/lume/projects if it doesn't exist yet.",
|
||||
"Open the repository with Cursor (on the dock, black background and white cube icon).",
|
||||
"From Cursor, open Composer if not already open.",
|
||||
"Focus on the Composer text area, then write and submit a task to help resolve the GitHub issue.",
|
||||
]
|
||||
|
||||
for i, task in enumerate(tasks):
|
||||
print(f"\nExecuting task {i+1}/{len(tasks)}: {task}")
|
||||
async for result in agent.run(task):
|
||||
print(f"Action: {result.get('text')}")
|
||||
print(f"✅ Task {i+1}/{len(tasks)} completed")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(github_workflow())
|
||||
```
|
||||
|
||||
2. Make sure your OpenAI API key is set:
|
||||
```bash
|
||||
export OPENAI_API_KEY=your_openai_key_here
|
||||
```
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
python github_workflow.py
|
||||
```
|
||||
|
||||
4. Watch as the agent completes the entire workflow:
|
||||
- The agent will navigate to GitHub
|
||||
- Find and investigate issues in the repository
|
||||
- Clone the repository to the local machine
|
||||
- Open it in Cursor
|
||||
- Use Cursor's AI features to work on a solution
|
||||
|
||||
This example:
|
||||
1. Searches GitHub for a repository
|
||||
2. Reads an issue
|
||||
3. Clones the repository
|
||||
4. Opens it in an IDE
|
||||
5. Uses AI to write a solution
|
||||
|
||||
## Comparing Implementation Approaches
|
||||
|
||||
Let's compare our manual implementation from Part 1 with the framework approach:
|
||||
|
||||
### Manual Implementation (Part 1)
|
||||
- Required writing custom code for the interaction loop
|
||||
- Needed explicit handling of different action types
|
||||
- Required direct management of the OpenAI API calls
|
||||
- Around 50-100 lines of code for basic functionality
|
||||
- Limited to OpenAI's computer-use model
|
||||
|
||||
### Framework Implementation (Part 2)
|
||||
- Abstracts the interaction loop
|
||||
- Handles all action types automatically
|
||||
- Manages API calls internally
|
||||
- Only 10-15 lines of code for the same functionality
|
||||
- Works with multiple model providers
|
||||
- Includes UI capabilities
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `cua-agent` framework transforms what was a complex implementation task into a simple, high-level interface for building Computer-Use Agents. By abstracting away the technical details, it lets you focus on defining the tasks rather than the machinery.
|
||||
|
||||
### When to Use Each Approach
|
||||
- **Manual Implementation (Part 1)**: When you need complete control over the interaction loop or are implementing a custom solution
|
||||
- **Framework (Part 2)**: For most applications where you want to quickly build and deploy Computer-Use Agents
|
||||
|
||||
### Next Steps
|
||||
With the basics covered, you might want to explore:
|
||||
- Customizing the agent's behavior with additional parameters
|
||||
- Building more complex workflows spanning multiple applications
|
||||
- Integrating your agent into other applications
|
||||
- Contributing to the open-source project on GitHub
|
||||
|
||||
### Resources
|
||||
- [cua-agent GitHub repository](https://github.com/trycua/cua/tree/main/libs/agent)
|
||||
- [Agent Notebook Examples](https://github.com/trycua/cua/blob/main/notebooks/agent_nb.ipynb)
|
||||
- [OpenAI Agent SDK Specification](https://platform.openai.com/docs/api-reference/responses)
|
||||
- [Anthropic API Documentation](https://docs.anthropic.com/en/api/getting-started)
|
||||
- [UI-TARS GitHub](https://github.com/ByteDance/UI-TARS)
|
||||
- [OmniParser GitHub](https://github.com/microsoft/OmniParser)
|
||||
74
blog/composite-agents.md
Normal file
74
blog/composite-agents.md
Normal file
@@ -0,0 +1,74 @@
|
||||
# Announcing Cua Agent framework 0.4 and Composite Agents
|
||||
|
||||
*Published on August 26, 2025 by Dillon DuPont*
|
||||
|
||||
<img src="/composite-agents.png" alt="Composite Agents">
|
||||
|
||||
So you want to build an agent that can use a computer. Great! You've probably discovered that there are now dozens of different AI models that claim they can click GUI buttons and fill out forms. Less great: actually getting them to work together is like trying to coordinate a group project where everyone speaks a different language and has invented seventeen different ways to say "click here".
|
||||
|
||||
Here's the thing about new GUI models: they're all special snowflakes. One model wants you to feed it images and expects coordinates back as percentages from 0 to 1. Another wants absolute pixel coordinates. A third model has invented its own numeral system with `<|loc095|><|loc821|>` tokens inside tool calls. Some models output Python code that calls `pyautogui.click(x, y)`. Others will start hallucinating coordinates if you forget to format all previous messages within a very specific GUI system prompt.
|
||||
|
||||
This is the kind of problem that makes you wonder if we're building the future of computing or just recreating the Tower of Babel with more GPUs.
|
||||
|
||||
## What we fixed
|
||||
|
||||
Agent framework 0.4 solves this by doing something radical: making all these different models speak the same language.
|
||||
|
||||
Instead of writing separate code for each model's peculiarities, you now just pick a model with a string like `"anthropic/claude-3-5-sonnet-20241022"` or `"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"`, and everything else Just Works™. Behind the scenes, we handle all the coordinate normalization, token parsing, and image preprocessing so you don't have to.
|
||||
|
||||
```python
|
||||
# This works the same whether you're using Anthropic, OpenAI, or that new model you found on Hugging Face
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022", # or any other supported model
|
||||
tools=[computer]
|
||||
)
|
||||
```
|
||||
|
||||
The output format is consistent across all providers (OpenAI, Anthropic, Vertex, Hugging Face, OpenRouter, etc.). No more writing different parsers for each model's creative interpretation of how to represent a mouse click.
|
||||
|
||||
## Composite Agents: Two Brains Are Better Than One
|
||||
|
||||
Here's where it gets interesting. We realized that you don't actually need one model to be good at everything. Some models are excellent at understanding what's on the screen—they can reliably identify buttons and text fields and figure out where to click. Other models are great at planning and reasoning but might be a bit fuzzy on the exact pixel coordinates.
|
||||
|
||||
So we let you combine them with a `+` sign:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
# specify the grounding model first, then the planning model
|
||||
model="huggingface-local/HelloKKMe/GTA1-7B+huggingface-local/OpenGVLab/InternVL3_5-8B",
|
||||
tools=[computer]
|
||||
)
|
||||
```
|
||||
|
||||
This creates a composite agent where one model (the "grounding" model) handles the visual understanding and precise UI interactions, while the other (the "planning" model) handles the high-level reasoning and task orchestration. It's like having a pilot and a navigator, except they're both AI models and they're trying to help you star a GitHub repository.
|
||||
|
||||
You can even take a model that was never designed for computer use—like GPT-4o—and give it GUI capabilities by pairing it with a specialized vision model:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-4o",
|
||||
tools=[computer]
|
||||
)
|
||||
```
|
||||
|
||||
## Example notebook
|
||||
|
||||
For a full, ready-to-run demo (install deps, local computer using Docker, and a composed agent example), see the notebook:
|
||||
|
||||
- https://github.com/trycua/cua/blob/models/opencua/notebooks/composite_agents_docker_nb.ipynb
|
||||
|
||||
## What's next
|
||||
|
||||
We're building integration with HUD evals, allowing us to curate and benchmark model combinations. This will help us identify which composite agent pairs work best for different types of tasks, and provide you with tested recommendations rather than just throwing model names at the wall to see what sticks.
|
||||
|
||||
If you try out version 0.4.x, we'd love to hear how it goes. Join us on Discord to share your results and let us know what model combinations work best for your projects.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
* **Composite Agent Docs:** [https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents)
|
||||
* **Discord:** [https://discord.gg/cua-ai](https://discord.gg/cua-ai)
|
||||
|
||||
Questions or weird edge cases? Ping us on Discord—we’re curious to see what you build.
|
||||
79
blog/cua-hackathon.md
Normal file
79
blog/cua-hackathon.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Computer-Use Agents SOTA Challenge: Hack the North + Global Online
|
||||
|
||||
*Published on August 25, 2025 by Francesco Bonacci*
|
||||
|
||||
We’re bringing something new to [Hack the North](https://hackthenorth.com), Canada’s largest hackathon, this year: a head-to-head competition for **Computer-Use Agents** - on-site at Waterloo and a **Global online challenge**. From September 12–14, 2025, teams build on the **Cua Agent Framework** and are scored in **HUD’s OSWorld-Verified** environment to push past today’s SOTA on [OS-World](https://os-world.github.io).
|
||||
|
||||
<img src="/hack-the-north.png">
|
||||
|
||||
## Track A: On-site @ Hack the North
|
||||
|
||||
There’s one global leaderboard: **Cua - Best State-of-the-Art Computer-Use Agent**. Use any model setup you like (cloud or local). After projects are submitted, [HUD](https://www.hud.so) runs the official benchmark; the top team earns a **guaranteed YC partner interview (W26 batch)**. We’ll also feature winners on our blog and socials and kit the team out with swag.
|
||||
|
||||
## Track B: Cua Global Online Hackathon
|
||||
|
||||
**Cua** and [**Ollama**](https://ollama.com) organize a global hackathon to find the **most creative uses of local and hybrid computer-use agents**. There are no geographic restrictions on who can join — this is a worldwide competition focused on **originality, impact, and inventive applications** that showcase what's possible with local and hybrid inference.
|
||||
|
||||
**Prizes:**
|
||||
- 1st **MacBook Air M4 (or equivalent value)** + features in Cua & Ollama channels
|
||||
- 2nd **$500 CAD + swag**
|
||||
- 3rd **swag + public feature**
|
||||
|
||||
---
|
||||
|
||||
## How it works
|
||||
|
||||
Two different tracks, two different processes:
|
||||
|
||||
### On-site (Track A)
|
||||
Build during the weekend and submit a repo with a one-line start command. **HUD** executes your command in a clean environment and runs **OSWorld-Verified**. Scores come from official benchmark results; ties break by median, then wall-clock time, then earliest submission. Any model setup is allowed (cloud or local).
|
||||
|
||||
**HUD** runs official evaluations immediately after submission. Winners are announced at the **closing ceremony**.
|
||||
|
||||
### Rules
|
||||
- Fork and star the [Cua repo](https://github.com/trycua/cua).
|
||||
- Add your agent and instructions in `samples/community/hack-the-north/<YOUR_TEAM_NAME>`.
|
||||
- Include a README with details on the approach and any required notes.
|
||||
- Submit a PR.
|
||||
|
||||
**Deadline: Sept 15, 8:00 AM EDT**
|
||||
|
||||
### Global Online (Track B)
|
||||
Open to anyone, anywhere. Build on your own timeline and submit through the **Cua Discord form** by the deadline.
|
||||
|
||||
**Project Requirements:**
|
||||
- Your agent must integrate **Cua and Ollama** in some way
|
||||
- Your agent must be **easily runnable by judges**
|
||||
|
||||
Judged by **Cua** and **Ollama** teams on:
|
||||
- **Creativity (30%)** – originality, usefulness, surprise factor
|
||||
- **Technical Depth (30%)** – quality of engineering and agent design
|
||||
- **Use of Ollama (30%)** – effective integration of local/hybrid inference
|
||||
- **Polish (10%)** – presentation, clarity, demo readiness
|
||||
|
||||
### Submission Process
|
||||
Submissions will be collected via a **form link provided in the Cua Discord**. Your submission must contain:
|
||||
|
||||
- **GitHub repo** containing the agent source code and a clear README with instructions on how to use the agent
|
||||
- **Explanation** of the models and tools used, and what's local or hybrid about your design
|
||||
- **Short demo video** (up to two minutes)
|
||||
|
||||
A **commit freeze** will be used to ensure that no changes are made after the deadline. Winners will be announced after judging is complete.
|
||||
|
||||
**Deadline: Sept 28, 11:59 PM UTC (extended due to popular demand!)**
|
||||
|
||||
---
|
||||
|
||||
## Join us
|
||||
|
||||
Bring a team, pick a model stack, and push what agents can do on real computers. We can’t wait to see what you build at **Hack the North 2025**.
|
||||
|
||||
**Discord channels**
|
||||
- Join the Discord first: https://discord.gg/cua-ai
|
||||
- **#hack-the-north (on-site):** https://discord.com/channels/1328377437301641247/1409508526774157342
|
||||
- **#global-online (Ollama × Cua):** https://discord.com/channels/1328377437301641247/1409518100491145226
|
||||
|
||||
**Contact**
|
||||
Questions on Hack the North? Email **hackthenorth@trycua.com**.
|
||||
|
||||
*P.S. If you’re planning ahead, start with the Cua Agent Framework and OSWorld-Verified docs at docs.trycua.com; we’ll share office-hour times in both Discord channels.*
|
||||
93
blog/hud-agent-evals.md
Normal file
93
blog/hud-agent-evals.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Cua × HUD - Evaluate Any Computer-Use Agent
|
||||
|
||||
*Published on August 27, 2025 by Dillon DuPont*
|
||||
|
||||
You can now benchmark any GUI-capable agent on real computer-use tasks through our new integration with [HUD](https://hud.so), the evaluation platform for computer-use agents.
|
||||
|
||||
If [yesterday's 0.4 release](composite-agents.md) made it easy to compose planning and grounding models, today's update makes it easy to measure them. Configure your model, run evaluations at scale, and watch traces live in HUD.
|
||||
|
||||
<img src="/hud-agent-evals.png" alt="Cua × HUD">
|
||||
|
||||
## What you get
|
||||
|
||||
- One-line evals on OSWorld (and more) for OpenAI, Anthropic, Hugging Face, and composed GUI models.
|
||||
- Live traces at [app.hud.so](https://app.hud.so) to see every click, type, and screenshot.
|
||||
- Zero glue code needed - we wrapped the interface for you.
|
||||
- With Cua's Agent SDK, you can benchmark any configurations of models, by just changing the `model` string.
|
||||
|
||||
## What is OSWorld?
|
||||
|
||||
[OSWorld](https://os-world.github.io) is a comprehensive evaluation benchmark comprising 369 real-world computer-use tasks spanning diverse desktop environments (Chrome, LibreOffice, GIMP, VS Code, etc.) developed by XLang Labs. This benchmark has emerged as the de facto standard for evaluating multimodal agents in realistic computing environments, with adoption by leading AI research teams at OpenAI, Anthropic, and other major institutions for systematic agent assessment. The benchmark was recently enhanced to [OSWorld-Verified](https://xlang.ai/blog/osworld-verified), incorporating rigorous validation improvements that address over 300 community-identified issues to ensure evaluation reliability and reproducibility.
|
||||
|
||||
## Environment Setup
|
||||
|
||||
First, set up your environment variables:
|
||||
|
||||
```bash
|
||||
export HUD_API_KEY="your_hud_api_key" # Required for HUD access
|
||||
export ANTHROPIC_API_KEY="your_anthropic_key" # For Claude models
|
||||
export OPENAI_API_KEY="your_openai_key" # For OpenAI models
|
||||
```
|
||||
|
||||
## Try it
|
||||
|
||||
### Quick Start - Single Task
|
||||
|
||||
```python
|
||||
from agent.integrations.hud import run_single_task
|
||||
|
||||
await run_single_task(
|
||||
dataset="hud-evals/OSWorld-Verified-XLang",
|
||||
model="openai/computer-use-preview+openai/gpt-5-nano", # or any supported model string
|
||||
task_id=155 # open last tab task (easy)
|
||||
)
|
||||
```
|
||||
|
||||
### Run a dataset (parallel execution)
|
||||
|
||||
```python
|
||||
from agent.integrations.hud import run_full_dataset
|
||||
|
||||
# Test on OSWorld (367 computer-use tasks)
|
||||
await run_full_dataset(
|
||||
dataset="hud-evals/OSWorld-Verified-XLang",
|
||||
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
|
||||
split="train[:3]" # try a few tasks to start
|
||||
)
|
||||
|
||||
# Or test on SheetBench (50 spreadsheet tasks)
|
||||
await run_full_dataset(
|
||||
dataset="hud-evals/SheetBench-V2",
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
split="train[:2]"
|
||||
)
|
||||
```
|
||||
|
||||
### Live Environment Streaming
|
||||
|
||||
Watch your agent work in real-time. Example output:
|
||||
|
||||
```md
|
||||
Starting full dataset run...
|
||||
╔═════════════════════════════════════════════════════════════════╗
|
||||
║ 🚀 See your agent live at: ║
|
||||
╟─────────────────────────────────────────────────────────────────╢
|
||||
║ https://app.hud.so/jobs/fe05805d-4da9-4fc6-84b5-5c518528fd3c ║
|
||||
╚═════════════════════════════════════════════════════════════════╝
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
Customize your evaluation with these options:
|
||||
|
||||
- **Environment types**: `environment="linux"` (OSWorld) or `environment="browser"` (SheetBench)
|
||||
- **Model composition**: Mix planning and grounding models with `+` (e.g., `"gpt-4+gpt-5-nano"`)
|
||||
- **Parallel scaling**: Set `max_concurrent_tasks` for throughput
|
||||
- **Local trajectories**: Save with `trajectory_dir` for offline analysis
|
||||
- **Live monitoring**: Every run gets a unique trace URL at app.hud.so
|
||||
|
||||
## Learn more
|
||||
|
||||
- Notebook with end‑to‑end examples: https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb
|
||||
- Docs: https://docs.trycua.com/docs/agent-sdk/integrations/hud
|
||||
- Live traces: https://app.hud.so
|
||||
211
blog/human-in-the-loop.md
Normal file
211
blog/human-in-the-loop.md
Normal file
@@ -0,0 +1,211 @@
|
||||
# When Agents Need Human Wisdom - Introducing Human-In-The-Loop Support
|
||||
|
||||
*Published on August 29, 2025 by Francesco Bonacci*
|
||||
|
||||
Sometimes the best AI agent is a human. Whether you're creating training demonstrations, evaluating complex scenarios, or need to intervene when automation hits a wall, our new Human-In-The-Loop integration puts you directly in control.
|
||||
|
||||
With yesterday's [HUD evaluation integration](hud-agent-evals.md), you could benchmark any agent at scale. Today's update lets you *become* the agent when it matters most—seamlessly switching between automated intelligence and human judgment.
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/human-in-the-loop.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
## What you get
|
||||
|
||||
- **One-line human takeover** for any agent configuration with `human/human` or `model+human/human`
|
||||
- **Interactive web UI** to see what your agent sees and control what it does
|
||||
- **Zero context switching** - step in exactly where automation left off
|
||||
- **Training data generation** - create perfect demonstrations by doing tasks yourself
|
||||
- **Ground truth evaluation** - validate agent performance with human expertise
|
||||
|
||||
## Why Human-In-The-Loop?
|
||||
|
||||
Even the most sophisticated agents encounter edge cases, ambiguous interfaces, or tasks requiring human judgment. Rather than failing gracefully, they can now fail *intelligently*—by asking for human help.
|
||||
|
||||
This approach bridges the gap between fully automated systems and pure manual control, letting you:
|
||||
- **Demonstrate complex workflows** that agents can learn from
|
||||
- **Evaluate tricky scenarios** where ground truth requires human assessment
|
||||
- **Intervene selectively** when automated agents need guidance
|
||||
- **Test and debug** your tools and environments manually
|
||||
|
||||
## Getting Started
|
||||
|
||||
Launch the human agent interface:
|
||||
|
||||
```bash
|
||||
python -m agent.human_tool
|
||||
```
|
||||
|
||||
The web UI will show pending completions. Click any completion to take control of the agent and see exactly what it sees.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Direct Human Control
|
||||
|
||||
Perfect for creating demonstrations or when you want full manual control:
|
||||
|
||||
```python
|
||||
from agent import ComputerAgent
|
||||
from agent.computer import computer
|
||||
|
||||
agent = ComputerAgent(
|
||||
"human/human",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
# You'll get full control through the web UI
|
||||
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
|
||||
pass
|
||||
```
|
||||
|
||||
### Hybrid: AI Planning + Human Execution
|
||||
|
||||
Combine model intelligence with human precision—let AI plan, then execute manually:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
"huggingface-local/HelloKKMe/GTA1-7B+human/human",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
# AI creates the plan, human executes each step
|
||||
async for _ in agent.run("Navigate to the settings page and enable dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
### Fallback Pattern
|
||||
|
||||
Start automated, escalate to human when needed:
|
||||
|
||||
```python
|
||||
# Primary automated agent
|
||||
primary_agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
|
||||
|
||||
# Human fallback agent
|
||||
fallback_agent = ComputerAgent("human/human", tools=[computer])
|
||||
|
||||
try:
|
||||
async for result in primary_agent.run(task):
|
||||
if result.confidence < 0.7: # Low confidence threshold
|
||||
# Seamlessly hand off to human
|
||||
async for _ in fallback_agent.run(f"Continue this task: {task}"):
|
||||
pass
|
||||
except Exception:
|
||||
# Agent failed, human takes over
|
||||
async for _ in fallback_agent.run(f"Handle this failed task: {task}"):
|
||||
pass
|
||||
```
|
||||
|
||||
## Interactive Features
|
||||
|
||||
The human-in-the-loop interface provides a rich, responsive experience:
|
||||
|
||||
### **Visual Environment**
|
||||
- **Screenshot display** with live updates as you work
|
||||
- **Click handlers** for direct interaction with UI elements
|
||||
- **Zoom and pan** to see details clearly
|
||||
|
||||
### **Action Controls**
|
||||
- **Click actions** - precise cursor positioning and clicking
|
||||
- **Keyboard input** - type text naturally or send specific key combinations
|
||||
- **Action history** - see the sequence of actions taken
|
||||
- **Undo support** - step back when needed
|
||||
|
||||
### **Tool Integration**
|
||||
- **Full OpenAI compatibility** - standard tool call format
|
||||
- **Custom tools** - integrate your own tools seamlessly
|
||||
- **Real-time feedback** - see tool responses immediately
|
||||
|
||||
### **Smart Polling**
|
||||
- **Responsive updates** - UI refreshes when new completions arrive
|
||||
- **Background processing** - continue working while waiting for tasks
|
||||
- **Session persistence** - resume interrupted sessions
|
||||
|
||||
## Real-World Use Cases
|
||||
|
||||
### **Training Data Generation**
|
||||
Create perfect demonstrations for fine-tuning:
|
||||
|
||||
```python
|
||||
# Generate training examples for spreadsheet tasks
|
||||
demo_agent = ComputerAgent("human/human", tools=[computer])
|
||||
|
||||
tasks = [
|
||||
"Create a budget spreadsheet with income and expense categories",
|
||||
"Apply conditional formatting to highlight overbudget items",
|
||||
"Generate a pie chart showing expense distribution"
|
||||
]
|
||||
|
||||
for task in tasks:
|
||||
# Human demonstrates each task perfectly
|
||||
async for _ in demo_agent.run(task):
|
||||
pass # Recorded actions become training data
|
||||
```
|
||||
|
||||
### **Evaluation and Ground Truth**
|
||||
Validate agent performance on complex scenarios:
|
||||
|
||||
```python
|
||||
# Human evaluates agent performance
|
||||
evaluator = ComputerAgent("human/human", tools=[computer])
|
||||
|
||||
async for _ in evaluator.run("Review this completed form and rate accuracy (1-10)"):
|
||||
pass # Human provides authoritative quality assessment
|
||||
```
|
||||
|
||||
### **Interactive Debugging**
|
||||
Step through agent behavior manually:
|
||||
|
||||
```python
|
||||
# Test a workflow step by step
|
||||
debug_agent = ComputerAgent("human/human", tools=[computer])
|
||||
|
||||
async for _ in debug_agent.run("Reproduce the agent's failed login sequence"):
|
||||
pass # Human identifies exactly where automation breaks
|
||||
```
|
||||
|
||||
### **Edge Case Handling**
|
||||
Handle scenarios that break automated agents:
|
||||
|
||||
```python
|
||||
# Complex UI interaction requiring human judgment
|
||||
edge_case_agent = ComputerAgent("human/human", tools=[computer])
|
||||
|
||||
async for _ in edge_case_agent.run("Navigate this CAPTCHA-protected form"):
|
||||
pass # Human handles what automation cannot
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
Customize the human agent experience:
|
||||
|
||||
- **UI refresh rate**: Adjust polling frequency for your workflow
|
||||
- **Image quality**: Balance detail vs. performance for screenshots
|
||||
- **Action logging**: Save detailed traces for analysis and training
|
||||
- **Session timeout**: Configure idle timeouts for security
|
||||
- **Tool permissions**: Restrict which tools humans can access
|
||||
|
||||
## When to Use Human-In-The-Loop
|
||||
|
||||
| **Scenario** | **Why Human Control** |
|
||||
|--------------|----------------------|
|
||||
| **Creating training data** | Perfect demonstrations for model fine-tuning |
|
||||
| **Evaluating complex tasks** | Human judgment for subjective or nuanced assessment |
|
||||
| **Handling edge cases** | CAPTCHAs, unusual UIs, context-dependent decisions |
|
||||
| **Debugging workflows** | Step through failures to identify breaking points |
|
||||
| **High-stakes operations** | Critical tasks requiring human oversight and approval |
|
||||
| **Testing new environments** | Validate tools and environments work as expected |
|
||||
|
||||
## Learn More
|
||||
|
||||
- **Interactive examples**: Try human-in-the-loop control with sample tasks
|
||||
- **Training data pipelines**: Learn how to convert human demonstrations into model training data
|
||||
- **Evaluation frameworks**: Build human-validated test suites for your agents
|
||||
- **API documentation**: Full reference for human agent configuration
|
||||
|
||||
Ready to put humans back in the loop? The most sophisticated AI system knows when to ask for help.
|
||||
|
||||
---
|
||||
|
||||
*Questions about human-in-the-loop agents? Join the conversation in our [Discord community](https://discord.gg/cua-ai) or check out our [documentation](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop).*
|
||||
232
blog/introducing-cua-cloud-containers.md
Normal file
232
blog/introducing-cua-cloud-containers.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# Introducing Cua Cloud Containers: Computer-Use Agents in the Cloud
|
||||
|
||||
*Published on May 28, 2025 by Francesco Bonacci*
|
||||
|
||||
Welcome to the next chapter in our Computer-Use Agent journey! In [Part 1](./build-your-own-operator-on-macos-1), we showed you how to build your own Operator on macOS. In [Part 2](./build-your-own-operator-on-macos-2), we explored the cua-agent framework. Today, we're excited to introduce **Cua Cloud Containers** – the easiest way to deploy Computer-Use Agents at scale.
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/launch-video-cua-cloud.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
## What is Cua Cloud?
|
||||
|
||||
Think of Cua Cloud as **Docker for Computer-Use Agents**. Instead of managing VMs, installing dependencies, and configuring environments, you can launch pre-configured cloud containers with a single command. Each container comes with a **full desktop environment** accessible via browser (via noVNC), all CUA-related dependencies pre-configured (with a PyAutoGUI-compatible server), and **pay-per-use pricing** that scales with your needs.
|
||||
|
||||
## Why Cua Cloud Containers?
|
||||
|
||||
Four months ago, we launched [**Lume**](https://github.com/trycua/cua/tree/main/libs/lume) and [**Cua**](https://github.com/trycua/cua) with the goal to bring sandboxed VMs and Computer-Use Agents on Apple Silicon. The developer's community response was incredible 🎉
|
||||
|
||||
Going from prototype to production revealed a problem though: **local macOS VMs don't scale**, neither are they easily portable.
|
||||
|
||||
Our Discord community, YC peers, and early pilot customers kept hitting the same issues. Storage constraints meant **20-40GB per VM** filled laptops fast. Different hardware architectures (Apple Silicon ARM vs Intel x86) prevented portability of local workflows. Every new user lost a day to setup and configuration.
|
||||
|
||||
**Cua Cloud** eliminates these constraints while preserving everything developers are familiar with about our Computer and Agent SDK.
|
||||
|
||||
### What We Built
|
||||
|
||||
Over the past month, we've been iterating over Cua Cloud with partners and beta users to address these challenges. You use the exact same `Computer` and `ComputerAgent` classes you already know, but with **zero local setup** or storage requirements. VNC access comes with **built-in encryption**, you pay only for compute time (not idle resources), and can bring your own API keys for any LLM provider.
|
||||
|
||||
The result? **Instant deployment** in seconds instead of hours, with no infrastructure to manage. Scale elastically from **1 to 100 agents** in parallel, with consistent behavior across all deployments. Share agent trajectories with your team for better collaboration and debugging.
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Step 1: Get Your API Key
|
||||
|
||||
Sign up at [**trycua.com**](https://trycua.com) to get your API key.
|
||||
|
||||
```bash
|
||||
# Set your API key in environment variables
|
||||
export CUA_API_KEY=your_api_key_here
|
||||
export CUA_CONTAINER_NAME=my-agent-container
|
||||
```
|
||||
|
||||
### Step 2: Launch Your First Container
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer, VMProviderType
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def run_cloud_agent():
|
||||
# Create a remote Linux computer with Cua Cloud
|
||||
computer = Computer(
|
||||
os_type="linux",
|
||||
api_key=os.getenv("CUA_API_KEY"),
|
||||
name=os.getenv("CUA_CONTAINER_NAME"),
|
||||
provider_type=VMProviderType.CLOUD,
|
||||
)
|
||||
|
||||
# Create an agent with your preferred loop
|
||||
agent = ComputerAgent(
|
||||
model="openai/gpt-4o",
|
||||
save_trajectory=True,
|
||||
verbosity=logging.INFO,
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
# Run a task
|
||||
async for result in agent.run("Open Chrome and search for AI news"):
|
||||
print(f"Response: {result.get('text')}")
|
||||
|
||||
# Run the agent
|
||||
asyncio.run(run_cloud_agent())
|
||||
```
|
||||
|
||||
### Available Tiers
|
||||
|
||||
We're launching with **three compute tiers** to match your workload needs:
|
||||
|
||||
- **Small** (1 vCPU, 4GB RAM) - Perfect for simple automation tasks and testing
|
||||
- **Medium** (2 vCPU, 8GB RAM) - Ideal for most production workloads
|
||||
- **Large** (8 vCPU, 32GB RAM) - Built for complex, resource-intensive operations
|
||||
|
||||
Each tier includes a **full Linux with Xfce desktop environment** with pre-configured browser, **secure VNC access** with SSL, persistent storage during your session, and automatic cleanup on termination.
|
||||
|
||||
## How some customers are using Cua Cloud today
|
||||
|
||||
### Example 1: Automated GitHub Workflow
|
||||
|
||||
Let's automate a complete GitHub workflow:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
from computer import Computer, VMProviderType
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def github_automation():
|
||||
"""Automate GitHub repository management tasks."""
|
||||
computer = Computer(
|
||||
os_type="linux",
|
||||
api_key=os.getenv("CUA_API_KEY"),
|
||||
name="github-automation",
|
||||
provider_type=VMProviderType.CLOUD,
|
||||
)
|
||||
|
||||
agent = ComputerAgent(
|
||||
model="openai/gpt-4o",
|
||||
save_trajectory=True,
|
||||
verbosity=logging.INFO,
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
tasks = [
|
||||
"Look for a repository named trycua/cua on GitHub.",
|
||||
"Check the open issues, open the most recent one and read it.",
|
||||
"Clone the repository if it doesn't exist yet.",
|
||||
"Create a new branch for the issue.",
|
||||
"Make necessary changes to resolve the issue.",
|
||||
"Commit the changes with a descriptive message.",
|
||||
"Create a pull request."
|
||||
]
|
||||
|
||||
for i, task in enumerate(tasks):
|
||||
print(f"\nExecuting task {i+1}/{len(tasks)}: {task}")
|
||||
async for result in agent.run(task):
|
||||
print(f"Response: {result.get('text')}")
|
||||
|
||||
# Check if any tools were used
|
||||
tools = result.get('tools')
|
||||
if tools:
|
||||
print(f"Tools used: {tools}")
|
||||
|
||||
print(f"Task {i+1} completed")
|
||||
|
||||
# Run the automation
|
||||
asyncio.run(github_automation())
|
||||
```
|
||||
|
||||
### Example 2: Parallel Web Scraping
|
||||
|
||||
Run multiple agents in parallel to scrape different websites:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer, VMProviderType
|
||||
from agent import ComputerAgent
|
||||
|
||||
async def scrape_website(site_name, url):
|
||||
"""Scrape a website using a cloud agent."""
|
||||
computer = Computer(
|
||||
os_type="linux",
|
||||
api_key=os.getenv("CUA_API_KEY"),
|
||||
name=f"scraper-{site_name}",
|
||||
provider_type=VMProviderType.CLOUD,
|
||||
)
|
||||
|
||||
agent = ComputerAgent(
|
||||
model="openai/gpt-4o",
|
||||
save_trajectory=True,
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
results = []
|
||||
tasks = [
|
||||
f"Navigate to {url}",
|
||||
"Extract the main headlines or article titles",
|
||||
"Take a screenshot of the page",
|
||||
"Save the extracted data to a file"
|
||||
]
|
||||
|
||||
for task in tasks:
|
||||
async for result in agent.run(task):
|
||||
results.append({
|
||||
'site': site_name,
|
||||
'task': task,
|
||||
'response': result.get('text')
|
||||
})
|
||||
|
||||
return results
|
||||
|
||||
async def parallel_scraping():
|
||||
"""Scrape multiple websites in parallel."""
|
||||
sites = [
|
||||
("ArXiv", "https://arxiv.org"),
|
||||
("HackerNews", "https://news.ycombinator.com"),
|
||||
("TechCrunch", "https://techcrunch.com")
|
||||
]
|
||||
|
||||
# Run all scraping tasks in parallel
|
||||
tasks = [scrape_website(name, url) for name, url in sites]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
# Process results
|
||||
for site_results in results:
|
||||
print(f"\nResults from {site_results[0]['site']}:")
|
||||
for result in site_results:
|
||||
print(f" - {result['task']}: {result['response'][:100]}...")
|
||||
|
||||
# Run parallel scraping
|
||||
asyncio.run(parallel_scraping())
|
||||
```
|
||||
|
||||
## Cost Optimization Tips
|
||||
|
||||
To optimize your costs, use appropriate container sizes for your workload and implement timeouts to prevent runaway tasks. Batch related operations together to minimize container spin-up time, and always remember to terminate containers when your work is complete.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
Cua Cloud runs all containers in isolated environments with encrypted VNC connections. Your API keys are never exposed in trajectories.
|
||||
|
||||
## What's Next for Cua Cloud
|
||||
|
||||
We're just getting started! Here's what's coming in the next few months:
|
||||
|
||||
### Elastic Autoscaled Container Pools
|
||||
|
||||
Soon you'll be able to create elastic container pools that automatically scale based on demand. Define minimum and maximum container counts, and let Cua Cloud handle the rest. Perfect for batch processing, scheduled automations, and handling traffic spikes without manual intervention.
|
||||
|
||||
### Windows and macOS Cloud Support
|
||||
|
||||
While we're launching with Linux containers, Windows and macOS cloud machines are coming soon. Run Windows-specific automations, test cross-platform workflows, or leverage macOS-exclusive applications – all in the cloud with the same simple API.
|
||||
|
||||
Stay tuned for updates and join our [**Discord**](https://discord.gg/cua-ai) to vote on which features you'd like to see first!
|
||||
|
||||
## Get Started Today
|
||||
|
||||
Ready to deploy your Computer-Use Agents in the cloud?
|
||||
|
||||
Visit [**trycua.com**](https://trycua.com) to sign up and get your API key. Join our [**Discord community**](https://discord.gg/cua-ai) for support and explore more examples on [**GitHub**](https://github.com/trycua/cua).
|
||||
|
||||
Happy RPA 2.0! 🚀
|
||||
176
blog/lume-to-containerization.md
Normal file
176
blog/lume-to-containerization.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# From Lume to Containerization: Our Journey Meets Apple's Vision
|
||||
|
||||
*Published on June 10, 2025 by Francesco Bonacci*
|
||||
|
||||
Yesterday, Apple announced their new [Containerization framework](https://github.com/apple/containerization) at WWDC. Since then, our Discord and X users have been asking what this means for Cua virtualization capabilities on Apple Silicon. We've been working in this space for months - from [Lume](https://github.com/trycua/cua/tree/main/libs/lume) to [Lumier](https://github.com/trycua/cua/tree/main/libs/lumier) to [Cua Cloud Containers](./introducing-cua-cloud-containers). Here's our take on Apple's announcement.
|
||||
|
||||
## Our Story
|
||||
|
||||
When we started Cua, we wanted to solve a simple problem: make it easy to run VMs on Apple Silicon, with a focus on testing and deploying computer-use agents without dealing with complicated setups.
|
||||
|
||||
We decided to build on Apple's Virtualization framework because it was fast and well-designed. This became Lume, which we launched on [Hacker News](https://news.ycombinator.com/item?id=42908061).
|
||||
|
||||
Four months later, we're happy with our choice. Users are running VMs with great performance and low memory usage. Now Apple's new [Containerization](https://github.com/apple/containerization) framework builds on the same foundation - showing we were on the right track.
|
||||
|
||||
## What Apple Announced
|
||||
|
||||
Apple's Containerization framework changes how containers work on macOS. Here's what's different:
|
||||
|
||||
### How It Works
|
||||
|
||||
Instead of running all containers in one shared VM (like Docker or Colima), Apple runs each container in its own tiny VM:
|
||||
|
||||
```bash
|
||||
How Docker Works:
|
||||
┌─────────────────────────────────┐
|
||||
│ Your Mac │
|
||||
├─────────────────────────────────┤
|
||||
│ One Big Linux VM │
|
||||
├─────────────────────────────────┤
|
||||
│ Container 1 │ Container 2 │ ... │
|
||||
└─────────────────────────────────┘
|
||||
|
||||
How Apple's Framework Works:
|
||||
┌─────────────────────────────────┐
|
||||
│ Your Mac │
|
||||
├─────────────────────────────────┤
|
||||
│ Mini VM 1 │ Mini VM 2 │ Mini VM 3│
|
||||
│Container 1│Container 2│Container 3│
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
Why is this better?
|
||||
- **Better security**: Each container is completely separate
|
||||
- **Better performance**: Each container gets its own resources
|
||||
- **Real isolation**: If one container has problems, others aren't affected
|
||||
|
||||
> **Note**: You'll need macOS Tahoe 26 Preview or later to use all features. The new [VZVMNetNetworkDeviceAttachment](https://developer.apple.com/documentation/virtualization/vzvmnetnetworkdeviceattachment) API required to fully implement the above architecture is only available there.
|
||||
|
||||
### The Technical Details
|
||||
|
||||
Here's what makes it work:
|
||||
|
||||
- **vminitd**: A tiny program that starts up each container VM super fast
|
||||
- **Fast boot**: These mini VMs start in less than a second
|
||||
- **Simple storage**: Containers are stored as ready-to-use disk images
|
||||
|
||||
Instead of using big, slow startup systems, Apple created something minimal. Each container VM boots with just what it needs - nothing more.
|
||||
|
||||
The `vminitd` part is really clever. It's the first thing that runs in each mini VM and lets the container talk to the outside world. It handles everything the container needs to work properly.
|
||||
|
||||
### What About GPU Passthrough?
|
||||
|
||||
Some developers found hints in macOS Tahoe that GPU support might be coming, through a symbol called `_VZPCIDeviceConfiguration` in the new version of the Virtualization framework. This could mean we'll be able to use GPUs inside containers and VMs soon. Imagine running local models using Ollama or LM Studio! We're not far from having fully local and isolated computer-use agents.
|
||||
|
||||
## What We've Built on top of Apple's Virtualization Framework
|
||||
|
||||
While Apple's new framework focuses on containers, we've been building VM management tools on top of the same Apple Virtualization framework. Here's what we've released:
|
||||
|
||||
### Lume: Simple VM Management
|
||||
|
||||
[Lume](https://github.com/trycua/cua/tree/main/libs/lume) is our command-line tool for creating and managing VMs on Apple Silicon. We built it because setting up VMs on macOS was too complicated.
|
||||
|
||||
What Lume does:
|
||||
- **Direct control**: Works directly with Apple's Virtualization framework
|
||||
- **Ready-to-use images**: Start a macOS or Linux VM with one command
|
||||
- **API server**: Control VMs from other programs (runs on port 7777)
|
||||
- **Smart storage**: Uses disk space efficiently
|
||||
- **Easy install**: One command to get started
|
||||
- **Share images**: Push your VM images to registries like Docker images
|
||||
|
||||
```bash
|
||||
# Install Lume
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
|
||||
# Start a macOS VM
|
||||
lume run macos-sequoia-vanilla:latest
|
||||
```
|
||||
|
||||
### Lumier: Docker-Style VM Management
|
||||
|
||||
[Lumier](https://github.com/trycua/lumier) works differently. It lets you use Docker commands to manage VMs. But here's the key: **Docker is just for packaging, not for isolation**.
|
||||
|
||||
What makes Lumier useful:
|
||||
- **Familiar commands**: If you know Docker, you know Lumier
|
||||
- **Web access**: Connect to your VM through a browser
|
||||
- **Save your work**: VMs remember their state
|
||||
- **Share files**: Easy to move files between your Mac and the VM
|
||||
- **Automation**: Script your VM setup
|
||||
|
||||
```bash
|
||||
# Run a macOS VM with Lumier
|
||||
docker run -it --rm \
|
||||
--name macos-vm \
|
||||
-p 8006:8006 \
|
||||
-e VM_NAME=macos-vm \
|
||||
-e VERSION=ghcr.io/trycua/macos-sequoia-cua:latest \
|
||||
trycua/lumier:latest
|
||||
```
|
||||
|
||||
## Comparing the Options
|
||||
|
||||
Let's see how these three approaches stack up:
|
||||
|
||||
### How They're Built
|
||||
|
||||
```bash
|
||||
Apple Containerization:
|
||||
Your App → Container → Mini VM → Mac Hardware
|
||||
|
||||
Lume:
|
||||
Your App → Full VM → Mac Hardware
|
||||
|
||||
Lumier:
|
||||
Docker → Lume → Full VM → Mac Hardware
|
||||
```
|
||||
|
||||
### When to Use What
|
||||
|
||||
**Apple's Containerization**
|
||||
- ✅ Perfect for: Running containers with maximum security
|
||||
- ✅ Starts in under a second
|
||||
- ✅ Uses less memory and CPU
|
||||
- ❌ Needs macOS Tahoe 26 Preview
|
||||
- ❌ Only for containers, not full VMs
|
||||
|
||||
**Lume**
|
||||
- ✅ Perfect for: Development and testing
|
||||
- ✅ Full control over macOS/Linux VMs
|
||||
- ✅ Works on current macOS versions
|
||||
- ✅ Direct access to everything
|
||||
- ❌ Uses more resources than containers
|
||||
|
||||
**Lumier**
|
||||
- ✅ Perfect for: Teams already using Docker
|
||||
- ✅ Easy to share and deploy
|
||||
- ✅ Access through your browser
|
||||
- ✅ Great for automated workflows
|
||||
- ❌ Adds an extra layer of complexity
|
||||
|
||||
### Using Them Together
|
||||
|
||||
Here's the cool part - you can combine these tools:
|
||||
|
||||
1. **Create a VM**: Use Lume to set up a macOS VM
|
||||
2. **Run containers**: Use Apple's framework inside that VM (works on M3+ Macs with nested virtualization)
|
||||
|
||||
You get the best of both worlds: full VM control plus secure containers.
|
||||
|
||||
## What's Next for Cua?
|
||||
|
||||
Apple's announcement confirms we're on the right path. Here's what we're looking forward to:
|
||||
|
||||
1. **Faster VMs**: Learning from Apple's super-fast container startup, and whether some learnings can be applied to macOS VMs
|
||||
2. **GPU support**: Getting ready for GPU passthrough when `_VZPCIDeviceConfiguration` is made available, realistically in a stable release of macOS Tahoe 26
|
||||
|
||||
## Learn More
|
||||
|
||||
- [Apple Containerization Framework](https://github.com/apple/containerization)
|
||||
- [Lume - Direct VM Management](https://github.com/trycua/cua/tree/main/libs/lume)
|
||||
- [Lumier - Docker Interface for VMs](https://github.com/trycua/cua/tree/main/libs/lumier)
|
||||
- [Cua Cloud Containers](https://trycua.com)
|
||||
- [Join our Discord](https://discord.gg/cua-ai)
|
||||
|
||||
---
|
||||
|
||||
*Questions about virtualization on Apple Silicon? Come chat with us on Discord!*
|
||||
372
blog/sandboxed-python-execution.md
Normal file
372
blog/sandboxed-python-execution.md
Normal file
@@ -0,0 +1,372 @@
|
||||
# Sandboxed Python Execution: Run Code Safely in Cua Containers
|
||||
|
||||
*Published on June 23, 2025 by Dillon DuPont*
|
||||
|
||||
Cua's computer-use capabilities that we touched on in [Building your own Operator on macOS - Part 2](build-your-own-operator-on-macos-2.md) – your AI agents can click, scroll, type, and interact with any desktop application. But what if your agent needs to do more than just UI automation? What if it needs to process data, make API calls, analyze images, or run complex logic alongside those UI interactions, within the same virtual environment?
|
||||
|
||||
That's where Cua's `@sandboxed` decorator comes in. While Cua handles the clicking and typing, sandboxed execution lets you run full Python code inside the same virtual environment. It's like giving your AI agents a programming brain to complement their clicking fingers.
|
||||
|
||||
Think of it as the perfect marriage: Cua handles the "what you see" (UI interactions), while sandboxed Python handles the "what you compute" (data processing, logic, API calls) – all happening in the same isolated environment.
|
||||
|
||||
## So, what exactly is sandboxed execution?
|
||||
|
||||
Cua excels at automating user interfaces – clicking buttons, filling forms, navigating applications. But modern AI agents need to do more than just UI automation. They need to process the data they collect, make intelligent decisions, call external APIs, and run sophisticated algorithms.
|
||||
|
||||
Sandboxed execution bridges this gap. You write a Python function, decorate it with `@sandboxed`, and it runs inside your Cua container alongside your UI automation. Your agent can now click a button, extract some data, process it with Python, and then use those results to decide what to click next.
|
||||
|
||||
Here's what makes this combination powerful for AI agent development:
|
||||
|
||||
- **Unified environment**: Your UI automation and code execution happen in the same container
|
||||
- **Rich capabilities**: Combine Cua's clicking with Python's data processing, API calls, and libraries
|
||||
- **Seamless integration**: Pass data between UI interactions and Python functions effortlessly
|
||||
- **Cross-platform consistency**: Your Python code runs the same way across different Cua environments
|
||||
- **Complete workflows**: Build agents that can both interact with apps AND process the data they collect
|
||||
|
||||
## The architecture behind @sandboxed
|
||||
|
||||
Let's jump right into an example that'll make this crystal clear:
|
||||
|
||||
```python
|
||||
from computer.helpers import sandboxed
|
||||
|
||||
@sandboxed("demo_venv")
|
||||
def greet_and_print(name):
|
||||
"""This function runs inside the container"""
|
||||
import PyXA # macOS-specific library
|
||||
safari = PyXA.Application("Safari")
|
||||
html = safari.current_document.source()
|
||||
print(f"Hello from inside the container, {name}!")
|
||||
return {"greeted": name, "safari_html": html}
|
||||
|
||||
# When called, this executes in the container
|
||||
result = await greet_and_print("Cua")
|
||||
```
|
||||
|
||||
What's happening here? When you call `greet_and_print()`, Cua extracts the function's source code, transmits it to the container, and executes it there. The result returns to you seamlessly, while the actual execution remains completely isolated.
|
||||
|
||||
## How does sandboxed execution work?
|
||||
|
||||
Cua's sandboxed execution system employs several key architectural components:
|
||||
|
||||
### 1. Source Code Extraction
|
||||
Cua uses Python's `inspect.getsource()` to extract your function's source code and reconstruct the function definition in the remote environment.
|
||||
|
||||
### 2. Virtual Environment Isolation
|
||||
Each sandboxed function runs in a named virtual environment within the container. This provides complete dependency isolation between different functions and their respective environments.
|
||||
|
||||
### 3. Data Serialization and Transport
|
||||
Arguments and return values are serialized as JSON and transported between the host and container. This ensures compatibility across different Python versions and execution environments.
|
||||
|
||||
### 4. Comprehensive Error Handling
|
||||
The system captures both successful results and exceptions, preserving stack traces and error information for debugging purposes.
|
||||
|
||||
## Getting your sandbox ready
|
||||
|
||||
Setting up sandboxed execution is simple:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer.computer import Computer
|
||||
from computer.helpers import sandboxed, set_default_computer
|
||||
|
||||
async def main():
|
||||
# Fire up the computer
|
||||
computer = Computer()
|
||||
await computer.run()
|
||||
|
||||
# Make it the default for all sandboxed functions
|
||||
set_default_computer(computer)
|
||||
|
||||
# Install some packages in a virtual environment
|
||||
await computer.venv_install("demo_venv", ["requests", "beautifulsoup4"])
|
||||
```
|
||||
|
||||
If you want to get fancy, you can specify which computer instance to use:
|
||||
|
||||
```python
|
||||
@sandboxed("my_venv", computer=my_specific_computer)
|
||||
def my_function():
|
||||
# This runs on your specified computer instance
|
||||
pass
|
||||
```
|
||||
|
||||
## Real-world examples that actually work
|
||||
|
||||
### Browser automation without the headaches
|
||||
|
||||
Ever tried to automate a browser and had it crash your entire system? Yeah, us too. Here's how to do it safely:
|
||||
|
||||
```python
|
||||
@sandboxed("browser_env")
|
||||
def automate_browser_with_playwright():
|
||||
"""Automate browser interactions using Playwright"""
|
||||
from playwright.sync_api import sync_playwright
|
||||
import time
|
||||
import base64
|
||||
from datetime import datetime
|
||||
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
# Launch browser (visible, because why not?)
|
||||
browser = p.chromium.launch(
|
||||
headless=False,
|
||||
args=['--no-sandbox', '--disable-dev-shm-usage']
|
||||
)
|
||||
|
||||
page = browser.new_page()
|
||||
page.set_viewport_size({"width": 1280, "height": 720})
|
||||
|
||||
actions = []
|
||||
screenshots = {}
|
||||
|
||||
# Let's visit example.com and poke around
|
||||
page.goto("https://example.com")
|
||||
actions.append("Navigated to example.com")
|
||||
|
||||
# Grab a screenshot because screenshots are cool
|
||||
screenshot_bytes = page.screenshot(full_page=True)
|
||||
screenshots["initial"] = base64.b64encode(screenshot_bytes).decode()
|
||||
|
||||
# Get some basic info
|
||||
title = page.title()
|
||||
actions.append(f"Page title: {title}")
|
||||
|
||||
# Find links and headings
|
||||
try:
|
||||
links = page.locator("a").all()
|
||||
link_texts = [link.text_content() for link in links[:5]]
|
||||
actions.append(f"Found {len(links)} links: {link_texts}")
|
||||
|
||||
headings = page.locator("h1, h2, h3").all()
|
||||
heading_texts = [h.text_content() for h in headings[:3]]
|
||||
actions.append(f"Found headings: {heading_texts}")
|
||||
|
||||
except Exception as e:
|
||||
actions.append(f"Element interaction error: {str(e)}")
|
||||
|
||||
# Let's try a form for good measure
|
||||
try:
|
||||
page.goto("https://httpbin.org/forms/post")
|
||||
actions.append("Navigated to form page")
|
||||
|
||||
# Fill out the form
|
||||
page.fill('input[name="custname"]', "Test User from Sandboxed Environment")
|
||||
page.fill('input[name="custtel"]', "555-0123")
|
||||
page.fill('input[name="custemail"]', "test@example.com")
|
||||
page.select_option('select[name="size"]', "large")
|
||||
|
||||
actions.append("Filled out form fields")
|
||||
|
||||
# Submit and see what happens
|
||||
page.click('input[type="submit"]')
|
||||
page.wait_for_load_state("networkidle")
|
||||
|
||||
actions.append("Submitted form")
|
||||
|
||||
except Exception as e:
|
||||
actions.append(f"Form interaction error: {str(e)}")
|
||||
|
||||
browser.close()
|
||||
|
||||
return {
|
||||
"actions_performed": actions,
|
||||
"screenshots": screenshots,
|
||||
"success": True
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Browser automation failed: {str(e)}"}
|
||||
|
||||
# Install Playwright and its browsers
|
||||
await computer.venv_install("browser_env", ["playwright"])
|
||||
await computer.venv_cmd("browser_env", "playwright install chromium")
|
||||
|
||||
# Run the automation
|
||||
result = await automate_browser_with_playwright()
|
||||
print(f"Performed {len(result.get('actions_performed', []))} actions")
|
||||
```
|
||||
|
||||
### Building code analysis agents
|
||||
|
||||
Want to build agents that can analyze code safely? Here's a security audit tool that won't accidentally `eval()` your system into oblivion:
|
||||
|
||||
```python
|
||||
@sandboxed("analysis_env")
|
||||
def security_audit_tool(code_snippet):
|
||||
"""Analyze code for potential security issues"""
|
||||
import ast
|
||||
import re
|
||||
|
||||
issues = []
|
||||
|
||||
# Check for the usual suspects
|
||||
dangerous_patterns = [
|
||||
(r'eval\s*\(', "Use of eval() function"),
|
||||
(r'exec\s*\(', "Use of exec() function"),
|
||||
(r'__import__\s*\(', "Dynamic import usage"),
|
||||
(r'subprocess\.', "Subprocess usage"),
|
||||
(r'os\.system\s*\(', "OS system call"),
|
||||
]
|
||||
|
||||
for pattern, description in dangerous_patterns:
|
||||
if re.search(pattern, code_snippet):
|
||||
issues.append(description)
|
||||
|
||||
# Get fancy with AST analysis
|
||||
try:
|
||||
tree = ast.parse(code_snippet)
|
||||
for node in ast.walk(tree):
|
||||
if isinstance(node, ast.Call):
|
||||
if hasattr(node.func, 'id'):
|
||||
if node.func.id in ['eval', 'exec', 'compile']:
|
||||
issues.append(f"Dangerous function call: {node.func.id}")
|
||||
except SyntaxError:
|
||||
issues.append("Syntax error in code")
|
||||
|
||||
return {
|
||||
"security_issues": issues,
|
||||
"risk_level": "HIGH" if len(issues) > 2 else "MEDIUM" if issues else "LOW"
|
||||
}
|
||||
|
||||
# Test it on some sketchy code
|
||||
audit_result = await security_audit_tool("eval(user_input)")
|
||||
print(f"Security audit: {audit_result}")
|
||||
```
|
||||
|
||||
### Desktop automation in the cloud
|
||||
|
||||
Here's where things get really interesting. Cua cloud containers come with full desktop environments, so you can automate GUIs:
|
||||
|
||||
```python
|
||||
@sandboxed("desktop_env")
|
||||
def take_screenshot_and_analyze():
|
||||
"""Take a screenshot and analyze the desktop"""
|
||||
import io
|
||||
import base64
|
||||
from PIL import ImageGrab
|
||||
from datetime import datetime
|
||||
|
||||
try:
|
||||
# Grab the screen
|
||||
screenshot = ImageGrab.grab()
|
||||
|
||||
# Convert to base64 for easy transport
|
||||
buffer = io.BytesIO()
|
||||
screenshot.save(buffer, format='PNG')
|
||||
screenshot_data = base64.b64encode(buffer.getvalue()).decode()
|
||||
|
||||
# Get some basic info
|
||||
screen_info = {
|
||||
"size": screenshot.size,
|
||||
"mode": screenshot.mode,
|
||||
"timestamp": datetime.now().isoformat()
|
||||
}
|
||||
|
||||
# Analyze the colors (because why not?)
|
||||
colors = screenshot.getcolors(maxcolors=256*256*256)
|
||||
dominant_color = max(colors, key=lambda x: x[0])[1] if colors else None
|
||||
|
||||
return {
|
||||
"screenshot_base64": screenshot_data,
|
||||
"screen_info": screen_info,
|
||||
"dominant_color": dominant_color,
|
||||
"unique_colors": len(colors) if colors else 0
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Screenshot failed: {str(e)}"}
|
||||
|
||||
# Install the dependencies
|
||||
await computer.venv_install("desktop_env", ["Pillow"])
|
||||
|
||||
# Take and analyze a screenshot
|
||||
result = await take_screenshot_and_analyze()
|
||||
print("Desktop analysis complete!")
|
||||
```
|
||||
|
||||
## Pro tips for sandboxed success
|
||||
|
||||
### Keep it self-contained
|
||||
Always put your imports inside the function. Trust us on this one:
|
||||
|
||||
```python
|
||||
@sandboxed("good_env")
|
||||
def good_function():
|
||||
import os # Import inside the function
|
||||
import json
|
||||
|
||||
# Your code here
|
||||
return {"result": "success"}
|
||||
```
|
||||
|
||||
### Install dependencies first
|
||||
Don't forget to install packages before using them:
|
||||
|
||||
```python
|
||||
# Install first
|
||||
await computer.venv_install("my_env", ["pandas", "numpy", "matplotlib"])
|
||||
|
||||
@sandboxed("my_env")
|
||||
def data_analysis():
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
# Now you can use them
|
||||
```
|
||||
|
||||
### Use descriptive environment names
|
||||
Future you will thank you:
|
||||
|
||||
```python
|
||||
@sandboxed("data_processing_env")
|
||||
def process_data(): pass
|
||||
|
||||
@sandboxed("web_scraping_env")
|
||||
def scrape_site(): pass
|
||||
|
||||
@sandboxed("ml_training_env")
|
||||
def train_model(): pass
|
||||
```
|
||||
|
||||
### Always handle errors gracefully
|
||||
Things break. Plan for it:
|
||||
|
||||
```python
|
||||
@sandboxed("robust_env")
|
||||
def robust_function(data):
|
||||
try:
|
||||
result = process_data(data)
|
||||
return {"success": True, "result": result}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
```
|
||||
|
||||
## What about performance?
|
||||
|
||||
Let's be honest – there's some overhead here. Code needs to be serialized, sent over the network, and executed remotely. But for most use cases, the benefits far outweigh the costs.
|
||||
|
||||
If you're building something performance-critical, consider:
|
||||
- Batching multiple operations into a single sandboxed function
|
||||
- Minimizing data transfer between host and container
|
||||
- Using persistent virtual environments
|
||||
|
||||
## The security angle
|
||||
|
||||
This is where sandboxed execution really shines:
|
||||
|
||||
1. **Complete process isolation** – code runs in a separate container
|
||||
2. **File system protection** – limited access to your host files
|
||||
3. **Network isolation** – controlled network access
|
||||
4. **Clean environments** – no package conflicts or pollution
|
||||
5. **Resource limits** – container-level constraints keep things in check
|
||||
|
||||
## Ready to get started?
|
||||
|
||||
The `@sandboxed` decorator is one of those features that sounds simple but opens up a world of possibilities. Whether you're testing sketchy code, building AI agents, or just want to keep your development environment pristine, it's got you covered.
|
||||
|
||||
Give it a try in your next Cua project and see how liberating it feels to run code without fear!
|
||||
|
||||
Happy coding (safely)!
|
||||
|
||||
---
|
||||
|
||||
*Want to dive deeper? Check out our [sandboxed functions examples](https://github.com/trycua/cua/blob/main/examples/sandboxed_functions_examples.py) and [virtual environment tests](https://github.com/trycua/cua/blob/main/tests/venv.py) on GitHub. Questions? Come chat with us on Discord!*
|
||||
302
blog/training-computer-use-models-trajectories-1.md
Normal file
302
blog/training-computer-use-models-trajectories-1.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# Training Computer-Use Models: Creating Human Trajectories with Cua
|
||||
|
||||
*Published on May 1, 2025 by Dillon DuPont*
|
||||
|
||||
In our previous posts, we covered [building your own Computer-Use Operator](build-your-own-operator-on-macos-1) and [using the Agent framework](build-your-own-operator-on-macos-2) to simplify development. Today, we'll focus on a critical aspect of improving computer-use agents and models: gathering high-quality demonstration data using Cua's Computer-Use Interface (CUI) and its Gradio UI to create and share human-generated trajectories.
|
||||
|
||||
Why is this important? Underlying models used by Computer-use agents need examples of how humans interact with computers to learn effectively. By creating a dataset of diverse, well-executed tasks, we can help train better models that understand how to navigate user interfaces and accomplish real tasks.
|
||||
|
||||
<video src="https://github.com/user-attachments/assets/c586d460-3877-4b5f-a736-3248886d2134" controls width="600"></video>
|
||||
|
||||
|
||||
## What You'll Learn
|
||||
|
||||
By the end of this tutorial, you'll be able to:
|
||||
- Set up the Computer-Use Interface (CUI) with Gradio UI support
|
||||
- Record your own computer interaction trajectories
|
||||
- Organize and tag your demonstrations
|
||||
- Upload your datasets to Hugging Face for community sharing
|
||||
- Contribute to improving computer-use AI for everyone
|
||||
|
||||
**Prerequisites:**
|
||||
- macOS Sonoma (14.0) or later
|
||||
- Python 3.10+
|
||||
- Basic familiarity with Python and terminal commands
|
||||
- A Hugging Face account (for uploading datasets)
|
||||
|
||||
**Estimated Time:** 20-30 minutes
|
||||
|
||||
## Understanding Human Trajectories
|
||||
|
||||
### What are Human Trajectories?
|
||||
|
||||
Human trajectories, in the context of Computer-use AI Agents, are recordings of how humans interact with computer interfaces to complete tasks. These interactions include:
|
||||
|
||||
- Mouse movements, clicks, and scrolls
|
||||
- Keyboard input
|
||||
- Changes in the UI state
|
||||
- Time spent on different elements
|
||||
|
||||
These trajectories serve as examples for AI models to learn from, helping them understand the relationship between:
|
||||
1. The visual state of the screen
|
||||
2. The user's goal or task
|
||||
3. The most appropriate action to take
|
||||
|
||||
### Why Human Demonstrations Matter
|
||||
|
||||
Unlike synthetic data or rule-based automation, human demonstrations capture the nuanced decision-making that happens during computer interaction:
|
||||
|
||||
- **Natural Pacing**: Humans pause to think, accelerate through familiar patterns, and adjust to unexpected UI changes
|
||||
- **Error Recovery**: Humans demonstrate how to recover from mistakes or handle unexpected states
|
||||
- **Context-Sensitive Actions**: The same UI element might be used differently depending on the task context
|
||||
|
||||
By contributing high-quality demonstrations, you're helping to create more capable, human-like computer-use AI systems.
|
||||
|
||||
## Setting Up Your Environment
|
||||
|
||||
### Installing the CUI with Gradio Support
|
||||
|
||||
The Computer-Use Interface includes an optional Gradio UI specifically designed to make recording and sharing demonstrations easy. Let's set it up:
|
||||
|
||||
1. **Create a Python environment** (optional but recommended):
|
||||
```bash
|
||||
# Using conda
|
||||
conda create -n cua-trajectories python=3.10
|
||||
conda activate cua-trajectories
|
||||
|
||||
# Using venv
|
||||
python -m venv cua-trajectories
|
||||
source cua-trajectories/bin/activate # On macOS/Linux
|
||||
```
|
||||
|
||||
2. **Install the CUI package with UI support**:
|
||||
```bash
|
||||
pip install "cua-computer[ui]"
|
||||
```
|
||||
|
||||
3. **Set up your Hugging Face access token**:
|
||||
Create a `.env` file in your project directory and add your Hugging Face token:
|
||||
```bash
|
||||
echo "HF_TOKEN=your_huggingface_token" > .env
|
||||
```
|
||||
You can get your token from your [Hugging Face account settings](https://huggingface.co/settings/tokens).
|
||||
|
||||
### Understanding the Gradio UI
|
||||
|
||||
The Computer-Use Interface Gradio UI provides three main components:
|
||||
|
||||
1. **Recording Panel**: Captures your screen, mouse, and keyboard activity during demonstrations
|
||||
2. **Review Panel**: Allows you to review, tag, and organize your demonstration recordings
|
||||
3. **Upload Panel**: Lets you share your demonstrations with the community via Hugging Face
|
||||
|
||||
The UI is designed to make the entire process seamless, from recording to sharing, without requiring deep technical knowledge of the underlying systems.
|
||||
|
||||
## Creating Your First Trajectory Dataset
|
||||
|
||||
### Launching the UI
|
||||
|
||||
To get started, create a simple Python script to launch the Gradio UI:
|
||||
|
||||
```python
|
||||
# launch_trajectory_ui.py
|
||||
from computer.ui.gradio.app import create_gradio_ui
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Load your Hugging Face token from .env
|
||||
load_dotenv('.env')
|
||||
|
||||
# Create and launch the UI
|
||||
app = create_gradio_ui()
|
||||
app.launch(share=False)
|
||||
```
|
||||
|
||||
Run this script to start the UI:
|
||||
|
||||
```bash
|
||||
python launch_trajectory_ui.py
|
||||
```
|
||||
|
||||
### Recording a Demonstration
|
||||
|
||||
Let's walk through the process of recording your first demonstration:
|
||||
|
||||
1. **Start the VM**: Click the "Initialize Computer" button in the UI to initialize a fresh macOS sandbox. This ensures your demonstrations are clean and reproducible.
|
||||
2. **Perform a Task**: Complete a simple task like creating a document, organizing files, or searching for information. Natural, everyday tasks make the best demonstrations.
|
||||
3. **Review Recording**: Click the "Conversation Logs" or "Function Logs" tabs to review your captured interactions, making sure there is no personal information that you wouldn't want to share.
|
||||
4. **Add Metadata**: In the "Save/Share Demonstrations" tab, give your recording a descriptive name (e.g., "Creating a Calendar Event") and add relevant tags (e.g., "productivity", "time-management").
|
||||
5. **Save Your Demonstration**: Click "Save" to store your recording locally.
|
||||
|
||||
<video src="https://github.com/user-attachments/assets/de3c3477-62fe-413c-998d-4063e48de176" controls width="600"></video>
|
||||
|
||||
### Key Tips for Quality Demonstrations
|
||||
|
||||
To create the most valuable demonstrations:
|
||||
|
||||
- **Start and end at logical points**: Begin with a clear starting state and end when the task is visibly complete
|
||||
- **Narrate your thought process**: Use the message input to describe what you're trying to do and why
|
||||
- **Move at a natural pace**: Don't rush or perform actions artificially slowly
|
||||
- **Include error recovery**: If you make a mistake, keep going and show how to correct it
|
||||
- **Demonstrate variations**: Record multiple ways to complete the same task
|
||||
|
||||
## Organizing and Tagging Demonstrations
|
||||
|
||||
Effective tagging and organization make your demonstrations more valuable to researchers and model developers. Consider these tagging strategies:
|
||||
|
||||
### Task-Based Tags
|
||||
|
||||
Describe what the demonstration accomplishes:
|
||||
- `web-browsing`
|
||||
- `document-editing`
|
||||
- `file-management`
|
||||
- `email`
|
||||
- `scheduling`
|
||||
|
||||
### Application Tags
|
||||
|
||||
Identify the applications used:
|
||||
- `finder`
|
||||
- `safari`
|
||||
- `notes`
|
||||
- `terminal`
|
||||
- `calendar`
|
||||
|
||||
### Complexity Tags
|
||||
|
||||
Indicate the difficulty level:
|
||||
- `beginner`
|
||||
- `intermediate`
|
||||
- `advanced`
|
||||
- `multi-application`
|
||||
|
||||
### UI Element Tags
|
||||
|
||||
Highlight specific UI interactions:
|
||||
- `drag-and-drop`
|
||||
- `menu-navigation`
|
||||
- `form-filling`
|
||||
- `search`
|
||||
|
||||
The Computer-Use Interface UI allows you to apply and manage these tags across all your saved demonstrations, making it easy to create cohesive, well-organized datasets.
|
||||
|
||||
<video src="https://github.com/user-attachments/assets/5ad1df37-026a-457f-8b49-922ae805faef" controls width="600"></video>
|
||||
|
||||
## Uploading to Hugging Face
|
||||
|
||||
Sharing your demonstrations helps advance research in computer-use AI. The Gradio UI makes uploading to Hugging Face simple:
|
||||
|
||||
### Preparing for Upload
|
||||
|
||||
1. **Review Your Demonstrations**: Use the review panel to ensure all demonstrations are complete and correctly tagged.
|
||||
|
||||
2. **Select Demonstrations to Upload**: You can upload all demonstrations or filter by specific tags.
|
||||
|
||||
3. **Configure Dataset Information**:
|
||||
- **Repository Name**: Format as `{your_username}/{dataset_name}`, e.g., `johndoe/productivity-tasks`
|
||||
- **Visibility**: Choose `public` to contribute to the community or `private` for personal use
|
||||
- **License**: Standard licenses like CC-BY or MIT are recommended for public datasets
|
||||
|
||||
### The Upload Process
|
||||
|
||||
1. **Click "Upload to Hugging Face"**: This initiates the upload preparation.
|
||||
|
||||
2. **Review Dataset Summary**: Confirm the number of demonstrations and total size.
|
||||
|
||||
3. **Confirm Upload**: The UI will show progress as files are transferred.
|
||||
|
||||
4. **Receive Confirmation**: Once complete, you'll see a link to your new dataset on Hugging Face.
|
||||
|
||||
<video src="https://github.com/user-attachments/assets/c586d460-3877-4b5f-a736-3248886d2134" controls width="600"></video>
|
||||
|
||||
Your uploaded dataset will have a standardized format with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-05-01T09:20:40.594878",
|
||||
"session_id": "1fe9f0fe-9331-4078-aacd-ec7ffb483b86",
|
||||
"name": "penguin lemon forest",
|
||||
"tool_calls": [...], // Detailed interaction records
|
||||
"messages": [...], // User/assistant messages
|
||||
"tags": ["highquality", "tasks"],
|
||||
"images": [...] // Screenshots of each state
|
||||
}
|
||||
```
|
||||
|
||||
This structured format makes it easy for researchers to analyze patterns across different demonstrations and build better computer-use models.
|
||||
|
||||
```python
|
||||
from computer import Computer
|
||||
|
||||
computer = Computer(os_type="macos", display="1024x768", memory="8GB", cpu="4")
|
||||
try:
|
||||
await computer.run()
|
||||
|
||||
screenshot = await computer.interface.screenshot()
|
||||
with open("screenshot.png", "wb") as f:
|
||||
f.write(screenshot)
|
||||
|
||||
await computer.interface.move_cursor(100, 100)
|
||||
await computer.interface.left_click()
|
||||
await computer.interface.right_click(300, 300)
|
||||
await computer.interface.double_click(400, 400)
|
||||
|
||||
await computer.interface.type("Hello, World!")
|
||||
await computer.interface.press_key("enter")
|
||||
|
||||
await computer.interface.set_clipboard("Test clipboard")
|
||||
content = await computer.interface.copy_to_clipboard()
|
||||
print(f"Clipboard content: {content}")
|
||||
finally:
|
||||
await computer.stop()
|
||||
```
|
||||
|
||||
## Example: Shopping List Demonstration
|
||||
|
||||
Let's walk through a concrete example of creating a valuable demonstration:
|
||||
|
||||
### Task: Adding Shopping List Items to a Doordash Cart
|
||||
|
||||
1. **Start Recording**: Begin with a clean desktop and a text file containing a shopping list.
|
||||
|
||||
2. **Task Execution**: Open the file, read the list, open Safari, navigate to Doordash, and add each item to the cart.
|
||||
|
||||
3. **Narration**: Add messages like "Reading the shopping list" and "Searching for rice on Doordash" to provide context.
|
||||
|
||||
4. **Completion**: Verify all items are in the cart and end the recording.
|
||||
|
||||
5. **Tagging**: Add tags like `shopping`, `web-browsing`, `task-completion`, and `multi-step`.
|
||||
|
||||
This type of demonstration is particularly valuable because it showcases real-world task completion requiring multiple applications and context switching.
|
||||
|
||||
### Exploring Community Datasets
|
||||
|
||||
You can also learn from existing trajectory datasets contributed by the community:
|
||||
|
||||
1. Visit [Hugging Face Datasets tagged with 'cua'](https://huggingface.co/datasets?other=cua)
|
||||
2. Explore different approaches to similar tasks
|
||||
3. Download and analyze high-quality demonstrations
|
||||
|
||||
## Conclusion
|
||||
|
||||
### Summary
|
||||
|
||||
In this guide, we've covered how to:
|
||||
- Set up the Computer-Use Interface with Gradio UI
|
||||
- Record high-quality human demonstrations
|
||||
- Organize and tag your trajectories
|
||||
- Share your datasets with the community
|
||||
|
||||
By contributing your own demonstrations, you're helping to build more capable, human-like AI systems that can understand and execute complex computer tasks.
|
||||
|
||||
### Next Steps
|
||||
|
||||
Now that you know how to create and share trajectories, consider these advanced techniques:
|
||||
|
||||
- Create themed collections around specific productivity workflows
|
||||
- Collaborate with others to build comprehensive datasets
|
||||
- Use your datasets to fine-tune your own computer-use models
|
||||
|
||||
### Resources
|
||||
|
||||
- [Computer-Use Interface GitHub](https://github.com/trycua/cua/tree/main/libs/computer)
|
||||
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
|
||||
- [Example Dataset: ddupont/test-dataset](https://huggingface.co/datasets/ddupont/test-dataset)
|
||||
89
blog/trajectory-viewer.md
Normal file
89
blog/trajectory-viewer.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Trajectory Viewer for Cua
|
||||
|
||||
*Published on May 13, 2025 by Dillon DuPont*
|
||||
|
||||
Don’t forget to check out [Part 1: Building your own Computer-Use Operator](build-your-own-operator-on-macos-1) and [Part 2: Using the Agent framework](build-your-own-operator-on-macos-2) for setting up your Cua environment and basic tips and tricks!
|
||||
|
||||
## Introduction
|
||||
|
||||
Okay, so you’ve gotten your environment up and also tested a few agent runs. You’ll likely have encountered cases where your agent was successful at doing some tasks but also places where it got stuck or outright failed.
|
||||
Now what?
|
||||
If you’ve ever wondered exactly what your computer agent is doing and why it sometimes doesn’t do what you expected, then the Trajectory Viewer for Cua is here to help! Whether you’re a seasoned developer or someone who just wants to dive in and see results, this tool makes it easy to explore every step your agent takes on your screen.
|
||||
Plus, if you want to start thinking about generating data to train your own agentic model (we’ll cover training in an upcoming blog, so look forward to it), then our Trajectory Viewer might be for you.
|
||||
|
||||
## So, what’s a “trajectory”?
|
||||
|
||||
Think of a trajectory as a detailed video recording of your agent’s journey:
|
||||
|
||||
- **Observations**: What did the agent see (the exact screen content) at each point in time?
|
||||
- **Actions**: What clicks, keystrokes, or commands did it perform in response?
|
||||
- **Decisions**: Which options did it choose, and why?
|
||||
Especially for longer and more complex tasks, your agent will make multiple steps, take multiple actions, and make multiple observations. By examining this record, you can pinpoint where things go right, and more importantly, where they go wrong.
|
||||
|
||||
## So, what’s Cua’s Trajectory Viewer and why use it?
|
||||
|
||||
The Trajectory Player for Cua is a GUI tool that helps you explore saved trajectories generated from your Cua computer agent runs. This tool provides a powerful way to:
|
||||
|
||||
- **Debug your agents**: See exactly what your agent saw to reproduce bugs
|
||||
- **Analyze failure cases**: Identify the moment when your agent went off-script
|
||||
- **Collect training data**: Export your trajectories for your own processing, training, and more!
|
||||
|
||||
The viewer allows you to see exactly what your agent observed and how it interacted with the computer all through your browser.
|
||||
|
||||
## Opening Trajectory Viewer in 3 Simple Steps
|
||||
|
||||
1. **Visit**: Open your browser and go to [https://www.trycua.com/trajectory-viewer](https://www.trycua.com/trajectory-viewer).
|
||||
2. **Upload**: Drag and drop a trajectories folder or click Select Folder.
|
||||
3. **Explore**: View your agent’s trajectories! All data stays in your browser unless you give permission otherwise.
|
||||
|
||||

|
||||
|
||||
## Recording a Trajectory
|
||||
|
||||
### Using the Gradio UI
|
||||
|
||||
The simplest way to create agent trajectories is through the [Cua Agent Gradio UI](https://www.trycua.com/docs/quickstart-ui) by checking the "Save Trajectory" option.
|
||||
|
||||
### Using the ComputerAgent API
|
||||
|
||||
Trajectories are saved by default when using the ComputerAgent API:
|
||||
|
||||
```python
|
||||
agent.run("book a flight for me")
|
||||
```
|
||||
|
||||
You can explicitly control trajectory saving with the `save_trajectory` parameter:
|
||||
|
||||
```python
|
||||
from cua import ComputerAgent
|
||||
|
||||
agent = ComputerAgent(save_trajectory=True)
|
||||
agent.run("search for hotels in Boston")
|
||||
```
|
||||
|
||||
Each trajectory folder is saved in a `trajectories` directory with a timestamp format, for example: `trajectories/20250501_222749`
|
||||
|
||||
## Exploring and Analyzing Trajectories
|
||||
|
||||
Our Trajectory Viewer is designed to allow for thorough analysis and debugging in a friendly way. Once loaded, the viewer presents:
|
||||
|
||||
- **Timeline Slider**: Jump to any step in the session
|
||||
- **Screen Preview**: See exactly what the agent saw
|
||||
- **Action Details**: Review clicks, keypresses, and API calls
|
||||
- **Logs & Metadata**: Inspect debug logs or performance stats
|
||||
|
||||
Use these features to:
|
||||
|
||||
- Step through each action and observation; understand your agent’s decision-making
|
||||
- Understand why and where your agent failed
|
||||
- Collect insights for improving your instructions, prompts, tasks, agent, etc.
|
||||
|
||||
The trajectory viewer provides a visual interface for stepping through each action your agent took, making it easy to see what your agent “sees”.
|
||||
|
||||
## Getting Started
|
||||
|
||||
Ready to see your agent in action? Head over to the Trajectory Viewer and load up your first session. Debug smarter, train faster, and stay in control (all within your browser).
|
||||
|
||||
Happy tinkering and Cua on!
|
||||
|
||||
Have questions or want to share feedback? Join our community on Discord or open an issue on GitHub.
|
||||
183
blog/ubuntu-docker-support.md
Normal file
183
blog/ubuntu-docker-support.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Ubuntu Docker Support in Cua with Kasm
|
||||
|
||||
*Published Aug 26, 2025 by Francesco Bonacci*
|
||||
|
||||
Today we’re shipping **Ubuntu Docker support** in Cua. You get a full Linux desktop inside a Docker container, viewable right in your browser—no VM spin-up, no extra clients. It behaves the same on macOS, Windows, and Linux.
|
||||
|
||||
<img src="/docker-ubuntu-support.png" alt="Cua + KasmVNC Ubuntu container desktop">
|
||||
|
||||
## Why we did this
|
||||
|
||||
If you build automation or RL workflows with Cua, you’ve probably run into the usual platform walls: macOS VMs (via Lume) are Apple-Silicon only; Windows Sandbox needs Pro/Enterprise; giving agents your host desktop is… exciting, but risky; and little OS quirks make “build once, run anywhere” harder than it should be.
|
||||
|
||||
We wanted something lightweight, isolated, and identical across machines. So we put a desktop in a container.
|
||||
|
||||
## Why we didn’t use QEMU/KVM
|
||||
|
||||
Short answer: **portability, startup time, and ops friction.**
|
||||
|
||||
* **Runs everywhere, no hypervisor drama.** KVM needs Linux; Hyper-V/Virtualization.Framework setups vary by host and policy. Docker is ubiquitous across macOS/Windows/Linux and allowed in most CI runners—so your GUI env actually runs where your team works.
|
||||
* **Faster boot & smaller footprints.** Containers cold-start in seconds and images are GB-scale; VMs tend to be minutes and tens of GB. That matters for parallel agents, CI, and local iteration.
|
||||
* **Lower ops overhead.** No nested virt, kernel modules, or privileged host tweaks that many orgs (and cloud runners) block. Pull → run → browser.
|
||||
* **Same image, everywhere.** One Docker image gives you an identical desktop on every dev laptop and in CI.
|
||||
* **Web-first access out of the box.** KasmVNC serves the desktop over HTTP—no extra VNC/RDP clients or SPICE config.
|
||||
|
||||
**When we *do* reach for QEMU/KVM:**
|
||||
|
||||
* You need **true OS isolation** or to run **non-Linux** guests.
|
||||
* You want **kernel-level features** or **device/GPU passthrough** (VFIO).
|
||||
* You’re optimizing for **hardware realism** over startup speed and density.
|
||||
|
||||
For this release, the goal was a **cross-platform Linux desktop that feels instant and identical** across local dev and CI. Containers + KasmVNC hit that sweet spot.
|
||||
|
||||
## What we built
|
||||
|
||||
Under the hood it’s **KasmVNC + Ubuntu 22.04 (Xfce) in Docker**, pre-configured for computer-use automation. You get a proper GUI desktop served over HTTP (no VNC/RDP client), accessible from any modern browser. Cua’s Computer server boots automatically so your agents can connect immediately.
|
||||
|
||||
### How it works (at a glance)
|
||||
|
||||
```
|
||||
Your System
|
||||
└─ Docker Container
|
||||
└─ Xfce Desktop + KasmVNC → open in your browser
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick start
|
||||
|
||||
1. **Install Docker** — Docker Desktop (macOS/Windows) or Docker Engine (Linux).
|
||||
|
||||
2. **Pull or build the image**
|
||||
|
||||
```bash
|
||||
# Pull (recommended)
|
||||
docker pull --platform=linux/amd64 trycua/cua-ubuntu:latest
|
||||
|
||||
# Or build locally
|
||||
cd libs/kasm
|
||||
docker build -t cua-ubuntu:latest .
|
||||
```
|
||||
|
||||
3. **Run with Cua’s Computer SDK**
|
||||
|
||||
```python
|
||||
from computer import Computer
|
||||
|
||||
computer = Computer(
|
||||
os_type="linux",
|
||||
provider_type="docker",
|
||||
image="trycua/cua-ubuntu:latest",
|
||||
name="my-automation-container"
|
||||
)
|
||||
|
||||
await computer.run()
|
||||
```
|
||||
|
||||
### Make an agent that drives this desktop
|
||||
|
||||
```python
|
||||
from agent import ComputerAgent
|
||||
|
||||
# assumes `computer` is the instance created above
|
||||
agent = ComputerAgent("openrouter/z-ai/glm-4.5v", tools=[computer])
|
||||
|
||||
async for _ in agent.run("Click on the search bar and type 'hello world'"):
|
||||
pass
|
||||
```
|
||||
|
||||
> Use any VLM with tool use; just make sure your OpenRouter creds are set.
|
||||
|
||||
By default you land on **Ubuntu 22.04 + Xfce** with a browser and desktop basics, the **Computer server** is running, the **web viewer** is available at `http://localhost:8006`, and common automation tools are preinstalled.
|
||||
|
||||
---
|
||||
|
||||
## What’s inside (in plain English)
|
||||
|
||||
A tidy Linux desktop with web access through **KasmVNC**, Python 3.11 and dev tools, plus utilities you’ll actually use for automation—`wmctrl` for windows, `xclip` for clipboard, `ffmpeg` for media, screenshot helpers, and so on. It starts as a **non-root `kasm-user`**, lives in an **isolated filesystem** (unless you mount volumes), and ships with **SSL off for local dev** so you terminate TLS upstream when you deploy.
|
||||
|
||||
---
|
||||
|
||||
## How it compares
|
||||
|
||||
| Feature | KasmVNC Docker | Lume (macOS VM) | Windows Sandbox |
|
||||
| ---------------- | --------------------- | --------------------- | ---------------------- |
|
||||
| Platform support | macOS, Windows, Linux | macOS (Apple Silicon) | Windows Pro/Enterprise |
|
||||
| Resource usage | Low (container) | Medium (full VM) | Medium (full VM) |
|
||||
| Setup time | \~30s | 2–5 min | 1–2 min |
|
||||
| GUI desktop | Linux | macOS | Windows |
|
||||
| Web access | Browser (no client) | Typically VNC client | Typically RDP client |
|
||||
| Consistency | Same everywhere | Hardware-dependent | OS-dependent |
|
||||
|
||||
**Use KasmVNC Docker when…** you want the **same GUI env across devs/CI/platforms**, you’re doing **RL or end-to-end GUI tests**, or you need **many isolated desktops on one machine**.
|
||||
**Use alternatives when…** you need native **macOS** (→ Lume) or native **Windows** (→ Windows Sandbox).
|
||||
|
||||
---
|
||||
|
||||
## Using the Agent Framework (parallel example)
|
||||
|
||||
A compact pattern for running multiple desktops and agents side-by-side:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent
|
||||
|
||||
# Create multiple computer instances (each gets its own desktop)
|
||||
computers = []
|
||||
for i in range(3):
|
||||
c = Computer(
|
||||
os_type="linux",
|
||||
provider_type="docker",
|
||||
image="trycua/cua-ubuntu:latest",
|
||||
name=f"parallel-desktop-{i}"
|
||||
)
|
||||
computers.append(c)
|
||||
await c.run()
|
||||
|
||||
# Pair each desktop with a task
|
||||
tasks = [
|
||||
"open github and search for 'trycua/cua'",
|
||||
"open a text editor and write 'hello world'",
|
||||
"open the browser and go to google.com",
|
||||
]
|
||||
|
||||
agents = [
|
||||
ComputerAgent(model="openrouter/z-ai/glm-4.5v", tools=[c])
|
||||
for c in computers
|
||||
]
|
||||
|
||||
async def run_agent(agent, task):
|
||||
async for _ in agent.run(task):
|
||||
pass
|
||||
|
||||
await asyncio.gather(*[run_agent(a, t) for a, t in zip(agents, tasks)])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What’s next
|
||||
|
||||
We’re polishing a **CLI to push/scale these containers on Cua Cloud**, exploring **GPU acceleration** for in-container inference, and publishing **prebuilt images** for Playwright, Selenium, and friends.
|
||||
|
||||
---
|
||||
|
||||
## Try it
|
||||
|
||||
```python
|
||||
from computer import Computer
|
||||
computer = Computer(os_type="linux", provider_type="docker", image="trycua/cua-ubuntu:latest")
|
||||
await computer.run()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
* **Docker Provider Docs:** [https://docs.trycua.com/computers/docker](https://docs.trycua.com/computers/docker)
|
||||
* **KasmVNC:** [https://github.com/kasmtech/KasmVNC](https://github.com/kasmtech/KasmVNC)
|
||||
* **Container Source:** [https://github.com/trycua/cua/tree/main/libs/kasm](https://github.com/trycua/cua/tree/main/libs/kasm)
|
||||
* **Computer SDK:** [https://docs.trycua.com/docs/computer-sdk/computers](https://docs.trycua.com/docs/computer-sdk/computers)
|
||||
* **Discord:** [https://discord.gg/cua-ai](https://discord.gg/cua-ai)
|
||||
|
||||
Questions or weird edge cases? Ping us on Discord—we’re curious to see what you build.
|
||||
238
blog/windows-sandbox.md
Normal file
238
blog/windows-sandbox.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Your Windows PC is Already the Perfect Development Environment for Computer-Use Agents
|
||||
|
||||
*Published on June 18, 2025 by Dillon DuPont*
|
||||
|
||||
Over the last few months, our enterprise users kept asking the same type of question: *"When are you adding support for AutoCAD?"* *"What about SAP integration?"* *"Can you automate our MES system?"* - each request was for different enterprise applications we'd never heard of.
|
||||
|
||||
At first, we deflected. We've been building Cua to work across different environments - from [Lume for macOS VMs](./lume-to-containerization) to cloud containers. But these requests kept piling up. AutoCAD automation. SAP integration. Specialized manufacturing systems.
|
||||
|
||||
Then it hit us: **they all ran exclusively on Windows**.
|
||||
|
||||
Most of us develop on macOS, so we hadn't considered Windows as a primary target for agent automation. But we were missing out on helping customers automate the software that actually runs their businesses.
|
||||
|
||||
So last month, we started working on Windows support for [RPA (Robotic Process Automation)](https://en.wikipedia.org/wiki/Robotic_process_automation). Here's the twist: **the perfect development environment was already sitting on every Windows machine** - we just had to unlock it.
|
||||
|
||||
<video width="100%" controls>
|
||||
<source src="/demo_wsb.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
|
||||
## Our Journey to Windows CUA Support
|
||||
|
||||
When we started Cua, we focused on making computer-use agents work everywhere - we built [Lume for macOS](https://github.com/trycua/cua/tree/main/libs/lume), created cloud infrastructure, and worked on Linux support. But no matter what we built, Windows kept coming up in every enterprise conversation.
|
||||
|
||||
The pattern became clear during customer calls: **the software that actually runs businesses lives on Windows**. Engineering teams wanted agents to automate AutoCAD workflows. Manufacturing companies needed automation for their MES systems. Finance teams were asking about Windows-only trading platforms and legacy enterprise software.
|
||||
|
||||
We could have gone straight to expensive Windows cloud infrastructure, but then we discovered Microsoft had already solved the development problem: [Windows Sandbox](https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/). Lightweight, free, and sitting on every Windows machine waiting to be used.
|
||||
|
||||
Windows Sandbox support is our first step - **Windows cloud instances are coming later this month** for production workloads.
|
||||
|
||||
## What is Windows Sandbox?
|
||||
|
||||
Windows Sandbox is Microsoft's built-in lightweight virtualization technology. Despite the name, it's actually closer to a disposable virtual machine than a traditional "sandbox" - it creates a completely separate, lightweight Windows environment rather than just containerizing applications.
|
||||
|
||||
Here's how it compares to other approaches:
|
||||
|
||||
```bash
|
||||
Traditional VM Testing:
|
||||
┌─────────────────────────────────┐
|
||||
│ Your Windows PC │
|
||||
├─────────────────────────────────┤
|
||||
│ VMware/VirtualBox VM │
|
||||
│ (Heavy, Persistent, Complex) │
|
||||
├─────────────────────────────────┤
|
||||
│ Agent Testing │
|
||||
└─────────────────────────────────┘
|
||||
|
||||
Windows Sandbox:
|
||||
┌─────────────────────────────────┐
|
||||
│ Your Windows PC │
|
||||
├─────────────────────────────────┤
|
||||
│ Windows Sandbox │
|
||||
│ (Built-in, Fast, Disposable) │
|
||||
├─────────────────────────────────┤
|
||||
│ Separate Windows Instance │
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
> ⚠️ **Important Note**: Windows Sandbox supports **one virtual machine at a time**. For production workloads or running multiple agents simultaneously, you'll want our upcoming cloud infrastructure - but for learning and testing, this local setup is perfect to get started.
|
||||
|
||||
## Why Windows Sandbox is Perfect for Local Computer-Use Agent Testing
|
||||
|
||||
First, it's incredibly lightweight. We're talking seconds to boot up a fresh Windows environment, not the minutes you'd wait for a traditional VM. And since it's built into Windows 10 and 11, there's literally no setup cost - it's just sitting there waiting for you to enable it.
|
||||
|
||||
But the real magic is how disposable it is. Every time you start Windows Sandbox, you get a completely clean slate. Your agent messed something up? Crashed an application? No problem - just close the sandbox and start fresh. It's like having an unlimited supply of pristine Windows machines for testing.
|
||||
|
||||
## Getting Started: Three Ways to Test Agents
|
||||
|
||||
We've made Windows Sandbox agent testing as simple as possible. Here are your options:
|
||||
|
||||
### Option A: Quick Start with Agent UI (Recommended)
|
||||
|
||||
**Perfect for**: First-time users who want to see agents in action immediately
|
||||
|
||||
```bash
|
||||
# One-time setup
|
||||
pip install -U git+git://github.com/karkason/pywinsandbox.git
|
||||
pip install -U "cua-computer[all]" "cua-agent[all]"
|
||||
|
||||
# Launch the Agent UI
|
||||
python -m agent.ui
|
||||
```
|
||||
|
||||
**What you get**:
|
||||
- Visual interface in your browser
|
||||
- Real-time agent action viewing
|
||||
- Natural language task instructions
|
||||
- No coding required
|
||||
|
||||
### Option B: Python API Integration
|
||||
|
||||
**Perfect for**: Developers building agent workflows
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from computer import Computer, VMProviderType
|
||||
from agent import ComputerAgent, LLM
|
||||
|
||||
async def test_windows_agent():
|
||||
# Create Windows Sandbox computer
|
||||
computer = Computer(
|
||||
provider_type=VMProviderType.WINSANDBOX,
|
||||
os_type="windows",
|
||||
memory="4GB",
|
||||
)
|
||||
|
||||
# Start the VM (~35s)
|
||||
await computer.run()
|
||||
|
||||
# Create agent with your preferred model
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
save_trajectory=True,
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
# Give it a task
|
||||
async for result in agent.run("Open Calculator and compute 15% tip on $47.50"):
|
||||
print(f"Agent action: {result}")
|
||||
|
||||
# Shutdown the VM
|
||||
await computer.stop()
|
||||
|
||||
asyncio.run(test_windows_agent())
|
||||
```
|
||||
|
||||
**What you get**:
|
||||
- Full programmatic control
|
||||
- Custom agent workflows
|
||||
- Integration with your existing code
|
||||
- Detailed action logging
|
||||
|
||||
### Option C: Manual Configuration
|
||||
|
||||
**Perfect for**: Advanced users who want full control
|
||||
|
||||
1. Enable Windows Sandbox in Windows Features
|
||||
2. Create custom .wsb configuration files
|
||||
3. Integrate with your existing automation tools
|
||||
|
||||
## Comparing Your Options
|
||||
|
||||
Let's see how different testing approaches stack up:
|
||||
|
||||
### Windows Sandbox + Cua
|
||||
- **Perfect for**: Quick testing and development
|
||||
- **Cost**: Free (built into Windows)
|
||||
- **Setup time**: Under 5 minutes
|
||||
- **Safety**: Complete isolation from host system
|
||||
- **Limitation**: One sandbox at a time
|
||||
- **Requires**: Windows 10/11 with 4GB+ RAM
|
||||
|
||||
### Traditional VMs
|
||||
- **Perfect for**: Complex testing scenarios
|
||||
- **Full customization**: Any Windows version
|
||||
- **Heavy resource usage**: Slow to start/stop
|
||||
- **Complex setup**: License management required
|
||||
- **Cost**: VM software + Windows licenses
|
||||
|
||||
## Real-World Windows RPA Examples
|
||||
|
||||
Here's what our enterprise users are building with Windows Sandbox:
|
||||
|
||||
### CAD and Engineering Automation
|
||||
```python
|
||||
# Example: AutoCAD drawing automation
|
||||
task = """
|
||||
1. Open AutoCAD and create a new drawing
|
||||
2. Draw a basic floor plan with rooms and dimensions
|
||||
3. Add electrical symbols and circuit layouts
|
||||
4. Generate a bill of materials from the drawing
|
||||
5. Export the drawing as both DWG and PDF formats
|
||||
"""
|
||||
```
|
||||
|
||||
### Manufacturing and ERP Integration
|
||||
```python
|
||||
# Example: SAP workflow automation
|
||||
task = """
|
||||
1. Open SAP GUI and log into the production system
|
||||
2. Navigate to Material Management module
|
||||
3. Create purchase orders for stock items below minimum levels
|
||||
4. Generate vendor comparison reports
|
||||
5. Export the reports to Excel and email to procurement team
|
||||
"""
|
||||
```
|
||||
|
||||
### Financial Software Automation
|
||||
```python
|
||||
# Example: Trading platform automation
|
||||
task = """
|
||||
1. Open Bloomberg Terminal or similar trading software
|
||||
2. Monitor specific stock tickers and market indicators
|
||||
3. Execute trades based on predefined criteria
|
||||
4. Generate daily portfolio performance reports
|
||||
5. Update risk management spreadsheets
|
||||
"""
|
||||
```
|
||||
|
||||
### Legacy Windows Application Integration
|
||||
```python
|
||||
# Example: Custom Windows application automation
|
||||
task = """
|
||||
1. Open legacy manufacturing execution system (MES)
|
||||
2. Input production data from CSV files
|
||||
3. Generate quality control reports
|
||||
4. Update inventory levels across multiple systems
|
||||
5. Create maintenance scheduling reports
|
||||
"""
|
||||
```
|
||||
|
||||
## System Requirements and Performance
|
||||
|
||||
### What You Need
|
||||
- **Windows 10/11**: Any edition that supports Windows Sandbox
|
||||
- **Memory**: 4GB minimum (8GB recommended for CAD/professional software)
|
||||
- **CPU**: Virtualization support (enabled by default on modern systems)
|
||||
- **Storage**: A few GB free space
|
||||
|
||||
### Performance Tips
|
||||
- **Close unnecessary applications** before starting Windows Sandbox
|
||||
- **Allocate appropriate memory** based on your RPA workflow complexity
|
||||
- **Use SSD storage** for faster sandbox startup
|
||||
- **Consider dedicated hardware** for resource-intensive applications like CAD software
|
||||
|
||||
**Stay tuned** - we'll be announcing Windows Cloud Instances later this month.
|
||||
|
||||
But for development, prototyping, and learning Windows RPA workflows, **Windows Sandbox gives you everything you need to get started right now**.
|
||||
|
||||
## Learn More
|
||||
|
||||
- [Windows Sandbox Documentation](https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/)
|
||||
- [Cua GitHub Repository](https://github.com/trycua/cua)
|
||||
- [Agent UI Documentation](https://github.com/trycua/cua/tree/main/libs/agent)
|
||||
- [Join our Discord Community](https://discord.gg/cua-ai)
|
||||
|
||||
---
|
||||
|
||||
*Ready to see AI agents control your Windows applications? Come share your testing experiences on Discord!*
|
||||
Reference in New Issue
Block a user