Add oss blogposts

This commit is contained in:
f-trycua
2025-09-17 17:07:58 +02:00
parent 88ee0ecaee
commit 6faaf0dea8
25 changed files with 3843 additions and 0 deletions

239
blog/app-use.md Normal file
View File

@@ -0,0 +1,239 @@
# App-Use: Control Individual Applications with Cua Agents
*Published on May 31, 2025 by The Cua Team*
Today, we are excited to introduce a new experimental feature landing in the [Cua GitHub repository](https://github.com/trycua/cua): **App-Use**. App-Use allows you to create lightweight virtual desktops that limit agent access to specific applications, improving precision of your agent's trajectory. Perfect for parallel workflows, and focused task execution.
> **Note:** App-Use is currently experimental. To use it, you need to enable it by passing `experiments=["app-use"]` feature flag when creating your Computer instance.
Check out an example of a Cua Agent automating Cua's team Taco Bell order through the iPhone Mirroring app:
<video width="100%" controls>
<source src="/demo_app_use.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## What is App-Use?
App-Use lets you create virtual desktop sessions scoped to specific applications. Instead of giving an agent access to your entire screen, you can say "only work with Safari and Notes" or "just control the iPhone Mirroring app."
```python
# Create a macOS VM with App Use experimental feature enabled
computer = Computer(experiments=["app-use"])
# Create a desktop limited to specific apps
desktop = computer.create_desktop_from_apps(["Safari", "Notes"])
# Your agent can now only see and interact with these apps
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[desktop]
)
```
## Key Benefits
### 1. Lightweight and Fast
App-Use creates visual filters, not new processes. Your apps continue running normally - we just control what the agent can see and click on. The virtual desktops are composited views that require no additional compute resources beyond the existing window manager operations.
### 2. Run Multiple Agents in Parallel
Deploy a team of specialized agents, each focused on their own apps:
```python
# Create a Computer with App Use enabled
computer = Computer(experiments=["app-use"])
# Research agent focuses on browser
research_desktop = computer.create_desktop_from_apps(["Safari"])
research_agent = ComputerAgent(tools=[research_desktop], ...)
# Writing agent focuses on documents
writing_desktop = computer.create_desktop_from_apps(["Pages", "Notes"])
writing_agent = ComputerAgent(tools=[writing_desktop], ...)
async def run_agent(agent, task):
async for result in agent.run(task):
print(result.get('text', ''))
# Run both simultaneously
await asyncio.gather(
run_agent(research_agent, "Research AI trends for 2025"),
run_agent(writing_agent, "Draft blog post outline")
)
```
## How To: Getting Started with App-Use
### Requirements
To get started with App-Use, you'll need:
- Python 3.11+
- macOS Sequoia (15.0) or later
### Getting Started
```bash
# Install packages and launch UI
pip install -U "cua-computer[all]" "cua-agent[all]"
python -m agent.ui.gradio.app
```
```python
import asyncio
from computer import Computer
from agent import ComputerAgent
async def main():
computer = Computer()
await computer.run()
# Create app-specific desktop sessions
desktop = computer.create_desktop_from_apps(["Notes"])
# Initialize an agent
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[desktop]
)
# Take a screenshot (returns bytes by default)
screenshot = await desktop.interface.screenshot()
with open("app_screenshot.png", "wb") as f:
f.write(screenshot)
# Run an agent task
async for result in agent.run("Create a new note titled 'Meeting Notes' and add today's agenda items"):
print(f"Agent: {result.get('text', '')}")
if __name__ == "__main__":
asyncio.run(main())
```
## Use Case: Automating Your iPhone with Cua
### ⚠️ Important Warning
Computer-use agents are powerful tools that can interact with your devices. This guide involves using your own macOS and iPhone instead of a VM. **Proceed at your own risk.** Always:
- Review agent actions before running
- Start with non-critical tasks
- Monitor agent behavior closely
Remember with Cua it is still advised to use a VM for a better level of isolation for your agents.
### Setting Up iPhone Automation
### Step 1: Start the cua-computer-server
First, you'll need to start the cua-computer-server locally to enable access to iPhone Mirroring via the Computer interface:
```bash
# Install the server
pip install cua-computer-server
# Start the server
python -m computer_server
```
### Step 2: Connect iPhone Mirroring
Then, you'll need to open the "iPhone Mirroring" app on your Mac and connect it to your iPhone.
### Step 3: Create an iPhone Automation Session
Finally, you can create an iPhone automation session:
```python
import asyncio
from computer import Computer
from cua_agent import Agent
async def automate_iphone():
# Connect to your local computer server
my_mac = Computer(use_host_computer_server=True, os_type="macos", experiments=["app-use"])
await my_mac.run()
# Create a desktop focused on iPhone Mirroring
my_iphone = my_mac.create_desktop_from_apps(["iPhone Mirroring"])
# Initialize an agent for iPhone automation
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[my_iphone]
)
# Example: Send a message
async for result in agent.run("Open Messages and send 'Hello from Cua!' to John"):
print(f"Agent: {result.get('text', '')}")
# Example: Set a reminder
async for result in agent.run("Create a reminder to call mom at 5 PM today"):
print(f"Agent: {result.get('text', '')}")
if __name__ == "__main__":
asyncio.run(automate_iphone())
```
### iPhone Automation Use Cases
With Cua's iPhone automation, you can:
- **Automate messaging**: Send texts, respond to messages, manage conversations
- **Control apps**: Navigate any iPhone app using natural language
- **Manage settings**: Adjust iPhone settings programmatically
- **Extract data**: Read information from apps that don't have APIs
- **Test iOS apps**: Automate testing workflows for iPhone applications
## Important Notes
- **Visual isolation only**: Apps share the same files, OS resources, and user session
- **Dynamic resolution**: Desktops automatically scale to fit app windows and menu bars
- **macOS only**: Currently requires macOS due to compositing engine dependencies
- **Not a security boundary**: This is for agent focus, not security isolation
## When to Use What: App-Use vs Multiple Cua Containers
### Use App-Use within the same macOS Cua Container:
- ✅ You need lightweight, fast agent focusing (macOS only)
- ✅ You want to run multiple agents on one desktop
- ✅ You're automating personal devices like iPhones
- ✅ Window layout isolation is sufficient
- ✅ You want low computational overhead
### Use Multiple Cua Containers:
- ✅ You need maximum isolation between agents
- ✅ You require cross-platform support (Mac/Linux/Windows)
- ✅ You need guaranteed resource allocation
- ✅ Security and complete isolation are critical
- ⚠️ Note: Most computationally expensive option
## Pro Tips
1. **Start Small**: Test with one app before creating complex multi-app desktops
2. **Screenshot First**: Take a screenshot to verify your desktop shows the right apps
3. **Name Your Apps Correctly**: Use exact app names as they appear in the system
4. **Consider Performance**: While lightweight, too many parallel agents can still impact system performance
5. **Plan Your Workflows**: Design agent tasks to minimize app switching for best results
### How It Works
When you create a desktop session with `create_desktop_from_apps()`, App Use:
- Filters the visual output to show only specified application windows
- Routes input events only to those applications
- Maintains window layout isolation between different sessions
- Shares the underlying file system and OS resources
- **Dynamically adjusts resolution** to fit the window layout and menu bar items
The resolution of these virtual desktops is dynamic, automatically scaling to accommodate the applications' window sizes and menu bar requirements. This ensures that agents always have a clear view of the entire interface they need to interact with, regardless of the specific app combination.
Currently, App Use is limited to macOS only due to its reliance on Quartz, Apple's powerful compositing engine, for creating these virtual desktops. Quartz provides the low-level window management and rendering capabilities that make it possible to composite multiple application windows into isolated visual environments.
## Conclusion
App Use brings a new dimension to computer automation - lightweight, focused, and parallel. Whether you're building a personal iPhone assistant or orchestrating a team of specialized agents, App Use provides the perfect balance of functionality and efficiency.
Ready to try it? Update to the latest Cua version and start focusing your agents today!
```bash
pip install -U "cua-computer[all]" "cua-agent[all]"
```
Happy automating! 🎯🤖

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.2 MiB

Binary file not shown.

BIN
blog/assets/demo_wsb.mp4 Normal file

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 MiB

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1021 KiB

View File

@@ -0,0 +1,353 @@
# Bringing Computer-Use to the Web
*Published on August 5, 2025 by Morgan Dean*
In one of our original posts, we explored building Computer-Use Operators on macOS - first with a [manual implementation](build-your-own-operator-on-macos-1.md) using OpenAI's `computer-use-preview` model, then with our [cua-agent framework](build-your-own-operator-on-macos-2.md) for Python developers. While these tutorials have been incredibly popular, we've received consistent feedback from our community: **"Can we use C/ua with JavaScript and TypeScript?"**
Today, we're excited to announce the release of the **`@trycua/computer` Web SDK** - a new library that allows you to control your C/ua cloud containers from any JavaScript or TypeScript project. With this library, you can click, type, and grab screenshots from your cloud containers - no extra servers required.
With this new SDK, you can easily develop CUA experiences like the one below, which we will release soon as open source.
<video width="100%" controls>
<source src="/playground_web_ui_sdk_sample.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
Lets see how it works.
## What You'll Learn
By the end of this tutorial, you'll be able to:
- Set up the `@trycua/computer` npm library in any JavaScript/TypeScript project
- Connect OpenAI's computer-use model to C/ua cloud containers from web applications
- Build computer-use agents that work in Node.js, React, Vue, or any web framework
- Handle different types of computer actions (clicking, typing, scrolling) from web code
- Implement the complete computer-use loop in JavaScript/TypeScript
- Integrate AI automation into existing web applications and workflows
**Prerequisites:**
- Node.js 16+ and npm/yarn/pnpm
- Basic JavaScript or TypeScript knowledge
- OpenAI API access (Tier 3+ for computer-use-preview)
- C/ua cloud container credits ([get started here](https://trycua.com/pricing))
**Estimated Time:** 45-60 minutes
## Access Requirements
### OpenAI Model Availability
At the time of writing, the **computer-use-preview** model has limited availability:
- Only accessible to OpenAI tier 3+ users
- Additional application process may be required even for eligible users
- Cannot be used in the OpenAI Playground
- Outside of ChatGPT Operator, usage is restricted to the new **Responses API**
Luckily, the `@trycua/computer` library can be used in conjunction with other models, like [Anthropics Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool) or [UI-TARS](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B). Youll just have to write your own handler to parse the model output for interfacing with the container.
### C/ua Cloud Containers
To follow this guide, youll need access to a C/ua cloud container.
Getting access is simple: purchase credits from our [pricing page](https://trycua.com/pricing), then create and provision a new container instance from the [dashboard](https://trycua.com/dashboard/containers). With your container running, you'll be ready to leverage the web SDK and bring automation to your JavaScript or TypeScript applications.
## Understanding the Flow
### OpenAI API Overview
Let's start with the basics. In our case, we'll use OpenAI's API to communicate with their computer-use model.
Think of it like this:
1. We send the model a screenshot of our container and tell it what we want it to do
2. The model looks at the screenshot and decides what actions to take
3. It sends back instructions (like "click here" or "type this")
4. We execute those instructions in our container.
### Model Setup
Here's how we set up the computer-use model for web development:
```javascript
const res = await openai.responses.create({
model: 'computer-use-preview',
tools: [
{
type: 'computer_use_preview',
display_width: 1024,
display_height: 768,
environment: 'linux', // we're using a linux container
},
],
input: [
{
role: 'user',
content: [
// what we want the ai to do
{ type: 'input_text', text: 'Open firefox and go to trycua.com' },
// first screenshot of the vm
{
type: 'input_image',
image_url: `data:image/png;base64,${screenshotBase64}`,
detail: 'auto',
},
],
},
],
truncation: 'auto'
});
```
### Understanding the Response
When we send a request, the API sends back a response that looks like this:
```json
"output": [
{
"type": "reasoning", // The AI explains what it's thinking
"id": "rs_67cc...",
"summary": [
{
"type": "summary_text",
"text": "Clicking on the browser address bar."
}
]
},
{
"type": "computer_call", // The actual action to perform
"id": "cu_67cc...",
"call_id": "call_zw3...", // Used to track previous calls
"action": {
"type": "click", // What kind of action (click, type, etc.)
"button": "left", // Which mouse button to use
"x": 156, // Where to click (coordinates)
"y": 50
},
"pending_safety_checks": [], // Any safety warnings to consider
"status": "completed" // Whether the action was successful
}
]
```
Each response contains:
1. **Reasoning**: The AI's explanation of what it's doing
2. **Action**: The specific computer action to perform
3. **Safety Checks**: Any potential risks to review
4. **Status**: Whether everything worked as planned
## Implementation Guide
### Provision a C/ua Cloud Container
1. Visit [trycua.com](https://trycua.com), sign up, purchase [credits](https://trycua.com/pricing), and create a new container instance from the [dashboard](https://trycua.com/dashboard).
2. Create an API key from the dashboard — be sure to save it in a secure location before continuing.
3. Start the cloud container from the dashboard.
### Environment Setup
1. Install required packages with your preferred package manager:
```bash
npm install --save @trycua/computer # or yarn, pnpm, bun
npm install --save openai # or yarn, pnpm, bun
```
Works with any JavaScript/TypeScript project setup - whether you're using Create React App, Next.js, Vue, Angular, or plain JavaScript.
2. Save your OpenAI API key, C/ua API key, and container name to a `.env` file:
```bash
OPENAI_API_KEY=openai-api-key
CUA_API_KEY=cua-api-key
CUA_CONTAINER_NAME=cua-cloud-container-name
```
These environment variables work the same whether you're using vanilla JavaScript, TypeScript, or any web framework.
## Building the Agent
### Mapping API Actions to `@trycua/computer` Interface Methods
This helper function handles a `computer_call` action from the OpenAI API — converting the action into an equivalent action from the `@trycua/computer` interface. These actions will execute on the initialized `Computer` instance. For example, `await computer.interface.leftClick()` sends a mouse left click to the current cursor position.
Whether you're using JavaScript or TypeScript, the interface remains the same:
```javascript
export async function executeAction(
computer: Computer,
action: OpenAI.Responses.ResponseComputerToolCall['action']
) {
switch (action.type) {
case 'click':
const { x, y, button } = action;
console.log(`Executing click at (${x}, ${y}) with button '${button}'.`);
await computer.interface.moveCursor(x, y);
if (button === 'right') await computer.interface.rightClick();
else await computer.interface.leftClick();
break;
case 'type':
const { text } = action;
console.log(`Typing text: ${text}`);
await computer.interface.typeText(text);
break;
case 'scroll':
const { x: locX, y: locY, scroll_x, scroll_y } = action;
console.log(
`Scrolling at (${locX}, ${locY}) with offsets (scroll_x=${scroll_x}, scroll_y=${scroll_y}).`
);
await computer.interface.moveCursor(locX, locY);
await computer.interface.scroll(scroll_x, scroll_y);
break;
case 'keypress':
const { keys } = action;
for (const key of keys) {
console.log(`Pressing key: ${key}.`);
// Map common key names to CUA equivalents
if (key.toLowerCase() === 'enter') {
await computer.interface.pressKey('return');
} else if (key.toLowerCase() === 'space') {
await computer.interface.pressKey('space');
} else {
await computer.interface.pressKey(key);
}
}
break;
case 'wait':
console.log(`Waiting for 3 seconds.`);
await new Promise((resolve) => setTimeout(resolve, 3 * 1000));
break;
case 'screenshot':
console.log('Taking screenshot.');
// This is handled automatically in the main loop, but we can take an extra one if requested
const screenshot = await computer.interface.screenshot();
return screenshot;
default:
console.log(`Unrecognized action: ${action.type}`);
break;
}
}
```
### Implementing the Computer-Use Loop
This section defines a loop that:
1. Initializes the `Computer` instance (connecting to a Linux cloud container).
2. Captures a screenshot of the current state.
3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` model.
4. Processes the returned `computer_call` action and executes it using our helper function.
5. Captures an updated screenshot after the action.
6. Send the updated screenshot and loops until no more actions are returned.
```javascript
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// Initialize the Computer Connection
const computer = new Computer({
apiKey: process.env.CUA_API_KEY!,
name: process.env.CUA_CONTAINER_NAME!,
osType: OSType.LINUX,
});
await computer.run();
// Take the initial screenshot
const screenshot = await computer.interface.screenshot();
const screenshotBase64 = screenshot.toString('base64');
// Setup openai config for computer use
const computerUseConfig: OpenAI.Responses.ResponseCreateParamsNonStreaming = {
model: 'computer-use-preview',
tools: [
{
type: 'computer_use_preview',
display_width: 1024,
display_height: 768,
environment: 'linux', // we're using a linux vm
},
],
truncation: 'auto',
};
// Send initial screenshot to the openai computer use model
let res = await openai.responses.create({
...computerUseConfig,
input: [
{
role: 'user',
content: [
// what we want the ai to do
{ type: 'input_text', text: 'open firefox and go to trycua.com' },
// current screenshot of the vm
{
type: 'input_image',
image_url: `data:image/png;base64,${screenshotBase64}`,
detail: 'auto',
},
],
},
],
});
// Loop until there are no more computer use actions.
while (true) {
const computerCalls = res.output.filter((o) => o.type === 'computer_call');
if (computerCalls.length < 1) {
console.log('No more computer calls. Loop complete.');
break;
}
// Get the first call
const call = computerCalls[0];
const action = call.action;
console.log('Received action from OpenAI Responses API:', action);
let ackChecks: OpenAI.Responses.ResponseComputerToolCall.PendingSafetyCheck[] =
[];
if (call.pending_safety_checks.length > 0) {
console.log('Safety checks pending:', call.pending_safety_checks);
// In a real implementation, you would want to get user confirmation here.
ackChecks = call.pending_safety_checks;
}
// Execute the action in the container
await executeAction(computer, action);
// Wait for changes to process within the container (1sec)
await new Promise((resolve) => setTimeout(resolve, 1000));
// Capture new screenshot
const newScreenshot = await computer.interface.screenshot();
const newScreenshotBase64 = newScreenshot.toString('base64');
// Screenshot back as computer_call_output
res = await openai.responses.create({
...computerUseConfig,
previous_response_id: res.id,
input: [
{
type: 'computer_call_output',
call_id: call.call_id,
acknowledged_safety_checks: ackChecks,
output: {
type: 'computer_screenshot',
image_url: `data:image/png;base64,${newScreenshotBase64}`,
},
},
],
});
}
```
You can find the full example on [GitHub](https://github.com/trycua/cua/tree/main/examples/computer-example-ts).
## What's Next?
The `@trycua/computer` Web SDK opens up some interesting possibilities. You could build browser-based testing tools, create interactive demos for your products, or automate repetitive workflows directly from your web apps.
We're working on more examples and better documentation - if you build something cool with this SDK, we'd love to see it. Drop by our [Discord](https://discord.gg/cua-ai) and share what you're working on.
Happy automating on the web!

View File

@@ -0,0 +1,547 @@
# Build Your Own Operator on macOS - Part 1
*Published on March 31, 2025 by Francesco Bonacci*
In this first blogpost, we'll learn how to build our own Computer-Use Operator using OpenAI's `computer-use-preview` model. But first, let's understand what some common terms mean:
- A **Virtual Machine (VM)** is like a computer within your computer - a safe, isolated environment where the AI can work without affecting your main system.
- **computer-use-preview** is OpenAI's specialized language model trained to understand and interact with computer interfaces through screenshots.
- A **Computer-Use Agent** is an AI agent that can control a computer just like a human would - clicking buttons, typing text, and interacting with applications.
Our Operator will run in an isolated macOS VM, by making use of our [cua-computer](https://github.com/trycua/cua/tree/main/libs/computer) package and [lume virtualization CLI](https://github.com/trycua/cua/tree/main/libs/lume).
Check out what it looks like to use your own Operator from a Gradio app:
<video width="100%" controls>
<source src="/demo_gradio.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## What You'll Learn
By the end of this tutorial, you'll be able to:
- Set up a macOS virtual machine for AI automation
- Connect OpenAI's computer-use model to your VM
- Create a basic loop for the AI to interact with your VM
- Handle different types of computer actions (clicking, typing, etc.)
- Implement safety checks and error handling
**Prerequisites:**
- macOS Sonoma (14.0) or later
- 8GB RAM minimum (16GB recommended)
- OpenAI API access (Tier 3+)
- Basic Python knowledge
- Familiarity with terminal commands
**Estimated Time:** 45-60 minutes
## Introduction to Computer-Use Agents
Last March OpenAI released a fine-tuned version of GPT-4o, namely [CUA](https://openai.com/index/computer-using-agent/), introducing pixel-level vision capabilities with advanced reasoning through reinforcement learning. This fine-tuning enables the computer-use model to interpret screenshots and interact with graphical user interfaces on a pixel-level such as buttons, menus, and text fields - mimicking human interactions on a computer screen. It scores a remarkable 38.1% success rate on [OSWorld](https://os-world.github.io) - a benchmark for Computer-Use agents on Linux and Windows. This is the 2nd available model after Anthropic's [Claude 3.5 Sonnet](https://www.anthropic.com/news/3-5-models-and-computer-use) to support computer-use capabilities natively with no external models (e.g. accessory [SoM (Set-of-Mark)](https://arxiv.org/abs/2310.11441) and OCR runs).
Professor Ethan Mollick provides an excellent explanation of computer-use agents in this article: [When you give a Claude a mouse](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse).
### ChatGPT Operator
OpenAI's computer-use model powers [ChatGPT Operator](https://openai.com/index/introducing-operator), a Chromium-based interface exclusively available to ChatGPT Pro subscribers. Users leverage this functionality to automate web-based tasks such as online shopping, expense report submission, and booking reservations by interacting with websites in a human-like manner.
## Benefits of Custom Operators
### Why Build Your Own?
While OpenAI's Operator uses a controlled Chromium VM instance, there are scenarios where you may want to use your own VM with full desktop capabilities. Here are some examples:
- Automating native macOS apps like Finder, Xcode
- Managing files, changing settings, and running terminal commands
- Testing desktop software and applications
- Creating workflows that combine web and desktop tasks
- Automating media editing in apps like Final Cut Pro and Blender
This gives you more control and flexibility to automate tasks beyond just web browsing, with full access to interact with native applications and system-level operations. Additionally, running your own VM locally provides better privacy for sensitive user files and delivers superior performance by leveraging your own hardware instead of renting expensive Cloud VMs.
## Access Requirements
### Model Availability
As we speak, the **computer-use-preview** model has limited availability:
- Only accessible to OpenAI tier 3+ users
- Additional application process may be required even for eligible users
- Cannot be used in the OpenAI Playground
- Outside of ChatGPT Operator, usage is restricted to the new **Responses API**
## Understanding the OpenAI API
### Responses API Overview
Let's start with the basics. In our case, we'll use OpenAI's Responses API to communicate with their computer-use model.
Think of it like this:
1. We send the model a screenshot of our VM and tell it what we want it to do
2. The model looks at the screenshot and decides what actions to take
3. It sends back instructions (like "click here" or "type this")
4. We execute those instructions in our VM
The [Responses API](https://platform.openai.com/docs/guides/responses) is OpenAI's newest way to interact with their AI models. It comes with several built-in tools:
- **Web search**: Let the AI search the internet
- **File search**: Help the AI find documents
- **Computer use**: Allow the AI to control a computer (what we'll be using)
As we speak, the computer-use model is only available through the Responses API.
### Responses API Examples
Let's look at some simple examples. We'll start with the traditional way of using OpenAI's API with Chat Completions, then show the new Responses API primitive.
Chat Completions:
```python
# The old way required managing conversation history manually
messages = [{"role": "user", "content": "Hello"}]
response = client.chat.completions.create(
model="gpt-4",
messages=messages # We had to track all messages ourselves
)
messages.append(response.choices[0].message) # Manual message tracking
```
Responses API:
```python
# Example 1: Simple web search
# The API handles all the complexity for us
response = client.responses.create(
model="gpt-4",
input=[{
"role": "user",
"content": "What's the latest news about AI?"
}],
tools=[{
"type": "web_search", # Tell the API to use web search
"search_query": "latest AI news"
}]
)
# Example 2: File search
# Looking for specific documents becomes easy
response = client.responses.create(
model="gpt-4",
input=[{
"role": "user",
"content": "Find documents about project X"
}],
tools=[{
"type": "file_search",
"query": "project X",
"file_types": ["pdf", "docx"] # Specify which file types to look for
}]
)
```
### Computer-Use Model Setup
For our operator, we'll use the computer-use model. Here's how we set it up:
```python
# Set up the computer-use model to control our VM
response = client.responses.create(
model="computer-use-preview", # Special model for computer control
tools=[{
"type": "computer_use_preview",
"display_width": 1024, # Size of our VM screen
"display_height": 768,
"environment": "mac" # Tell it we're using macOS.
}],
input=[
{
"role": "user",
"content": [
# What we want the AI to do
{"type": "input_text", "text": "Open Safari and go to google.com"},
# Current screenshot of our VM
{"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"}
]
}
],
truncation="auto" # Let OpenAI handle message length
)
```
### Understanding the Response
When we send a request, the API sends back a response that looks like this:
```json
"output": [
{
"type": "reasoning", # The AI explains what it's thinking
"id": "rs_67cc...",
"summary": [
{
"type": "summary_text",
"text": "Clicking on the browser address bar."
}
]
},
{
"type": "computer_call", # The actual action to perform
"id": "cu_67cc...",
"call_id": "call_zw3...",
"action": {
"type": "click", # What kind of action (click, type, etc.)
"button": "left", # Which mouse button to use
"x": 156, # Where to click (coordinates)
"y": 50
},
"pending_safety_checks": [], # Any safety warnings to consider
"status": "completed" # Whether the action was successful
}
]
```
Each response contains:
1. **Reasoning**: The AI's explanation of what it's doing
2. **Action**: The specific computer action to perform
3. **Safety Checks**: Any potential risks to review
4. **Status**: Whether everything worked as planned
## CUA-Computer Interface
### Architecture Overview
Let's break down the main components of our system and how they work together:
1. **The Virtual Machine (VM)**
- Think of this as a safe playground for our AI
- It's a complete macOS system running inside your computer
- Anything the AI does stays inside this VM, keeping your main system safe
- We use `lume` to create and manage this VM
2. **The Computer Interface (CUI)**
- This is how we control the VM
- It can move the mouse, type text, and take screenshots
- Works like a remote control for the VM
- Built using our `cua-computer` package
3. **The OpenAI Model**
- This is the brain of our operator
- It looks at screenshots of the VM
- Decides what actions to take
- Sends back instructions like "click here" or "type this"
Here's how they all work together:
```mermaid
sequenceDiagram
participant User as You
participant CUI as Computer Interface
participant VM as Virtual Machine
participant AI as OpenAI API
Note over User,AI: The Main Loop
User->>CUI: Start the operator
CUI->>VM: Create macOS sandbox
activate VM
VM-->>CUI: VM is ready
loop Action Loop
Note over CUI,AI: Each iteration
CUI->>VM: Take a screenshot
VM-->>CUI: Return current screen
CUI->>AI: Send screenshot + instructions
AI-->>CUI: Return next action
Note over CUI,VM: Execute the action
alt Mouse Click
CUI->>VM: Move and click mouse
else Type Text
CUI->>VM: Type characters
else Scroll Screen
CUI->>VM: Scroll window
else Press Keys
CUI->>VM: Press keyboard keys
else Wait
CUI->>VM: Pause for a moment
end
end
VM-->>CUI: Task finished
deactivate VM
CUI-->>User: All done!
```
The diagram above shows how information flows through our system:
1. You start the operator
2. The Computer Interface creates a virtual macOS
3. Then it enters a loop:
- Take a picture of the VM screen
- Send it to OpenAI with instructions
- Get back an action to perform
- Execute that action in the VM
- Repeat until the task is done
This design keeps everything organized and safe. The AI can only interact with the VM through our controlled interface, and the VM keeps the AI's actions isolated from your main system.
---
## Implementation Guide
### Prerequisites
1. **Lume CLI Setup**
For installing the standalone lume binary, run the following command from a terminal, or download the [latest pkg](https://github.com/trycua/cua/releases/latest/download/lume.pkg.tar.gz).
```bash
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
```
**Important Storage Notes:**
- Initial download requires 80GB of free space
- After first run, space usage reduces to ~30GB due to macOS's sparse file system
- VMs are stored in `~/.lume`
- Cached images are stored in `~/.lume/cache`
You can check your downloaded VM images anytime:
```bash
lume ls
```
Example output:
| name | os | cpu | memory | disk | display | status | ip | vnc |
|--------------------------|---------|-------|---------|----------------|-----------|-----------|----------------|---------------------------------------------------|
| macos-sequoia-cua:latest | macOS | 12 | 16.00G | 64.5GB/80.0GB | 1024x768 | running | 192.168.64.78 | vnc://:kind-forest-zulu-island@127.0.0.1:56085 |
After checking your available images, you can run the VM to ensure everything is working correctly:
```bash
lume run macos-sequoia-cua:latest
```
2. **Python Environment Setup**
**Note**: The `cua-computer` package requires Python 3.10 or later. We recommend creating a dedicated Python environment:
**Using venv:**
```bash
python -m venv cua-env
source cua-env/bin/activate
```
**Using conda:**
```bash
conda create -n cua-env python=3.10
conda activate cua-env
```
Then install the required packages:
```bash
pip install openai
pip install cua-computer
```
Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).
### Building the Operator
#### Importing Required Modules
With the prerequisites installed and configured, we're ready to build our first operator.
The following example uses asynchronous Python (async/await). You can run it either in a VS Code Notebook or as a standalone Python script.
```python
import asyncio
import base64
import openai
from computer import Computer
```
#### Mapping API Actions to CUA Methods
The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUI interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the lume VM Sandbox. We will use the computer interface to execute the actions.
```python
async def execute_action(computer, action):
action_type = action.type
if action_type == "click":
x = action.x
y = action.y
button = action.button
print(f"Executing click at ({x}, {y}) with button '{button}'")
await computer.interface.move_cursor(x, y)
if button == "right":
await computer.interface.right_click()
else:
await computer.interface.left_click()
elif action_type == "type":
text = action.text
print(f"Typing text: {text}")
await computer.interface.type_text(text)
elif action_type == "scroll":
x = action.x
y = action.y
scroll_x = action.scroll_x
scroll_y = action.scroll_y
print(f"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})")
await computer.interface.move_cursor(x, y)
await computer.interface.scroll(scroll_y) # Using vertical scroll only
elif action_type == "keypress":
keys = action.keys
for key in keys:
print(f"Pressing key: {key}")
# Map common key names to CUA equivalents
if key.lower() == "enter":
await computer.interface.press_key("return")
elif key.lower() == "space":
await computer.interface.press_key("space")
else:
await computer.interface.press_key(key)
elif action_type == "wait":
wait_time = action.time
print(f"Waiting for {wait_time} seconds")
await asyncio.sleep(wait_time)
elif action_type == "screenshot":
print("Taking screenshot")
# This is handled automatically in the main loop, but we can take an extra one if requested
screenshot = await computer.interface.screenshot()
return screenshot
else:
print(f"Unrecognized action: {action_type}")
```
#### Implementing the Computer-Use Loop
This section defines a loop that:
1. Initializes the cua-computer instance (connecting to a macOS sandbox).
2. Captures a screenshot of the current state.
3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` model.
4. Processes the returned `computer_call` action and executes it using our helper function.
5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).
For a full loop, you would repeat these steps until no further actions are returned.
```python
async def cua_openai_loop():
# Initialize the lume computer instance (macOS sandbox)
async with Computer(
display="1024x768",
memory="4GB",
cpu="2",
os_type="macos"
) as computer:
await computer.run() # Start the lume VM
# Capture the initial screenshot
screenshot = await computer.interface.screenshot()
screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')
# Initial request to start the loop
response = openai.responses.create(
model="computer-use-preview",
tools=[{
"type": "computer_use_preview",
"display_width": 1024,
"display_height": 768,
"environment": "mac"
}],
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Open Safari, download and install Cursor."},
{"type": "input_image", "image_url": f"data:image/png;base64,{screenshot_base64}"}
]
}
],
truncation="auto"
)
# Continue the loop until no more computer_call actions
while True:
# Check for computer_call actions
computer_calls = [item for item in response.output if item and item.type == "computer_call"]
if not computer_calls:
print("No more computer calls. Loop complete.")
break
# Get the first computer call
call = computer_calls[0]
last_call_id = call.call_id
action = call.action
print("Received action from OpenAI Responses API:", action)
# Handle any pending safety checks
if call.pending_safety_checks:
print("Safety checks pending:", call.pending_safety_checks)
# In a real implementation, you would want to get user confirmation here
acknowledged_checks = call.pending_safety_checks
else:
acknowledged_checks = []
# Execute the action
await execute_action(computer, action)
await asyncio.sleep(1) # Allow time for changes to take effect
# Capture new screenshot after action
new_screenshot = await computer.interface.screenshot()
new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')
# Send the screenshot back as computer_call_output
response = openai.responses.create(
model="computer-use-preview",
tools=[{
"type": "computer_use_preview",
"display_width": 1024,
"display_height": 768,
"environment": "mac"
}],
input=[{
"type": "computer_call_output",
"call_id": last_call_id,
"acknowledged_safety_checks": acknowledged_checks,
"output": {
"type": "input_image",
"image_url": f"data:image/png;base64,{new_screenshot_base64}"
}
}],
truncation="auto"
)
# End the session
await computer.stop()
# Run the loop
if __name__ == "__main__":
asyncio.run(cua_openai_loop())
```
You can find the full code in our [notebook](https://github.com/trycua/cua/blob/main/notebooks/blog/build-your-own-operator-on-macos-1.ipynb).
#### Request Handling Differences
The first request to the OpenAI Responses API is special in that it includes the initial screenshot and prompt. Subsequent requests are handled differently, using the `computer_call_output` type to provide feedback on the executed action.
##### Initial Request Format
- We use `role: "user"` with `content` that contains both `input_text` (the prompt) and `input_image` (the screenshot)
##### Subsequent Request Format
- We use `type: "computer_call_output"` instead of the user role
- We include the `call_id` to link the output to the specific previous action that was executed
- We provide any `acknowledged_safety_checks` that were approved
- We include the new screenshot in the `output` field
This structured approach allows the API to maintain context and continuity throughout the interaction session.
**Note**: For multi-turn conversations, you should include the `previous_response_id` in your initial requests when starting a new conversation with prior context. However, when using `computer_call_output` for action feedback, you don't need to explicitly manage the conversation history - OpenAI's API automatically tracks the context using the `call_id`. The `previous_response_id` is primarily important when the user provides additional instructions or when starting a new request that should continue from a previous session.
## Conclusion
### Summary
This blogpost demonstrates a single iteration of a OpenAI Computer-Use loop where:
- A macOS sandbox is controlled using the CUA interface.
- A screenshot and prompt are sent to the OpenAI Responses API.
- The returned action (e.g. a click or type command) is executed via the CUI interface.
In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed.
### Next Steps
In the next blogpost, we'll introduce our Agent framework which abstracts away all these tedious implementation steps. This framework provides a higher-level API that handles the interaction loop between OpenAI's computer-use model and the macOS sandbox, allowing you to focus on building sophisticated applications rather than managing the low-level details we've explored here. Can't wait? Check out the [cua-agent](https://github.com/trycua/cua/tree/main/libs/agent) package!
### Resources
- [OpenAI Computer-Use docs](https://platform.openai.com/docs/guides/tools-computer-use)
- [cua-computer](https://github.com/trycua/cua/tree/main/libs/computer)
- [lume](https://github.com/trycua/cua/tree/main/libs/lume)

View File

@@ -0,0 +1,655 @@
# Build Your Own Operator on macOS - Part 2
*Published on April 27, 2025 by Francesco Bonacci*
In our [previous post](build-your-own-operator-on-macos-1.md), we built a basic Computer-Use Operator from scratch using OpenAI's `computer-use-preview` model and our [cua-computer](https://pypi.org/project/cua-computer) package. While educational, implementing the control loop manually can be tedious and error-prone.
In this follow-up, we'll explore our [cua-agent](https://pypi.org/project/cua-agent) framework - a high-level abstraction that handles all the complexity of VM interaction, screenshot processing, model communication, and action execution automatically.
<video width="100%" controls>
<source src="/demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## What You'll Learn
By the end of this tutorial, you'll be able to:
- Set up the `cua-agent` framework with various agent loop types and model providers
- Understand the different agent loop types and their capabilities
- Work with local models for cost-effective workflows
- Use a simple UI for your operator
**Prerequisites:**
- Completed setup from Part 1 ([lume CLI installed](https://github.com/trycua/cua?tab=readme-ov-file#option-2-full-computer-use-agent-capabilities), macOS CUA image already pulled)
- Python 3.10+. We recommend using Conda (or Anaconda) to create an ad hoc Python environment.
- API keys for OpenAI and/or Anthropic (optional for local models)
**Estimated Time:** 30-45 minutes
## Introduction to cua-agent
The `cua-agent` framework is designed to simplify building Computer-Use Agents. It abstracts away the complex interaction loop we built manually in Part 1, letting you focus on defining tasks rather than implementing the machinery. Among other features, it includes:
- **Multiple Provider Support**: Works with OpenAI, Anthropic, UI-Tars, local models (via Ollama), or any OpenAI-compatible model (e.g. LM Studio, vLLM, LocalAI, OpenRouter, Groq, etc.)
- **Flexible Loop Types**: Different implementations optimized for various models (e.g. OpenAI vs. Anthropic)
- **Structured Responses**: Clean, consistent output following the OpenAI Agent SDK specification we touched on in Part 1
- **Local Model Support**: Run cost-effectively with locally hosted models (Ollama, LM Studio, vLLM, LocalAI, etc.)
- **Gradio UI**: Optional visual interface for interacting with your agent
## Installation
Let's start by installing the `cua-agent` package. You can install it with all features or selectively install only what you need.
From your python 3.10+ environment, run:
```bash
# For all features
pip install "cua-agent[all]"
# Or selectively install only what you need
pip install "cua-agent[openai]" # OpenAI support
pip install "cua-agent[anthropic]" # Anthropic support
pip install "cua-agent[uitars]" # UI-Tars support
pip install "cua-agent[omni]" # OmniParser + VLMs support
pip install "cua-agent[ui]" # Gradio UI
```
## Setting Up Your Environment
Before running any code examples, let's set up a proper environment:
1. **Create a new directory** for your project:
```bash
mkdir cua-agent-tutorial
cd cua-agent-tutorial
```
2. **Set up a Python environment** using one of these methods:
**Option A: Using conda command line**
```bash
# Using conda
conda create -n cua-agent python=3.10
conda activate cua-agent
```
**Option B: Using Anaconda Navigator UI**
- Open Anaconda Navigator
- Click on "Environments" in the left sidebar
- Click the "Create" button at the bottom
- Name your environment "cua-agent"
- Select Python 3.10
- Click "Create"
- Once created, select the environment and click "Open Terminal" to activate it
**Option C: Using venv**
```bash
python -m venv cua-env
source cua-env/bin/activate # On macOS/Linux
```
3. **Install the cua-agent package**:
```bash
pip install "cua-agent[all]"
```
4. **Set up your API keys as environment variables**:
```bash
# For OpenAI models
export OPENAI_API_KEY=your_openai_key_here
# For Anthropic models (if needed)
export ANTHROPIC_API_KEY=your_anthropic_key_here
```
5. **Create a Python file or notebook**:
**Option A: Create a Python script**
```bash
# For a Python script
touch cua_agent_example.py
```
**Option B: Use VS Code notebooks**
- Open VS Code
- Install the Python extension if you haven't already
- Create a new file with a `.ipynb` extension (e.g., `cua_agent_tutorial.ipynb`)
- Select your Python environment when prompted
- You can now create and run code cells in the notebook interface
Now you're ready to run the code examples!
## Understanding Agent Loops
If you recall from Part 1, we had to implement a custom interaction loop to interact with the compute-use-preview model.
In the `cua-agent` framework, an **Agent Loop** is the core abstraction that implements the continuous interaction cycle between an AI model and the computer environment. It manages the flow of:
1. Capturing screenshots of the computer's state
2. Processing these screenshots (with or without UI element detection)
3. Sending this visual context to an AI model along with the task instructions
4. Receiving the model's decisions on what actions to take
5. Safely executing these actions in the environment
6. Repeating this cycle until the task is complete
The loop handles all the complex error handling, retries, context management, and model-specific interaction patterns so you don't have to implement them yourself.
While the core concept remains the same across all agent loops, different AI models require specialized handling for optimal performance. To address this, the framework provides 4 different agent loop implementations, each designed for different computer-use modalities.
| Agent Loop | Supported Models | Description | Set-Of-Marks |
|:-----------|:-----------------|:------------|:-------------|
| `AgentLoop.OPENAI` | • `computer_use_preview` | Use OpenAI Operator CUA Preview model | Not Required |
| `AgentLoop.ANTHROPIC` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219` | Use Anthropic Computer-Use Beta Tools | Not Required |
| `AgentLoop.UITARS` | • `ByteDance-Seed/UI-TARS-1.5-7B` | Uses ByteDance's UI-TARS 1.5 model | Not Required |
| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama or OpenAI-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
Each loop handles the same basic pattern we implemented manually in Part 1:
1. Take a screenshot of the VM
2. Send the screenshot and task to the AI model
3. Receive an action to perform
4. Execute the action
5. Repeat until the task is complete
### Why Different Agent Loops?
The `cua-agent` framework provides multiple agent loop implementations to abstract away the complexity of interacting with different CUA models. Each provider has unique API structures, response formats, conventions and capabilities that require specialized handling:
- **OpenAI Loop**: Uses the Responses API with a specific `computer_call_output` format for sending screenshots after actions. Requires handling safety checks and maintains a chain of requests using `previous_response_id`.
- **Anthropic Loop**: Implements a [multi-agent loop pattern](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use#understanding-the-multi-agent-loop) with a sophisticated message handling system, supporting various API providers (Anthropic, Bedrock, Vertex) with token management and prompt caching capabilities.
- **UI-TARS Loop**: Requires custom message formatting and specialized parsing to extract actions from text responses using a "box token" system for UI element identification.
- **OMNI Loop**: Uses [Microsoft's OmniParser](https://github.com/microsoft/OmniParser) to create a [Set-of-Marks (SoM)](https://arxiv.org/abs/2310.11441) representation of the UI, enabling any vision-language model to interact with interfaces without specialized UI training.
- **AgentLoop.OMNI**: The most flexible option that works with virtually any vision-language model including local and open-source ones. Perfect for cost-effective development or when you need to use models without native computer-use capabilities.
These abstractions allow you to easily switch between providers without changing your application code. All loop implementations are available in the [cua-agent GitHub repository](https://github.com/trycua/cua/tree/main/libs/agent/agent/providers).
Choosing the right agent loop depends not only on your API access and technical requirements but also on the specific tasks you need to accomplish. To make an informed decision, it's helpful to understand how these underlying models perform across different computing environments from desktop operating systems to web browsers and mobile interfaces.
## Computer-Use Model Capabilities
The performance of different Computer-Use models varies significantly across tasks. These benchmark evaluations measure an agent's ability to follow instructions and complete real-world tasks in different computing environments.
| Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA | Human |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------|-------------|-------------|----------------------|-------------|
| **Computer Use** | [OSworld](https://arxiv.org/abs/2404.07972) (100 steps) | **42.5** | 36.4 | 28 | 38.1 (200 step) | 72.4 |
| | [Windows Agent Arena](https://arxiv.org/abs/2409.08264) (50 steps) | **42.1** | - | - | 29.8 | - |
| **Browser Use** | [WebVoyager](https://arxiv.org/abs/2401.13919) | 84.8 | **87** | 84.1 | 87 | - |
| | [Online-Mind2web](https://arxiv.org/abs/2504.01382) | **75.8** | 71 | 62.9 | 71 | - |
| **Phone Use** | [Android World](https://arxiv.org/abs/2405.14573) | **64.2** | - | - | 59.5 | - |
### When to Use Each Loop
- **AgentLoop.OPENAI**: Choose when you have OpenAI Tier 3 access and need the most capable computer-use agent for web-based tasks. Uses the same [OpenAI Computer-Use Loop](https://platform.openai.com/docs/guides/tools-computer-use) as Part 1, delivering strong performance on browser-based benchmarks.
- **AgentLoop.ANTHROPIC**: Ideal for users with Anthropic API access who need strong reasoning capabilities with computer-use abilities. Works with `claude-3-5-sonnet-20240620` and `claude-3-7-sonnet-20250219` models following [Anthropic's Computer-Use tools](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use#understanding-the-multi-agent-loop).
- **AgentLoop.UITARS**: Best for scenarios requiring more powerful OS/desktop, and latency-sensitive automation, as UI-TARS-1.5 leads in OS capabilities benchmarks. Requires running the model locally or accessing it through compatible endpoints (e.g. on Hugging Face).
- **AgentLoop.OMNI**: The most flexible option that works with virtually any vision-language model including local and open-source ones. Perfect for cost-effective development or when you need to use models without native computer-use capabilities.
Now that we understand the capabilities and strengths of different models, let's see how easy it is to implement a Computer-Use Agent using the `cua-agent` framework. Let's look at the implementation details.
## Creating Your First Computer-Use Agent
With the `cua-agent` framework, creating a Computer-Use Agent becomes remarkably straightforward. The framework handles all the complexities of model interaction, screenshot processing, and action execution behind the scenes. Let's look at a simple example of how to build your first agent:
**How to run this example:**
1. Create a new file named `simple_task.py` in your text editor or IDE (like VS Code, PyCharm, or Cursor)
2. Copy and paste the following code:
```python
import asyncio
from computer import Computer
from agent import ComputerAgent
async def run_simple_task():
async with Computer() as macos_computer:
# Create agent with OpenAI loop
agent = ComputerAgent(
model="openai/computer-use-preview",
tools=[macos_computer]
)
# Define a simple task
task = "Open Safari and search for 'Python tutorials'"
# Run the task and process responses
async for result in agent.run(task):
print(f"Action: {result.get('text')}")
# Run the example
if __name__ == "__main__":
asyncio.run(run_simple_task())
```
3. Save the file
4. Open a terminal, navigate to your project directory, and run:
```bash
python simple_task.py
```
5. The code will initialize the macOS virtual machine, create an agent, and execute the task of opening Safari and searching for Python tutorials.
You can also run this in a VS Code notebook:
1. Create a new notebook in VS Code (.ipynb file)
2. Copy the code into a cell (without the `if __name__ == "__main__":` part)
3. Run the cell to execute the code
You can find the full code in our [notebook](https://github.com/trycua/cua/blob/main/notebooks/blog/build-your-own-operator-on-macos-2.ipynb).
Compare this to the manual implementation from Part 1 - we've reduced dozens of lines of code to just a few. The cua-agent framework handles all the complex logic internally, letting you focus on the overarching agentic system.
## Working with Multiple Tasks
Another advantage of the cua-agent framework is easily chaining multiple tasks. Instead of managing complex state between tasks, you can simply provide a sequence of instructions to be executed in order:
**How to run this example:**
1. Create a new file named `multi_task.py` with the following code:
```python
import asyncio
from computer import Computer
from agent import ComputerAgent
async def run_multi_task_workflow():
async with Computer() as macos_computer:
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[macos_computer]
)
tasks = [
"Open Safari and go to github.com",
"Search for 'trycua/cua'",
"Open the repository page",
"Click on the 'Issues' tab",
"Read the first open issue"
]
for i, task in enumerate(tasks):
print(f"\nTask {i+1}/{len(tasks)}: {task}")
async for result in agent.run(task):
# Print just the action description for brevity
if result.get("text"):
print(f" → {result.get('text')}")
print(f"✅ Task {i+1} completed")
if __name__ == "__main__":
asyncio.run(run_multi_task_workflow())
```
2. Save the file
3. Make sure you have set your Anthropic API key:
```bash
export ANTHROPIC_API_KEY=your_anthropic_key_here
```
4. Run the script:
```bash
python multi_task.py
```
This pattern is particularly useful for creating workflows that navigate through multiple steps of an application or process. The agent maintains visual context between tasks, making it more likely to successfully complete complex sequences of actions.
## Understanding the Response Format
Each action taken by the agent returns a structured response following the OpenAI Agent SDK specification. This standardized format makes it easy to extract detailed information about what the agent is doing and why:
```python
async for result in agent.run(task):
# Basic information
print(f"Response ID: {result.get('id')}")
print(f"Response Text: {result.get('text')}")
# Detailed token usage statistics
usage = result.get('usage')
if usage:
print(f"Input Tokens: {usage.get('input_tokens')}")
print(f"Output Tokens: {usage.get('output_tokens')}")
# Reasoning and actions
for output in result.get('output', []):
if output.get('type') == 'reasoning':
print(f"Reasoning: {output.get('summary', [{}])[0].get('text')}")
elif output.get('type') == 'computer_call':
action = output.get('action', {})
print(f"Action: {action.get('type')} at ({action.get('x')}, {action.get('y')})")
```
This structured format allows you to:
- Log detailed information about agent actions
- Provide real-time feedback to users
- Track token usage for cost monitoring
- Access the reasoning behind decisions for debugging or user explanation
## Using Local Models with OMNI
One of the most powerful features of the framework is the ability to use local models via the OMNI loop. This approach dramatically reduces costs while maintaining acceptable reliability for many agentic workflows:
**How to run this example:**
1. First, you'll need to install Ollama for running local models:
- Visit [ollama.com](https://ollama.com) and download the installer for your OS
- Follow the installation instructions
- Pull the Gemma 3 model:
```bash
ollama pull gemma3:4b-it-q4_K_M
```
2. Create a file named `local_model.py` with this code:
```python
import asyncio
from computer import Computer
from agent import ComputerAgent
async def run_with_local_model():
async with Computer() as macos_computer:
agent = ComputerAgent(
model="omniparser+ollama_chat/gemma3",
tools=[macos_computer]
)
task = "Open the Calculator app and perform a simple calculation"
async for result in agent.run(task):
print(f"Action: {result.get('text')}")
if __name__ == "__main__":
asyncio.run(run_with_local_model())
```
3. Run the script:
```bash
python local_model.py
```
You can also use other local model servers with the OAICOMPAT provider, which enables compatibility with any API endpoint following the OpenAI API structure:
```python
agent = ComputerAgent(
model=LLM(
provider=LLMProvider.OAICOMPAT,
name="gemma-3-12b-it",
provider_base_url="http://localhost:1234/v1" # LM Studio endpoint
),
tools=[macos_computer]
)
```
Common local endpoints include:
- LM Studio: `http://localhost:1234/v1`
- vLLM: `http://localhost:8000/v1`
- LocalAI: `http://localhost:8080/v1`
- Ollama with OpenAI compat: `http://localhost:11434/v1`
This approach is perfect for:
- Development and testing without incurring API costs
- Offline or air-gapped environments where API access isn't possible
- Privacy-sensitive applications where data can't leave your network
- Experimenting with different models to find the best fit for your use case
## Deploying and Using UI-TARS
UI-TARS is ByteDance's Computer-Use model designed for navigating OS-level interfaces. It shows excellent performance on desktop OS tasks. To use UI-TARS, you'll first need to deploy the model.
### Deployment Options
1. **Local Deployment**: Follow the [UI-TARS deployment guide](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md) to run the model locally.
2. **Hugging Face Endpoint**: Deploy UI-TARS on Hugging Face Inference Endpoints, which will give you a URL like:
`https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1`
3. **Using with cua-agent**: Once deployed, you can use UI-TARS with the cua-agent framework:
```python
agent = ComputerAgent(
model=LLM(
provider=LLMProvider.OAICOMPAT,
name="tgi",
provider_base_url="https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1"
),
tools=[macos_computer]
)
```
UI-TARS is particularly useful for desktop automation tasks, as it shows the highest performance on OS-level benchmarks like OSworld and Windows Agent Arena.
## Understanding Agent Responses in Detail
The `run()` method of your agent yields structured responses that follow the OpenAI Agent SDK specification. This provides a rich set of information beyond just the basic action text:
```python
async for result in agent.run(task):
# Basic ID and text
print("Response ID:", result.get("id"))
print("Response Text:", result.get("text"))
# Token usage statistics
usage = result.get("usage")
if usage:
print("\nUsage Details:")
print(f" Input Tokens: {usage.get('input_tokens')}")
if "input_tokens_details" in usage:
print(f" Input Tokens Details: {usage.get('input_tokens_details')}")
print(f" Output Tokens: {usage.get('output_tokens')}")
if "output_tokens_details" in usage:
print(f" Output Tokens Details: {usage.get('output_tokens_details')}")
print(f" Total Tokens: {usage.get('total_tokens')}")
# Detailed reasoning and actions
outputs = result.get("output", [])
for output in outputs:
output_type = output.get("type")
if output_type == "reasoning":
print("\nReasoning:")
for summary in output.get("summary", []):
print(f" {summary.get('text')}")
elif output_type == "computer_call":
action = output.get("action", {})
print("\nComputer Action:")
print(f" Type: {action.get('type')}")
print(f" Position: ({action.get('x')}, {action.get('y')})")
if action.get("text"):
print(f" Text: {action.get('text')}")
```
This detailed information is invaluable for debugging, logging, and understanding the agent's decision-making process in an agentic system. More details can be found in the [OpenAI Agent SDK Specification](https://platform.openai.com/docs/guides/responses-vs-chat-completions).
## Building a Gradio UI
For a visual interface to your agent, the package also includes a Gradio UI:
**How to run the Gradio UI:**
1. Create a file named `launch_ui.py` with the following code:
```python
from agent.ui.gradio.app import create_gradio_ui
# Create and launch the UI
if __name__ == "__main__":
app = create_gradio_ui()
app.launch(share=False) # Set share=False for local access only
```
2. Install the UI dependencies if you haven't already:
```bash
pip install "cua-agent[ui]"
```
3. Run the script:
```bash
python launch_ui.py
```
4. Open your browser to the displayed URL (usually http://127.0.0.1:7860)
**Creating a Shareable Link (Optional):**
You can also create a temporary public URL to access your Gradio UI from anywhere:
```python
# In launch_ui.py
if __name__ == "__main__":
app = create_gradio_ui()
app.launch(share=True) # Creates a public link
```
When you run this, Gradio will display both a local URL and a public URL like:
```
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://abcd1234.gradio.live
```
**Security Note:** Be cautious when sharing your Gradio UI publicly:
- The public URL gives anyone with the link full access to your agent
- Consider using basic authentication for additional protection:
```python
app.launch(share=True, auth=("username", "password"))
```
- Only use this feature for personal or team use, not for production environments
- The temporary link expires when you stop the Gradio application
This provides:
- Model provider selection
- Agent loop selection
- Task input field
- Real-time display of VM screenshots
- Action history
### Setting API Keys for the UI
To use the UI with different providers, set your API keys as environment variables:
```bash
# For OpenAI models
export OPENAI_API_KEY=your_openai_key_here
# For Anthropic models
export ANTHROPIC_API_KEY=your_anthropic_key_here
# Launch with both keys set
OPENAI_API_KEY=your_key ANTHROPIC_API_KEY=your_key python launch_ui.py
```
### UI Settings Persistence
The Gradio UI automatically saves your configuration to maintain your preferences between sessions:
- Settings like Agent Loop, Model Choice, Custom Base URL, and configuration options are saved to `.gradio_settings.json` in the project's root directory
- These settings are loaded automatically when you restart the UI
- API keys entered in the custom provider field are **not** saved for security reasons
- It's recommended to add `.gradio_settings.json` to your `.gitignore` file
## Advanced Example: GitHub Repository Workflow
Let's look at a more complex example that automates a GitHub workflow:
**How to run this advanced example:**
1. Create a file named `github_workflow.py` with the following code:
```python
import asyncio
import logging
from computer import Computer
from agent import ComputerAgent
async def github_workflow():
async with Computer(verbosity=logging.INFO) as macos_computer:
agent = ComputerAgent(
model="openai/computer-use-preview",
save_trajectory=True, # Save screenshots for debugging
only_n_most_recent_images=3, # Only keep last 3 images in context
verbosity=logging.INFO,
tools=[macos_computer]
)
tasks = [
"Look for a repository named trycua/cua on GitHub.",
"Check the open issues, open the most recent one and read it.",
"Clone the repository in users/lume/projects if it doesn't exist yet.",
"Open the repository with Cursor (on the dock, black background and white cube icon).",
"From Cursor, open Composer if not already open.",
"Focus on the Composer text area, then write and submit a task to help resolve the GitHub issue.",
]
for i, task in enumerate(tasks):
print(f"\nExecuting task {i+1}/{len(tasks)}: {task}")
async for result in agent.run(task):
print(f"Action: {result.get('text')}")
print(f"✅ Task {i+1}/{len(tasks)} completed")
if __name__ == "__main__":
asyncio.run(github_workflow())
```
2. Make sure your OpenAI API key is set:
```bash
export OPENAI_API_KEY=your_openai_key_here
```
3. Run the script:
```bash
python github_workflow.py
```
4. Watch as the agent completes the entire workflow:
- The agent will navigate to GitHub
- Find and investigate issues in the repository
- Clone the repository to the local machine
- Open it in Cursor
- Use Cursor's AI features to work on a solution
This example:
1. Searches GitHub for a repository
2. Reads an issue
3. Clones the repository
4. Opens it in an IDE
5. Uses AI to write a solution
## Comparing Implementation Approaches
Let's compare our manual implementation from Part 1 with the framework approach:
### Manual Implementation (Part 1)
- Required writing custom code for the interaction loop
- Needed explicit handling of different action types
- Required direct management of the OpenAI API calls
- Around 50-100 lines of code for basic functionality
- Limited to OpenAI's computer-use model
### Framework Implementation (Part 2)
- Abstracts the interaction loop
- Handles all action types automatically
- Manages API calls internally
- Only 10-15 lines of code for the same functionality
- Works with multiple model providers
- Includes UI capabilities
## Conclusion
The `cua-agent` framework transforms what was a complex implementation task into a simple, high-level interface for building Computer-Use Agents. By abstracting away the technical details, it lets you focus on defining the tasks rather than the machinery.
### When to Use Each Approach
- **Manual Implementation (Part 1)**: When you need complete control over the interaction loop or are implementing a custom solution
- **Framework (Part 2)**: For most applications where you want to quickly build and deploy Computer-Use Agents
### Next Steps
With the basics covered, you might want to explore:
- Customizing the agent's behavior with additional parameters
- Building more complex workflows spanning multiple applications
- Integrating your agent into other applications
- Contributing to the open-source project on GitHub
### Resources
- [cua-agent GitHub repository](https://github.com/trycua/cua/tree/main/libs/agent)
- [Agent Notebook Examples](https://github.com/trycua/cua/blob/main/notebooks/agent_nb.ipynb)
- [OpenAI Agent SDK Specification](https://platform.openai.com/docs/api-reference/responses)
- [Anthropic API Documentation](https://docs.anthropic.com/en/api/getting-started)
- [UI-TARS GitHub](https://github.com/ByteDance/UI-TARS)
- [OmniParser GitHub](https://github.com/microsoft/OmniParser)

74
blog/composite-agents.md Normal file
View File

@@ -0,0 +1,74 @@
# Announcing Cua Agent framework 0.4 and Composite Agents
*Published on August 26, 2025 by Dillon DuPont*
<img src="/composite-agents.png" alt="Composite Agents">
So you want to build an agent that can use a computer. Great! You've probably discovered that there are now dozens of different AI models that claim they can click GUI buttons and fill out forms. Less great: actually getting them to work together is like trying to coordinate a group project where everyone speaks a different language and has invented seventeen different ways to say "click here".
Here's the thing about new GUI models: they're all special snowflakes. One model wants you to feed it images and expects coordinates back as percentages from 0 to 1. Another wants absolute pixel coordinates. A third model has invented its own numeral system with `<|loc095|><|loc821|>` tokens inside tool calls. Some models output Python code that calls `pyautogui.click(x, y)`. Others will start hallucinating coordinates if you forget to format all previous messages within a very specific GUI system prompt.
This is the kind of problem that makes you wonder if we're building the future of computing or just recreating the Tower of Babel with more GPUs.
## What we fixed
Agent framework 0.4 solves this by doing something radical: making all these different models speak the same language.
Instead of writing separate code for each model's peculiarities, you now just pick a model with a string like `"anthropic/claude-3-5-sonnet-20241022"` or `"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"`, and everything else Just Works™. Behind the scenes, we handle all the coordinate normalization, token parsing, and image preprocessing so you don't have to.
```python
# This works the same whether you're using Anthropic, OpenAI, or that new model you found on Hugging Face
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022", # or any other supported model
tools=[computer]
)
```
The output format is consistent across all providers (OpenAI, Anthropic, Vertex, Hugging Face, OpenRouter, etc.). No more writing different parsers for each model's creative interpretation of how to represent a mouse click.
## Composite Agents: Two Brains Are Better Than One
Here's where it gets interesting. We realized that you don't actually need one model to be good at everything. Some models are excellent at understanding what's on the screen—they can reliably identify buttons and text fields and figure out where to click. Other models are great at planning and reasoning but might be a bit fuzzy on the exact pixel coordinates.
So we let you combine them with a `+` sign:
```python
agent = ComputerAgent(
# specify the grounding model first, then the planning model
model="huggingface-local/HelloKKMe/GTA1-7B+huggingface-local/OpenGVLab/InternVL3_5-8B",
tools=[computer]
)
```
This creates a composite agent where one model (the "grounding" model) handles the visual understanding and precise UI interactions, while the other (the "planning" model) handles the high-level reasoning and task orchestration. It's like having a pilot and a navigator, except they're both AI models and they're trying to help you star a GitHub repository.
You can even take a model that was never designed for computer use—like GPT-4o—and give it GUI capabilities by pairing it with a specialized vision model:
```python
agent = ComputerAgent(
model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-4o",
tools=[computer]
)
```
## Example notebook
For a full, ready-to-run demo (install deps, local computer using Docker, and a composed agent example), see the notebook:
- https://github.com/trycua/cua/blob/models/opencua/notebooks/composite_agents_docker_nb.ipynb
## What's next
We're building integration with HUD evals, allowing us to curate and benchmark model combinations. This will help us identify which composite agent pairs work best for different types of tasks, and provide you with tested recommendations rather than just throwing model names at the wall to see what sticks.
If you try out version 0.4.x, we'd love to hear how it goes. Join us on Discord to share your results and let us know what model combinations work best for your projects.
---
## Links
* **Composite Agent Docs:** [https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents)
* **Discord:** [https://discord.gg/cua-ai](https://discord.gg/cua-ai)
Questions or weird edge cases? Ping us on Discord—were curious to see what you build.

79
blog/cua-hackathon.md Normal file
View File

@@ -0,0 +1,79 @@
# Computer-Use Agents SOTA Challenge: Hack the North + Global Online
*Published on August 25, 2025 by Francesco Bonacci*
Were bringing something new to [Hack the North](https://hackthenorth.com), Canadas largest hackathon, this year: a head-to-head competition for **Computer-Use Agents** - on-site at Waterloo and a **Global online challenge**. From September 1214, 2025, teams build on the **Cua Agent Framework** and are scored in **HUDs OSWorld-Verified** environment to push past todays SOTA on [OS-World](https://os-world.github.io).
<img src="/hack-the-north.png">
## Track A: On-site @ Hack the North
Theres one global leaderboard: **Cua - Best State-of-the-Art Computer-Use Agent**. Use any model setup you like (cloud or local). After projects are submitted, [HUD](https://www.hud.so) runs the official benchmark; the top team earns a **guaranteed YC partner interview (W26 batch)**. Well also feature winners on our blog and socials and kit the team out with swag.
## Track B: Cua Global Online Hackathon
**Cua** and [**Ollama**](https://ollama.com) organize a global hackathon to find the **most creative uses of local and hybrid computer-use agents**. There are no geographic restrictions on who can join — this is a worldwide competition focused on **originality, impact, and inventive applications** that showcase what's possible with local and hybrid inference.
**Prizes:**
- 1st **MacBook Air M4 (or equivalent value)** + features in Cua & Ollama channels
- 2nd **$500 CAD + swag**
- 3rd **swag + public feature**
---
## How it works
Two different tracks, two different processes:
### On-site (Track A)
Build during the weekend and submit a repo with a one-line start command. **HUD** executes your command in a clean environment and runs **OSWorld-Verified**. Scores come from official benchmark results; ties break by median, then wall-clock time, then earliest submission. Any model setup is allowed (cloud or local).
**HUD** runs official evaluations immediately after submission. Winners are announced at the **closing ceremony**.
### Rules
- Fork and star the [Cua repo](https://github.com/trycua/cua).
- Add your agent and instructions in `samples/community/hack-the-north/<YOUR_TEAM_NAME>`.
- Include a README with details on the approach and any required notes.
- Submit a PR.
**Deadline: Sept 15, 8:00 AM EDT**
### Global Online (Track B)
Open to anyone, anywhere. Build on your own timeline and submit through the **Cua Discord form** by the deadline.
**Project Requirements:**
- Your agent must integrate **Cua and Ollama** in some way
- Your agent must be **easily runnable by judges**
Judged by **Cua** and **Ollama** teams on:
- **Creativity (30%)** originality, usefulness, surprise factor
- **Technical Depth (30%)** quality of engineering and agent design
- **Use of Ollama (30%)** effective integration of local/hybrid inference
- **Polish (10%)** presentation, clarity, demo readiness
### Submission Process
Submissions will be collected via a **form link provided in the Cua Discord**. Your submission must contain:
- **GitHub repo** containing the agent source code and a clear README with instructions on how to use the agent
- **Explanation** of the models and tools used, and what's local or hybrid about your design
- **Short demo video** (up to two minutes)
A **commit freeze** will be used to ensure that no changes are made after the deadline. Winners will be announced after judging is complete.
**Deadline: Sept 28, 11:59 PM UTC (extended due to popular demand!)**
---
## Join us
Bring a team, pick a model stack, and push what agents can do on real computers. We cant wait to see what you build at **Hack the North 2025**.
**Discord channels**
- Join the Discord first: https://discord.gg/cua-ai
- **#hack-the-north (on-site):** https://discord.com/channels/1328377437301641247/1409508526774157342
- **#global-online (Ollama × Cua):** https://discord.com/channels/1328377437301641247/1409518100491145226
**Contact**
Questions on Hack the North? Email **hackthenorth@trycua.com**.
*P.S. If youre planning ahead, start with the Cua Agent Framework and OSWorld-Verified docs at docs.trycua.com; well share office-hour times in both Discord channels.*

93
blog/hud-agent-evals.md Normal file
View File

@@ -0,0 +1,93 @@
# Cua × HUD - Evaluate Any Computer-Use Agent
*Published on August 27, 2025 by Dillon DuPont*
You can now benchmark any GUI-capable agent on real computer-use tasks through our new integration with [HUD](https://hud.so), the evaluation platform for computer-use agents.
If [yesterday's 0.4 release](composite-agents.md) made it easy to compose planning and grounding models, today's update makes it easy to measure them. Configure your model, run evaluations at scale, and watch traces live in HUD.
<img src="/hud-agent-evals.png" alt="Cua × HUD">
## What you get
- One-line evals on OSWorld (and more) for OpenAI, Anthropic, Hugging Face, and composed GUI models.
- Live traces at [app.hud.so](https://app.hud.so) to see every click, type, and screenshot.
- Zero glue code needed - we wrapped the interface for you.
- With Cua's Agent SDK, you can benchmark any configurations of models, by just changing the `model` string.
## What is OSWorld?
[OSWorld](https://os-world.github.io) is a comprehensive evaluation benchmark comprising 369 real-world computer-use tasks spanning diverse desktop environments (Chrome, LibreOffice, GIMP, VS Code, etc.) developed by XLang Labs. This benchmark has emerged as the de facto standard for evaluating multimodal agents in realistic computing environments, with adoption by leading AI research teams at OpenAI, Anthropic, and other major institutions for systematic agent assessment. The benchmark was recently enhanced to [OSWorld-Verified](https://xlang.ai/blog/osworld-verified), incorporating rigorous validation improvements that address over 300 community-identified issues to ensure evaluation reliability and reproducibility.
## Environment Setup
First, set up your environment variables:
```bash
export HUD_API_KEY="your_hud_api_key" # Required for HUD access
export ANTHROPIC_API_KEY="your_anthropic_key" # For Claude models
export OPENAI_API_KEY="your_openai_key" # For OpenAI models
```
## Try it
### Quick Start - Single Task
```python
from agent.integrations.hud import run_single_task
await run_single_task(
dataset="hud-evals/OSWorld-Verified-XLang",
model="openai/computer-use-preview+openai/gpt-5-nano", # or any supported model string
task_id=155 # open last tab task (easy)
)
```
### Run a dataset (parallel execution)
```python
from agent.integrations.hud import run_full_dataset
# Test on OSWorld (367 computer-use tasks)
await run_full_dataset(
dataset="hud-evals/OSWorld-Verified-XLang",
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
split="train[:3]" # try a few tasks to start
)
# Or test on SheetBench (50 spreadsheet tasks)
await run_full_dataset(
dataset="hud-evals/SheetBench-V2",
model="anthropic/claude-3-5-sonnet-20241022",
split="train[:2]"
)
```
### Live Environment Streaming
Watch your agent work in real-time. Example output:
```md
Starting full dataset run...
╔═════════════════════════════════════════════════════════════════╗
║ 🚀 See your agent live at: ║
╟─────────────────────────────────────────────────────────────────╢
║ https://app.hud.so/jobs/fe05805d-4da9-4fc6-84b5-5c518528fd3c ║
╚═════════════════════════════════════════════════════════════════╝
```
## Configuration Options
Customize your evaluation with these options:
- **Environment types**: `environment="linux"` (OSWorld) or `environment="browser"` (SheetBench)
- **Model composition**: Mix planning and grounding models with `+` (e.g., `"gpt-4+gpt-5-nano"`)
- **Parallel scaling**: Set `max_concurrent_tasks` for throughput
- **Local trajectories**: Save with `trajectory_dir` for offline analysis
- **Live monitoring**: Every run gets a unique trace URL at app.hud.so
## Learn more
- Notebook with endtoend examples: https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb
- Docs: https://docs.trycua.com/docs/agent-sdk/integrations/hud
- Live traces: https://app.hud.so

211
blog/human-in-the-loop.md Normal file
View File

@@ -0,0 +1,211 @@
# When Agents Need Human Wisdom - Introducing Human-In-The-Loop Support
*Published on August 29, 2025 by Francesco Bonacci*
Sometimes the best AI agent is a human. Whether you're creating training demonstrations, evaluating complex scenarios, or need to intervene when automation hits a wall, our new Human-In-The-Loop integration puts you directly in control.
With yesterday's [HUD evaluation integration](hud-agent-evals.md), you could benchmark any agent at scale. Today's update lets you *become* the agent when it matters most—seamlessly switching between automated intelligence and human judgment.
<video width="100%" controls>
<source src="/human-in-the-loop.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## What you get
- **One-line human takeover** for any agent configuration with `human/human` or `model+human/human`
- **Interactive web UI** to see what your agent sees and control what it does
- **Zero context switching** - step in exactly where automation left off
- **Training data generation** - create perfect demonstrations by doing tasks yourself
- **Ground truth evaluation** - validate agent performance with human expertise
## Why Human-In-The-Loop?
Even the most sophisticated agents encounter edge cases, ambiguous interfaces, or tasks requiring human judgment. Rather than failing gracefully, they can now fail *intelligently*—by asking for human help.
This approach bridges the gap between fully automated systems and pure manual control, letting you:
- **Demonstrate complex workflows** that agents can learn from
- **Evaluate tricky scenarios** where ground truth requires human assessment
- **Intervene selectively** when automated agents need guidance
- **Test and debug** your tools and environments manually
## Getting Started
Launch the human agent interface:
```bash
python -m agent.human_tool
```
The web UI will show pending completions. Click any completion to take control of the agent and see exactly what it sees.
## Usage Examples
### Direct Human Control
Perfect for creating demonstrations or when you want full manual control:
```python
from agent import ComputerAgent
from agent.computer import computer
agent = ComputerAgent(
"human/human",
tools=[computer]
)
# You'll get full control through the web UI
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
```
### Hybrid: AI Planning + Human Execution
Combine model intelligence with human precision—let AI plan, then execute manually:
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+human/human",
tools=[computer]
)
# AI creates the plan, human executes each step
async for _ in agent.run("Navigate to the settings page and enable dark mode"):
pass
```
### Fallback Pattern
Start automated, escalate to human when needed:
```python
# Primary automated agent
primary_agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
# Human fallback agent
fallback_agent = ComputerAgent("human/human", tools=[computer])
try:
async for result in primary_agent.run(task):
if result.confidence < 0.7: # Low confidence threshold
# Seamlessly hand off to human
async for _ in fallback_agent.run(f"Continue this task: {task}"):
pass
except Exception:
# Agent failed, human takes over
async for _ in fallback_agent.run(f"Handle this failed task: {task}"):
pass
```
## Interactive Features
The human-in-the-loop interface provides a rich, responsive experience:
### **Visual Environment**
- **Screenshot display** with live updates as you work
- **Click handlers** for direct interaction with UI elements
- **Zoom and pan** to see details clearly
### **Action Controls**
- **Click actions** - precise cursor positioning and clicking
- **Keyboard input** - type text naturally or send specific key combinations
- **Action history** - see the sequence of actions taken
- **Undo support** - step back when needed
### **Tool Integration**
- **Full OpenAI compatibility** - standard tool call format
- **Custom tools** - integrate your own tools seamlessly
- **Real-time feedback** - see tool responses immediately
### **Smart Polling**
- **Responsive updates** - UI refreshes when new completions arrive
- **Background processing** - continue working while waiting for tasks
- **Session persistence** - resume interrupted sessions
## Real-World Use Cases
### **Training Data Generation**
Create perfect demonstrations for fine-tuning:
```python
# Generate training examples for spreadsheet tasks
demo_agent = ComputerAgent("human/human", tools=[computer])
tasks = [
"Create a budget spreadsheet with income and expense categories",
"Apply conditional formatting to highlight overbudget items",
"Generate a pie chart showing expense distribution"
]
for task in tasks:
# Human demonstrates each task perfectly
async for _ in demo_agent.run(task):
pass # Recorded actions become training data
```
### **Evaluation and Ground Truth**
Validate agent performance on complex scenarios:
```python
# Human evaluates agent performance
evaluator = ComputerAgent("human/human", tools=[computer])
async for _ in evaluator.run("Review this completed form and rate accuracy (1-10)"):
pass # Human provides authoritative quality assessment
```
### **Interactive Debugging**
Step through agent behavior manually:
```python
# Test a workflow step by step
debug_agent = ComputerAgent("human/human", tools=[computer])
async for _ in debug_agent.run("Reproduce the agent's failed login sequence"):
pass # Human identifies exactly where automation breaks
```
### **Edge Case Handling**
Handle scenarios that break automated agents:
```python
# Complex UI interaction requiring human judgment
edge_case_agent = ComputerAgent("human/human", tools=[computer])
async for _ in edge_case_agent.run("Navigate this CAPTCHA-protected form"):
pass # Human handles what automation cannot
```
## Configuration Options
Customize the human agent experience:
- **UI refresh rate**: Adjust polling frequency for your workflow
- **Image quality**: Balance detail vs. performance for screenshots
- **Action logging**: Save detailed traces for analysis and training
- **Session timeout**: Configure idle timeouts for security
- **Tool permissions**: Restrict which tools humans can access
## When to Use Human-In-The-Loop
| **Scenario** | **Why Human Control** |
|--------------|----------------------|
| **Creating training data** | Perfect demonstrations for model fine-tuning |
| **Evaluating complex tasks** | Human judgment for subjective or nuanced assessment |
| **Handling edge cases** | CAPTCHAs, unusual UIs, context-dependent decisions |
| **Debugging workflows** | Step through failures to identify breaking points |
| **High-stakes operations** | Critical tasks requiring human oversight and approval |
| **Testing new environments** | Validate tools and environments work as expected |
## Learn More
- **Interactive examples**: Try human-in-the-loop control with sample tasks
- **Training data pipelines**: Learn how to convert human demonstrations into model training data
- **Evaluation frameworks**: Build human-validated test suites for your agents
- **API documentation**: Full reference for human agent configuration
Ready to put humans back in the loop? The most sophisticated AI system knows when to ask for help.
---
*Questions about human-in-the-loop agents? Join the conversation in our [Discord community](https://discord.gg/cua-ai) or check out our [documentation](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop).*

View File

@@ -0,0 +1,232 @@
# Introducing Cua Cloud Containers: Computer-Use Agents in the Cloud
*Published on May 28, 2025 by Francesco Bonacci*
Welcome to the next chapter in our Computer-Use Agent journey! In [Part 1](./build-your-own-operator-on-macos-1), we showed you how to build your own Operator on macOS. In [Part 2](./build-your-own-operator-on-macos-2), we explored the cua-agent framework. Today, we're excited to introduce **Cua Cloud Containers** the easiest way to deploy Computer-Use Agents at scale.
<video width="100%" controls>
<source src="/launch-video-cua-cloud.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## What is Cua Cloud?
Think of Cua Cloud as **Docker for Computer-Use Agents**. Instead of managing VMs, installing dependencies, and configuring environments, you can launch pre-configured cloud containers with a single command. Each container comes with a **full desktop environment** accessible via browser (via noVNC), all CUA-related dependencies pre-configured (with a PyAutoGUI-compatible server), and **pay-per-use pricing** that scales with your needs.
## Why Cua Cloud Containers?
Four months ago, we launched [**Lume**](https://github.com/trycua/cua/tree/main/libs/lume) and [**Cua**](https://github.com/trycua/cua) with the goal to bring sandboxed VMs and Computer-Use Agents on Apple Silicon. The developer's community response was incredible 🎉
Going from prototype to production revealed a problem though: **local macOS VMs don't scale**, neither are they easily portable.
Our Discord community, YC peers, and early pilot customers kept hitting the same issues. Storage constraints meant **20-40GB per VM** filled laptops fast. Different hardware architectures (Apple Silicon ARM vs Intel x86) prevented portability of local workflows. Every new user lost a day to setup and configuration.
**Cua Cloud** eliminates these constraints while preserving everything developers are familiar with about our Computer and Agent SDK.
### What We Built
Over the past month, we've been iterating over Cua Cloud with partners and beta users to address these challenges. You use the exact same `Computer` and `ComputerAgent` classes you already know, but with **zero local setup** or storage requirements. VNC access comes with **built-in encryption**, you pay only for compute time (not idle resources), and can bring your own API keys for any LLM provider.
The result? **Instant deployment** in seconds instead of hours, with no infrastructure to manage. Scale elastically from **1 to 100 agents** in parallel, with consistent behavior across all deployments. Share agent trajectories with your team for better collaboration and debugging.
## Getting Started
### Step 1: Get Your API Key
Sign up at [**trycua.com**](https://trycua.com) to get your API key.
```bash
# Set your API key in environment variables
export CUA_API_KEY=your_api_key_here
export CUA_CONTAINER_NAME=my-agent-container
```
### Step 2: Launch Your First Container
```python
import asyncio
from computer import Computer, VMProviderType
from agent import ComputerAgent
async def run_cloud_agent():
# Create a remote Linux computer with Cua Cloud
computer = Computer(
os_type="linux",
api_key=os.getenv("CUA_API_KEY"),
name=os.getenv("CUA_CONTAINER_NAME"),
provider_type=VMProviderType.CLOUD,
)
# Create an agent with your preferred loop
agent = ComputerAgent(
model="openai/gpt-4o",
save_trajectory=True,
verbosity=logging.INFO,
tools=[computer]
)
# Run a task
async for result in agent.run("Open Chrome and search for AI news"):
print(f"Response: {result.get('text')}")
# Run the agent
asyncio.run(run_cloud_agent())
```
### Available Tiers
We're launching with **three compute tiers** to match your workload needs:
- **Small** (1 vCPU, 4GB RAM) - Perfect for simple automation tasks and testing
- **Medium** (2 vCPU, 8GB RAM) - Ideal for most production workloads
- **Large** (8 vCPU, 32GB RAM) - Built for complex, resource-intensive operations
Each tier includes a **full Linux with Xfce desktop environment** with pre-configured browser, **secure VNC access** with SSL, persistent storage during your session, and automatic cleanup on termination.
## How some customers are using Cua Cloud today
### Example 1: Automated GitHub Workflow
Let's automate a complete GitHub workflow:
```python
import asyncio
import os
from computer import Computer, VMProviderType
from agent import ComputerAgent
async def github_automation():
"""Automate GitHub repository management tasks."""
computer = Computer(
os_type="linux",
api_key=os.getenv("CUA_API_KEY"),
name="github-automation",
provider_type=VMProviderType.CLOUD,
)
agent = ComputerAgent(
model="openai/gpt-4o",
save_trajectory=True,
verbosity=logging.INFO,
tools=[computer]
)
tasks = [
"Look for a repository named trycua/cua on GitHub.",
"Check the open issues, open the most recent one and read it.",
"Clone the repository if it doesn't exist yet.",
"Create a new branch for the issue.",
"Make necessary changes to resolve the issue.",
"Commit the changes with a descriptive message.",
"Create a pull request."
]
for i, task in enumerate(tasks):
print(f"\nExecuting task {i+1}/{len(tasks)}: {task}")
async for result in agent.run(task):
print(f"Response: {result.get('text')}")
# Check if any tools were used
tools = result.get('tools')
if tools:
print(f"Tools used: {tools}")
print(f"Task {i+1} completed")
# Run the automation
asyncio.run(github_automation())
```
### Example 2: Parallel Web Scraping
Run multiple agents in parallel to scrape different websites:
```python
import asyncio
from computer import Computer, VMProviderType
from agent import ComputerAgent
async def scrape_website(site_name, url):
"""Scrape a website using a cloud agent."""
computer = Computer(
os_type="linux",
api_key=os.getenv("CUA_API_KEY"),
name=f"scraper-{site_name}",
provider_type=VMProviderType.CLOUD,
)
agent = ComputerAgent(
model="openai/gpt-4o",
save_trajectory=True,
tools=[computer]
)
results = []
tasks = [
f"Navigate to {url}",
"Extract the main headlines or article titles",
"Take a screenshot of the page",
"Save the extracted data to a file"
]
for task in tasks:
async for result in agent.run(task):
results.append({
'site': site_name,
'task': task,
'response': result.get('text')
})
return results
async def parallel_scraping():
"""Scrape multiple websites in parallel."""
sites = [
("ArXiv", "https://arxiv.org"),
("HackerNews", "https://news.ycombinator.com"),
("TechCrunch", "https://techcrunch.com")
]
# Run all scraping tasks in parallel
tasks = [scrape_website(name, url) for name, url in sites]
results = await asyncio.gather(*tasks)
# Process results
for site_results in results:
print(f"\nResults from {site_results[0]['site']}:")
for result in site_results:
print(f" - {result['task']}: {result['response'][:100]}...")
# Run parallel scraping
asyncio.run(parallel_scraping())
```
## Cost Optimization Tips
To optimize your costs, use appropriate container sizes for your workload and implement timeouts to prevent runaway tasks. Batch related operations together to minimize container spin-up time, and always remember to terminate containers when your work is complete.
## Security Considerations
Cua Cloud runs all containers in isolated environments with encrypted VNC connections. Your API keys are never exposed in trajectories.
## What's Next for Cua Cloud
We're just getting started! Here's what's coming in the next few months:
### Elastic Autoscaled Container Pools
Soon you'll be able to create elastic container pools that automatically scale based on demand. Define minimum and maximum container counts, and let Cua Cloud handle the rest. Perfect for batch processing, scheduled automations, and handling traffic spikes without manual intervention.
### Windows and macOS Cloud Support
While we're launching with Linux containers, Windows and macOS cloud machines are coming soon. Run Windows-specific automations, test cross-platform workflows, or leverage macOS-exclusive applications all in the cloud with the same simple API.
Stay tuned for updates and join our [**Discord**](https://discord.gg/cua-ai) to vote on which features you'd like to see first!
## Get Started Today
Ready to deploy your Computer-Use Agents in the cloud?
Visit [**trycua.com**](https://trycua.com) to sign up and get your API key. Join our [**Discord community**](https://discord.gg/cua-ai) for support and explore more examples on [**GitHub**](https://github.com/trycua/cua).
Happy RPA 2.0! 🚀

View File

@@ -0,0 +1,176 @@
# From Lume to Containerization: Our Journey Meets Apple's Vision
*Published on June 10, 2025 by Francesco Bonacci*
Yesterday, Apple announced their new [Containerization framework](https://github.com/apple/containerization) at WWDC. Since then, our Discord and X users have been asking what this means for Cua virtualization capabilities on Apple Silicon. We've been working in this space for months - from [Lume](https://github.com/trycua/cua/tree/main/libs/lume) to [Lumier](https://github.com/trycua/cua/tree/main/libs/lumier) to [Cua Cloud Containers](./introducing-cua-cloud-containers). Here's our take on Apple's announcement.
## Our Story
When we started Cua, we wanted to solve a simple problem: make it easy to run VMs on Apple Silicon, with a focus on testing and deploying computer-use agents without dealing with complicated setups.
We decided to build on Apple's Virtualization framework because it was fast and well-designed. This became Lume, which we launched on [Hacker News](https://news.ycombinator.com/item?id=42908061).
Four months later, we're happy with our choice. Users are running VMs with great performance and low memory usage. Now Apple's new [Containerization](https://github.com/apple/containerization) framework builds on the same foundation - showing we were on the right track.
## What Apple Announced
Apple's Containerization framework changes how containers work on macOS. Here's what's different:
### How It Works
Instead of running all containers in one shared VM (like Docker or Colima), Apple runs each container in its own tiny VM:
```bash
How Docker Works:
┌─────────────────────────────────┐
│ Your Mac │
├─────────────────────────────────┤
│ One Big Linux VM │
├─────────────────────────────────┤
│ Container 1 │ Container 2 │ ... │
└─────────────────────────────────┘
How Apple's Framework Works:
┌─────────────────────────────────┐
│ Your Mac │
├─────────────────────────────────┤
│ Mini VM 1 │ Mini VM 2 │ Mini VM 3│
│Container 1│Container 2│Container 3│
└─────────────────────────────────┘
```
Why is this better?
- **Better security**: Each container is completely separate
- **Better performance**: Each container gets its own resources
- **Real isolation**: If one container has problems, others aren't affected
> **Note**: You'll need macOS Tahoe 26 Preview or later to use all features. The new [VZVMNetNetworkDeviceAttachment](https://developer.apple.com/documentation/virtualization/vzvmnetnetworkdeviceattachment) API required to fully implement the above architecture is only available there.
### The Technical Details
Here's what makes it work:
- **vminitd**: A tiny program that starts up each container VM super fast
- **Fast boot**: These mini VMs start in less than a second
- **Simple storage**: Containers are stored as ready-to-use disk images
Instead of using big, slow startup systems, Apple created something minimal. Each container VM boots with just what it needs - nothing more.
The `vminitd` part is really clever. It's the first thing that runs in each mini VM and lets the container talk to the outside world. It handles everything the container needs to work properly.
### What About GPU Passthrough?
Some developers found hints in macOS Tahoe that GPU support might be coming, through a symbol called `_VZPCIDeviceConfiguration` in the new version of the Virtualization framework. This could mean we'll be able to use GPUs inside containers and VMs soon. Imagine running local models using Ollama or LM Studio! We're not far from having fully local and isolated computer-use agents.
## What We've Built on top of Apple's Virtualization Framework
While Apple's new framework focuses on containers, we've been building VM management tools on top of the same Apple Virtualization framework. Here's what we've released:
### Lume: Simple VM Management
[Lume](https://github.com/trycua/cua/tree/main/libs/lume) is our command-line tool for creating and managing VMs on Apple Silicon. We built it because setting up VMs on macOS was too complicated.
What Lume does:
- **Direct control**: Works directly with Apple's Virtualization framework
- **Ready-to-use images**: Start a macOS or Linux VM with one command
- **API server**: Control VMs from other programs (runs on port 7777)
- **Smart storage**: Uses disk space efficiently
- **Easy install**: One command to get started
- **Share images**: Push your VM images to registries like Docker images
```bash
# Install Lume
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
# Start a macOS VM
lume run macos-sequoia-vanilla:latest
```
### Lumier: Docker-Style VM Management
[Lumier](https://github.com/trycua/lumier) works differently. It lets you use Docker commands to manage VMs. But here's the key: **Docker is just for packaging, not for isolation**.
What makes Lumier useful:
- **Familiar commands**: If you know Docker, you know Lumier
- **Web access**: Connect to your VM through a browser
- **Save your work**: VMs remember their state
- **Share files**: Easy to move files between your Mac and the VM
- **Automation**: Script your VM setup
```bash
# Run a macOS VM with Lumier
docker run -it --rm \
--name macos-vm \
-p 8006:8006 \
-e VM_NAME=macos-vm \
-e VERSION=ghcr.io/trycua/macos-sequoia-cua:latest \
trycua/lumier:latest
```
## Comparing the Options
Let's see how these three approaches stack up:
### How They're Built
```bash
Apple Containerization:
Your App → Container → Mini VM → Mac Hardware
Lume:
Your App → Full VM → Mac Hardware
Lumier:
Docker → Lume → Full VM → Mac Hardware
```
### When to Use What
**Apple's Containerization**
- ✅ Perfect for: Running containers with maximum security
- ✅ Starts in under a second
- ✅ Uses less memory and CPU
- ❌ Needs macOS Tahoe 26 Preview
- ❌ Only for containers, not full VMs
**Lume**
- ✅ Perfect for: Development and testing
- ✅ Full control over macOS/Linux VMs
- ✅ Works on current macOS versions
- ✅ Direct access to everything
- ❌ Uses more resources than containers
**Lumier**
- ✅ Perfect for: Teams already using Docker
- ✅ Easy to share and deploy
- ✅ Access through your browser
- ✅ Great for automated workflows
- ❌ Adds an extra layer of complexity
### Using Them Together
Here's the cool part - you can combine these tools:
1. **Create a VM**: Use Lume to set up a macOS VM
2. **Run containers**: Use Apple's framework inside that VM (works on M3+ Macs with nested virtualization)
You get the best of both worlds: full VM control plus secure containers.
## What's Next for Cua?
Apple's announcement confirms we're on the right path. Here's what we're looking forward to:
1. **Faster VMs**: Learning from Apple's super-fast container startup, and whether some learnings can be applied to macOS VMs
2. **GPU support**: Getting ready for GPU passthrough when `_VZPCIDeviceConfiguration` is made available, realistically in a stable release of macOS Tahoe 26
## Learn More
- [Apple Containerization Framework](https://github.com/apple/containerization)
- [Lume - Direct VM Management](https://github.com/trycua/cua/tree/main/libs/lume)
- [Lumier - Docker Interface for VMs](https://github.com/trycua/cua/tree/main/libs/lumier)
- [Cua Cloud Containers](https://trycua.com)
- [Join our Discord](https://discord.gg/cua-ai)
---
*Questions about virtualization on Apple Silicon? Come chat with us on Discord!*

View File

@@ -0,0 +1,372 @@
# Sandboxed Python Execution: Run Code Safely in Cua Containers
*Published on June 23, 2025 by Dillon DuPont*
Cua's computer-use capabilities that we touched on in [Building your own Operator on macOS - Part 2](build-your-own-operator-on-macos-2.md) your AI agents can click, scroll, type, and interact with any desktop application. But what if your agent needs to do more than just UI automation? What if it needs to process data, make API calls, analyze images, or run complex logic alongside those UI interactions, within the same virtual environment?
That's where Cua's `@sandboxed` decorator comes in. While Cua handles the clicking and typing, sandboxed execution lets you run full Python code inside the same virtual environment. It's like giving your AI agents a programming brain to complement their clicking fingers.
Think of it as the perfect marriage: Cua handles the "what you see" (UI interactions), while sandboxed Python handles the "what you compute" (data processing, logic, API calls) all happening in the same isolated environment.
## So, what exactly is sandboxed execution?
Cua excels at automating user interfaces clicking buttons, filling forms, navigating applications. But modern AI agents need to do more than just UI automation. They need to process the data they collect, make intelligent decisions, call external APIs, and run sophisticated algorithms.
Sandboxed execution bridges this gap. You write a Python function, decorate it with `@sandboxed`, and it runs inside your Cua container alongside your UI automation. Your agent can now click a button, extract some data, process it with Python, and then use those results to decide what to click next.
Here's what makes this combination powerful for AI agent development:
- **Unified environment**: Your UI automation and code execution happen in the same container
- **Rich capabilities**: Combine Cua's clicking with Python's data processing, API calls, and libraries
- **Seamless integration**: Pass data between UI interactions and Python functions effortlessly
- **Cross-platform consistency**: Your Python code runs the same way across different Cua environments
- **Complete workflows**: Build agents that can both interact with apps AND process the data they collect
## The architecture behind @sandboxed
Let's jump right into an example that'll make this crystal clear:
```python
from computer.helpers import sandboxed
@sandboxed("demo_venv")
def greet_and_print(name):
"""This function runs inside the container"""
import PyXA # macOS-specific library
safari = PyXA.Application("Safari")
html = safari.current_document.source()
print(f"Hello from inside the container, {name}!")
return {"greeted": name, "safari_html": html}
# When called, this executes in the container
result = await greet_and_print("Cua")
```
What's happening here? When you call `greet_and_print()`, Cua extracts the function's source code, transmits it to the container, and executes it there. The result returns to you seamlessly, while the actual execution remains completely isolated.
## How does sandboxed execution work?
Cua's sandboxed execution system employs several key architectural components:
### 1. Source Code Extraction
Cua uses Python's `inspect.getsource()` to extract your function's source code and reconstruct the function definition in the remote environment.
### 2. Virtual Environment Isolation
Each sandboxed function runs in a named virtual environment within the container. This provides complete dependency isolation between different functions and their respective environments.
### 3. Data Serialization and Transport
Arguments and return values are serialized as JSON and transported between the host and container. This ensures compatibility across different Python versions and execution environments.
### 4. Comprehensive Error Handling
The system captures both successful results and exceptions, preserving stack traces and error information for debugging purposes.
## Getting your sandbox ready
Setting up sandboxed execution is simple:
```python
import asyncio
from computer.computer import Computer
from computer.helpers import sandboxed, set_default_computer
async def main():
# Fire up the computer
computer = Computer()
await computer.run()
# Make it the default for all sandboxed functions
set_default_computer(computer)
# Install some packages in a virtual environment
await computer.venv_install("demo_venv", ["requests", "beautifulsoup4"])
```
If you want to get fancy, you can specify which computer instance to use:
```python
@sandboxed("my_venv", computer=my_specific_computer)
def my_function():
# This runs on your specified computer instance
pass
```
## Real-world examples that actually work
### Browser automation without the headaches
Ever tried to automate a browser and had it crash your entire system? Yeah, us too. Here's how to do it safely:
```python
@sandboxed("browser_env")
def automate_browser_with_playwright():
"""Automate browser interactions using Playwright"""
from playwright.sync_api import sync_playwright
import time
import base64
from datetime import datetime
try:
with sync_playwright() as p:
# Launch browser (visible, because why not?)
browser = p.chromium.launch(
headless=False,
args=['--no-sandbox', '--disable-dev-shm-usage']
)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 720})
actions = []
screenshots = {}
# Let's visit example.com and poke around
page.goto("https://example.com")
actions.append("Navigated to example.com")
# Grab a screenshot because screenshots are cool
screenshot_bytes = page.screenshot(full_page=True)
screenshots["initial"] = base64.b64encode(screenshot_bytes).decode()
# Get some basic info
title = page.title()
actions.append(f"Page title: {title}")
# Find links and headings
try:
links = page.locator("a").all()
link_texts = [link.text_content() for link in links[:5]]
actions.append(f"Found {len(links)} links: {link_texts}")
headings = page.locator("h1, h2, h3").all()
heading_texts = [h.text_content() for h in headings[:3]]
actions.append(f"Found headings: {heading_texts}")
except Exception as e:
actions.append(f"Element interaction error: {str(e)}")
# Let's try a form for good measure
try:
page.goto("https://httpbin.org/forms/post")
actions.append("Navigated to form page")
# Fill out the form
page.fill('input[name="custname"]', "Test User from Sandboxed Environment")
page.fill('input[name="custtel"]', "555-0123")
page.fill('input[name="custemail"]', "test@example.com")
page.select_option('select[name="size"]', "large")
actions.append("Filled out form fields")
# Submit and see what happens
page.click('input[type="submit"]')
page.wait_for_load_state("networkidle")
actions.append("Submitted form")
except Exception as e:
actions.append(f"Form interaction error: {str(e)}")
browser.close()
return {
"actions_performed": actions,
"screenshots": screenshots,
"success": True
}
except Exception as e:
return {"error": f"Browser automation failed: {str(e)}"}
# Install Playwright and its browsers
await computer.venv_install("browser_env", ["playwright"])
await computer.venv_cmd("browser_env", "playwright install chromium")
# Run the automation
result = await automate_browser_with_playwright()
print(f"Performed {len(result.get('actions_performed', []))} actions")
```
### Building code analysis agents
Want to build agents that can analyze code safely? Here's a security audit tool that won't accidentally `eval()` your system into oblivion:
```python
@sandboxed("analysis_env")
def security_audit_tool(code_snippet):
"""Analyze code for potential security issues"""
import ast
import re
issues = []
# Check for the usual suspects
dangerous_patterns = [
(r'eval\s*\(', "Use of eval() function"),
(r'exec\s*\(', "Use of exec() function"),
(r'__import__\s*\(', "Dynamic import usage"),
(r'subprocess\.', "Subprocess usage"),
(r'os\.system\s*\(', "OS system call"),
]
for pattern, description in dangerous_patterns:
if re.search(pattern, code_snippet):
issues.append(description)
# Get fancy with AST analysis
try:
tree = ast.parse(code_snippet)
for node in ast.walk(tree):
if isinstance(node, ast.Call):
if hasattr(node.func, 'id'):
if node.func.id in ['eval', 'exec', 'compile']:
issues.append(f"Dangerous function call: {node.func.id}")
except SyntaxError:
issues.append("Syntax error in code")
return {
"security_issues": issues,
"risk_level": "HIGH" if len(issues) > 2 else "MEDIUM" if issues else "LOW"
}
# Test it on some sketchy code
audit_result = await security_audit_tool("eval(user_input)")
print(f"Security audit: {audit_result}")
```
### Desktop automation in the cloud
Here's where things get really interesting. Cua cloud containers come with full desktop environments, so you can automate GUIs:
```python
@sandboxed("desktop_env")
def take_screenshot_and_analyze():
"""Take a screenshot and analyze the desktop"""
import io
import base64
from PIL import ImageGrab
from datetime import datetime
try:
# Grab the screen
screenshot = ImageGrab.grab()
# Convert to base64 for easy transport
buffer = io.BytesIO()
screenshot.save(buffer, format='PNG')
screenshot_data = base64.b64encode(buffer.getvalue()).decode()
# Get some basic info
screen_info = {
"size": screenshot.size,
"mode": screenshot.mode,
"timestamp": datetime.now().isoformat()
}
# Analyze the colors (because why not?)
colors = screenshot.getcolors(maxcolors=256*256*256)
dominant_color = max(colors, key=lambda x: x[0])[1] if colors else None
return {
"screenshot_base64": screenshot_data,
"screen_info": screen_info,
"dominant_color": dominant_color,
"unique_colors": len(colors) if colors else 0
}
except Exception as e:
return {"error": f"Screenshot failed: {str(e)}"}
# Install the dependencies
await computer.venv_install("desktop_env", ["Pillow"])
# Take and analyze a screenshot
result = await take_screenshot_and_analyze()
print("Desktop analysis complete!")
```
## Pro tips for sandboxed success
### Keep it self-contained
Always put your imports inside the function. Trust us on this one:
```python
@sandboxed("good_env")
def good_function():
import os # Import inside the function
import json
# Your code here
return {"result": "success"}
```
### Install dependencies first
Don't forget to install packages before using them:
```python
# Install first
await computer.venv_install("my_env", ["pandas", "numpy", "matplotlib"])
@sandboxed("my_env")
def data_analysis():
import pandas as pd
import numpy as np
# Now you can use them
```
### Use descriptive environment names
Future you will thank you:
```python
@sandboxed("data_processing_env")
def process_data(): pass
@sandboxed("web_scraping_env")
def scrape_site(): pass
@sandboxed("ml_training_env")
def train_model(): pass
```
### Always handle errors gracefully
Things break. Plan for it:
```python
@sandboxed("robust_env")
def robust_function(data):
try:
result = process_data(data)
return {"success": True, "result": result}
except Exception as e:
return {"success": False, "error": str(e)}
```
## What about performance?
Let's be honest there's some overhead here. Code needs to be serialized, sent over the network, and executed remotely. But for most use cases, the benefits far outweigh the costs.
If you're building something performance-critical, consider:
- Batching multiple operations into a single sandboxed function
- Minimizing data transfer between host and container
- Using persistent virtual environments
## The security angle
This is where sandboxed execution really shines:
1. **Complete process isolation** code runs in a separate container
2. **File system protection** limited access to your host files
3. **Network isolation** controlled network access
4. **Clean environments** no package conflicts or pollution
5. **Resource limits** container-level constraints keep things in check
## Ready to get started?
The `@sandboxed` decorator is one of those features that sounds simple but opens up a world of possibilities. Whether you're testing sketchy code, building AI agents, or just want to keep your development environment pristine, it's got you covered.
Give it a try in your next Cua project and see how liberating it feels to run code without fear!
Happy coding (safely)!
---
*Want to dive deeper? Check out our [sandboxed functions examples](https://github.com/trycua/cua/blob/main/examples/sandboxed_functions_examples.py) and [virtual environment tests](https://github.com/trycua/cua/blob/main/tests/venv.py) on GitHub. Questions? Come chat with us on Discord!*

View File

@@ -0,0 +1,302 @@
# Training Computer-Use Models: Creating Human Trajectories with Cua
*Published on May 1, 2025 by Dillon DuPont*
In our previous posts, we covered [building your own Computer-Use Operator](build-your-own-operator-on-macos-1) and [using the Agent framework](build-your-own-operator-on-macos-2) to simplify development. Today, we'll focus on a critical aspect of improving computer-use agents and models: gathering high-quality demonstration data using Cua's Computer-Use Interface (CUI) and its Gradio UI to create and share human-generated trajectories.
Why is this important? Underlying models used by Computer-use agents need examples of how humans interact with computers to learn effectively. By creating a dataset of diverse, well-executed tasks, we can help train better models that understand how to navigate user interfaces and accomplish real tasks.
<video src="https://github.com/user-attachments/assets/c586d460-3877-4b5f-a736-3248886d2134" controls width="600"></video>
## What You'll Learn
By the end of this tutorial, you'll be able to:
- Set up the Computer-Use Interface (CUI) with Gradio UI support
- Record your own computer interaction trajectories
- Organize and tag your demonstrations
- Upload your datasets to Hugging Face for community sharing
- Contribute to improving computer-use AI for everyone
**Prerequisites:**
- macOS Sonoma (14.0) or later
- Python 3.10+
- Basic familiarity with Python and terminal commands
- A Hugging Face account (for uploading datasets)
**Estimated Time:** 20-30 minutes
## Understanding Human Trajectories
### What are Human Trajectories?
Human trajectories, in the context of Computer-use AI Agents, are recordings of how humans interact with computer interfaces to complete tasks. These interactions include:
- Mouse movements, clicks, and scrolls
- Keyboard input
- Changes in the UI state
- Time spent on different elements
These trajectories serve as examples for AI models to learn from, helping them understand the relationship between:
1. The visual state of the screen
2. The user's goal or task
3. The most appropriate action to take
### Why Human Demonstrations Matter
Unlike synthetic data or rule-based automation, human demonstrations capture the nuanced decision-making that happens during computer interaction:
- **Natural Pacing**: Humans pause to think, accelerate through familiar patterns, and adjust to unexpected UI changes
- **Error Recovery**: Humans demonstrate how to recover from mistakes or handle unexpected states
- **Context-Sensitive Actions**: The same UI element might be used differently depending on the task context
By contributing high-quality demonstrations, you're helping to create more capable, human-like computer-use AI systems.
## Setting Up Your Environment
### Installing the CUI with Gradio Support
The Computer-Use Interface includes an optional Gradio UI specifically designed to make recording and sharing demonstrations easy. Let's set it up:
1. **Create a Python environment** (optional but recommended):
```bash
# Using conda
conda create -n cua-trajectories python=3.10
conda activate cua-trajectories
# Using venv
python -m venv cua-trajectories
source cua-trajectories/bin/activate # On macOS/Linux
```
2. **Install the CUI package with UI support**:
```bash
pip install "cua-computer[ui]"
```
3. **Set up your Hugging Face access token**:
Create a `.env` file in your project directory and add your Hugging Face token:
```bash
echo "HF_TOKEN=your_huggingface_token" > .env
```
You can get your token from your [Hugging Face account settings](https://huggingface.co/settings/tokens).
### Understanding the Gradio UI
The Computer-Use Interface Gradio UI provides three main components:
1. **Recording Panel**: Captures your screen, mouse, and keyboard activity during demonstrations
2. **Review Panel**: Allows you to review, tag, and organize your demonstration recordings
3. **Upload Panel**: Lets you share your demonstrations with the community via Hugging Face
The UI is designed to make the entire process seamless, from recording to sharing, without requiring deep technical knowledge of the underlying systems.
## Creating Your First Trajectory Dataset
### Launching the UI
To get started, create a simple Python script to launch the Gradio UI:
```python
# launch_trajectory_ui.py
from computer.ui.gradio.app import create_gradio_ui
from dotenv import load_dotenv
# Load your Hugging Face token from .env
load_dotenv('.env')
# Create and launch the UI
app = create_gradio_ui()
app.launch(share=False)
```
Run this script to start the UI:
```bash
python launch_trajectory_ui.py
```
### Recording a Demonstration
Let's walk through the process of recording your first demonstration:
1. **Start the VM**: Click the "Initialize Computer" button in the UI to initialize a fresh macOS sandbox. This ensures your demonstrations are clean and reproducible.
2. **Perform a Task**: Complete a simple task like creating a document, organizing files, or searching for information. Natural, everyday tasks make the best demonstrations.
3. **Review Recording**: Click the "Conversation Logs" or "Function Logs" tabs to review your captured interactions, making sure there is no personal information that you wouldn't want to share.
4. **Add Metadata**: In the "Save/Share Demonstrations" tab, give your recording a descriptive name (e.g., "Creating a Calendar Event") and add relevant tags (e.g., "productivity", "time-management").
5. **Save Your Demonstration**: Click "Save" to store your recording locally.
<video src="https://github.com/user-attachments/assets/de3c3477-62fe-413c-998d-4063e48de176" controls width="600"></video>
### Key Tips for Quality Demonstrations
To create the most valuable demonstrations:
- **Start and end at logical points**: Begin with a clear starting state and end when the task is visibly complete
- **Narrate your thought process**: Use the message input to describe what you're trying to do and why
- **Move at a natural pace**: Don't rush or perform actions artificially slowly
- **Include error recovery**: If you make a mistake, keep going and show how to correct it
- **Demonstrate variations**: Record multiple ways to complete the same task
## Organizing and Tagging Demonstrations
Effective tagging and organization make your demonstrations more valuable to researchers and model developers. Consider these tagging strategies:
### Task-Based Tags
Describe what the demonstration accomplishes:
- `web-browsing`
- `document-editing`
- `file-management`
- `email`
- `scheduling`
### Application Tags
Identify the applications used:
- `finder`
- `safari`
- `notes`
- `terminal`
- `calendar`
### Complexity Tags
Indicate the difficulty level:
- `beginner`
- `intermediate`
- `advanced`
- `multi-application`
### UI Element Tags
Highlight specific UI interactions:
- `drag-and-drop`
- `menu-navigation`
- `form-filling`
- `search`
The Computer-Use Interface UI allows you to apply and manage these tags across all your saved demonstrations, making it easy to create cohesive, well-organized datasets.
<video src="https://github.com/user-attachments/assets/5ad1df37-026a-457f-8b49-922ae805faef" controls width="600"></video>
## Uploading to Hugging Face
Sharing your demonstrations helps advance research in computer-use AI. The Gradio UI makes uploading to Hugging Face simple:
### Preparing for Upload
1. **Review Your Demonstrations**: Use the review panel to ensure all demonstrations are complete and correctly tagged.
2. **Select Demonstrations to Upload**: You can upload all demonstrations or filter by specific tags.
3. **Configure Dataset Information**:
- **Repository Name**: Format as `{your_username}/{dataset_name}`, e.g., `johndoe/productivity-tasks`
- **Visibility**: Choose `public` to contribute to the community or `private` for personal use
- **License**: Standard licenses like CC-BY or MIT are recommended for public datasets
### The Upload Process
1. **Click "Upload to Hugging Face"**: This initiates the upload preparation.
2. **Review Dataset Summary**: Confirm the number of demonstrations and total size.
3. **Confirm Upload**: The UI will show progress as files are transferred.
4. **Receive Confirmation**: Once complete, you'll see a link to your new dataset on Hugging Face.
<video src="https://github.com/user-attachments/assets/c586d460-3877-4b5f-a736-3248886d2134" controls width="600"></video>
Your uploaded dataset will have a standardized format with the following structure:
```json
{
"timestamp": "2025-05-01T09:20:40.594878",
"session_id": "1fe9f0fe-9331-4078-aacd-ec7ffb483b86",
"name": "penguin lemon forest",
"tool_calls": [...], // Detailed interaction records
"messages": [...], // User/assistant messages
"tags": ["highquality", "tasks"],
"images": [...] // Screenshots of each state
}
```
This structured format makes it easy for researchers to analyze patterns across different demonstrations and build better computer-use models.
```python
from computer import Computer
computer = Computer(os_type="macos", display="1024x768", memory="8GB", cpu="4")
try:
await computer.run()
screenshot = await computer.interface.screenshot()
with open("screenshot.png", "wb") as f:
f.write(screenshot)
await computer.interface.move_cursor(100, 100)
await computer.interface.left_click()
await computer.interface.right_click(300, 300)
await computer.interface.double_click(400, 400)
await computer.interface.type("Hello, World!")
await computer.interface.press_key("enter")
await computer.interface.set_clipboard("Test clipboard")
content = await computer.interface.copy_to_clipboard()
print(f"Clipboard content: {content}")
finally:
await computer.stop()
```
## Example: Shopping List Demonstration
Let's walk through a concrete example of creating a valuable demonstration:
### Task: Adding Shopping List Items to a Doordash Cart
1. **Start Recording**: Begin with a clean desktop and a text file containing a shopping list.
2. **Task Execution**: Open the file, read the list, open Safari, navigate to Doordash, and add each item to the cart.
3. **Narration**: Add messages like "Reading the shopping list" and "Searching for rice on Doordash" to provide context.
4. **Completion**: Verify all items are in the cart and end the recording.
5. **Tagging**: Add tags like `shopping`, `web-browsing`, `task-completion`, and `multi-step`.
This type of demonstration is particularly valuable because it showcases real-world task completion requiring multiple applications and context switching.
### Exploring Community Datasets
You can also learn from existing trajectory datasets contributed by the community:
1. Visit [Hugging Face Datasets tagged with 'cua'](https://huggingface.co/datasets?other=cua)
2. Explore different approaches to similar tasks
3. Download and analyze high-quality demonstrations
## Conclusion
### Summary
In this guide, we've covered how to:
- Set up the Computer-Use Interface with Gradio UI
- Record high-quality human demonstrations
- Organize and tag your trajectories
- Share your datasets with the community
By contributing your own demonstrations, you're helping to build more capable, human-like AI systems that can understand and execute complex computer tasks.
### Next Steps
Now that you know how to create and share trajectories, consider these advanced techniques:
- Create themed collections around specific productivity workflows
- Collaborate with others to build comprehensive datasets
- Use your datasets to fine-tune your own computer-use models
### Resources
- [Computer-Use Interface GitHub](https://github.com/trycua/cua/tree/main/libs/computer)
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
- [Example Dataset: ddupont/test-dataset](https://huggingface.co/datasets/ddupont/test-dataset)

89
blog/trajectory-viewer.md Normal file
View File

@@ -0,0 +1,89 @@
# Trajectory Viewer for Cua
*Published on May 13, 2025 by Dillon DuPont*
Dont forget to check out [Part 1: Building your own Computer-Use Operator](build-your-own-operator-on-macos-1) and [Part 2: Using the Agent framework](build-your-own-operator-on-macos-2) for setting up your Cua environment and basic tips and tricks!
## Introduction
Okay, so youve gotten your environment up and also tested a few agent runs. Youll likely have encountered cases where your agent was successful at doing some tasks but also places where it got stuck or outright failed.
Now what?
If youve ever wondered exactly what your computer agent is doing and why it sometimes doesnt do what you expected, then the Trajectory Viewer for Cua is here to help! Whether youre a seasoned developer or someone who just wants to dive in and see results, this tool makes it easy to explore every step your agent takes on your screen.
Plus, if you want to start thinking about generating data to train your own agentic model (well cover training in an upcoming blog, so look forward to it), then our Trajectory Viewer might be for you.
## So, whats a “trajectory”?
Think of a trajectory as a detailed video recording of your agents journey:
- **Observations**: What did the agent see (the exact screen content) at each point in time?
- **Actions**: What clicks, keystrokes, or commands did it perform in response?
- **Decisions**: Which options did it choose, and why?
Especially for longer and more complex tasks, your agent will make multiple steps, take multiple actions, and make multiple observations. By examining this record, you can pinpoint where things go right, and more importantly, where they go wrong.
## So, whats Cuas Trajectory Viewer and why use it?
The Trajectory Player for Cua is a GUI tool that helps you explore saved trajectories generated from your Cua computer agent runs. This tool provides a powerful way to:
- **Debug your agents**: See exactly what your agent saw to reproduce bugs
- **Analyze failure cases**: Identify the moment when your agent went off-script
- **Collect training data**: Export your trajectories for your own processing, training, and more!
The viewer allows you to see exactly what your agent observed and how it interacted with the computer all through your browser.
## Opening Trajectory Viewer in 3 Simple Steps
1. **Visit**: Open your browser and go to [https://www.trycua.com/trajectory-viewer](https://www.trycua.com/trajectory-viewer).
2. **Upload**: Drag and drop a trajectories folder or click Select Folder.
3. **Explore**: View your agents trajectories! All data stays in your browser unless you give permission otherwise.
![Trajectory Viewer Screenshot](/trajectory-viewer.jpeg)
## Recording a Trajectory
### Using the Gradio UI
The simplest way to create agent trajectories is through the [Cua Agent Gradio UI](https://www.trycua.com/docs/quickstart-ui) by checking the "Save Trajectory" option.
### Using the ComputerAgent API
Trajectories are saved by default when using the ComputerAgent API:
```python
agent.run("book a flight for me")
```
You can explicitly control trajectory saving with the `save_trajectory` parameter:
```python
from cua import ComputerAgent
agent = ComputerAgent(save_trajectory=True)
agent.run("search for hotels in Boston")
```
Each trajectory folder is saved in a `trajectories` directory with a timestamp format, for example: `trajectories/20250501_222749`
## Exploring and Analyzing Trajectories
Our Trajectory Viewer is designed to allow for thorough analysis and debugging in a friendly way. Once loaded, the viewer presents:
- **Timeline Slider**: Jump to any step in the session
- **Screen Preview**: See exactly what the agent saw
- **Action Details**: Review clicks, keypresses, and API calls
- **Logs & Metadata**: Inspect debug logs or performance stats
Use these features to:
- Step through each action and observation; understand your agents decision-making
- Understand why and where your agent failed
- Collect insights for improving your instructions, prompts, tasks, agent, etc.
The trajectory viewer provides a visual interface for stepping through each action your agent took, making it easy to see what your agent “sees”.
## Getting Started
Ready to see your agent in action? Head over to the Trajectory Viewer and load up your first session. Debug smarter, train faster, and stay in control (all within your browser).
Happy tinkering and Cua on!
Have questions or want to share feedback? Join our community on Discord or open an issue on GitHub.

View File

@@ -0,0 +1,183 @@
# Ubuntu Docker Support in Cua with Kasm
*Published Aug 26, 2025 by Francesco Bonacci*
Today were shipping **Ubuntu Docker support** in Cua. You get a full Linux desktop inside a Docker container, viewable right in your browser—no VM spin-up, no extra clients. It behaves the same on macOS, Windows, and Linux.
<img src="/docker-ubuntu-support.png" alt="Cua + KasmVNC Ubuntu container desktop">
## Why we did this
If you build automation or RL workflows with Cua, youve probably run into the usual platform walls: macOS VMs (via Lume) are Apple-Silicon only; Windows Sandbox needs Pro/Enterprise; giving agents your host desktop is… exciting, but risky; and little OS quirks make “build once, run anywhere” harder than it should be.
We wanted something lightweight, isolated, and identical across machines. So we put a desktop in a container.
## Why we didnt use QEMU/KVM
Short answer: **portability, startup time, and ops friction.**
* **Runs everywhere, no hypervisor drama.** KVM needs Linux; Hyper-V/Virtualization.Framework setups vary by host and policy. Docker is ubiquitous across macOS/Windows/Linux and allowed in most CI runners—so your GUI env actually runs where your team works.
* **Faster boot & smaller footprints.** Containers cold-start in seconds and images are GB-scale; VMs tend to be minutes and tens of GB. That matters for parallel agents, CI, and local iteration.
* **Lower ops overhead.** No nested virt, kernel modules, or privileged host tweaks that many orgs (and cloud runners) block. Pull → run → browser.
* **Same image, everywhere.** One Docker image gives you an identical desktop on every dev laptop and in CI.
* **Web-first access out of the box.** KasmVNC serves the desktop over HTTP—no extra VNC/RDP clients or SPICE config.
**When we *do* reach for QEMU/KVM:**
* You need **true OS isolation** or to run **non-Linux** guests.
* You want **kernel-level features** or **device/GPU passthrough** (VFIO).
* Youre optimizing for **hardware realism** over startup speed and density.
For this release, the goal was a **cross-platform Linux desktop that feels instant and identical** across local dev and CI. Containers + KasmVNC hit that sweet spot.
## What we built
Under the hood its **KasmVNC + Ubuntu 22.04 (Xfce) in Docker**, pre-configured for computer-use automation. You get a proper GUI desktop served over HTTP (no VNC/RDP client), accessible from any modern browser. Cuas Computer server boots automatically so your agents can connect immediately.
### How it works (at a glance)
```
Your System
└─ Docker Container
└─ Xfce Desktop + KasmVNC → open in your browser
```
---
## Quick start
1. **Install Docker** — Docker Desktop (macOS/Windows) or Docker Engine (Linux).
2. **Pull or build the image**
```bash
# Pull (recommended)
docker pull --platform=linux/amd64 trycua/cua-ubuntu:latest
# Or build locally
cd libs/kasm
docker build -t cua-ubuntu:latest .
```
3. **Run with Cuas Computer SDK**
```python
from computer import Computer
computer = Computer(
os_type="linux",
provider_type="docker",
image="trycua/cua-ubuntu:latest",
name="my-automation-container"
)
await computer.run()
```
### Make an agent that drives this desktop
```python
from agent import ComputerAgent
# assumes `computer` is the instance created above
agent = ComputerAgent("openrouter/z-ai/glm-4.5v", tools=[computer])
async for _ in agent.run("Click on the search bar and type 'hello world'"):
pass
```
> Use any VLM with tool use; just make sure your OpenRouter creds are set.
By default you land on **Ubuntu 22.04 + Xfce** with a browser and desktop basics, the **Computer server** is running, the **web viewer** is available at `http://localhost:8006`, and common automation tools are preinstalled.
---
## Whats inside (in plain English)
A tidy Linux desktop with web access through **KasmVNC**, Python 3.11 and dev tools, plus utilities youll actually use for automation—`wmctrl` for windows, `xclip` for clipboard, `ffmpeg` for media, screenshot helpers, and so on. It starts as a **non-root `kasm-user`**, lives in an **isolated filesystem** (unless you mount volumes), and ships with **SSL off for local dev** so you terminate TLS upstream when you deploy.
---
## How it compares
| Feature | KasmVNC Docker | Lume (macOS VM) | Windows Sandbox |
| ---------------- | --------------------- | --------------------- | ---------------------- |
| Platform support | macOS, Windows, Linux | macOS (Apple Silicon) | Windows Pro/Enterprise |
| Resource usage | Low (container) | Medium (full VM) | Medium (full VM) |
| Setup time | \~30s | 25 min | 12 min |
| GUI desktop | Linux | macOS | Windows |
| Web access | Browser (no client) | Typically VNC client | Typically RDP client |
| Consistency | Same everywhere | Hardware-dependent | OS-dependent |
**Use KasmVNC Docker when…** you want the **same GUI env across devs/CI/platforms**, youre doing **RL or end-to-end GUI tests**, or you need **many isolated desktops on one machine**.
**Use alternatives when…** you need native **macOS** (→ Lume) or native **Windows** (→ Windows Sandbox).
---
## Using the Agent Framework (parallel example)
A compact pattern for running multiple desktops and agents side-by-side:
```python
import asyncio
from computer import Computer
from agent import ComputerAgent
# Create multiple computer instances (each gets its own desktop)
computers = []
for i in range(3):
c = Computer(
os_type="linux",
provider_type="docker",
image="trycua/cua-ubuntu:latest",
name=f"parallel-desktop-{i}"
)
computers.append(c)
await c.run()
# Pair each desktop with a task
tasks = [
"open github and search for 'trycua/cua'",
"open a text editor and write 'hello world'",
"open the browser and go to google.com",
]
agents = [
ComputerAgent(model="openrouter/z-ai/glm-4.5v", tools=[c])
for c in computers
]
async def run_agent(agent, task):
async for _ in agent.run(task):
pass
await asyncio.gather(*[run_agent(a, t) for a, t in zip(agents, tasks)])
```
---
## Whats next
Were polishing a **CLI to push/scale these containers on Cua Cloud**, exploring **GPU acceleration** for in-container inference, and publishing **prebuilt images** for Playwright, Selenium, and friends.
---
## Try it
```python
from computer import Computer
computer = Computer(os_type="linux", provider_type="docker", image="trycua/cua-ubuntu:latest")
await computer.run()
```
---
## Links
* **Docker Provider Docs:** [https://docs.trycua.com/computers/docker](https://docs.trycua.com/computers/docker)
* **KasmVNC:** [https://github.com/kasmtech/KasmVNC](https://github.com/kasmtech/KasmVNC)
* **Container Source:** [https://github.com/trycua/cua/tree/main/libs/kasm](https://github.com/trycua/cua/tree/main/libs/kasm)
* **Computer SDK:** [https://docs.trycua.com/docs/computer-sdk/computers](https://docs.trycua.com/docs/computer-sdk/computers)
* **Discord:** [https://discord.gg/cua-ai](https://discord.gg/cua-ai)
Questions or weird edge cases? Ping us on Discord—were curious to see what you build.

238
blog/windows-sandbox.md Normal file
View File

@@ -0,0 +1,238 @@
# Your Windows PC is Already the Perfect Development Environment for Computer-Use Agents
*Published on June 18, 2025 by Dillon DuPont*
Over the last few months, our enterprise users kept asking the same type of question: *"When are you adding support for AutoCAD?"* *"What about SAP integration?"* *"Can you automate our MES system?"* - each request was for different enterprise applications we'd never heard of.
At first, we deflected. We've been building Cua to work across different environments - from [Lume for macOS VMs](./lume-to-containerization) to cloud containers. But these requests kept piling up. AutoCAD automation. SAP integration. Specialized manufacturing systems.
Then it hit us: **they all ran exclusively on Windows**.
Most of us develop on macOS, so we hadn't considered Windows as a primary target for agent automation. But we were missing out on helping customers automate the software that actually runs their businesses.
So last month, we started working on Windows support for [RPA (Robotic Process Automation)](https://en.wikipedia.org/wiki/Robotic_process_automation). Here's the twist: **the perfect development environment was already sitting on every Windows machine** - we just had to unlock it.
<video width="100%" controls>
<source src="/demo_wsb.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
## Our Journey to Windows CUA Support
When we started Cua, we focused on making computer-use agents work everywhere - we built [Lume for macOS](https://github.com/trycua/cua/tree/main/libs/lume), created cloud infrastructure, and worked on Linux support. But no matter what we built, Windows kept coming up in every enterprise conversation.
The pattern became clear during customer calls: **the software that actually runs businesses lives on Windows**. Engineering teams wanted agents to automate AutoCAD workflows. Manufacturing companies needed automation for their MES systems. Finance teams were asking about Windows-only trading platforms and legacy enterprise software.
We could have gone straight to expensive Windows cloud infrastructure, but then we discovered Microsoft had already solved the development problem: [Windows Sandbox](https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/). Lightweight, free, and sitting on every Windows machine waiting to be used.
Windows Sandbox support is our first step - **Windows cloud instances are coming later this month** for production workloads.
## What is Windows Sandbox?
Windows Sandbox is Microsoft's built-in lightweight virtualization technology. Despite the name, it's actually closer to a disposable virtual machine than a traditional "sandbox" - it creates a completely separate, lightweight Windows environment rather than just containerizing applications.
Here's how it compares to other approaches:
```bash
Traditional VM Testing:
┌─────────────────────────────────┐
│ Your Windows PC │
├─────────────────────────────────┤
│ VMware/VirtualBox VM │
(Heavy, Persistent, Complex)
├─────────────────────────────────┤
│ Agent Testing │
└─────────────────────────────────┘
Windows Sandbox:
┌─────────────────────────────────┐
│ Your Windows PC │
├─────────────────────────────────┤
│ Windows Sandbox │
(Built-in, Fast, Disposable)
├─────────────────────────────────┤
│ Separate Windows Instance │
└─────────────────────────────────┘
```
> ⚠️ **Important Note**: Windows Sandbox supports **one virtual machine at a time**. For production workloads or running multiple agents simultaneously, you'll want our upcoming cloud infrastructure - but for learning and testing, this local setup is perfect to get started.
## Why Windows Sandbox is Perfect for Local Computer-Use Agent Testing
First, it's incredibly lightweight. We're talking seconds to boot up a fresh Windows environment, not the minutes you'd wait for a traditional VM. And since it's built into Windows 10 and 11, there's literally no setup cost - it's just sitting there waiting for you to enable it.
But the real magic is how disposable it is. Every time you start Windows Sandbox, you get a completely clean slate. Your agent messed something up? Crashed an application? No problem - just close the sandbox and start fresh. It's like having an unlimited supply of pristine Windows machines for testing.
## Getting Started: Three Ways to Test Agents
We've made Windows Sandbox agent testing as simple as possible. Here are your options:
### Option A: Quick Start with Agent UI (Recommended)
**Perfect for**: First-time users who want to see agents in action immediately
```bash
# One-time setup
pip install -U git+git://github.com/karkason/pywinsandbox.git
pip install -U "cua-computer[all]" "cua-agent[all]"
# Launch the Agent UI
python -m agent.ui
```
**What you get**:
- Visual interface in your browser
- Real-time agent action viewing
- Natural language task instructions
- No coding required
### Option B: Python API Integration
**Perfect for**: Developers building agent workflows
```python
import asyncio
from computer import Computer, VMProviderType
from agent import ComputerAgent, LLM
async def test_windows_agent():
# Create Windows Sandbox computer
computer = Computer(
provider_type=VMProviderType.WINSANDBOX,
os_type="windows",
memory="4GB",
)
# Start the VM (~35s)
await computer.run()
# Create agent with your preferred model
agent = ComputerAgent(
model="openai/computer-use-preview",
save_trajectory=True,
tools=[computer]
)
# Give it a task
async for result in agent.run("Open Calculator and compute 15% tip on $47.50"):
print(f"Agent action: {result}")
# Shutdown the VM
await computer.stop()
asyncio.run(test_windows_agent())
```
**What you get**:
- Full programmatic control
- Custom agent workflows
- Integration with your existing code
- Detailed action logging
### Option C: Manual Configuration
**Perfect for**: Advanced users who want full control
1. Enable Windows Sandbox in Windows Features
2. Create custom .wsb configuration files
3. Integrate with your existing automation tools
## Comparing Your Options
Let's see how different testing approaches stack up:
### Windows Sandbox + Cua
- **Perfect for**: Quick testing and development
- **Cost**: Free (built into Windows)
- **Setup time**: Under 5 minutes
- **Safety**: Complete isolation from host system
- **Limitation**: One sandbox at a time
- **Requires**: Windows 10/11 with 4GB+ RAM
### Traditional VMs
- **Perfect for**: Complex testing scenarios
- **Full customization**: Any Windows version
- **Heavy resource usage**: Slow to start/stop
- **Complex setup**: License management required
- **Cost**: VM software + Windows licenses
## Real-World Windows RPA Examples
Here's what our enterprise users are building with Windows Sandbox:
### CAD and Engineering Automation
```python
# Example: AutoCAD drawing automation
task = """
1. Open AutoCAD and create a new drawing
2. Draw a basic floor plan with rooms and dimensions
3. Add electrical symbols and circuit layouts
4. Generate a bill of materials from the drawing
5. Export the drawing as both DWG and PDF formats
"""
```
### Manufacturing and ERP Integration
```python
# Example: SAP workflow automation
task = """
1. Open SAP GUI and log into the production system
2. Navigate to Material Management module
3. Create purchase orders for stock items below minimum levels
4. Generate vendor comparison reports
5. Export the reports to Excel and email to procurement team
"""
```
### Financial Software Automation
```python
# Example: Trading platform automation
task = """
1. Open Bloomberg Terminal or similar trading software
2. Monitor specific stock tickers and market indicators
3. Execute trades based on predefined criteria
4. Generate daily portfolio performance reports
5. Update risk management spreadsheets
"""
```
### Legacy Windows Application Integration
```python
# Example: Custom Windows application automation
task = """
1. Open legacy manufacturing execution system (MES)
2. Input production data from CSV files
3. Generate quality control reports
4. Update inventory levels across multiple systems
5. Create maintenance scheduling reports
"""
```
## System Requirements and Performance
### What You Need
- **Windows 10/11**: Any edition that supports Windows Sandbox
- **Memory**: 4GB minimum (8GB recommended for CAD/professional software)
- **CPU**: Virtualization support (enabled by default on modern systems)
- **Storage**: A few GB free space
### Performance Tips
- **Close unnecessary applications** before starting Windows Sandbox
- **Allocate appropriate memory** based on your RPA workflow complexity
- **Use SSD storage** for faster sandbox startup
- **Consider dedicated hardware** for resource-intensive applications like CAD software
**Stay tuned** - we'll be announcing Windows Cloud Instances later this month.
But for development, prototyping, and learning Windows RPA workflows, **Windows Sandbox gives you everything you need to get started right now**.
## Learn More
- [Windows Sandbox Documentation](https://learn.microsoft.com/en-us/windows/security/application-security/application-isolation/windows-sandbox/)
- [Cua GitHub Repository](https://github.com/trycua/cua)
- [Agent UI Documentation](https://github.com/trycua/cua/tree/main/libs/agent)
- [Join our Discord Community](https://discord.gg/cua-ai)
---
*Ready to see AI agents control your Windows applications? Come share your testing experiences on Discord!*