Merge branch 'main' into feature/new-logo

This commit is contained in:
Morgan Dean
2025-08-13 17:17:32 +02:00
79 changed files with 118625 additions and 1731 deletions

414
README.md
View File

@@ -16,223 +16,149 @@
**cua** ("koo-ah") is Docker for [Computer-Use Agents](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse) - it enables AI agents to control full operating systems in virtual containers and deploy them locally or to the cloud.
<div align="center">
<video src="https://github.com/user-attachments/assets/c619b4ea-bb8e-4382-860e-f3757e36af20" width="800" controls></video>
<video src="https://github.com/user-attachments/assets/c619b4ea-bb8e-4382-860e-f3757e36af20" width="600" controls></video>
</div>
<details>
<summary><b>Check out more demos of the Computer-Use Agent in action
</b></summary>
<details open>
<summary><b>MCP Server: Work with Claude Desktop and Tableau</b></summary>
<br>
<div align="center">
<video src="https://github.com/user-attachments/assets/9f573547-5149-493e-9a72-396f3cff29df" width="800" controls></video>
</div>
</details>
With the Computer SDK, you can:
- automate Windows, Linux, and macOS VMs with a consistent, [pyautogui-like API](https://docs.trycua.com/docs/libraries/computer#interface-actions)
- create & manage VMs [locally](https://docs.trycua.com/docs/computer-sdk/computers#cua-local-containers) or using [cua cloud](https://www.trycua.com/)
<details>
<summary><b>AI-Gradio: Multi-app workflow with browser, VS Code and terminal</b></summary>
<br>
<div align="center">
<video src="https://github.com/user-attachments/assets/723a115d-1a07-4c8e-b517-88fbdf53ed0f" width="800" controls></video>
</div>
</details>
With the Agent SDK, you can:
- run computer-use models with a [consistent output](https://docs.trycua.com/docs/agent-sdk/chat-history#message-array-structure)
- run composed agents using UI grounding models and any LLM
- use any liteLLM provider (`openai/`, `openrouter/`, etc.) or our included local providers (`huggingface-local/`, `mlx/`)
- quickly evaluate new UI agent models and UI grounding models
- `anthropic/claude-opus-4-1-20250805` (using [Computer-Use Models](https://docs.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents))
- `openai/computer-use-preview`
- `openrouter/z-ai/glm-4.5v`
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `omniparser+{any LLM}` (using [Composed Agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents))
- `huggingface-local/HelloKKMe/GTA1-7B+{any LLM}`
- `huggingface/HelloKKMe/GTA1-32B+{any LLM}`
- `vllm_hosted/HelloKKMe/GTA1-72B+{any LLM}`
- `human/human` (using [Human-in-the-Loop](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop))
- benchmark on OSWorld-Verified, SheetBench-V2, and more [with a single line of code using HUD](https://docs.trycua.com/docs/agent-sdk/integrations/hud) ([Notebook](https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb))
<details>
<summary><b>Notebook: Fix GitHub issue in Cursor</b></summary>
<br>
<div align="center">
<video src="https://github.com/user-attachments/assets/f67f0107-a1e1-46dc-aa9f-0146eb077077" width="800" controls></video>
</div>
</details>
</details><br/>
# 🚀 Quick Start with a Computer-Use Agent UI
**Need to automate desktop tasks? Launch the Computer-Use Agent UI with a single command.**
### Option 1: Fully-managed install with Docker (recommended)
*Docker-based guided install for quick use*
**macOS/Linux/Windows (via WSL):**
```bash
# Requires Docker
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/scripts/playground-docker.sh)"
```
This script will guide you through setup using Docker containers and launch the Computer-Use Agent UI.
---
### Option 2: [Dev Container](./.devcontainer/README.md)
*Best for contributors and development*
This repository includes a [Dev Container](./.devcontainer/README.md) configuration that simplifies setup to a few steps:
1. **Install the Dev Containers extension ([VS Code](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) or [WindSurf](https://docs.windsurf.com/windsurf/advanced#dev-containers-beta))**
2. **Open the repository in the Dev Container:**
- Press `Ctrl+Shift+P` (or `⌘+Shift+P` on macOS)
- Select `Dev Containers: Clone Repository in Container Volume...` and paste the repository URL: `https://github.com/trycua/cua.git` (if not cloned) or `Dev Containers: Open Folder in Container...` (if git cloned).
> **Note**: On WindSurf, the post install hook might not run automatically. If so, run `/bin/bash .devcontainer/post-install.sh` manually.
3. **Open the VS Code workspace:** Once the post-install.sh is done running, open the `.vscode/py.code-workspace` workspace and press ![Open Workspace](https://github.com/user-attachments/assets/923bdd43-8c8f-4060-8d78-75bfa302b48c)
.
4. **Run the Agent UI example:** Click ![Run Agent UI](https://github.com/user-attachments/assets/7a61ef34-4b22-4dab-9864-f86bf83e290b)
to start the Gradio UI. If prompted to install **debugpy (Python Debugger)** to enable remote debugging, select 'Yes' to proceed.
5. **Access the Gradio UI:** The Gradio UI will be available at `http://localhost:7860` and will automatically forward to your host machine.
---
### Option 3: PyPI
*Direct Python package installation*
```bash
# conda create -yn cua python==3.12
pip install -U "cua-computer[all]" "cua-agent[all]"
python -m agent.ui # Start the agent UI
```
Or check out the [Usage Guide](#-usage-guide) to learn how to use our Python SDK in your own code.
---
## Supported [Agent Loops](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops)
- [UITARS-1.5](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Run locally on Apple Silicon with MLX, or use cloud providers
- [OpenAI CUA](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Use OpenAI's Computer-Use Preview model
- [Anthropic CUA](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Use Anthropic's Computer-Use capabilities
- [OmniParser-v2.0](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Control UI with [Set-of-Marks prompting](https://som-gpt4v.github.io/) using any vision model
## 🖥️ Compatibility
For detailed compatibility information including host OS support, VM emulation capabilities, and model provider compatibility, see the [Compatibility Matrix](./COMPATIBILITY.md).
Missing a model? [Raise a feature request](https://github.com/trycua/cua/issues/new?assignees=&labels=enhancement&projects=&title=%5BAgent%5D%3A+Add+model+support+for+) or [contribute](https://github.com/trycua/cua/blob/main/CONTRIBUTING.md)!
<br/>
# Quick Start
- [Get started with a Computer-Use Agent UI](https://docs.trycua.com/docs/quickstart-ui)
- [Get started with the Computer-Use Agent CLI](https://docs.trycua.com/docs/quickstart-cli)
- [Get Started with the Python SDKs](https://docs.trycua.com/docs/quickstart-devs)
<br/>
# 🐍 Usage Guide
Follow these steps to use Cua in your own Python code. See [Developer Guide](./docs/Developer-Guide.md) for building from source.
### Step 1: Install Lume CLI
# Usage ([Docs](https://docs.trycua.com/docs))
```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
pip install cua-agent[all]
```
```python
from agent import ComputerAgent
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[computer],
max_trajectory_budget=5.0
)
messages = [{"role": "user", "content": "Take a screenshot and tell me what you see"}]
async for result in agent.run(messages):
for item in result["output"]:
if item["type"] == "message":
print(item["content"][0]["text"])
```
Lume CLI manages high-performance macOS/Linux VMs with near-native speed on Apple Silicon.
### Output format (OpenAI Agent Responses Format):
```json
{
"output": [
# user input
{
"role": "user",
"content": "go to trycua on gh"
},
# first agent turn adds the model output to the history
{
"summary": [
{
"text": "Searching Firefox for Trycua GitHub",
"type": "summary_text"
}
],
"type": "reasoning"
},
{
"action": {
"text": "Trycua GitHub",
"type": "type"
},
"call_id": "call_QI6OsYkXxl6Ww1KvyJc4LKKq",
"status": "completed",
"type": "computer_call"
},
# second agent turn adds the computer output to the history
{
"type": "computer_call_output",
"call_id": "call_QI6OsYkXxl6Ww1KvyJc4LKKq",
"output": {
"type": "input_image",
"image_url": "data:image/png;base64,..."
}
},
# final agent turn adds the agent output text to the history
{
"type": "message",
"role": "assistant",
"content": [
{
"text": "Success! The Trycua GitHub page has been opened.",
"type": "output_text"
}
]
}
],
"usage": {
"prompt_tokens": 150,
"completion_tokens": 75,
"total_tokens": 225,
"response_cost": 0.01,
}
}
```
### Step 2: Pull the macOS CUA Image
# Computer ([Docs](https://docs.trycua.com/docs/computer-sdk/computers))
```bash
lume pull macos-sequoia-cua:latest
pip install cua-computer[all]
```
The macOS CUA image contains the default Mac apps and the Computer Server for easy automation.
### Step 3: Install Python SDK
```bash
pip install "cua-computer[all]" "cua-agent[all]"
```
### Step 4: Use in Your Code
```python
from computer import Computer
from agent import ComputerAgent, LLM
async def main():
# Start a local macOS VM
computer = Computer(os_type="macos")
await computer.run()
async with Computer(
os_type="linux",
provider_type="cloud",
name="your-container-name",
api_key="your-api-key"
) as computer:
# Take screenshot
screenshot = await computer.interface.screenshot()
# Or with Cua Cloud Container
computer = Computer(
os_type="linux",
api_key="your_cua_api_key_here",
name="your_container_name_here"
)
# Example: Direct control of a macOS VM with Computer
computer.interface.delay = 0.1 # Wait 0.1 seconds between kb/m actions
await computer.interface.left_click(100, 200)
await computer.interface.type_text("Hello, world!")
screenshot_bytes = await computer.interface.screenshot()
# Example: Create and run an agent locally using mlx-community/UI-TARS-1.5-7B-6bit
agent = ComputerAgent(
model="mlx/mlx-community/UI-TARS-1.5-7B-6bit",
tools=[computer],
)
async for result in agent.run("Find the trycua/cua repository on GitHub and follow the quick start guide"):
print(result)
if __name__ == "__main__":
asyncio.run(main())
# Click and type
await computer.interface.left_click(100, 100)
await computer.interface.type("Hello!")
```
For ready-to-use examples, check out our [Notebooks](./notebooks/) collection.
### Lume CLI Reference
```bash
# Install Lume CLI and background service
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
# List all VMs
lume ls
# Pull a VM image
lume pull macos-sequoia-cua:latest
# Create a new VM
lume create my-vm --os macos --cpu 4 --memory 8GB --disk-size 50GB
# Run a VM (creates and starts if it doesn't exist)
lume run macos-sequoia-cua:latest
# Stop a VM
lume stop macos-sequoia-cua_latest
# Delete a VM
lume delete macos-sequoia-cua_latest
```
### Lumier CLI Reference
For advanced container-like virtualization, check out [Lumier](./libs/lumier/README.md) - a Docker interface for macOS and Linux VMs.
```bash
# Install Lume CLI and background service
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
# Run macOS in a Docker container
docker run -it --rm \
--name lumier-vm \
-p 8006:8006 \
-v $(pwd)/storage:/storage \
-v $(pwd)/shared:/shared \
-e VM_NAME=lumier-vm \
-e VERSION=ghcr.io/trycua/macos-sequoia-cua:latest \
-e CPU_CORES=4 \
-e RAM_SIZE=8192 \
-e HOST_STORAGE_PATH=$(pwd)/storage \
-e HOST_SHARED_PATH=$(pwd)/shared \
trycua/lumier:latest
```
## Resources
# Resources
- [How to use the MCP Server with Claude Desktop or other MCP clients](./libs/python/mcp-server/README.md) - One of the easiest ways to get started with Cua
- [How to use OpenAI Computer-Use, Anthropic, OmniParser, or UI-TARS for your Computer-Use Agent](./libs/python/agent/README.md)
- [How to use Lume CLI for managing desktops](./libs/lume/README.md)
- [Training Computer-Use Models: Collecting Human Trajectories with Cua (Part 1)](https://www.trycua.com/blog/training-computer-use-models-trajectories-1)
- [Build Your Own Operator on macOS (Part 1)](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)
## Modules
@@ -249,112 +175,6 @@ docker run -it --rm \
| [**Core (Python)**](./libs/python/core/README.md) | Python Core utilities | `pip install cua-core` |
| [**Core (Typescript)**](./libs/typescript/core/README.md) | Typescript Core utilities | `npm install @trycua/core` |
## Computer Interface Reference
For complete examples, see [computer_examples.py](./examples/computer_examples.py) or [computer_nb.ipynb](./notebooks/computer_nb.ipynb)
```python
# Shell Actions
result = await computer.interface.run_command(cmd) # Run shell command
# result.stdout, result.stderr, result.returncode
# Mouse Actions
await computer.interface.left_click(x, y) # Left click at coordinates
await computer.interface.right_click(x, y) # Right click at coordinates
await computer.interface.double_click(x, y) # Double click at coordinates
await computer.interface.move_cursor(x, y) # Move cursor to coordinates
await computer.interface.drag_to(x, y, duration) # Drag to coordinates
await computer.interface.get_cursor_position() # Get current cursor position
await computer.interface.mouse_down(x, y, button="left") # Press and hold a mouse button
await computer.interface.mouse_up(x, y, button="left") # Release a mouse button
# Keyboard Actions
await computer.interface.type_text("Hello") # Type text
await computer.interface.press_key("enter") # Press a single key
await computer.interface.hotkey("command", "c") # Press key combination
await computer.interface.key_down("command") # Press and hold a key
await computer.interface.key_up("command") # Release a key
# Scrolling Actions
await computer.interface.scroll(x, y) # Scroll the mouse wheel
await computer.interface.scroll_down(clicks) # Scroll down
await computer.interface.scroll_up(clicks) # Scroll up
# Screen Actions
await computer.interface.screenshot() # Take a screenshot
await computer.interface.get_screen_size() # Get screen dimensions
# Clipboard Actions
await computer.interface.set_clipboard(text) # Set clipboard content
await computer.interface.copy_to_clipboard() # Get clipboard content
# File System Operations
await computer.interface.file_exists(path) # Check if file exists
await computer.interface.directory_exists(path) # Check if directory exists
await computer.interface.read_text(path, encoding="utf-8") # Read file content
await computer.interface.write_text(path, content, encoding="utf-8") # Write file content
await computer.interface.read_bytes(path) # Read file content as bytes
await computer.interface.write_bytes(path, content) # Write file content as bytes
await computer.interface.delete_file(path) # Delete file
await computer.interface.create_dir(path) # Create directory
await computer.interface.delete_dir(path) # Delete directory
await computer.interface.list_dir(path) # List directory contents
# Accessibility
await computer.interface.get_accessibility_tree() # Get accessibility tree
# Delay Configuration
# Set default delay between all actions (in seconds)
computer.interface.delay = 0.5 # 500ms delay between actions
# Or specify delay for individual actions
await computer.interface.left_click(x, y, delay=1.0) # 1 second delay after click
await computer.interface.type_text("Hello", delay=0.2) # 200ms delay after typing
await computer.interface.press_key("enter", delay=0.5) # 500ms delay after key press
# Python Virtual Environment Operations
await computer.venv_install("demo_venv", ["requests", "macos-pyxa"]) # Install packages in a virtual environment
await computer.venv_cmd("demo_venv", "python -c 'import requests; print(requests.get(`https://httpbin.org/ip`).json())'") # Run a shell command in a virtual environment
await computer.venv_exec("demo_venv", python_function_or_code, *args, **kwargs) # Run a Python function in a virtual environment and return the result / raise an exception
# Example: Use sandboxed functions to execute code in a Cua Container
from computer.helpers import sandboxed
@sandboxed("demo_venv")
def greet_and_print(name):
"""Get the HTML of the current Safari tab"""
import PyXA
safari = PyXA.Application("Safari")
html = safari.current_document.source()
print(f"Hello from inside the container, {name}!")
return {"greeted": name, "safari_html": html}
# When a @sandboxed function is called, it will execute in the container
result = await greet_and_print("Cua")
# Result: {"greeted": "Cua", "safari_html": "<html>...</html>"}
# stdout and stderr are also captured and printed / raised
print("Result from sandboxed function:", result)
```
## ComputerAgent Reference
For complete examples, see [agent_examples.py](./examples/agent_examples.py) or [agent_nb.ipynb](./notebooks/agent_nb.ipynb)
```python
# Import necessary components
from agent import ComputerAgent
# UI-TARS-1.5 agent for local execution with MLX
ComputerAgent(model="mlx/mlx-community/UI-TARS-1.5-7B-6bit")
# OpenAI Computer-Use agent using OPENAI_API_KEY
ComputerAgent(model="computer-use-preview")
# Anthropic Claude agent using ANTHROPIC_API_KEY
ComputerAgent(model="anthropic/claude-3-5-sonnet-20240620")
# OmniParser loop for UI control using Set-of-Marks (SOM) prompting and any vision LLM
ComputerAgent(model="omniparser+ollama_chat/gemma3:12b-it-q4_K_M")
```
## Community
Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas, get assistance, or share your demos!
@@ -409,4 +229,4 @@ Thank you to all our supporters!
<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- ALL-CONTRIBUTORS-LIST:END -->
<!-- ALL-CONTRIBUTORS-LIST:END -->

View File

@@ -29,11 +29,4 @@ async for result in agent.run(prompt):
print("Agent:", result["output"][-1]["content"][0]["text"])
```
We currently support 4 computer-using agent loops:
- Anthropic CUAs
- OpenAI CUA Preview
- UI-TARS 1.5
- Omniparser + LLMs
For a full list of supported models and configurations, see the [Supported Agents](./supported-agents) page.
For a list of supported models and configurations, see the [Supported Agents](./supported-agents/computer-use-agents) page.

View File

@@ -0,0 +1,28 @@
---
title: Benchmarks
description: Computer Agent SDK benchmarks for agentic GUI tasks
---
The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
## Available Benchmarks
- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding
- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding
- **[Interactive Testing](./interactive)** - Real-time testing and visualization
## Quick Start
```bash
# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks
# Install dependencies
pip install "cua-agent[all]"
# Run a benchmark
python ss-v2.py
```

View File

@@ -0,0 +1,21 @@
---
title: Interactive Tool
description: Real-time testing and visualization tool for GUI grounding models
---
This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results.
## Usage
```bash
# Start the interactive tool
cd libs/python/agent/benchmarks
python interactive.py
```
## Commands
- **Type instruction**: Screenshot + test all models
- **`screenshot`**: Take screenshot without prediction
- **`models`**: List available models
- **`quit`/`exit`**: Exit the tool

View File

@@ -0,0 +1,57 @@
---
title: Introduction
description: Overview of benchmarking in the c/ua agent framework
---
The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.
## Benchmark Types
Computer-Agent benchmarks evaluate two key capabilities:
- **Plan Generation**: Breaking down complex tasks into a sequence of actions
- **Coordinate Generation**: Predicting precise click locations on GUI elements
## Using State-of-the-Art Models
Let's see how to use the SOTA vision-language models in the c/ua agent framework.
### Plan Generation + Coordinate Generation
**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents
This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.
```python
# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
# This makes it suitable for agentic loops for computer-use
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉
```
### Coordinate Generation Only
**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy
This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.
```python
# GTA1-7B is a SOTA coordinate generation VLM
# It can only generate coordinates, not plan:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
agent.predict_click("find the button to open the settings") # (27, 450)
# This will raise an error:
# agent.run("Open Firefox and go to github.com")
```
### Composed Agent
The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.
```python
# It can be paired with any LLM to form a composed agent:
# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉
```

View File

@@ -0,0 +1,9 @@
{
"pages": [
"introduction",
"screenspot-v2",
"screenspot-pro",
"interactive",
"osworld-verified"
]
}

View File

@@ -0,0 +1,89 @@
---
title: OSWorld-Verified
description: Benchmark ComputerAgent on OSWorld tasks using HUD
---
OSWorld-Verified is a curated subset of OSWorld tasks that can be run using the HUD framework. Use ComputerAgent with HUD to benchmark on these tasks.
## Setup
```bash
pip install hud-python==0.2.10
```
Set environment variables:
```bash
export HUD_API_KEY="your_hud_key"
export ANTHROPIC_API_KEY="your_anthropic_key" # For Claude
export OPENAI_API_KEY="your_openai_key" # For OpenAI
```
## Quick Start
```python
import asyncio
from hud import gym, load_taskset
from agent.integrations.hud import ComputerAgent
async def run_osworld():
# Load taskset
taskset = await load_taskset("OSWorld-Verified")
test = taskset[144] # Example task
# Create environment (~2.5 min startup)
env = await gym.make(test)
# Create agent
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022", # any ComputerAgent model string
environment="linux"
)
# Run benchmark
obs, _ = await env.reset()
for i in range(100):
action, done = await agent.predict(obs)
obs, reward, terminated, info = await env.step(action)
if done or terminated:
break
# Evaluate results
result = await env.evaluate()
await env.close()
return result
# Run benchmark
result = asyncio.run(run_osworld())
print(f"Success: {result.get('success', False)}")
```
## Parallel Execution
Run all tasks in parallel using `run_job`:
```python
from agent.integrations.hud import run_job
from hud import load_taskset
from hud.taskset import TaskSet
import logging
# Load taskset
taskset = await load_taskset("OSWorld-Verified")
taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370
# Run benchmark job
job = await run_job(
model="openai/computer-use-preview",
task_or_taskset=taskset,
job_name="test-computeragent-job",
max_concurrent_tasks=5,
# add any extra ComputerAgent kwargs:
verbosity=logging.INFO, # Enable logging
# trajectory_dir=".." # Save trajectories locally
)
# Get results OR view them at app.hud.so
print(await job.get_analytics())
print(f"View results at: https://app.hud.so/jobs/{job.id}")
```

View File

@@ -0,0 +1,25 @@
---
title: ScreenSpot-Pro
description: High-resolution GUI grounding benchmark
---
ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts.
## Usage
```bash
# Run the benchmark
cd libs/python/agent/benchmarks
python ss-pro.py
# Run with custom sample limit
python ss-pro.py --samples 50
```
## Results
| Model | Accuracy | Failure Rate | Samples |
|-------|----------|--------------|---------|
| Coming Soon | - | - | - |
Results will be populated after running benchmarks with various models.

View File

@@ -0,0 +1,25 @@
---
title: ScreenSpot-v2
description: Standard resolution GUI grounding benchmark
---
ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots.
## Usage
```bash
# Run the benchmark
cd libs/python/agent/benchmarks
python ss-v2.py
# Run with custom sample limit
python ss-v2.py --samples 100
```
## Results
| Model | Accuracy | Failure Rate | Samples |
|-------|----------|--------------|---------|
| Coming Soon | - | - | - |
Results will be populated after running benchmarks with various models.

View File

@@ -0,0 +1,130 @@
---
title: Custom Computers
slug: custom-computer-handlers
---
The Agent SDK supports defining custom computer handlers using a simple dictionary interface. This enables integration with custom automation backends, testing frameworks, or specialized computer control systems.
## Example: Defining a Custom Computer Handler
```python
import asyncio
from PIL import Image
# Define your custom computer functions
async def take_screenshot():
"""Your custom screenshot implementation"""
# Return PIL Image, bytes, or base64 string
return Image.new('RGB', (1920, 1080), color='white')
# Create dict-based computer handler - only 'screenshot' is required
custom_computer = {
'screenshot': take_screenshot, # required
# everything below is optional
'environment': 'linux', # linux, mac, windows, browser
'dimensions': (1920, 1080), # (width, height)
'click': lambda x, y, button: print(f"Clicking at ({x}, {y}) with {button} button"),
}
```
You can then use this as a tool for your agent:
```python
from agent import ComputerAgent
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20240620",
tools=[custom_computer],
)
# Agent will automatically convert dict to agent.computers.CustomComputerHandler
await agent.run("Take a screenshot and click at coordinates 100, 200")
```
## Class-Based Implementation
For more complex implementations, you can create a custom class by inheriting from `AsyncComputerHandler`:
```python
from agent.computers import AsyncComputerHandler
from PIL import Image
from typing import Literal, List, Dict, Union, Optional
class MyCustomComputer(AsyncComputerHandler):
"""Custom computer handler implementation."""
def __init__(self):
# Initialize your custom computer interface here
pass
# ==== Computer-Use-Preview Action Space ====
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
...
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
...
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
...
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
...
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
...
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
...
async def type(self, text: str) -> None:
"""Type text."""
...
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
...
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
...
async def keypress(self, keys: Union[List[str], str]) -> None:
"""Press key combination."""
...
async def drag(self, path: List[Dict[str, int]]) -> None:
"""Drag along specified path."""
...
async def get_current_url(self) -> str:
"""Get current URL (for browser environments)."""
...
# ==== Anthropic Action Space ====
async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse down at coordinates."""
...
async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse up at coordinates."""
...
# Use with agent
custom_computer = MyCustomComputer()
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20240620",
tools=[custom_computer],
)
await agent.run("Take a screenshot and click at coordinates 100, 200")
```

View File

@@ -0,0 +1,49 @@
---
title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---
The HUD integration allows you to use ComputerAgent with the [HUD benchmarking framework](https://www.hud.so/), providing the same interface as existing HUD agents while leveraging ComputerAgent's capabilities.
## Installation
```bash
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.2.10
```
## Usage
```python
from agent.integrations.hud import run_job
from hud import load_taskset
from hud.taskset import TaskSet
import logging
# Load taskset
taskset = await load_taskset("OSWorld-Verified")
taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370
# Run benchmark job
job = await run_job(
model="openai/computer-use-preview",
# model="anthropic/claude-3-5-sonnet-20241022",
# model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
task_or_taskset=taskset,
job_name="test-computeragent-job",
max_concurrent_tasks=5,
# add any extra ComputerAgent kwargs:
verbosity=logging.INFO, # Enable logging
# trajectory_dir=".." # Save trajectories locally
)
# Get results OR view them at app.hud.so
print(await job.get_analytics())
print(f"View results at: https://app.hud.so/jobs/{job.id}")
```
**Available Benchmarks:**
1. [OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified) - Benchmark on OSWorld tasks
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.

View File

@@ -0,0 +1,4 @@
{
"title": "Integrations",
"pages": ["hud"]
}

View File

@@ -3,13 +3,16 @@
"description": "Build computer-using agents with the Agent SDK",
"pages": [
"agent-loops",
"supported-agents",
"supported-agents",
"chat-history",
"callbacks",
"sandboxed-tools",
"custom-computer-handlers",
"local-models",
"prompt-caching",
"usage-tracking",
"migration-guide"
"benchmarks",
"migration-guide",
"integrations"
]
}

View File

@@ -1,34 +0,0 @@
---
title: Supported Agents
---
This page lists all supported agent loops and their compatible models/configurations in cua.
All agent loops are compatible with any LLM provider supported by LiteLLM.
See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine.
## Anthropic CUAs
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
## OpenAI CUA Preview
- Computer-use-preview: `computer-use-preview`
## UI-TARS 1.5
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
## Omniparser + LLMs
- `omniparser+vertex_ai/gemini-pro`
- `omniparser+openai/gpt-4o`
- Any LiteLLM-compatible model combined with Omniparser
---
For details on agent loop behavior and usage, see [Agent Loops](./agent-loops).

View File

@@ -0,0 +1,106 @@
---
title: Composed Agents
description: Combine grounding models with any LLM for computer-use capabilities
---
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
## How Composed Agents Work
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
3. **Execution**: Actions are performed using the predicted coordinates
## Supported Grounding Models
Any model that supports `predict_click()` can be used as the grounding component:
- `omniparser` (OSS set-of-marks model)
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
- `openai/computer-use-preview` (OpenAI CUA)
## Supported Thinking Models
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
- **Local models**: Any Hugging Face vision-language model
## Usage Examples
### GTA1 + GPT-5
Use Google's Gemini for planning with specialized grounding:
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
tools=[computer]
)
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
```
### GTA1 + Claude 3.5 Sonnet
Combine state-of-the-art grounding with powerful reasoning:
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element
```
### UI-TARS + GPT-4o
Combine two different vision models for enhanced capabilities:
```python
agent = ComputerAgent(
"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
tools=[computer]
)
async for _ in agent.run("Help me fill out this form with my personal information"):
pass
```
## Benefits of Composed Agents
- **Specialized Grounding**: Use models optimized for click prediction accuracy
- **Flexible Planning**: Choose any LLM for task reasoning and planning
- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed
- **Performance**: Leverage the strengths of different model architectures
## Capabilities
Composed agents support both capabilities:
```python
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
pass
# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")
```
---
For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models).

View File

@@ -0,0 +1,67 @@
---
title: Computer-Use Models
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
---
These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks.
All agent loops are compatible with any LLM provider supported by LiteLLM.
See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine.
## Anthropic CUAs
Claude models with computer-use capabilities:
- Claude 4.1: `claude-opus-4-1-20250805`
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
```python
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
async for _ in agent.run("Open Firefox and navigate to github.com"):
pass
```
## OpenAI CUA Preview
OpenAI's computer-use preview model:
- Computer-use-preview: `computer-use-preview`
```python
agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
async for _ in agent.run("Take a screenshot and describe what you see"):
pass
```
## UI-TARS 1.5
Unified vision-language model for computer-use:
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
```python
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
pass
```
## GLM-4.5V
Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
- `openrouter/z-ai/glm-4.5v`
- `huggingface-local/zai-org/GLM-4.5V`
```python
agent = ComputerAgent("openrouter/z-ai/glm-4.5v", tools=[computer])
async for _ in agent.run("Click on the search bar and type 'hello world'"):
pass
```
---
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).

View File

@@ -0,0 +1,89 @@
---
title: Grounding Models
description: Models that support click prediction with ComputerAgent.predict_click()
---
These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
## All Computer-Use Agents
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
### Anthropic CUAs
- Claude 4.1: `claude-opus-4-1-20250805`
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
### OpenAI CUA Preview
- Computer-use-preview: `computer-use-preview`
### UI-TARS 1.5
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
## Specialized Grounding Models
These models are optimized specifically for click prediction and UI element grounding:
### OmniParser
OCR-focused set-of-marks model that requires an LLM for click prediction:
- `omniparser` (requires combination with any LiteLLM vision model)
### GTA1-7B
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
- `huggingface-local/HelloKKMe/GTA1-7B`
## Usage Examples
```python
# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")
print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
```
```python
# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button") # Error!
```
```python
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")
```
---
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).

View File

@@ -0,0 +1,66 @@
---
title: Human-In-The-Loop
description: Use humans as agents for evaluation, demonstrations, and interactive control
---
The Agent SDK provides a human tool, with native support for using a human-in-the-loop as a way to evaluate your environment, tools, or to create demonstrations. You can use it by doing `grounding_model+human/human` or `human/human` directly.
## Getting Started
To start the human agent tool, simply run:
```bash
python -m agent.human_tool
```
The UI will show you pending completions. Select a completion to take control of the agent.
## Usage Examples
### Direct Human Agent
```python
from agent import ComputerAgent
from agent.computer import computer
agent = ComputerAgent(
"human/human",
tools=[computer]
)
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
```
### Composed with Grounding Model
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+human/human",
tools=[computer]
)
async for _ in agent.run("Navigate to the settings page and enable dark mode"):
pass
```
## Features
The human-in-the-loop interface provides:
- **Interactive UI**: Web-based interface for reviewing and responding to agent requests
- **Image Display**: Screenshots with click handlers for direct interaction
- **Action Accordions**: Support for various computer actions (click, type, keypress, etc.)
- **Tool Calls**: Full OpenAI-compatible tool call support
- **Real-time Updates**: Smart polling for responsive UI updates
## Use Cases
- **Evaluation**: Have humans evaluate agent performance and provide ground truth responses
- **Demonstrations**: Create training data by having humans demonstrate tasks
- **Interactive Control**: Take manual control when automated agents need human guidance
- **Testing**: Validate agent, tool, and environment behavior manually
---
For more details on the human tool implementation, see the [Human Tool Documentation](../../tools/human-tool).

View File

@@ -0,0 +1,10 @@
{
"title": "Supported Agents",
"description": "Models and configurations supported by the Agent SDK",
"pages": [
"computer-use-agents",
"grounding-models",
"composed-agents",
"human-in-the-loop"
]
}

View File

@@ -169,18 +169,20 @@ python -m agent.cli openai/computer-use-preview
<Tab value="uv">
```bash
uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-3-5-sonnet-20241022
uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-opus-4-20250514
uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-opus-4-1-20250805
uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-sonnet-4-20250514
uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-3-5-sonnet-20241022
```
</Tab>
<Tab value="conda/pip">
```bash
python -m agent.cli anthropic/claude-3-5-sonnet-20241022
python -m agent.cli anthropic/claude-opus-4-1-20250805
python -m agent.cli anthropic/claude-opus-4-20250514
python -m agent.cli anthropic/claude-sonnet-4-20250514
python -m agent.cli anthropic/claude-3-5-sonnet-20241022
```
</Tab>

View File

@@ -13,7 +13,7 @@ from utils import load_dotenv_files
load_dotenv_files()
# Import the create_gradio_ui function
from agent.ui.gradio.app import create_gradio_ui
from agent.ui.gradio.ui_components import create_gradio_ui
if __name__ == "__main__":
print("Launching Computer-Use Agent Gradio UI with advanced features...")

View File

@@ -37,6 +37,7 @@ pip install "cua-agent[omni]" # Omniparser + any LLM support
pip install "cua-agent[uitars]" # UI-TARS
pip install "cua-agent[uitars-mlx]" # UI-TARS + MLX support
pip install "cua-agent[uitars-hf]" # UI-TARS + Huggingface support
pip install "cua-agent[glm45v-hf]" # GLM-4.5V + Huggingface support
pip install "cua-agent[ui]" # Gradio UI support
```

View File

@@ -5,7 +5,7 @@ agent - Decorator-based Computer Use Agent with liteLLM integration
import logging
import sys
from .decorators import agent_loop
from .decorators import register_agent
from .agent import ComputerAgent
from .types import Messages, AgentResponse
@@ -13,7 +13,7 @@ from .types import Messages, AgentResponse
from . import loops
__all__ = [
"agent_loop",
"register_agent",
"ComputerAgent",
"Messages",
"AgentResponse"

View File

@@ -3,7 +3,9 @@ Adapters package for agent - Custom LLM adapters for LiteLLM
"""
from .huggingfacelocal_adapter import HuggingFaceLocalAdapter
from .human_adapter import HumanAdapter
__all__ = [
"HuggingFaceLocalAdapter",
"HumanAdapter",
]

View File

@@ -1,5 +1,7 @@
import asyncio
import functools
import warnings
from concurrent.futures import ThreadPoolExecutor
from typing import Iterator, AsyncIterator, Dict, List, Any, Optional
from litellm.types.utils import GenericStreamingChunk, ModelResponse
from litellm.llms.custom_llm import CustomLLM
@@ -8,7 +10,7 @@ from litellm import completion, acompletion
# Try to import HuggingFace dependencies
try:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from transformers import AutoModelForImageTextToText, AutoProcessor
HF_AVAILABLE = True
except ImportError:
HF_AVAILABLE = False
@@ -28,6 +30,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
self.device = device
self.models = {} # Cache for loaded models
self.processors = {} # Cache for loaded processors
self._executor = ThreadPoolExecutor(max_workers=1) # Single thread pool
def _load_model_and_processor(self, model_name: str):
"""Load model and processor if not already cached.
@@ -40,7 +43,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
"""
if model_name not in self.models:
# Load model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model = AutoModelForImageTextToText.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map=self.device,
@@ -48,7 +51,12 @@ class HuggingFaceLocalAdapter(CustomLLM):
)
# Load processor
processor = AutoProcessor.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(
model_name,
min_pixels=3136,
max_pixels=4096 * 2160,
device_map=self.device
)
# Cache them
self.models[model_name] = model
@@ -141,8 +149,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
)
# Move inputs to the same device as model
if torch.cuda.is_available() and self.device != "cpu":
inputs = inputs.to("cuda")
inputs = inputs.to(model.device)
# Generate response
with torch.no_grad():
@@ -182,7 +189,11 @@ class HuggingFaceLocalAdapter(CustomLLM):
ModelResponse with generated text
"""
# Run _generate in thread pool to avoid blocking
generated_text = await asyncio.to_thread(self._generate, **kwargs)
loop = asyncio.get_event_loop()
generated_text = await loop.run_in_executor(
self._executor,
functools.partial(self._generate, **kwargs)
)
return await acompletion(
model=f"huggingface-local/{kwargs['model']}",
@@ -215,7 +226,11 @@ class HuggingFaceLocalAdapter(CustomLLM):
AsyncIterator of GenericStreamingChunk
"""
# Run _generate in thread pool to avoid blocking
generated_text = await asyncio.to_thread(self._generate, **kwargs)
loop = asyncio.get_event_loop()
generated_text = await loop.run_in_executor(
self._executor,
functools.partial(self._generate, **kwargs)
)
generic_streaming_chunk: GenericStreamingChunk = {
"finish_reason": "stop",

View File

@@ -0,0 +1,348 @@
import os
import asyncio
import requests
from typing import List, Dict, Any, Iterator, AsyncIterator
from litellm.types.utils import GenericStreamingChunk, ModelResponse
from litellm.llms.custom_llm import CustomLLM
from litellm import completion, acompletion
class HumanAdapter(CustomLLM):
"""Human Adapter for human-in-the-loop completions.
This adapter sends completion requests to a human completion server
where humans can review and respond to AI requests.
"""
def __init__(self, base_url: str | None = None, timeout: float = 300.0, **kwargs):
"""Initialize the human adapter.
Args:
base_url: Base URL for the human completion server.
Defaults to HUMAN_BASE_URL environment variable or http://localhost:8002
timeout: Timeout in seconds for waiting for human response
**kwargs: Additional arguments
"""
super().__init__()
self.base_url = base_url or os.getenv('HUMAN_BASE_URL', 'http://localhost:8002')
self.timeout = timeout
# Ensure base_url doesn't end with slash
self.base_url = self.base_url.rstrip('/')
def _queue_completion(self, messages: List[Dict[str, Any]], model: str) -> str:
"""Queue a completion request and return the call ID.
Args:
messages: Messages in OpenAI format
model: Model name
Returns:
Call ID for tracking the request
Raises:
Exception: If queueing fails
"""
try:
response = requests.post(
f"{self.base_url}/queue",
json={"messages": messages, "model": model},
timeout=10
)
response.raise_for_status()
return response.json()["id"]
except requests.RequestException as e:
raise Exception(f"Failed to queue completion request: {e}")
def _wait_for_completion(self, call_id: str) -> Dict[str, Any]:
"""Wait for human to complete the call.
Args:
call_id: ID of the queued completion call
Returns:
Dict containing response and/or tool_calls
Raises:
TimeoutError: If timeout is exceeded
Exception: If completion fails
"""
import time
start_time = time.time()
while True:
try:
# Check status
status_response = requests.get(f"{self.base_url}/status/{call_id}")
status_response.raise_for_status()
status_data = status_response.json()
if status_data["status"] == "completed":
result = {}
if "response" in status_data and status_data["response"]:
result["response"] = status_data["response"]
if "tool_calls" in status_data and status_data["tool_calls"]:
result["tool_calls"] = status_data["tool_calls"]
return result
elif status_data["status"] == "failed":
error_msg = status_data.get("error", "Unknown error")
raise Exception(f"Completion failed: {error_msg}")
# Check timeout
if time.time() - start_time > self.timeout:
raise TimeoutError(f"Timeout waiting for human response after {self.timeout} seconds")
# Wait before checking again
time.sleep(1.0)
except requests.RequestException as e:
if time.time() - start_time > self.timeout:
raise TimeoutError(f"Timeout waiting for human response: {e}")
# Continue trying if we haven't timed out
time.sleep(1.0)
async def _async_wait_for_completion(self, call_id: str) -> Dict[str, Any]:
"""Async version of wait_for_completion.
Args:
call_id: ID of the queued completion call
Returns:
Dict containing response and/or tool_calls
Raises:
TimeoutError: If timeout is exceeded
Exception: If completion fails
"""
import aiohttp
import time
start_time = time.time()
async with aiohttp.ClientSession() as session:
while True:
try:
# Check status
async with session.get(f"{self.base_url}/status/{call_id}") as response:
response.raise_for_status()
status_data = await response.json()
if status_data["status"] == "completed":
result = {}
if "response" in status_data and status_data["response"]:
result["response"] = status_data["response"]
if "tool_calls" in status_data and status_data["tool_calls"]:
result["tool_calls"] = status_data["tool_calls"]
return result
elif status_data["status"] == "failed":
error_msg = status_data.get("error", "Unknown error")
raise Exception(f"Completion failed: {error_msg}")
# Check timeout
if time.time() - start_time > self.timeout:
raise TimeoutError(f"Timeout waiting for human response after {self.timeout} seconds")
# Wait before checking again
await asyncio.sleep(1.0)
except Exception as e:
if time.time() - start_time > self.timeout:
raise TimeoutError(f"Timeout waiting for human response: {e}")
# Continue trying if we haven't timed out
await asyncio.sleep(1.0)
def _generate_response(self, messages: List[Dict[str, Any]], model: str) -> Dict[str, Any]:
"""Generate a human response for the given messages.
Args:
messages: Messages in OpenAI format
model: Model name
Returns:
Dict containing response and/or tool_calls
"""
# Queue the completion request
call_id = self._queue_completion(messages, model)
# Wait for human response
response = self._wait_for_completion(call_id)
return response
async def _async_generate_response(self, messages: List[Dict[str, Any]], model: str) -> Dict[str, Any]:
"""Async version of _generate_response.
Args:
messages: Messages in OpenAI format
model: Model name
Returns:
Dict containing response and/or tool_calls
"""
# Queue the completion request (sync operation)
call_id = self._queue_completion(messages, model)
# Wait for human response (async)
response = await self._async_wait_for_completion(call_id)
return response
def completion(self, *args, **kwargs) -> ModelResponse:
"""Synchronous completion method.
Returns:
ModelResponse with human-generated text or tool calls
"""
messages = kwargs.get('messages', [])
model = kwargs.get('model', 'human')
# Generate human response
human_response_data = self._generate_response(messages, model)
# Create ModelResponse with proper structure
from litellm.types.utils import ModelResponse, Choices, Message
import uuid
import time
# Create message content based on response type
if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
# Tool calls response
message = Message(
role="assistant",
content=human_response_data.get("response", ""),
tool_calls=human_response_data["tool_calls"]
)
else:
# Text response
message = Message(
role="assistant",
content=human_response_data.get("response", "")
)
choice = Choices(
finish_reason="stop",
index=0,
message=message
)
result = ModelResponse(
id=f"human-{uuid.uuid4()}",
choices=[choice],
created=int(time.time()),
model=f"human/{model}",
object="chat.completion"
)
return result
async def acompletion(self, *args, **kwargs) -> ModelResponse:
"""Asynchronous completion method.
Returns:
ModelResponse with human-generated text or tool calls
"""
messages = kwargs.get('messages', [])
model = kwargs.get('model', 'human')
# Generate human response
human_response_data = await self._async_generate_response(messages, model)
# Create ModelResponse with proper structure
from litellm.types.utils import ModelResponse, Choices, Message
import uuid
import time
# Create message content based on response type
if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
# Tool calls response
message = Message(
role="assistant",
content=human_response_data.get("response", ""),
tool_calls=human_response_data["tool_calls"]
)
else:
# Text response
message = Message(
role="assistant",
content=human_response_data.get("response", "")
)
choice = Choices(
finish_reason="stop",
index=0,
message=message
)
result = ModelResponse(
id=f"human-{uuid.uuid4()}",
choices=[choice],
created=int(time.time()),
model=f"human/{model}",
object="chat.completion"
)
return result
def streaming(self, *args, **kwargs) -> Iterator[GenericStreamingChunk]:
"""Synchronous streaming method.
Yields:
Streaming chunks with human-generated text or tool calls
"""
messages = kwargs.get('messages', [])
model = kwargs.get('model', 'human')
# Generate human response
human_response_data = self._generate_response(messages, model)
import time
# Handle tool calls vs text response
if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
# Stream tool calls as a single chunk
generic_chunk: GenericStreamingChunk = {
"finish_reason": "tool_calls",
"index": 0,
"is_finished": True,
"text": human_response_data.get("response", ""),
"tool_use": human_response_data["tool_calls"],
"usage": {"completion_tokens": 1, "prompt_tokens": 0, "total_tokens": 1},
}
yield generic_chunk
else:
# Stream text response
response_text = human_response_data.get("response", "")
generic_chunk: GenericStreamingChunk = {
"finish_reason": "stop",
"index": 0,
"is_finished": True,
"text": response_text,
"tool_use": None,
"usage": {"completion_tokens": len(response_text.split()), "prompt_tokens": 0, "total_tokens": len(response_text.split())},
}
yield generic_chunk
async def astreaming(self, *args, **kwargs) -> AsyncIterator[GenericStreamingChunk]:
"""Asynchronous streaming method.
Yields:
Streaming chunks with human-generated text or tool calls
"""
messages = kwargs.get('messages', [])
model = kwargs.get('model', 'human')
# Generate human response
human_response = await self._async_generate_response(messages, model)
# Return as single streaming chunk
generic_streaming_chunk: GenericStreamingChunk = {
"finish_reason": "stop",
"index": 0,
"is_finished": True,
"text": human_response,
"tool_use": None,
"usage": {"completion_tokens": len(human_response.split()), "prompt_tokens": 0, "total_tokens": len(human_response.split())},
}
yield generic_streaming_chunk

View File

@@ -3,18 +3,20 @@ ComputerAgent - Main agent class that selects and runs agent loops
"""
import asyncio
from typing import Dict, List, Any, Optional, AsyncGenerator, Union, cast, Callable, Set
from typing import Dict, List, Any, Optional, AsyncGenerator, Union, cast, Callable, Set, Tuple
from litellm.responses.utils import Usage
from .types import Messages, Computer
from .decorators import find_agent_loop
from .computer_handler import OpenAIComputerHandler, acknowledge_safety_check_callback, check_blocklisted_url
from .types import Messages, AgentCapability
from .decorators import find_agent_config
import json
import litellm
import litellm.utils
import inspect
from .adapters import HuggingFaceLocalAdapter
from .adapters import (
HuggingFaceLocalAdapter,
HumanAdapter,
)
from .callbacks import (
ImageRetentionCallback,
LoggingCallback,
@@ -22,9 +24,14 @@ from .callbacks import (
BudgetManagerCallback,
TelemetryCallback,
)
from .computers import (
AsyncComputerHandler,
is_agent_computer,
make_computer_handler
)
def get_json(obj: Any, max_depth: int = 10) -> Any:
def custom_serializer(o: Any, depth: int = 0, seen: Set[int] = None) -> Any:
def custom_serializer(o: Any, depth: int = 0, seen: Optional[Set[int]] = None) -> Any:
if seen is None:
seen = set()
@@ -117,6 +124,13 @@ def sanitize_message(msg: Any) -> Any:
return sanitized
return msg
def get_output_call_ids(messages: List[Dict[str, Any]]) -> List[str]:
call_ids = []
for message in messages:
if message.get("type") == "computer_call_output" or message.get("type") == "function_call_output":
call_ids.append(message.get("call_id"))
return call_ids
class ComputerAgent:
"""
Main agent class that automatically selects the appropriate agent loop
@@ -204,22 +218,26 @@ class ComputerAgent:
hf_adapter = HuggingFaceLocalAdapter(
device="auto"
)
human_adapter = HumanAdapter()
litellm.custom_provider_map = [
{"provider": "huggingface-local", "custom_handler": hf_adapter}
{"provider": "huggingface-local", "custom_handler": hf_adapter},
{"provider": "human", "custom_handler": human_adapter}
]
litellm.suppress_debug_info = True
# == Initialize computer agent ==
# Find the appropriate agent loop
if custom_loop:
self.agent_loop = custom_loop
self.agent_loop_info = None
self.agent_config_info = None
else:
loop_info = find_agent_loop(model)
if not loop_info:
raise ValueError(f"No agent loop found for model: {model}")
self.agent_loop = loop_info.func
self.agent_loop_info = loop_info
config_info = find_agent_config(model)
if not config_info:
raise ValueError(f"No agent config found for model: {model}")
# Instantiate the agent config class
self.agent_loop = config_info.agent_class()
self.agent_config_info = config_info
self.tool_schemas = []
self.computer_handler = None
@@ -227,10 +245,6 @@ class ComputerAgent:
async def _initialize_computers(self):
"""Initialize computer objects"""
if not self.tool_schemas:
for tool in self.tools:
if hasattr(tool, '_initialized') and not tool._initialized:
await tool.run()
# Process tools and create tool schemas
self.tool_schemas = self._process_tools()
@@ -238,7 +252,7 @@ class ComputerAgent:
computer_handler = None
for schema in self.tool_schemas:
if schema["type"] == "computer":
computer_handler = OpenAIComputerHandler(schema["computer"].interface)
computer_handler = await make_computer_handler(schema["computer"])
break
self.computer_handler = computer_handler
@@ -254,7 +268,7 @@ class ComputerAgent:
for tool in self.tools:
# Check if it's a computer object (has interface attribute)
if hasattr(tool, 'interface'):
if is_agent_computer(tool):
# This is a computer tool - will be handled by agent loop
schemas.append({
"type": "computer",
@@ -389,8 +403,10 @@ class ComputerAgent:
# AGENT OUTPUT PROCESSING
# ============================================================================
async def _handle_item(self, item: Any, computer: Optional[Computer] = None) -> List[Dict[str, Any]]:
async def _handle_item(self, item: Any, computer: Optional[AsyncComputerHandler] = None, ignore_call_ids: Optional[List[str]] = None) -> List[Dict[str, Any]]:
"""Handle each item; may cause a computer action + screenshot."""
if ignore_call_ids and item.get("call_id") and item.get("call_id") in ignore_call_ids:
return []
item_type = item.get("type", None)
@@ -411,6 +427,9 @@ class ComputerAgent:
# Perform computer actions
action = item.get("action")
action_type = action.get("type")
if action_type is None:
print(f"Action type cannot be `None`: action={action}, action_type={action_type}")
return []
# Extract action arguments (all fields except 'type')
action_args = {k: v for k, v in action.items() if k != "type"}
@@ -436,10 +455,12 @@ class ComputerAgent:
acknowledged_checks = []
for check in pending_checks:
check_message = check.get("message", str(check))
if acknowledge_safety_check_callback(check_message):
acknowledged_checks.append(check)
else:
raise ValueError(f"Safety check failed: {check_message}")
acknowledged_checks.append(check)
# TODO: implement a callback for safety checks
# if acknowledge_safety_check_callback(check_message, allow_always=True):
# acknowledged_checks.append(check)
# else:
# raise ValueError(f"Safety check failed: {check_message}")
# Create call output
call_output = {
@@ -452,11 +473,12 @@ class ComputerAgent:
},
}
# Additional URL safety checks for browser environments
if await computer.get_environment() == "browser":
current_url = await computer.get_current_url()
call_output["output"]["current_url"] = current_url
check_blocklisted_url(current_url)
# # Additional URL safety checks for browser environments
# if await computer.get_environment() == "browser":
# current_url = await computer.get_current_url()
# call_output["output"]["current_url"] = current_url
# # TODO: implement a callback for URL safety checks
# # check_blocklisted_url(current_url)
result = [call_output]
await self._on_computer_call_end(item, result)
@@ -511,6 +533,12 @@ class ComputerAgent:
Returns:
AsyncGenerator that yields response chunks
"""
if not self.agent_config_info:
raise ValueError("Agent configuration not found")
capabilities = self.get_capabilities()
if "step" not in capabilities:
raise ValueError(f"Agent loop {self.agent_config_info.agent_class.__name__} does not support step predictions")
await self._initialize_computers()
@@ -525,7 +553,7 @@ class ComputerAgent:
"messages": messages,
"stream": stream,
"model": self.model,
"agent_loop": self.agent_loop.__name__,
"agent_loop": self.agent_config_info.agent_class.__name__,
**merged_kwargs
}
await self._on_run_start(run_kwargs, old_items)
@@ -555,7 +583,7 @@ class ComputerAgent:
}
# Run agent loop iteration
result = await self.agent_loop(
result = await self.agent_loop.predict_step(
**loop_kwargs,
_on_api_start=self._on_api_start,
_on_api_end=self._on_api_end,
@@ -576,9 +604,12 @@ class ComputerAgent:
# Add agent response to new_items
new_items += result.get("output")
# Get output call ids
output_call_ids = get_output_call_ids(result.get("output", []))
# Handle computer actions
for item in result.get("output"):
partial_items = await self._handle_item(item, self.computer_handler)
partial_items = await self._handle_item(item, self.computer_handler, ignore_call_ids=output_call_ids)
new_items += partial_items
# Yield partial response
@@ -591,4 +622,51 @@ class ComputerAgent:
)
}
await self._on_run_end(loop_kwargs, old_items, new_items)
await self._on_run_end(loop_kwargs, old_items, new_items)
async def predict_click(
self,
instruction: str,
image_b64: Optional[str] = None
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates based on image and instruction.
Args:
instruction: Instruction for where to click
image_b64: Base64 encoded image (optional, will take screenshot if not provided)
Returns:
None or tuple with (x, y) coordinates
"""
if not self.agent_config_info:
raise ValueError("Agent configuration not found")
capabilities = self.get_capabilities()
if "click" not in capabilities:
raise ValueError(f"Agent loop {self.agent_config_info.agent_class.__name__} does not support click predictions")
if hasattr(self.agent_loop, 'predict_click'):
if not image_b64:
if not self.computer_handler:
raise ValueError("Computer tool or image_b64 is required for predict_click")
image_b64 = await self.computer_handler.screenshot()
return await self.agent_loop.predict_click(
model=self.model,
image_b64=image_b64,
instruction=instruction
)
return None
def get_capabilities(self) -> List[AgentCapability]:
"""
Get list of capabilities supported by the current agent config.
Returns:
List of capability strings (e.g., ["step", "click"])
"""
if not self.agent_config_info:
raise ValueError("Agent configuration not found")
if hasattr(self.agent_loop, 'get_capabilities'):
return self.agent_loop.get_capabilities()
return ["step"] # Default capability

View File

@@ -9,10 +9,7 @@ import io
import logging
try:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
from presidio_image_redactor import ImageRedactorEngine
# TODO: Add Presidio dependencies
from PIL import Image
PRESIDIO_AVAILABLE = True
except ImportError:
@@ -32,11 +29,7 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
def __init__(
self,
anonymize_text: bool = True,
anonymize_images: bool = True,
entities_to_anonymize: Optional[List[str]] = None,
anonymization_operator: str = "replace",
image_redaction_color: Tuple[int, int, int] = (255, 192, 203) # Pink
# TODO: Any extra kwargs if needed
):
"""
Initialize the PII anonymization callback.
@@ -51,23 +44,10 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
if not PRESIDIO_AVAILABLE:
raise ImportError(
"Presidio is not available. Install with: "
"pip install presidio-analyzer presidio-anonymizer presidio-image-redactor"
"pip install cua-agent[pii-anonymization]"
)
self.anonymize_text = anonymize_text
self.anonymize_images = anonymize_images
self.entities_to_anonymize = entities_to_anonymize
self.anonymization_operator = anonymization_operator
self.image_redaction_color = image_redaction_color
# Initialize Presidio engines
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
self.deanonymizer = DeanonymizeEngine()
self.image_redactor = ImageRedactorEngine()
# Store anonymization mappings for deanonymization
self.anonymization_mappings: Dict[str, Any] = {}
# TODO: Implement __init__
async def on_llm_start(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
@@ -79,9 +59,6 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
Returns:
List of messages with PII anonymized
"""
if not self.anonymize_text and not self.anonymize_images:
return messages
anonymized_messages = []
for msg in messages:
anonymized_msg = await self._anonymize_message(msg)
@@ -99,9 +76,6 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
Returns:
List of output with PII deanonymized for tool calls
"""
if not self.anonymize_text:
return output
deanonymized_output = []
for item in output:
# Only deanonymize tool calls and computer_call messages
@@ -114,146 +88,9 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
return deanonymized_output
async def _anonymize_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
"""Anonymize PII in a single message."""
msg_copy = message.copy()
# Anonymize text content
if self.anonymize_text:
msg_copy = await self._anonymize_text_content(msg_copy)
# Redact images in computer_call_output
if self.anonymize_images and msg_copy.get("type") == "computer_call_output":
msg_copy = await self._redact_image_content(msg_copy)
return msg_copy
async def _anonymize_text_content(self, message: Dict[str, Any]) -> Dict[str, Any]:
"""Anonymize text content in a message."""
msg_copy = message.copy()
# Handle content array
content = msg_copy.get("content", [])
if isinstance(content, str):
anonymized_text, _ = await self._anonymize_text(content)
msg_copy["content"] = anonymized_text
elif isinstance(content, list):
anonymized_content = []
for item in content:
if isinstance(item, dict) and item.get("type") == "text":
text = item.get("text", "")
anonymized_text, _ = await self._anonymize_text(text)
item_copy = item.copy()
item_copy["text"] = anonymized_text
anonymized_content.append(item_copy)
else:
anonymized_content.append(item)
msg_copy["content"] = anonymized_content
return msg_copy
async def _redact_image_content(self, message: Dict[str, Any]) -> Dict[str, Any]:
"""Redact PII from images in computer_call_output messages."""
msg_copy = message.copy()
output = msg_copy.get("output", {})
if isinstance(output, dict) and "image_url" in output:
try:
# Extract base64 image data
image_url = output["image_url"]
if image_url.startswith("data:image/"):
# Parse data URL
header, data = image_url.split(",", 1)
image_data = base64.b64decode(data)
# Load image with PIL
image = Image.open(io.BytesIO(image_data))
# Redact PII from image
redacted_image = self.image_redactor.redact(image, self.image_redaction_color)
# Convert back to base64
buffer = io.BytesIO()
redacted_image.save(buffer, format="PNG")
redacted_data = base64.b64encode(buffer.getvalue()).decode()
# Update image URL
output_copy = output.copy()
output_copy["image_url"] = f"data:image/png;base64,{redacted_data}"
msg_copy["output"] = output_copy
except Exception as e:
logger.warning(f"Failed to redact image: {e}")
return msg_copy
# TODO: Implement _anonymize_message
return message
async def _deanonymize_item(self, item: Dict[str, Any]) -> Dict[str, Any]:
"""Deanonymize PII in tool calls and computer outputs."""
item_copy = item.copy()
# Handle computer_call arguments
if item.get("type") == "computer_call":
args = item_copy.get("args", {})
if isinstance(args, dict):
deanonymized_args = {}
for key, value in args.items():
if isinstance(value, str):
deanonymized_value, _ = await self._deanonymize_text(value)
deanonymized_args[key] = deanonymized_value
else:
deanonymized_args[key] = value
item_copy["args"] = deanonymized_args
return item_copy
async def _anonymize_text(self, text: str) -> Tuple[str, List[RecognizerResult]]:
"""Anonymize PII in text and return the anonymized text and results."""
if not text.strip():
return text, []
try:
# Analyze text for PII
analyzer_results = self.analyzer.analyze(
text=text,
entities=self.entities_to_anonymize,
language="en"
)
if not analyzer_results:
return text, []
# Anonymize the text
anonymized_result = self.anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={entity_type: OperatorConfig(self.anonymization_operator)
for entity_type in set(result.entity_type for result in analyzer_results)}
)
# Store mapping for deanonymization
mapping_key = str(hash(text))
self.anonymization_mappings[mapping_key] = {
"original": text,
"anonymized": anonymized_result.text,
"results": analyzer_results
}
return anonymized_result.text, analyzer_results
except Exception as e:
logger.warning(f"Failed to anonymize text: {e}")
return text, []
async def _deanonymize_text(self, text: str) -> Tuple[str, bool]:
"""Attempt to deanonymize text using stored mappings."""
try:
# Look for matching anonymized text in mappings
for mapping_key, mapping in self.anonymization_mappings.items():
if mapping["anonymized"] == text:
return mapping["original"], True
# If no mapping found, return original text
return text, False
except Exception as e:
logger.warning(f"Failed to deanonymize text: {e}")
return text, False
# TODO: Implement _deanonymize_item
return item

View File

@@ -51,12 +51,14 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
within the trajectory gets its own folder with screenshots and responses.
"""
def __init__(self, trajectory_dir: str):
def __init__(self, trajectory_dir: str, reset_on_run: bool = True):
"""
Initialize trajectory saver.
Args:
trajectory_dir: Base directory to save trajectories
reset_on_run: If True, reset trajectory_id/turn/artifact on each run.
If False, continue using existing trajectory_id if set.
"""
self.trajectory_dir = Path(trajectory_dir)
self.trajectory_id: Optional[str] = None
@@ -64,6 +66,7 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
self.current_artifact: int = 0
self.model: Optional[str] = None
self.total_usage: Dict[str, Any] = {}
self.reset_on_run = reset_on_run
# Ensure trajectory directory exists
self.trajectory_dir.mkdir(parents=True, exist_ok=True)
@@ -113,32 +116,38 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
async def on_run_start(self, kwargs: Dict[str, Any], old_items: List[Dict[str, Any]]) -> None:
"""Initialize trajectory tracking for a new run."""
model = kwargs.get("model", "unknown")
model_name_short = model.split("+")[-1].split("/")[-1].lower()[:16]
if "+" in model:
model_name_short = model.split("+")[0].lower()[:4] + "_" + model_name_short
# Only reset trajectory state if reset_on_run is True or no trajectory exists
if self.reset_on_run or not self.trajectory_id:
model_name_short = model.split("+")[-1].split("/")[-1].lower()[:16]
if "+" in model:
model_name_short = model.split("+")[0].lower()[:4] + "_" + model_name_short
# id format: yyyy-mm-dd_model_hhmmss_uuid[:4]
now = datetime.now()
self.trajectory_id = f"{now.strftime('%Y-%m-%d')}_{model_name_short}_{now.strftime('%H%M%S')}_{str(uuid.uuid4())[:4]}"
self.current_turn = 0
self.current_artifact = 0
self.model = model
self.total_usage = {}
# Create trajectory directory
trajectory_path = self.trajectory_dir / self.trajectory_id
trajectory_path.mkdir(parents=True, exist_ok=True)
# Save trajectory metadata
metadata = {
"trajectory_id": self.trajectory_id,
"created_at": str(uuid.uuid1().time),
"status": "running",
"kwargs": kwargs,
}
with open(trajectory_path / "metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
# id format: yyyy-mm-dd_model_hhmmss_uuid[:4]
now = datetime.now()
self.trajectory_id = f"{now.strftime('%Y-%m-%d')}_{model_name_short}_{now.strftime('%H%M%S')}_{str(uuid.uuid4())[:4]}"
self.current_turn = 0
self.current_artifact = 0
self.model = model
self.total_usage = {}
# Create trajectory directory
trajectory_path = self.trajectory_dir / self.trajectory_id
trajectory_path.mkdir(parents=True, exist_ok=True)
# Save trajectory metadata
metadata = {
"trajectory_id": self.trajectory_id,
"created_at": str(uuid.uuid1().time),
"status": "running",
"kwargs": kwargs,
}
with open(trajectory_path / "metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
else:
# Continue with existing trajectory - just update model if needed
self.model = model
@override
async def on_run_end(self, kwargs: Dict[str, Any], old_items: List[Dict[str, Any]], new_items: List[Dict[str, Any]]) -> None:

View File

@@ -94,14 +94,14 @@ def print_action(action_type: str, details: Dict[str, Any], total_cost: float):
# Format action details
args_str = ""
if action_type == "click" and "x" in details and "y" in details:
args_str = f"({details['x']}, {details['y']})"
args_str = f"_{details['button']}({details['x']}, {details['y']})"
elif action_type == "type" and "text" in details:
text = details["text"]
if len(text) > 50:
text = text[:47] + "..."
args_str = f'"{text}"'
elif action_type == "key" and "key" in details:
args_str = f"'{details['key']}'"
args_str = f'("{text}")'
elif action_type == "key" and "text" in details:
args_str = f"('{details['text']}')"
elif action_type == "scroll" and "x" in details and "y" in details:
args_str = f"({details['x']}, {details['y']})"
@@ -120,7 +120,7 @@ async def ainput(prompt: str = ""):
async def chat_loop(agent, model: str, container_name: str, initial_prompt: str = "", show_usage: bool = True):
"""Main chat loop with the agent."""
print_welcome(model, agent.agent_loop.__name__, container_name)
print_welcome(model, agent.agent_config_info.agent_class.__name__, container_name)
history = []
@@ -130,7 +130,7 @@ async def chat_loop(agent, model: str, container_name: str, initial_prompt: str
total_cost = 0
while True:
if history[-1].get("role") != "user":
if len(history) == 0 or history[-1].get("role") != "user":
# Get user input with prompt
print_colored("> ", end="")
user_input = await ainput()
@@ -260,7 +260,12 @@ Examples:
help="Show total cost of the agent runs"
)
parser.add_argument(
"-r", "--max-retries",
type=int,
default=3,
help="Maximum number of retries for the LLM API calls"
)
args = parser.parse_args()
@@ -327,6 +332,7 @@ Examples:
"model": args.model,
"tools": [computer],
"verbosity": 20 if args.verbose else 30, # DEBUG vs WARNING
"max_retries": args.max_retries
}
if args.images > 0:

View File

@@ -0,0 +1,41 @@
"""
Computer handler factory and interface definitions.
This module provides a factory function to create computer handlers from different
computer interface types, supporting both the ComputerHandler protocol and the
Computer library interface.
"""
from .base import AsyncComputerHandler
from .cua import cuaComputerHandler
from .custom import CustomComputerHandler
from computer import Computer as cuaComputer
def is_agent_computer(computer):
"""Check if the given computer is a ComputerHandler or CUA Computer."""
return isinstance(computer, AsyncComputerHandler) or \
isinstance(computer, cuaComputer) or \
(isinstance(computer, dict)) #and "screenshot" in computer)
async def make_computer_handler(computer):
"""
Create a computer handler from a computer interface.
Args:
computer: Either a ComputerHandler instance, Computer instance, or dict of functions
Returns:
ComputerHandler: A computer handler instance
Raises:
ValueError: If the computer type is not supported
"""
if isinstance(computer, AsyncComputerHandler):
return computer
if isinstance(computer, cuaComputer):
computer_handler = cuaComputerHandler(computer)
await computer_handler._initialize()
return computer_handler
if isinstance(computer, dict):
return CustomComputerHandler(computer)
raise ValueError(f"Unsupported computer type: {type(computer)}")

View File

@@ -0,0 +1,70 @@
"""
Base computer interface protocol for agent interactions.
"""
from typing import Protocol, Literal, List, Dict, Any, Union, Optional, runtime_checkable
@runtime_checkable
class AsyncComputerHandler(Protocol):
"""Protocol defining the interface for computer interactions."""
# ==== Computer-Use-Preview Action Space ====
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
...
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
...
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
...
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
...
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
...
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
...
async def type(self, text: str) -> None:
"""Type text."""
...
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
...
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
...
async def keypress(self, keys: Union[List[str], str]) -> None:
"""Press key combination."""
...
async def drag(self, path: List[Dict[str, int]]) -> None:
"""Drag along specified path."""
...
async def get_current_url(self) -> str:
"""Get current URL (for browser environments)."""
...
# ==== Anthropic Action Space ====
async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse down at coordinates."""
...
async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse up at coordinates."""
...

View File

@@ -3,34 +3,45 @@ Computer handler implementation for OpenAI computer-use-preview protocol.
"""
import base64
from typing import Dict, List, Any, Literal
from .types import Computer
from typing import Dict, List, Any, Literal, Union, Optional
from .base import AsyncComputerHandler
from computer import Computer
class OpenAIComputerHandler:
class cuaComputerHandler(AsyncComputerHandler):
"""Computer handler that implements the Computer protocol using the computer interface."""
def __init__(self, computer_interface):
def __init__(self, cua_computer: Computer):
"""Initialize with a computer interface (from tool schema)."""
self.interface = computer_interface
self.cua_computer = cua_computer
self.interface = None
async def _initialize(self):
if hasattr(self.cua_computer, '_initialized') and not self.cua_computer._initialized:
await self.cua_computer.run()
self.interface = self.cua_computer.interface
# ==== Computer-Use-Preview Action Space ====
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
# For now, return a default - this could be enhanced to detect actual environment
return "windows"
# TODO: detect actual environment
return "linux"
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
assert self.interface is not None
screen_size = await self.interface.get_screen_size()
return screen_size["width"], screen_size["height"]
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
assert self.interface is not None
screenshot_bytes = await self.interface.screenshot()
return base64.b64encode(screenshot_bytes).decode('utf-8')
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
assert self.interface is not None
if button == "left":
await self.interface.left_click(x, y)
elif button == "right":
@@ -41,28 +52,36 @@ class OpenAIComputerHandler:
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
assert self.interface is not None
await self.interface.double_click(x, y)
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
assert self.interface is not None
await self.interface.move_cursor(x, y)
await self.interface.scroll(scroll_x, scroll_y)
async def type(self, text: str) -> None:
"""Type text."""
assert self.interface is not None
await self.interface.type_text(text)
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
assert self.interface is not None
import asyncio
await asyncio.sleep(ms / 1000.0)
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
assert self.interface is not None
await self.interface.move_cursor(x, y)
async def keypress(self, keys: List[str]) -> None:
async def keypress(self, keys: Union[List[str], str]) -> None:
"""Press key combination."""
assert self.interface is not None
if isinstance(keys, str):
keys = keys.replace("-", "+").split("+")
if len(keys) == 1:
await self.interface.press_key(keys[0])
else:
@@ -71,6 +90,7 @@ class OpenAIComputerHandler:
async def drag(self, path: List[Dict[str, int]]) -> None:
"""Drag along specified path."""
assert self.interface is not None
if not path:
return
@@ -92,16 +112,13 @@ class OpenAIComputerHandler:
# For now, return empty string
return ""
def acknowledge_safety_check_callback(message: str) -> bool:
"""Safety check callback for user acknowledgment."""
response = input(
f"Safety Check Warning: {message}\nDo you want to acknowledge and proceed? (y/n): "
).lower()
return response.strip() == "y"
def check_blocklisted_url(url: str) -> None:
"""Check if URL is blocklisted (placeholder implementation)."""
# This would contain actual URL checking logic
pass
# ==== Anthropic Computer Action Space ====
async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse down at coordinates."""
assert self.interface is not None
await self.interface.mouse_down(x, y, button="left")
async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse up at coordinates."""
assert self.interface is not None
await self.interface.mouse_up(x, y, button="left")

View File

@@ -0,0 +1,209 @@
"""
Custom computer handler implementation that accepts a dictionary of functions.
"""
import base64
from typing import Dict, List, Any, Literal, Union, Optional, Callable
from PIL import Image
import io
from .base import AsyncComputerHandler
class CustomComputerHandler(AsyncComputerHandler):
"""Computer handler that implements the Computer protocol using a dictionary of custom functions."""
def __init__(self, functions: Dict[str, Callable]):
"""
Initialize with a dictionary of functions.
Args:
functions: Dictionary where keys are method names and values are callable functions.
Only 'screenshot' is required, all others are optional.
Raises:
ValueError: If required 'screenshot' function is not provided.
"""
if 'screenshot' not in functions:
raise ValueError("'screenshot' function is required in functions dictionary")
self.functions = functions
self._last_screenshot_size: Optional[tuple[int, int]] = None
async def _call_function(self, func, *args, **kwargs):
"""
Call a function, handling both async and sync functions.
Args:
func: The function to call
*args: Positional arguments to pass to the function
**kwargs: Keyword arguments to pass to the function
Returns:
The result of the function call
"""
import asyncio
import inspect
if callable(func):
if inspect.iscoroutinefunction(func):
return await func(*args, **kwargs)
else:
return func(*args, **kwargs)
else:
return func
async def _get_value(self, attribute: str):
"""
Get value for an attribute, checking both 'get_{attribute}' and '{attribute}' keys.
Args:
attribute: The attribute name to look for
Returns:
The value from the functions dict, called if callable, returned directly if not
"""
# Check for 'get_{attribute}' first
get_key = f"get_{attribute}"
if get_key in self.functions:
return await self._call_function(self.functions[get_key])
# Check for '{attribute}'
if attribute in self.functions:
return await self._call_function(self.functions[attribute])
return None
def _to_b64_str(self, img: Union[bytes, Image.Image, str]) -> str:
"""
Convert image to base64 string.
Args:
img: Image as bytes, PIL Image, or base64 string
Returns:
str: Base64 encoded image string
"""
if isinstance(img, str):
# Already a base64 string
return img
elif isinstance(img, bytes):
# Raw bytes
return base64.b64encode(img).decode('utf-8')
elif isinstance(img, Image.Image):
# PIL Image
buffer = io.BytesIO()
img.save(buffer, format='PNG')
return base64.b64encode(buffer.getvalue()).decode('utf-8')
else:
raise ValueError(f"Unsupported image type: {type(img)}")
# ==== Computer-Use-Preview Action Space ====
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
result = await self._get_value('environment')
if result is None:
return "linux"
assert result in ["windows", "mac", "linux", "browser"]
return result # type: ignore
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
result = await self._get_value('dimensions')
if result is not None:
return result # type: ignore
# Fallback: use last screenshot size if available
if not self._last_screenshot_size:
await self.screenshot()
assert self._last_screenshot_size is not None, "Failed to get screenshot size"
return self._last_screenshot_size
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
result = await self._call_function(self.functions['screenshot'])
b64_str = self._to_b64_str(result) # type: ignore
# Try to extract dimensions for fallback use
try:
if isinstance(result, Image.Image):
self._last_screenshot_size = result.size
elif isinstance(result, bytes):
# Try to decode bytes to get dimensions
img = Image.open(io.BytesIO(result))
self._last_screenshot_size = img.size
except Exception:
# If we can't get dimensions, that's okay
pass
return b64_str
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
if 'click' in self.functions:
await self._call_function(self.functions['click'], x, y, button)
# No-op if not implemented
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
if 'double_click' in self.functions:
await self._call_function(self.functions['double_click'], x, y)
# No-op if not implemented
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
if 'scroll' in self.functions:
await self._call_function(self.functions['scroll'], x, y, scroll_x, scroll_y)
# No-op if not implemented
async def type(self, text: str) -> None:
"""Type text."""
if 'type' in self.functions:
await self._call_function(self.functions['type'], text)
# No-op if not implemented
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
if 'wait' in self.functions:
await self._call_function(self.functions['wait'], ms)
else:
# Default implementation
import asyncio
await asyncio.sleep(ms / 1000.0)
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
if 'move' in self.functions:
await self._call_function(self.functions['move'], x, y)
# No-op if not implemented
async def keypress(self, keys: Union[List[str], str]) -> None:
"""Press key combination."""
if 'keypress' in self.functions:
await self._call_function(self.functions['keypress'], keys)
# No-op if not implemented
async def drag(self, path: List[Dict[str, int]]) -> None:
"""Drag along specified path."""
if 'drag' in self.functions:
await self._call_function(self.functions['drag'], path)
# No-op if not implemented
async def get_current_url(self) -> str:
"""Get current URL (for browser environments)."""
if 'get_current_url' in self.functions:
return await self._get_value('current_url') # type: ignore
return "" # Default fallback
async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse down at coordinates."""
if 'left_mouse_down' in self.functions:
await self._call_function(self.functions['left_mouse_down'], x, y)
# No-op if not implemented
async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse up at coordinates."""
if 'left_mouse_up' in self.functions:
await self._call_function(self.functions['left_mouse_up'], x, y)
# No-op if not implemented

View File

@@ -2,89 +2,51 @@
Decorators for agent - agent_loop decorator
"""
import asyncio
import inspect
from typing import Dict, List, Any, Callable, Optional
from functools import wraps
from .types import AgentLoopInfo
from typing import List, Optional
from .types import AgentConfigInfo
# Global registry
_agent_loops: List[AgentLoopInfo] = []
_agent_configs: List[AgentConfigInfo] = []
def agent_loop(models: str, priority: int = 0):
def register_agent(models: str, priority: int = 0):
"""
Decorator to register an agent loop function.
Decorator to register an AsyncAgentConfig class.
Args:
models: Regex pattern to match supported models
priority: Priority for loop selection (higher = more priority)
priority: Priority for agent selection (higher = more priority)
"""
def decorator(func: Callable):
# Validate function signature
sig = inspect.signature(func)
required_params = {'messages', 'model'}
func_params = set(sig.parameters.keys())
def decorator(agent_class: type):
# Validate that the class implements AsyncAgentConfig protocol
if not hasattr(agent_class, 'predict_step'):
raise ValueError(f"Agent class {agent_class.__name__} must implement predict_step method")
if not hasattr(agent_class, 'predict_click'):
raise ValueError(f"Agent class {agent_class.__name__} must implement predict_click method")
if not hasattr(agent_class, 'get_capabilities'):
raise ValueError(f"Agent class {agent_class.__name__} must implement get_capabilities method")
if not required_params.issubset(func_params):
missing = required_params - func_params
raise ValueError(f"Agent loop function must have parameters: {missing}")
# Register the loop
loop_info = AgentLoopInfo(
func=func,
# Register the agent config
config_info = AgentConfigInfo(
agent_class=agent_class,
models_regex=models,
priority=priority
)
_agent_loops.append(loop_info)
_agent_configs.append(config_info)
# Sort by priority (highest first)
_agent_loops.sort(key=lambda x: x.priority, reverse=True)
_agent_configs.sort(key=lambda x: x.priority, reverse=True)
@wraps(func)
async def wrapper(*args, **kwargs):
# Wrap the function in an asyncio.Queue for cancellation support
queue = asyncio.Queue()
task = None
try:
# Create a task that can be cancelled
async def run_loop():
try:
result = await func(*args, **kwargs)
await queue.put(('result', result))
except Exception as e:
await queue.put(('error', e))
task = asyncio.create_task(run_loop())
# Wait for result or cancellation
event_type, data = await queue.get()
if event_type == 'error':
raise data
return data
except asyncio.CancelledError:
if task:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
raise
return wrapper
return agent_class
return decorator
def get_agent_loops() -> List[AgentLoopInfo]:
"""Get all registered agent loops"""
return _agent_loops.copy()
def get_agent_configs() -> List[AgentConfigInfo]:
"""Get all registered agent configs"""
return _agent_configs.copy()
def find_agent_loop(model: str) -> Optional[AgentLoopInfo]:
"""Find the best matching agent loop for a model"""
for loop_info in _agent_loops:
if loop_info.matches_model(model):
return loop_info
def find_agent_config(model: str) -> Optional[AgentConfigInfo]:
"""Find the best matching agent config for a model"""
for config_info in _agent_configs:
if config_info.matches_model(model):
return config_info
return None

View File

@@ -0,0 +1,29 @@
"""
Human-in-the-Loop Completion Tool
This package provides a human-in-the-loop completion system that allows
AI agents to request human assistance for complex decisions or responses.
Components:
- server.py: FastAPI server with completion queue management
- ui.py: Gradio UI for human interaction
- __main__.py: Combined server and UI application
Usage:
# Run the server and UI
python -m agent.human_tool
# Or run components separately
python -m agent.human_tool.server # API server only
python -m agent.human_tool.ui # UI only
"""
from .server import CompletionQueue, completion_queue
from .ui import HumanCompletionUI, create_ui
__all__ = [
"CompletionQueue",
"completion_queue",
"HumanCompletionUI",
"create_ui"
]

View File

@@ -0,0 +1,38 @@
#!/usr/bin/env python3
"""
Human-in-the-Loop Completion Server and UI
This module combines the FastAPI server for handling completion requests
with a Gradio UI for human interaction.
"""
import gradio as gr
from fastapi import FastAPI
from .server import app as fastapi_app
from .ui import create_ui
# Create the Gradio demo
gradio_demo = create_ui()
# Mount Gradio on FastAPI
CUSTOM_PATH = "/gradio"
app = gr.mount_gradio_app(fastapi_app, gradio_demo, path=CUSTOM_PATH)
# Add a redirect from root to Gradio UI
@fastapi_app.get("/")
async def redirect_to_ui():
"""Redirect root to Gradio UI."""
return {
"message": "Human Completion Server is running",
"ui_url": "/gradio",
"api_docs": "/docs"
}
if __name__ == "__main__":
import uvicorn
print("🚀 Starting Human-in-the-Loop Completion Server...")
print("📊 API Server: http://localhost:8002")
print("🎨 Gradio UI: http://localhost:8002/gradio")
print("📚 API Docs: http://localhost:8002/docs")
uvicorn.run(app, host="0.0.0.0", port=8002)

View File

@@ -0,0 +1,234 @@
import asyncio
import uuid
from datetime import datetime
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, asdict
from enum import Enum
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
class CompletionStatus(str, Enum):
PENDING = "pending"
COMPLETED = "completed"
FAILED = "failed"
@dataclass
class CompletionCall:
id: str
messages: List[Dict[str, Any]]
model: str
status: CompletionStatus
created_at: datetime
completed_at: Optional[datetime] = None
response: Optional[str] = None
tool_calls: Optional[List[Dict[str, Any]]] = None
error: Optional[str] = None
class ToolCall(BaseModel):
id: str
type: str = "function"
function: Dict[str, Any]
class CompletionRequest(BaseModel):
messages: List[Dict[str, Any]]
model: str
class CompletionResponse(BaseModel):
response: Optional[str] = None
tool_calls: Optional[List[Dict[str, Any]]] = None
class CompletionQueue:
def __init__(self):
self._queue: Dict[str, CompletionCall] = {}
self._pending_order: List[str] = []
self._lock = asyncio.Lock()
async def add_completion(self, messages: List[Dict[str, Any]], model: str) -> str:
"""Add a completion call to the queue."""
async with self._lock:
call_id = str(uuid.uuid4())
completion_call = CompletionCall(
id=call_id,
messages=messages,
model=model,
status=CompletionStatus.PENDING,
created_at=datetime.now()
)
self._queue[call_id] = completion_call
self._pending_order.append(call_id)
return call_id
async def get_pending_calls(self) -> List[Dict[str, Any]]:
"""Get all pending completion calls."""
async with self._lock:
pending_calls = []
for call_id in self._pending_order:
if call_id in self._queue and self._queue[call_id].status == CompletionStatus.PENDING:
call = self._queue[call_id]
pending_calls.append({
"id": call.id,
"model": call.model,
"created_at": call.created_at.isoformat(),
"messages": call.messages
})
return pending_calls
async def get_call_status(self, call_id: str) -> Optional[Dict[str, Any]]:
"""Get the status of a specific completion call."""
async with self._lock:
if call_id not in self._queue:
return None
call = self._queue[call_id]
result = {
"id": call.id,
"status": call.status.value,
"created_at": call.created_at.isoformat(),
"model": call.model,
"messages": call.messages
}
if call.completed_at:
result["completed_at"] = call.completed_at.isoformat()
if call.response:
result["response"] = call.response
if call.tool_calls:
result["tool_calls"] = call.tool_calls
if call.error:
result["error"] = call.error
return result
async def complete_call(self, call_id: str, response: Optional[str] = None, tool_calls: Optional[List[Dict[str, Any]]] = None) -> bool:
"""Mark a completion call as completed with a response or tool calls."""
async with self._lock:
if call_id not in self._queue:
return False
call = self._queue[call_id]
if call.status != CompletionStatus.PENDING:
return False
call.status = CompletionStatus.COMPLETED
call.completed_at = datetime.now()
call.response = response
call.tool_calls = tool_calls
# Remove from pending order
if call_id in self._pending_order:
self._pending_order.remove(call_id)
return True
async def fail_call(self, call_id: str, error: str) -> bool:
"""Mark a completion call as failed with an error."""
async with self._lock:
if call_id not in self._queue:
return False
call = self._queue[call_id]
if call.status != CompletionStatus.PENDING:
return False
call.status = CompletionStatus.FAILED
call.completed_at = datetime.now()
call.error = error
# Remove from pending order
if call_id in self._pending_order:
self._pending_order.remove(call_id)
return True
async def wait_for_completion(self, call_id: str, timeout: float = 300.0) -> Optional[str]:
"""Wait for a completion call to be completed and return the response."""
start_time = asyncio.get_event_loop().time()
while True:
status = await self.get_call_status(call_id)
if not status:
return None
if status["status"] == CompletionStatus.COMPLETED.value:
return status.get("response")
elif status["status"] == CompletionStatus.FAILED.value:
raise Exception(f"Completion failed: {status.get('error', 'Unknown error')}")
# Check timeout
if asyncio.get_event_loop().time() - start_time > timeout:
await self.fail_call(call_id, "Timeout waiting for human response")
raise TimeoutError("Timeout waiting for human response")
# Wait a bit before checking again
await asyncio.sleep(0.5)
# Global queue instance
completion_queue = CompletionQueue()
# FastAPI app
app = FastAPI(title="Human Completion Server", version="1.0.0")
@app.post("/queue", response_model=Dict[str, str])
async def queue_completion(request: CompletionRequest):
"""Add a completion request to the queue."""
call_id = await completion_queue.add_completion(request.messages, request.model)
return {"id": call_id, "status": "queued"}
@app.get("/pending")
async def list_pending():
"""List all pending completion calls."""
pending_calls = await completion_queue.get_pending_calls()
return {"pending_calls": pending_calls}
@app.get("/status/{call_id}")
async def get_status(call_id: str):
"""Get the status of a specific completion call."""
status = await completion_queue.get_call_status(call_id)
if not status:
raise HTTPException(status_code=404, detail="Completion call not found")
return status
@app.post("/complete/{call_id}")
async def complete_call(call_id: str, response: CompletionResponse):
"""Complete a call with a human response."""
success = await completion_queue.complete_call(
call_id,
response=response.response,
tool_calls=response.tool_calls
)
if success:
return {"status": "success", "message": "Call completed"}
else:
raise HTTPException(status_code=404, detail="Call not found or already completed")
@app.post("/fail/{call_id}")
async def fail_call(call_id: str, error: Dict[str, str]):
"""Mark a call as failed."""
success = await completion_queue.fail_call(call_id, error.get("error", "Unknown error"))
if not success:
raise HTTPException(status_code=404, detail="Completion call not found or already completed")
return {"status": "failed"}
@app.get("/")
async def root():
"""Root endpoint."""
return {"message": "Human Completion Server is running"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8002)

View File

@@ -0,0 +1,630 @@
import gradio as gr
import json
import time
from typing import List, Dict, Any, Optional
from datetime import datetime
import requests
from .server import completion_queue
import base64
import io
from PIL import Image
class HumanCompletionUI:
def __init__(self, server_url: str = "http://localhost:8002"):
self.server_url = server_url
self.current_call_id: Optional[str] = None
self.refresh_interval = 2.0 # seconds
self.last_image = None # Store the last image for display
def format_messages_for_chatbot(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Format messages for display in gr.Chatbot with type='messages'."""
formatted = []
for msg in messages:
role = msg.get("role", "user")
content = msg.get("content", "")
tool_calls = msg.get("tool_calls", [])
# Handle different content formats
if isinstance(content, list):
# Multi-modal content - can include text and images
formatted_content = []
for item in content:
if item.get("type") == "text":
text = item.get("text", "")
if text.strip(): # Only add non-empty text
formatted_content.append(text)
elif item.get("type") == "image_url":
image_url = item.get("image_url", {}).get("url", "")
if image_url:
# Check if it's a base64 image or URL
if image_url.startswith("data:image"):
# For base64 images, decode and create gr.Image
try:
header, data = image_url.split(",", 1)
image_data = base64.b64decode(data)
image = Image.open(io.BytesIO(image_data))
formatted_content.append(gr.Image(value=image))
except Exception as e:
print(f"Error loading image: {e}")
formatted_content.append(f"[Image loading error: {e}]")
else:
# For URL images, create gr.Image with URL
formatted_content.append(gr.Image(value=image_url))
# Determine final content format
if len(formatted_content) == 1:
content = formatted_content[0]
elif len(formatted_content) > 1:
content = formatted_content
else:
content = "[Empty content]"
# Ensure role is valid for Gradio Chatbot
if role not in ["user", "assistant"]:
role = "assistant" if role == "system" else "user"
# Invert roles for better display in human UI context
# (what the AI says becomes "user", what human should respond becomes "assistant")
if role == "user":
role = "assistant"
else:
role = "user"
# Add the main message if it has content
if content and str(content).strip():
formatted.append({"role": role, "content": content})
# Handle tool calls - create separate messages for each tool call
if tool_calls:
for tool_call in tool_calls:
function_name = tool_call.get("function", {}).get("name", "unknown")
arguments_str = tool_call.get("function", {}).get("arguments", "{}")
try:
# Parse arguments to format them nicely
arguments = json.loads(arguments_str)
formatted_args = json.dumps(arguments, indent=2)
except json.JSONDecodeError:
# If parsing fails, use the raw string
formatted_args = arguments_str
# Create a formatted message for the tool call
tool_call_content = f"```json\n{formatted_args}\n```"
formatted.append({
"role": role,
"content": tool_call_content,
"metadata": {"title": f"🛠️ Used {function_name}"}
})
return formatted
def get_pending_calls(self) -> List[Dict[str, Any]]:
"""Get pending calls from the server."""
try:
response = requests.get(f"{self.server_url}/pending", timeout=5)
if response.status_code == 200:
return response.json().get("pending_calls", [])
except Exception as e:
print(f"Error fetching pending calls: {e}")
return []
def complete_call_with_response(self, call_id: str, response: str) -> bool:
"""Complete a call with a text response."""
try:
response_data = {"response": response}
response_obj = requests.post(
f"{self.server_url}/complete/{call_id}",
json=response_data,
timeout=10
)
response_obj.raise_for_status()
return True
except requests.RequestException as e:
print(f"Error completing call: {e}")
return False
def complete_call_with_tool_calls(self, call_id: str, tool_calls: List[Dict[str, Any]]) -> bool:
"""Complete a call with tool calls."""
try:
response_data = {"tool_calls": tool_calls}
response_obj = requests.post(
f"{self.server_url}/complete/{call_id}",
json=response_data,
timeout=10
)
response_obj.raise_for_status()
return True
except requests.RequestException as e:
print(f"Error completing call: {e}")
return False
def complete_call(self, call_id: str, response: Optional[str] = None, tool_calls: Optional[List[Dict[str, Any]]] = None) -> bool:
"""Complete a call with either a response or tool calls."""
try:
response_data = {}
if response:
response_data["response"] = response
if tool_calls:
response_data["tool_calls"] = tool_calls
response_obj = requests.post(
f"{self.server_url}/complete/{call_id}",
json=response_data,
timeout=10
)
response_obj.raise_for_status()
return True
except requests.RequestException as e:
print(f"Error completing call: {e}")
return False
def get_last_image_from_messages(self, messages: List[Dict[str, Any]]) -> Optional[Any]:
"""Extract the last image from the messages for display above conversation."""
last_image = None
for msg in reversed(messages): # Start from the last message
content = msg.get("content", "")
if isinstance(content, list):
for item in reversed(content): # Get the last image in the message
if item.get("type") == "image_url":
image_url = item.get("image_url", {}).get("url", "")
if image_url:
if image_url.startswith("data:image"):
# For base64 images, create a gr.Image component
try:
header, data = image_url.split(",", 1)
image_data = base64.b64decode(data)
image = Image.open(io.BytesIO(image_data))
return image
except Exception as e:
print(f"Error loading image: {e}")
continue
else:
# For URL images, return the URL
return image_url
return last_image
def refresh_pending_calls(self):
"""Refresh the list of pending calls."""
pending_calls = self.get_pending_calls()
if not pending_calls:
return (
gr.update(choices=["latest"], value="latest"), # dropdown
gr.update(value=None), # image (no image)
gr.update(value=[]), # chatbot (empty messages)
gr.update(interactive=False) # submit button
)
# Sort pending calls by created_at to get oldest first
sorted_calls = sorted(pending_calls, key=lambda x: x.get("created_at", ""))
# Create choices for dropdown
choices = [("latest", "latest")] # Add "latest" option first
for call in sorted_calls:
call_id = call["id"]
model = call.get("model", "unknown")
created_at = call.get("created_at", "")
# Format timestamp
try:
dt = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
time_str = dt.strftime("%H:%M:%S")
except:
time_str = created_at
choice_label = f"{call_id[:8]}... ({model}) - {time_str}"
choices.append((choice_label, call_id))
# Default to "latest" which shows the oldest pending conversation
selected_call_id = "latest"
if selected_call_id == "latest" and sorted_calls:
# Use the oldest call (first in sorted list)
selected_call = sorted_calls[0]
conversation = self.format_messages_for_chatbot(selected_call.get("messages", []))
self.current_call_id = selected_call["id"]
# Get the last image from messages
self.last_image = self.get_last_image_from_messages(selected_call.get("messages", []))
else:
conversation = []
self.current_call_id = None
self.last_image = None
return (
gr.update(choices=choices, value="latest"),
gr.update(value=self.last_image),
gr.update(value=conversation),
gr.update(interactive=bool(choices))
)
def on_call_selected(self, selected_choice):
"""Handle when a call is selected from the dropdown."""
if not selected_choice:
return (
gr.update(value=None), # no image
gr.update(value=[]), # empty chatbot
gr.update(interactive=False)
)
pending_calls = self.get_pending_calls()
if not pending_calls:
return (
gr.update(value=None), # no image
gr.update(value=[]), # empty chatbot
gr.update(interactive=False)
)
# Handle "latest" option
if selected_choice == "latest":
# Sort calls by created_at to get oldest first
sorted_calls = sorted(pending_calls, key=lambda x: x.get("created_at", ""))
selected_call = sorted_calls[0] # Get the oldest call
call_id = selected_call["id"]
else:
# Extract call_id from the choice for specific calls
call_id = None
for call in pending_calls:
call_id_short = call["id"][:8]
if call_id_short in selected_choice:
call_id = call["id"]
break
if not call_id:
return (
gr.update(value=None), # no image
gr.update(value=[]), # empty chatbot
gr.update(interactive=False)
)
# Find the selected call
selected_call = next((c for c in pending_calls if c["id"] == call_id), None)
if not selected_call:
return (
gr.update(value=None), # no image
gr.update(value=[]), # empty chatbot
gr.update(interactive=False)
)
conversation = self.format_messages_for_chatbot(selected_call.get("messages", []))
self.current_call_id = call_id
# Get the last image from messages
self.last_image = self.get_last_image_from_messages(selected_call.get("messages", []))
return (
gr.update(value=self.last_image),
gr.update(value=conversation),
gr.update(interactive=True)
)
def submit_response(self, response_text: str):
"""Submit a text response to the current call."""
if not self.current_call_id:
return (
gr.update(value=response_text), # keep response text
gr.update(value="❌ No call selected") # status
)
if not response_text.strip():
return (
gr.update(value=response_text), # keep response text
gr.update(value="❌ Response cannot be empty") # status
)
success = self.complete_call_with_response(self.current_call_id, response_text)
if success:
status_msg = "✅ Response submitted successfully!"
return (
gr.update(value=""), # clear response text
gr.update(value=status_msg) # status
)
else:
return (
gr.update(value=response_text), # keep response text
gr.update(value="❌ Failed to submit response") # status
)
def submit_action(self, action_type: str, **kwargs) -> str:
"""Submit a computer action as a tool call."""
if not self.current_call_id:
return "❌ No call selected"
import uuid
# Create tool call structure
action_data = {"type": action_type, **kwargs}
tool_call = {
"id": f"call_{uuid.uuid4().hex[:24]}",
"type": "function",
"function": {
"name": "computer",
"arguments": json.dumps(action_data)
}
}
success = self.complete_call_with_tool_calls(self.current_call_id, [tool_call])
if success:
return f"{action_type.capitalize()} action submitted as tool call"
else:
return f"❌ Failed to submit {action_type} action"
def submit_click_action(self, x: int, y: int, action_type: str = "click", button: str = "left") -> str:
"""Submit a coordinate-based action."""
if action_type == "click":
return self.submit_action(action_type, x=x, y=y, button=button)
else:
return self.submit_action(action_type, x=x, y=y)
def submit_type_action(self, text: str) -> str:
"""Submit a type action."""
return self.submit_action("type", text=text)
def submit_hotkey_action(self, keys: str) -> str:
"""Submit a hotkey action."""
return self.submit_action("keypress", keys=keys)
def submit_description_click(self, description: str, action_type: str = "click", button: str = "left") -> str:
"""Submit a description-based action."""
if action_type == "click":
return self.submit_action(action_type, element_description=description, button=button)
else:
return self.submit_action(action_type, element_description=description)
def wait_for_pending_calls(self, max_seconds: float = 10.0, check_interval: float = 0.2):
"""Wait for pending calls to appear or until max_seconds elapsed.
This method loops and checks for pending calls at regular intervals,
returning as soon as a pending call is found or the maximum wait time is reached.
Args:
max_seconds: Maximum number of seconds to wait
check_interval: How often to check for pending calls (in seconds)
"""
import time
start_time = time.time()
while time.time() - start_time < max_seconds:
# Check if there are any pending calls
pending_calls = self.get_pending_calls()
if pending_calls:
# Found pending calls, return immediately
return self.refresh_pending_calls()
# Wait before checking again
time.sleep(check_interval)
# Max wait time reached, return current state
return self.refresh_pending_calls()
def create_ui():
"""Create the Gradio interface."""
ui_handler = HumanCompletionUI()
with gr.Blocks(title="Human-in-the-Loop Agent Tool") as demo:
gr.Markdown("# 🤖 Human-in-the-Loop Agent Tool")
gr.Markdown("Review AI conversation requests and provide human responses.")
with gr.Row():
with gr.Column(scale=2):
with gr.Group():
screenshot_image = gr.Image(
label="Screenshot",
interactive=False,
height=600
)
# Action type selection for image clicks
with gr.Row():
action_type_radio = gr.Radio(
label="Action Type",
choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
value="click",
scale=2
)
action_button_radio = gr.Radio(
label="Button (for click only)",
choices=["left", "right", "wheel", "back", "forward"],
value="left",
visible=True,
scale=1
)
conversation_chatbot = gr.Chatbot(
label="Messages",
type="messages",
height=500,
show_copy_button=True
)
with gr.Column(scale=1):
with gr.Group():
call_dropdown = gr.Dropdown(
label="Select a pending call",
choices=["latest"],
interactive=True,
value="latest"
)
refresh_btn = gr.Button("🔄 Refresh", variant="secondary")
with gr.Group():
response_text = gr.Textbox(
label="Response",
lines=3,
placeholder="Enter your response here..."
)
submit_btn = gr.Button("📤 Submit Response", variant="primary", interactive=False)
# Action Accordions
with gr.Accordion("🖱️ Click Actions", open=False):
with gr.Group():
with gr.Row():
click_x = gr.Number(label="X", value=0, minimum=0)
click_y = gr.Number(label="Y", value=0, minimum=0)
with gr.Row():
click_action_type = gr.Dropdown(
label="Action Type",
choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
value="click"
)
click_button = gr.Dropdown(
label="Button (for click only)",
choices=["left", "right", "wheel", "back", "forward"],
value="left"
)
click_submit_btn = gr.Button("Submit Action")
with gr.Accordion("📝 Type Action", open=False):
with gr.Group():
type_text = gr.Textbox(
label="Text to Type",
placeholder="Enter text to type..."
)
type_submit_btn = gr.Button("Submit Type")
with gr.Accordion("⌨️ Keypress Action", open=False):
with gr.Group():
keypress_text = gr.Textbox(
label="Keys",
placeholder="e.g., ctrl+c, alt+tab"
)
keypress_submit_btn = gr.Button("Submit Keypress")
with gr.Accordion("🎯 Description Action", open=False):
with gr.Group():
description_text = gr.Textbox(
label="Element Description",
placeholder="e.g., 'Privacy and security option in left sidebar'"
)
with gr.Row():
description_action_type = gr.Dropdown(
label="Action Type",
choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
value="click"
)
description_button = gr.Radio(
label="Button (for click only)",
choices=["left", "right", "wheel", "back", "forward"],
value="left"
)
description_submit_btn = gr.Button("Submit Description Action")
status_display = gr.Textbox(
label="Status",
interactive=False,
value="Ready to receive calls..."
)
# Event handlers
refresh_btn.click(
fn=ui_handler.refresh_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
call_dropdown.change(
fn=ui_handler.on_call_selected,
inputs=[call_dropdown],
outputs=[screenshot_image, conversation_chatbot, submit_btn]
)
def handle_image_click(evt: gr.SelectData):
if evt.index is not None:
x, y = evt.index
action_type = action_type_radio.value or "click"
button = action_button_radio.value or "left"
result = ui_handler.submit_click_action(x, y, action_type, button)
ui_handler.wait_for_pending_calls()
return result
return "No coordinates selected"
screenshot_image.select(
fn=handle_image_click,
outputs=[status_display]
).then(
fn=ui_handler.wait_for_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
# Response submission
submit_btn.click(
fn=ui_handler.submit_response,
inputs=[response_text],
outputs=[response_text, status_display]
).then(
fn=ui_handler.refresh_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
# Toggle button radio visibility based on action type
def toggle_button_visibility(action_type):
return gr.update(visible=(action_type == "click"))
action_type_radio.change(
fn=toggle_button_visibility,
inputs=[action_type_radio],
outputs=[action_button_radio]
)
# Action accordion handlers
click_submit_btn.click(
fn=ui_handler.submit_click_action,
inputs=[click_x, click_y, click_action_type, click_button],
outputs=[status_display]
).then(
fn=ui_handler.wait_for_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
type_submit_btn.click(
fn=ui_handler.submit_type_action,
inputs=[type_text],
outputs=[status_display]
).then(
fn=ui_handler.wait_for_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
keypress_submit_btn.click(
fn=ui_handler.submit_hotkey_action,
inputs=[keypress_text],
outputs=[status_display]
).then(
fn=ui_handler.wait_for_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
def handle_description_submit(description, action_type, button):
if description:
result = ui_handler.submit_description_click(description, action_type, button)
ui_handler.wait_for_pending_calls()
return result
return "Please enter a description"
description_submit_btn.click(
fn=handle_description_submit,
inputs=[description_text, description_action_type, description_button],
outputs=[status_display]
).then(
fn=ui_handler.wait_for_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
# Load initial data
demo.load(
fn=ui_handler.refresh_pending_calls,
outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
)
return demo
if __name__ == "__main__":
demo = create_ui()
demo.queue()
demo.launch(server_name="0.0.0.0", server_port=7860)

View File

@@ -0,0 +1,77 @@
"""HUD integration for ComputerAgent."""
import logging
from typing import Any, Optional, Dict
from hud import run_job as hud_run_job
from .agent import ComputerAgent
from .adapter import ComputerAgentAdapter
from .computer_handler import HUDComputerHandler
async def run_job(
model: str,
task_or_taskset: Any,
job_name: str,
# Job kwargs
auto_reply_question: bool = False,
adapter_cls: Any = None,
adapter_kwargs: Optional[Dict[str, Any]] = None,
max_steps_per_task: int = 20,
run_parallel: bool = True,
job_metadata: Optional[Dict[str, Any]] = None,
show_progress: bool = True,
max_concurrent_env_creations: Optional[int] = 30, # Limits gym.make calls
max_concurrent_agent_predictions: Optional[int] = None, # No limit on LLM calls
max_concurrent_tasks: Optional[int] = 30, # Limits overall task concurrency
**agent_kwargs: Any
) -> Any:
"""
Run a job using ComputerAgent with the specified model.
Args:
model: Model string for ComputerAgent (e.g., "anthropic/claude-3-5-sonnet-20241022")
task_or_taskset: Task or TaskSet to run
job_name: Name for the job
auto_reply_question: Whether to auto-reply to questions
adapter_cls: Custom adapter class (defaults to ComputerAgentAdapter)
adapter_kwargs: Additional kwargs for the adapter
max_steps_per_task: Maximum steps per task
run_parallel: Whether to run tasks in parallel
job_metadata: Additional metadata for the job
show_progress: Whether to show progress
max_concurrent_env_creations: Max concurrent environment creations
max_concurrent_agent_predictions: Max concurrent agent predictions
max_concurrent_tasks: Max concurrent tasks
**agent_kwargs: Additional kwargs to pass to ComputerAgent
Returns:
Job instance from HUD
"""
# combine verbose and verbosity kwargs
if "verbose" in agent_kwargs:
agent_kwargs["verbosity"] = logging.INFO
del agent_kwargs["verbose"]
verbose = True if agent_kwargs.get("verbosity", logging.WARNING) > logging.INFO else False
# run job
return await hud_run_job(
agent_cls=ComputerAgent,
agent_kwargs={"model": model, **agent_kwargs},
task_or_taskset=task_or_taskset,
job_name=job_name,
auto_reply_question=auto_reply_question,
adapter_cls=adapter_cls,
adapter_kwargs=adapter_kwargs,
max_steps_per_task=max_steps_per_task,
run_parallel=run_parallel,
job_metadata=job_metadata,
show_progress=show_progress,
verbose=verbose,
max_concurrent_env_creations=max_concurrent_env_creations,
max_concurrent_agent_predictions=max_concurrent_agent_predictions,
max_concurrent_tasks=max_concurrent_tasks
)
__all__ = ["ComputerAgent", "ComputerAgentAdapter", "HUDComputerHandler", "run_job"]

View File

@@ -0,0 +1,121 @@
"""HUD Adapter for ComputerAgent integration."""
from __future__ import annotations
from typing import Any, ClassVar
from hud.adapters.common import CLA, Adapter
from hud.adapters.common.types import (
CLAButton,
CLAKey,
ClickAction,
CustomAction,
DragAction,
MoveAction,
Point,
PressAction,
ResponseAction,
ScreenshotFetch,
ScrollAction,
TypeAction,
WaitAction,
)
class ComputerAgentAdapter(Adapter):
"""Adapter for ComputerAgent to work with HUD."""
KEY_MAP: ClassVar[dict[str, CLAKey]] = {
"return": "enter",
"arrowup": "up",
"arrowdown": "down",
"arrowleft": "left",
"arrowright": "right",
"cmd": "ctrl",
"super": "win",
"meta": "win",
}
BUTTON_MAP: ClassVar[dict[str, CLAButton]] = {
"wheel": "middle",
"middle": "middle",
}
def __init__(self) -> None:
super().__init__()
# ComputerAgent default dimensions (can be overridden)
self.agent_width = 1024
self.agent_height = 768
def _map_key(self, key: str) -> CLAKey:
"""Map a key to its standardized form."""
return self.KEY_MAP.get(key.lower(), key.lower()) # type: ignore
def convert(self, data: Any) -> CLA:
"""Convert a ComputerAgent action to a HUD action."""
try:
action_type = data.get("type")
if action_type == "click":
x, y = data.get("x", 0), data.get("y", 0)
button = data.get("button", "left")
button = self.BUTTON_MAP.get(button, button)
if button is None:
button = "left"
converted_action = ClickAction(point=Point(x=x, y=y), button=button)
elif action_type == "double_click":
x, y = data.get("x", 0), data.get("y", 0)
converted_action = ClickAction(point=Point(x=x, y=y), button="left", pattern=[100])
elif action_type == "scroll":
x, y = int(data.get("x", 0)), int(data.get("y", 0))
scroll_x = int(data.get("scroll_x", 0))
scroll_y = int(data.get("scroll_y", 0))
converted_action = ScrollAction(
point=Point(x=x, y=y), scroll=Point(x=scroll_x, y=scroll_y)
)
elif action_type == "type":
text = data.get("text", "")
converted_action = TypeAction(text=text, enter_after=False)
elif action_type == "wait":
ms = data.get("ms", 1000)
converted_action = WaitAction(time=ms)
elif action_type == "move":
x, y = data.get("x", 0), data.get("y", 0)
converted_action = MoveAction(point=Point(x=x, y=y))
elif action_type == "keypress":
keys = data.get("keys", [])
if isinstance(keys, str):
keys = [keys]
converted_action = PressAction(keys=[self._map_key(k) for k in keys])
elif action_type == "drag":
path = data.get("path", [])
points = [Point(x=p.get("x", 0), y=p.get("y", 0)) for p in path]
converted_action = DragAction(path=points)
elif action_type == "screenshot":
converted_action = ScreenshotFetch()
elif action_type == "response":
converted_action = ResponseAction(text=data.get("text", ""))
elif action_type == "custom":
converted_action = CustomAction(action=data.get("action", ""))
else:
raise ValueError(f"Unsupported action type: {action_type}")
# Add reasoning and logs if available
converted_action.reasoning = data.get("reasoning", "")
converted_action.logs = data.get("logs", "")
return converted_action
except Exception as e:
raise ValueError(f"Invalid action: {data}. Error: {e!s}") from e

View File

@@ -0,0 +1,373 @@
"""HUD ComputerAgent wrapper for OSWorld benchmarking."""
import logging
from typing import Any, Literal, Optional, Union, List, Dict
import asyncio
from agent import ComputerAgent as BaseComputerAgent
from agent.responses import make_failed_tool_call_items
from hud.adapters import Adapter
from hud.agent.base import Agent
from hud.utils.common import Observation
from hud.adapters.common.types import LogType
from hud.types import Gym
from .adapter import ComputerAgentAdapter
from .computer_handler import HUDComputerHandler
logger = logging.getLogger(__name__)
BASE_SYSTEM_PROMPT = """
You are an autonomous computer-using agent. Follow these guidelines:
1. Be decisive and complete tasks without asking for confirmation unless absolutely necessary.
2. Use the computer tools to complete the task and do not stop until the task is complete.
3. Do NOT ask questions like "Should I proceed?" or "Would you like me to continue?" - just proceed with the task.
4. When you find what you're looking for (e.g., a file to upload), proceed with the action directly.
5. Only stop when the task is fully complete or if you encounter an error that prevents completion.
6. Trust that the user wants you to complete the entire task they've requested.
7. You must say "Task completed" when the task is complete.
Remember: You have been given permission to complete the requested task autonomously.
""".strip()
class ComputerAgent(Agent[BaseComputerAgent, dict[str, Any]]):
"""
A ComputerAgent wrapper for HUD integration.
This agent wraps the base ComputerAgent to work with HUD environments,
providing the same interface as OperatorAgent but using ComputerAgent internally.
"""
transfer_gyms: dict[Gym, Gym] = {"qa": "hud-browser"}
def __init__(
self,
model: str = "anthropic/claude-3-5-sonnet-20241022",
environment: Literal["windows", "mac", "linux", "browser"] = "linux",
adapter: Optional[Adapter] = None,
name: Optional[str] = None,
**kwargs: Any,
):
"""
Initialize the ComputerAgent for HUD.
Args:
model: The model string for ComputerAgent (e.g., "anthropic/claude-3-5-sonnet-20241022")
environment: The environment type (windows, mac, linux, browser)
adapter: The adapter to use for preprocessing and postprocessing
name: The name of the agent
**kwargs: Additional arguments passed to ComputerAgent
"""
# Create adapter if not provided
adapter = adapter or ComputerAgentAdapter()
if name is None:
name = f"computeragent-{model.split('/')[-1]}"
# Initialize the base Agent class without client (we'll create it later)
super().__init__(client=None, adapter=adapter, name=name)
self.model = model
self.environment = environment
self.kwargs = kwargs
# Default dimensions
self.width = 1024
self.height = 768
# Update dimensions if adapter is provided
if self.adapter:
self.width = self.adapter.agent_width
self.height = self.adapter.agent_height
# Create HUD computer handler
self.hud_computer = HUDComputerHandler(
environment=environment,
dimensions=(self.width, self.height)
)
# Handle trajectory_dir by adding TrajectorySaverCallback
trajectory_dir = kwargs.pop("trajectory_dir", None)
callbacks = kwargs.get("callbacks", [])
if trajectory_dir:
from agent.callbacks.trajectory_saver import TrajectorySaverCallback
trajectory_callback = TrajectorySaverCallback(trajectory_dir, reset_on_run=False)
callbacks = callbacks + [trajectory_callback]
kwargs["callbacks"] = callbacks
# Initialize ComputerAgent with HUD computer handler
self.computer_agent = BaseComputerAgent(
model=model,
tools=[self.hud_computer],
**kwargs
)
# Set the client to the computer_agent for compatibility
self.client = self.computer_agent
# State tracking
self.conversation_history: List[Dict[str, Any]] = []
self.initial_prompt: Optional[str] = None
# System prompt for computer use tasks
self.base_system_prompt = BASE_SYSTEM_PROMPT
async def fetch_response(self, observation: Observation) -> tuple[list[dict[str, Any]], bool]:
"""
Fetch a response from ComputerAgent based on the observation.
Args:
observation: The preprocessed observation, attributes:
screenshot: Base64 encoded PNG string of the screen
text: Text observation, if available
Returns:
tuple[list[dict[str, Any]], bool, list[LogType] | None]: A tuple containing the list of raw actions,
boolean indicating if the agent believes the task is complete.
"""
try:
# Update the computer handler with the current screenshot
if observation.screenshot:
self.hud_computer.update_screenshot(observation.screenshot)
# Set up action callback to capture actions
captured_actions = []
action_done = False
async def action_callback(action: Dict[str, Any]) -> None:
"""Callback to capture actions from ComputerAgent."""
nonlocal captured_actions, action_done
captured_actions.append(action)
# Set the action callback
self.hud_computer.set_action_callback(action_callback)
# Prepare the message for ComputerAgent
if not self.conversation_history:
# First interaction - use the observation text as initial prompt
if observation.text:
self.initial_prompt = observation.text
message = f"{self.base_system_prompt}\n\nTask: {observation.text}"
else:
message = f"{self.base_system_prompt}\n\nPlease analyze the current screen and determine what action to take."
input_content = [
{"type": "input_text", "text": message}
]
# Add screenshot if present
if observation.screenshot:
input_content.append(
{
"type": "input_image",
"image_url": f"data:image/png;base64,{observation.screenshot}",
}
)
self.conversation_history.append({"role": "user", "content": input_content})
else:
# Subsequent interactions - check if last action was computer_call
# If so, add computer_call_output with screenshot instead of user message
last_computer_calls = []
for msg in reversed(self.conversation_history):
if msg.get("type") == "computer_call":
call_id = msg.get("call_id")
if call_id:
# Check if this call_id already has a computer_call_output
has_output = any(
m.get("type") == "computer_call_output" and m.get("call_id") == call_id
for m in self.conversation_history
)
if not has_output:
last_computer_calls.append(call_id)
if last_computer_calls:
if not observation.screenshot:
print("No screenshot found, taking screenshot")
screenshot_b64 = await self.hud_computer.screenshot()
# Add computer_call_output for each unresponded computer_call
for call_id in reversed(last_computer_calls): # Maintain order
self.conversation_history.append({
"type": "computer_call_output",
"call_id": call_id,
"output": {
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshot_b64}"
}
})
else:
# No computer_call found, add regular user message
message = "Continue with the task based on the current screen state."
input_content = [
{"type": "input_text", "text": message}
]
# Add screenshot if present
if observation.screenshot:
input_content.append(
{
"type": "input_image",
"image_url": f"data:image/png;base64,{observation.screenshot}",
}
)
self.conversation_history.append({"role": "user", "content": input_content})
# If the last message is a reasoning message, change it to output_text
if (self.conversation_history and
self.conversation_history[-1].get("type") == "reasoning" and
self.conversation_history[-1].get("summary")):
reasoning_msg = self.conversation_history[-1]
summary_texts = []
# Extract all summary_text entries
for summary_item in reasoning_msg["summary"]:
if summary_item.get("type") == "summary_text":
summary_texts.append(summary_item.get("text", ""))
# Convert to message format with output_text
if summary_texts:
converted_message = {
"type": "message",
"role": "assistant",
"content": [
{
"text": " ".join(summary_texts),
"type": "output_text"
}
]
}
# Replace the reasoning message with the converted message
self.conversation_history[-1] = converted_message
# Run ComputerAgent
try:
new_items = []
# ComputerAgent.run returns an async generator
try:
async for result in self.computer_agent.run(self.conversation_history, stream=False):
# if the result has computer_call_output, immediately exit
if result.get("output", []) and result.get("output", [])[-1].get("type") == "computer_call_output":
break
# otherwise add agent output to conversation history
new_items += result["output"]
except Exception as e:
# if the last message is reasoning, change it to output_text
if new_items and new_items[-1].get("type") == "reasoning":
new_items[-1] = {
"type": "message",
"role": "assistant",
"content": [
{
"text": new_items[-1].get("summary", [{}])[0].get("text", ""),
"type": "output_text"
}
]
}
# Check if there are any computer_call items in new_items
computer_calls = [item for item in new_items if item.get("type") == "computer_call"]
if computer_calls:
# Remove computer_call items from new_items
new_items = [item for item in new_items if item.get("type") != "computer_call"]
# Add failed tool call items for each computer call
for computer_call in computer_calls:
tool_input = computer_call.get("action", {})
call_id = computer_call.get("call_id")
new_items.extend(make_failed_tool_call_items(
tool_name="computer",
tool_kwargs=tool_input,
error_message=repr(e),
call_id=call_id
))
else:
# add error message to conversation history (fallback for non-computer-call errors)
new_items.append({
"type": "user",
"content": [
{
"type": "input_text",
"text": f"Error during previous attempted action: {repr(e)}"
}
]
})
# Check if we captured any actions
if captured_actions:
# Extract reasoning from the conversation history
reasoning = ""
# Look for the latest reasoning message
for msg in reversed(new_items):
if msg.get("type") == "reasoning" and msg.get("summary"):
reasoning = " ".join([s.get("text", "") for s in msg["summary"] if s.get("type") == "summary_text"])
break
elif msg.get("type") == "message" and msg.get("role") == "assistant":
content = msg.get("content", [])
if isinstance(content, list):
reasoning = " ".join([c.get("text", "") for c in content if c.get("type") == "output_text"])
break
# update conversation history
self.conversation_history += new_items
# Add reasoning and logs to each action
for action in captured_actions:
action["reasoning"] = reasoning
action["logs"] = {"conversation_length": len(self.conversation_history)}
return captured_actions, False
# Check if the last message is "Task completed"
response_text = ""
for msg in reversed(new_items):
if msg.get("type") == "message" and msg.get("role") == "assistant":
content = msg.get("content", [])
for c in content:
if c.get("type") == "output_text":
response_text = c.get("text", response_text)
break
break
done = "task completed" in response_text.lower()
# update conversation history
self.conversation_history += new_items
response_action = {
"type": "response",
"text": response_text,
"reasoning": response_text,
"logs": {"conversation_length": len(self.conversation_history)}
}
# Check if this indicates task completion or failure
if "task is infeasible" in response_text.lower():
response_action = {"type": "custom", "action": "FAIL"}
done = True
return [response_action], done
except Exception as e:
logger.error(f"Error running ComputerAgent: {e}")
# Return an error response
error_action = {
"type": "response",
"text": f"Error occurred: {str(e)}",
"reasoning": f"ComputerAgent encountered an error: {str(e)}",
"logs": {"error": str(e)}
}
return [error_action], True
except Exception as e:
logger.error(f"Error in fetch_response: {e}")
error_action = {
"type": "response",
"text": f"Error in agent processing: {str(e)}",
"reasoning": f"Agent processing error: {str(e)}",
"logs": {"error": str(e)}
}
return [error_action], True

View File

@@ -0,0 +1,187 @@
"""HUD Computer Handler for ComputerAgent integration."""
import base64
from io import BytesIO
from typing import Literal, Optional, Any, Dict, Callable
from PIL import Image
from agent.computers import AsyncComputerHandler
class HUDComputerHandler(AsyncComputerHandler):
"""Computer handler that interfaces with HUD environment."""
def __init__(
self,
environment: Literal["windows", "mac", "linux", "browser"] = "linux",
dimensions: tuple[int, int] = (1024, 768),
screenshot_callback: Optional[Callable] = None,
action_callback: Optional[Callable] = None,
):
"""
Initialize HUD computer handler.
Args:
environment: The environment type for HUD
dimensions: Screen dimensions as (width, height)
screenshot_callback: Optional callback to get screenshots from HUD environment
action_callback: Optional callback to execute actions in HUD environment
"""
super().__init__()
self._environment = environment
self._dimensions = dimensions
self._screenshot_callback = screenshot_callback
self._action_callback = action_callback
# Store the last screenshot for reuse
self._last_screenshot: Optional[str] = None
def set_screenshot_callback(self, callback: Callable) -> None:
"""Set the screenshot callback."""
self._screenshot_callback = callback
def set_action_callback(self, callback: Callable) -> None:
"""Set the action callback."""
self._action_callback = callback
def update_screenshot(self, screenshot: str) -> None:
"""Update the stored screenshot (base64 string)."""
self._last_screenshot = screenshot
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
return self._environment # type: ignore
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
return self._dimensions
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
if self._screenshot_callback:
screenshot = await self._screenshot_callback()
if isinstance(screenshot, str):
self._last_screenshot = screenshot
return screenshot
elif isinstance(screenshot, Image.Image):
# Convert PIL Image to base64
buffer = BytesIO()
screenshot.save(buffer, format="PNG")
screenshot_b64 = base64.b64encode(buffer.getvalue()).decode()
self._last_screenshot = screenshot_b64
return screenshot_b64
elif isinstance(screenshot, bytes):
screenshot_b64 = base64.b64encode(screenshot).decode()
self._last_screenshot = screenshot_b64
return screenshot_b64
# Return last screenshot if available, otherwise create a blank one
if self._last_screenshot:
return self._last_screenshot
# Create a blank screenshot as fallback
blank_image = Image.new('RGB', self._dimensions, color='white')
buffer = BytesIO()
blank_image.save(buffer, format="PNG")
screenshot_b64 = base64.b64encode(buffer.getvalue()).decode()
self._last_screenshot = screenshot_b64
return screenshot_b64
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
if self._action_callback:
await self._action_callback({
"type": "click",
"x": x,
"y": y,
"button": button
})
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
if self._action_callback:
await self._action_callback({
"type": "double_click",
"x": x,
"y": y
})
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
if self._action_callback:
await self._action_callback({
"type": "scroll",
"x": x,
"y": y,
"scroll_x": scroll_x,
"scroll_y": scroll_y
})
async def type(self, text: str) -> None:
"""Type text."""
if self._action_callback:
await self._action_callback({
"type": "type",
"text": text
})
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
if self._action_callback:
await self._action_callback({
"type": "wait",
"ms": ms
})
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
if self._action_callback:
await self._action_callback({
"type": "move",
"x": x,
"y": y
})
async def keypress(self, keys: list[str] | str) -> None:
"""Press key combination."""
if isinstance(keys, str):
keys = [keys]
if self._action_callback:
await self._action_callback({
"type": "keypress",
"keys": keys
})
async def drag(self, path: list[dict[str, int]]) -> None:
"""Drag along a path of points."""
if self._action_callback:
await self._action_callback({
"type": "drag",
"path": path
})
async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse down at coordinates."""
if self._action_callback:
await self._action_callback({
"type": "left_mouse_down",
"x": x,
"y": y
})
async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Left mouse up at coordinates."""
if self._action_callback:
await self._action_callback({
"type": "left_mouse_up",
"x": x,
"y": y
})
async def get_current_url(self) -> str:
"""Get the current URL."""
if self._action_callback:
return await self._action_callback({
"type": "get_current_url"
})
return ""

View File

@@ -7,5 +7,8 @@ from . import anthropic
from . import openai
from . import uitars
from . import omniparser
from . import gta1
from . import composed_grounded
from . import glm45v
__all__ = ["anthropic", "openai", "uitars", "omniparser"]
__all__ = ["anthropic", "openai", "uitars", "omniparser", "gta1", "composed_grounded", "glm45v"]

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,76 @@
"""
Base protocol for async agent configurations
"""
from typing import Protocol, List, Dict, Any, Optional, Tuple, Union
from abc import abstractmethod
from ..types import AgentCapability
class AsyncAgentConfig(Protocol):
"""Protocol defining the interface for async agent configurations."""
@abstractmethod
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Dict[str, Any]:
"""
Predict the next step based on input items.
Args:
messages: Input items following Responses format (message, function_call, computer_call)
model: Model name to use
tools: Optional list of tool schemas
max_retries: Maximum number of retries for failed API calls
stream: Whether to stream responses
computer_handler: Computer handler instance
_on_api_start: Callback for API start
_on_api_end: Callback for API end
_on_usage: Callback for usage tracking
_on_screenshot: Callback for screenshot events
**kwargs: Additional arguments
Returns:
Dictionary with "output" (output items) and "usage" array
"""
...
@abstractmethod
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates based on image and instruction.
Args:
model: Model name to use
image_b64: Base64 encoded image
instruction: Instruction for where to click
Returns:
None or tuple with (x, y) coordinates
"""
...
@abstractmethod
def get_capabilities(self) -> List[AgentCapability]:
"""
Get list of capabilities supported by this agent config.
Returns:
List of capability strings (e.g., ["step", "click"])
"""
...

View File

@@ -0,0 +1,318 @@
"""
Composed-grounded agent loop implementation that combines grounding and thinking models.
Uses a two-stage approach: grounding model for element detection, thinking model for reasoning.
"""
import uuid
import asyncio
import json
import base64
from typing import Dict, List, Any, Optional, Tuple
from io import BytesIO
from PIL import Image
import litellm
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
from ..loops.base import AsyncAgentConfig
from ..responses import (
convert_computer_calls_xy2desc,
convert_responses_items_to_completion_messages,
convert_completion_messages_to_responses_items,
convert_computer_calls_desc2xy,
get_all_element_descriptions
)
from ..agent import find_agent_config
GROUNDED_COMPUTER_TOOL_SCHEMA = {
"type": "function",
"function": {
"name": "computer",
"description": "Control a computer by taking screenshots and interacting with UI elements. This tool uses element descriptions to locate and interact with UI elements on the screen (e.g., 'red submit button', 'search text field', 'hamburger menu icon', 'close button in top right corner').",
"parameters": {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": [
"screenshot",
"click",
"double_click",
"drag",
"type",
"keypress",
"scroll",
"move",
"wait",
"get_current_url",
"get_dimensions",
"get_environment"
],
"description": "The action to perform"
},
"element_description": {
"type": "string",
"description": "Description of the element to interact with (required for click, double_click, move, scroll actions, and as start/end for drag)"
},
"start_element_description": {
"type": "string",
"description": "Description of the element to start dragging from (required for drag action)"
},
"end_element_description": {
"type": "string",
"description": "Description of the element to drag to (required for drag action)"
},
"text": {
"type": "string",
"description": "The text to type (required for type action)"
},
"keys": {
"type": "string",
"description": "Key combination to press (required for keypress action). Single key for individual key press, multiple keys for combinations (e.g., 'ctrl+c')"
},
"button": {
"type": "string",
"description": "The mouse button to use for click action (left, right, wheel, back, forward) Default: left",
},
"scroll_x": {
"type": "integer",
"description": "Horizontal scroll amount for scroll action (positive for right, negative for left)",
},
"scroll_y": {
"type": "integer",
"description": "Vertical scroll amount for scroll action (positive for down, negative for up)",
},
},
"required": [
"action"
]
}
}
}
def _prepare_tools_for_grounded(tool_schemas: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Prepare tools for grounded API format"""
grounded_tools = []
for schema in tool_schemas:
if schema["type"] == "computer":
grounded_tools.append(GROUNDED_COMPUTER_TOOL_SCHEMA)
else:
grounded_tools.append(schema)
return grounded_tools
def get_last_computer_call_image(messages: List[Dict[str, Any]]) -> Optional[str]:
"""Get the last computer call output image from messages."""
for message in reversed(messages):
if (isinstance(message, dict) and
message.get("type") == "computer_call_output" and
isinstance(message.get("output"), dict) and
message["output"].get("type") == "input_image"):
image_url = message["output"].get("image_url", "")
if image_url.startswith("data:image/png;base64,"):
return image_url.split(",", 1)[1]
return None
@register_agent(r".*\+.*", priority=1)
class ComposedGroundedConfig:
"""
Composed-grounded agent configuration that uses both grounding and thinking models.
The model parameter should be in format: "grounding_model+thinking_model"
e.g., "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro"
"""
def __init__(self):
self.desc2xy: Dict[str, Tuple[float, float]] = {}
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Dict[str, Any]:
"""
Composed-grounded predict step implementation.
Process:
0. Store last computer call image, if none then take a screenshot
1. Convert computer calls from xy to descriptions
2. Convert responses items to completion messages
3. Call thinking model with litellm.acompletion
4. Convert completion messages to responses items
5. Get all element descriptions and populate desc2xy mapping
6. Convert computer calls from descriptions back to xy coordinates
7. Return output and usage
"""
# Parse the composed model
if "+" not in model:
raise ValueError(f"Composed model must be in format 'grounding_model+thinking_model', got: {model}")
grounding_model, thinking_model = model.split("+", 1)
pre_output_items = []
# Step 0: Store last computer call image, if none then take a screenshot
last_image_b64 = get_last_computer_call_image(messages)
if last_image_b64 is None:
# Take a screenshot
screenshot_b64 = await computer_handler.screenshot() # type: ignore
if screenshot_b64:
call_id = uuid.uuid4().hex
pre_output_items += [
{
"type": "message",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "Taking a screenshot to see the current computer screen."
}
]
},
{
"action": {
"type": "screenshot"
},
"call_id": call_id,
"status": "completed",
"type": "computer_call"
},
{
"type": "computer_call_output",
"call_id": call_id,
"output": {
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshot_b64}"
}
},
]
last_image_b64 = screenshot_b64
# Call screenshot callback if provided
if _on_screenshot:
await _on_screenshot(screenshot_b64)
tool_schemas = _prepare_tools_for_grounded(tools) # type: ignore
# Step 1: Convert computer calls from xy to descriptions
input_messages = messages + pre_output_items
messages_with_descriptions = convert_computer_calls_xy2desc(input_messages, self.desc2xy)
# Step 2: Convert responses items to completion messages
completion_messages = convert_responses_items_to_completion_messages(
messages_with_descriptions,
allow_images_in_tool_results=False
)
# Step 3: Call thinking model with litellm.acompletion
api_kwargs = {
"model": thinking_model,
"messages": completion_messages,
"tools": tool_schemas,
"max_retries": max_retries,
"stream": stream,
**kwargs
}
if use_prompt_caching:
api_kwargs["use_prompt_caching"] = use_prompt_caching
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
# Make the completion call
response = await litellm.acompletion(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract usage information
usage = {
**response.usage.model_dump(), # type: ignore
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(usage)
# Step 4: Convert completion messages back to responses items format
response_dict = response.model_dump() # type: ignore
choice_messages = [choice["message"] for choice in response_dict["choices"]]
thinking_output_items = []
for choice_message in choice_messages:
thinking_output_items.extend(convert_completion_messages_to_responses_items([choice_message]))
# Step 5: Get all element descriptions and populate desc2xy mapping
element_descriptions = get_all_element_descriptions(thinking_output_items)
if element_descriptions and last_image_b64:
# Use grounding model to predict coordinates for each description
grounding_agent_conf = find_agent_config(grounding_model)
if grounding_agent_conf:
grounding_agent = grounding_agent_conf.agent_class()
for desc in element_descriptions:
coords = await grounding_agent.predict_click(
model=grounding_model,
image_b64=last_image_b64,
instruction=desc
)
if coords:
self.desc2xy[desc] = coords
# Step 6: Convert computer calls from descriptions back to xy coordinates
final_output_items = convert_computer_calls_desc2xy(thinking_output_items, self.desc2xy)
# Step 7: Return output and usage
return {
"output": pre_output_items + final_output_items,
"usage": usage
}
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str,
**kwargs
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates using the grounding model.
For composed models, uses only the grounding model part for click prediction.
"""
# Parse the composed model to get grounding model
if "+" not in model:
raise ValueError(f"Composed model must be in format 'grounding_model+thinking_model', got: {model}")
grounding_model, thinking_model = model.split("+", 1)
# Find and use the grounding agent
grounding_agent_conf = find_agent_config(grounding_model)
if grounding_agent_conf:
grounding_agent = grounding_agent_conf.agent_class()
return await grounding_agent.predict_click(
model=grounding_model,
image_b64=image_b64,
instruction=instruction,
**kwargs
)
return None
def get_capabilities(self) -> List[AgentCapability]:
"""Return the capabilities supported by this agent."""
return ["click", "step"]

View File

@@ -0,0 +1,902 @@
"""
GLM-4.5V agent loop implementation using liteLLM for GLM-4.5V model.
Supports vision-language models for computer control with bounding box parsing.
"""
import asyncio
import json
import base64
import re
from typing import Dict, List, Any, Optional, Tuple
from io import BytesIO
from PIL import Image
import litellm
from litellm.types.utils import ModelResponse
from litellm.responses.litellm_completion_transformation.transformation import LiteLLMCompletionResponsesConfig
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
from ..loops.base import AsyncAgentConfig
from ..responses import (
convert_responses_items_to_completion_messages,
convert_completion_messages_to_responses_items,
make_reasoning_item,
make_output_text_item,
make_click_item,
make_double_click_item,
make_drag_item,
make_keypress_item,
make_scroll_item,
make_type_item,
make_wait_item,
make_input_image_item
)
# GLM-4.5V specific constants
GLM_ACTION_SPACE = """
### {left,right,middle}_click
Call rule: `{left,right,middle}_click(start_box='[x,y]', element_info='')`
{
'name': ['left_click', 'right_click', 'middle_click'],
'description': 'Perform a left/right/middle mouse click at the specified coordinates on the screen.',
'parameters': {
'type': 'object',
'properties': {
'start_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Coordinates [x,y] where to perform the click, normalized to 0-999 range.'
},
'element_info': {
'type': 'string',
'description': 'Optional text description of the UI element being clicked.'
}
},
'required': ['start_box']
}
}
### hover
Call rule: `hover(start_box='[x,y]', element_info='')`
{
'name': 'hover',
'description': 'Move the mouse pointer to the specified coordinates without performing any click action.',
'parameters': {
'type': 'object',
'properties': {
'start_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Coordinates [x,y] where to move the mouse pointer, normalized to 0-999 range.'
},
'element_info': {
'type': 'string',
'description': 'Optional text description of the UI element being hovered over.'
}
},
'required': ['start_box']
}
}
### left_double_click
Call rule: `left_double_click(start_box='[x,y]', element_info='')`
{
'name': 'left_double_click',
'description': 'Perform a left mouse double-click at the specified coordinates on the screen.',
'parameters': {
'type': 'object',
'properties': {
'start_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Coordinates [x,y] where to perform the double-click, normalized to 0-999 range.'
},
'element_info': {
'type': 'string',
'description': 'Optional text description of the UI element being double-clicked.'
}
},
'required': ['start_box']
}
}
### left_drag
Call rule: `left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')`
{
'name': 'left_drag',
'description': 'Drag the mouse from starting coordinates to ending coordinates while holding the left mouse button.',
'parameters': {
'type': 'object',
'properties': {
'start_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Starting coordinates [x1,y1] for the drag operation, normalized to 0-999 range.'
},
'end_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Ending coordinates [x2,y2] for the drag operation, normalized to 0-999 range.'
},
'element_info': {
'type': 'string',
'description': 'Optional text description of the UI element being dragged.'
}
},
'required': ['start_box', 'end_box']
}
}
### key
Call rule: `key(keys='')`
{
'name': 'key',
'description': 'Simulate pressing a single key or combination of keys on the keyboard.',
'parameters': {
'type': 'object',
'properties': {
'keys': {
'type': 'string',
'description': 'The key or key combination to press. Use '+' to separate keys in combinations (e.g., 'ctrl+c', 'alt+tab').'
}
},
'required': ['keys']
}
}
### type
Call rule: `type(content='')`
{
'name': 'type',
'description': 'Type text content into the currently focused text input field. This action only performs typing and does not handle field activation or clearing.',
'parameters': {
'type': 'object',
'properties': {
'content': {
'type': 'string',
'description': 'The text content to be typed into the active text field.'
}
},
'required': ['content']
}
}
### scroll
Call rule: `scroll(start_box='[x,y]', direction='', step=5, element_info='')`
{
'name': 'scroll',
'description': 'Scroll an element at the specified coordinates in the specified direction by a given number of wheel steps.',
'parameters': {
'type': 'object',
'properties': {
'start_box': {
'type': 'array',
'items': {
'type': 'integer'
},
'description': 'Coordinates [x,y] of the element or area to scroll, normalized to 0-999 range.'
},
'direction': {
'type': 'string',
'enum': ['down', 'up'],
'description': 'The direction to scroll: 'down' or 'up'.'
},
'step': {
'type': 'integer',
'default': 5,
'description': 'Number of wheel steps to scroll, default is 5.'
},
'element_info': {
'type': 'string',
'description': 'Optional text description of the UI element being scrolled.'
}
},
'required': ['start_box', 'direction']
}
}
### WAIT
Call rule: `WAIT()`
{
'name': 'WAIT',
'description': 'Wait for 5 seconds before proceeding to the next action.',
'parameters': {
'type': 'object',
'properties': {},
'required': []
}
}
### DONE
Call rule: `DONE()`
{
'name': 'DONE',
'description': 'Indicate that the current task has been completed successfully and no further actions are needed.',
'parameters': {
'type': 'object',
'properties': {},
'required': []
}
}
### FAIL
Call rule: `FAIL()`
{
'name': 'FAIL',
'description': 'Indicate that the current task cannot be completed or is impossible to accomplish.',
'parameters': {
'type': 'object',
'properties': {},
'required': []
}
}"""
def encode_image_to_base64(image_path: str) -> str:
"""Encode image file to base64 string with data URI."""
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
return f"data:image/png;base64,{encoded_string}"
def parse_glm_response(response: str) -> Dict[str, Any]:
"""
Parse GLM-4.5V response to extract action and memory.
The special tokens <|begin_of_box|> and <|end_of_box|> mark bounding boxes.
Coordinates are normalized values between 0 and 1000.
"""
# Extract action from between special tokens
pattern = r"<\|begin_of_box\|>(.*?)<\|end_of_box\|>"
match = re.search(pattern, response)
if match:
action = match.group(1).strip()
else:
# Fallback: look for function call patterns
action_pattern = r"[\w_]+\([^)]*\)"
matches = re.findall(action_pattern, response)
action = matches[0] if matches else None
# Extract memory section
memory_pattern = r"Memory:(.*?)$"
memory_match = re.search(memory_pattern, response, re.DOTALL)
memory = memory_match.group(1).strip() if memory_match else "[]"
# Extract action text (everything before Memory:)
action_text_pattern = r'^(.*?)Memory:'
action_text_match = re.search(action_text_pattern, response, re.DOTALL)
action_text = action_text_match.group(1).strip() if action_text_match else response
# Clean up action text by removing special tokens
if action_text:
action_text = action_text.replace("<|begin_of_box|>", "").replace("<|end_of_box|>", "")
return {
"action": action,
"action_text": action_text,
"memory": memory
}
def get_last_image_from_messages(messages: Messages) -> Optional[str]:
"""Extract the last image from messages for processing."""
for message in reversed(messages):
if isinstance(message, dict):
if message.get("type") == "computer_call_output":
output = message.get("output", {})
if isinstance(output, dict) and output.get("type") == "input_image":
image_url = output.get("image_url", "")
if isinstance(image_url, str) and image_url.startswith("data:image/"):
# Extract base64 part
return image_url.split(",", 1)[1]
elif message.get("role") == "user":
content = message.get("content", [])
if isinstance(content, list):
for item in reversed(content):
if isinstance(item, dict) and item.get("type") == "image_url":
image_url_obj = item.get("image_url", {})
if isinstance(image_url_obj, dict):
image_url = image_url_obj.get("url", "")
if isinstance(image_url, str) and image_url.startswith("data:image/"):
return image_url.split(",", 1)[1]
return None
def convert_responses_items_to_glm45v_pc_prompt(messages: Messages, task: str, memory: str = "") -> List[Dict[str, Any]]:
"""Convert responses items to GLM-4.5V PC prompt format with historical actions.
Args:
messages: List of message items from the conversation
task: The task description
memory: Current memory state
Returns:
List of content items for the prompt (text and image_url items)
"""
action_space = GLM_ACTION_SPACE
# Template head
head_text = f"""You are a GUI Agent, and your primary task is to respond accurately to user requests or questions. In addition to directly answering the user's queries, you can also use tools or perform GUI operations directly until you fulfill the user's request or provide a correct answer. You should carefully read and understand the images and questions provided by the user, and engage in thinking and reflection when appropriate. The coordinates involved are all represented in thousandths (0-999).
# Task:
{task}
# Task Platform
Ubuntu
# Action Space
{action_space}
# Historical Actions and Current Memory
History:"""
# Template tail
tail_text = f"""
Memory:
{memory}
# Output Format
Plain text explanation with action(param='...')
Memory:
[{{"key": "value"}}, ...]
# Some Additional Notes
- I'll give you the most recent 4 history screenshots(shrunked to 50%*50%) along with the historical action steps.
- You should put the key information you *have to remember* in a seperated memory part and I'll give it to you in the next round. The content in this part should be a dict list. If you no longer need some given information, you should remove it from the memory. Even if you don't need to remember anything, you should also output an empty list.
- My computer's password is "password", feel free to use it when you need sudo rights.
- For the thunderbird account "anonym-x2024@outlook.com", the password is "gTCI";=@y7|QJ0nDa_kN3Sb&>".
Current Screenshot:
"""
# Build history from messages
history = []
history_images = []
# Group messages into steps
current_step = []
step_num = 0
for message in messages:
msg_type = message.get("type")
if msg_type == "reasoning":
current_step.append(message)
elif msg_type == "message" and message.get("role") == "assistant":
current_step.append(message)
elif msg_type == "computer_call":
current_step.append(message)
elif msg_type == "computer_call_output":
current_step.append(message)
# End of step - process it
if current_step:
step_num += 1
# Extract bot thought from message content
bot_thought = ""
for item in current_step:
if item.get("type") == "message" and item.get("role") == "assistant":
content = item.get("content", [])
for content_item in content:
if content_item.get("type") == "output_text":
bot_thought = content_item.get("text", "")
break
break
# Extract action from computer_call
action_text = ""
for item in current_step:
if item.get("type") == "computer_call":
action = item.get("action", {})
action_type = action.get("type", "")
if action_type == "click":
x, y = action.get("x", 0), action.get("y", 0)
# Convert to 0-999 range (assuming screen dimensions)
# For now, use direct coordinates - this may need adjustment
action_text = f"left_click(start_box='[{x},{y}]')"
elif action_type == "double_click":
x, y = action.get("x", 0), action.get("y", 0)
action_text = f"left_double_click(start_box='[{x},{y}]')"
elif action_type == "right_click":
x, y = action.get("x", 0), action.get("y", 0)
action_text = f"right_click(start_box='[{x},{y}]')"
elif action_type == "drag":
# Handle drag with path
path = action.get("path", [])
if len(path) >= 2:
start = path[0]
end = path[-1]
action_text = f"left_drag(start_box='[{start.get('x', 0)},{start.get('y', 0)}]', end_box='[{end.get('x', 0)},{end.get('y', 0)}]')"
elif action_type == "keypress":
key = action.get("key", "")
action_text = f"key(keys='{key}')"
elif action_type == "type":
text = action.get("text", "")
action_text = f"type(content='{text}')"
elif action_type == "scroll":
x, y = action.get("x", 0), action.get("y", 0)
direction = action.get("direction", "down")
action_text = f"scroll(start_box='[{x},{y}]', direction='{direction}')"
elif action_type == "wait":
action_text = "WAIT()"
break
# Extract screenshot from computer_call_output
screenshot_url = None
for item in current_step:
if item.get("type") == "computer_call_output":
output = item.get("output", {})
if output.get("type") == "input_image":
screenshot_url = output.get("image_url", "")
break
# Store step info
step_info = {
"step_num": step_num,
"bot_thought": bot_thought,
"action_text": action_text,
"screenshot_url": screenshot_url
}
history.append(step_info)
# Store screenshot for last 4 steps
if screenshot_url:
history_images.append(screenshot_url)
current_step = []
# Build content array with head, history, and tail
content = []
current_text = head_text
total_history_steps = len(history)
history_image_count = min(4, len(history_images)) # Last 4 images
for step_idx, step_info in enumerate(history):
step_num = step_info["step_num"]
bot_thought = step_info["bot_thought"]
action_text = step_info["action_text"]
if step_idx < total_history_steps - history_image_count:
# For steps beyond the last 4, use text placeholder
current_text += f"\nstep {step_num}: Screenshot:(Omitted in context.) Thought: {bot_thought}\nAction: {action_text}"
else:
# For the last 4 steps, insert images
current_text += f"\nstep {step_num}: Screenshot:"
content.append({"type": "text", "text": current_text})
# Add image
img_idx = step_idx - (total_history_steps - history_image_count)
if img_idx < len(history_images):
content.append({"type": "image_url", "image_url": {"url": history_images[img_idx]}})
current_text = f" Thought: {bot_thought}\nAction: {action_text}"
# Add tail
current_text += tail_text
content.append({"type": "text", "text": current_text})
return content
def model_dump(obj) -> Dict[str, Any]:
if isinstance(obj, dict):
return {k: model_dump(v) for k, v in obj.items()}
elif hasattr(obj, "model_dump"):
return obj.model_dump()
else:
return obj
def convert_glm_completion_to_responses_items(response: ModelResponse, image_width: int, image_height: int) -> List[Dict[str, Any]]:
"""
Convert GLM-4.5V completion response to responses items format.
Args:
response: LiteLLM ModelResponse from GLM-4.5V
image_width: Original image width for coordinate scaling
image_height: Original image height for coordinate scaling
Returns:
List of response items in the proper format
"""
import uuid
response_items = []
if not response.choices or not response.choices[0].message:
return response_items
message = response.choices[0].message
content = message.content or ""
reasoning_content = getattr(message, 'reasoning_content', None)
# Add reasoning item if present
if reasoning_content:
reasoning_item = model_dump(make_reasoning_item(reasoning_content))
response_items.append(reasoning_item)
# Parse the content to extract action and text
parsed_response = parse_glm_response(content)
action = parsed_response.get("action", "")
action_text = parsed_response.get("action_text", "")
# Add message item with text content (excluding action and memory)
if action_text:
# Remove action from action_text if it's there
clean_text = action_text
if action and action in clean_text:
clean_text = clean_text.replace(action, "").strip()
# Remove memory section
memory_pattern = r"Memory:\s*\[.*?\]\s*$"
clean_text = re.sub(memory_pattern, "", clean_text, flags=re.DOTALL).strip()
if clean_text:
message_item = model_dump(make_output_text_item(clean_text))
response_items.append(message_item)
# Convert action to computer call if present
if action:
call_id = f"call_{uuid.uuid4().hex[:8]}"
# Parse different action types and create appropriate computer calls
if action.startswith("left_click"):
coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
if coord_match:
x, y = int(coord_match.group(1)), int(coord_match.group(2))
# Convert from 0-999 to actual pixel coordinates
actual_x = int((x / 999.0) * image_width)
actual_y = int((y / 999.0) * image_height)
computer_call = model_dump(make_click_item(actual_x, actual_y))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("right_click"):
coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
if coord_match:
x, y = int(coord_match.group(1)), int(coord_match.group(2))
actual_x = int((x / 999.0) * image_width)
actual_y = int((y / 999.0) * image_height)
computer_call = model_dump(make_click_item(actual_x, actual_y, button="right"))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("left_double_click"):
coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
if coord_match:
x, y = int(coord_match.group(1)), int(coord_match.group(2))
actual_x = int((x / 999.0) * image_width)
actual_y = int((y / 999.0) * image_height)
computer_call = model_dump(make_double_click_item(actual_x, actual_y))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("left_drag"):
start_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
end_match = re.search(r"end_box='?\[(\d+),\s*(\d+)\]'?", action)
if start_match and end_match:
x1, y1 = int(start_match.group(1)), int(start_match.group(2))
x2, y2 = int(end_match.group(1)), int(end_match.group(2))
actual_x1 = int((x1 / 999.0) * image_width)
actual_y1 = int((y1 / 999.0) * image_height)
actual_x2 = int((x2 / 999.0) * image_width)
actual_y2 = int((y2 / 999.0) * image_height)
# Create path for drag operation
drag_path = [{"x": actual_x1, "y": actual_y1}, {"x": actual_x2, "y": actual_y2}]
computer_call = model_dump(make_drag_item(drag_path))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("key"):
key_match = re.search(r"keys='([^']+)'", action)
if key_match:
keys = key_match.group(1)
# Split keys by '+' for key combinations, or use as single key
key_list = keys.split('+') if '+' in keys else [keys]
computer_call = model_dump(make_keypress_item(key_list))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("type"):
content_match = re.search(r"content='([^']*)'", action)
if content_match:
content = content_match.group(1)
computer_call = model_dump(make_type_item(content))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action.startswith("scroll"):
coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
direction_match = re.search(r"direction='([^']+)'", action)
if coord_match and direction_match:
x, y = int(coord_match.group(1)), int(coord_match.group(2))
direction = direction_match.group(1)
actual_x = int((x / 999.0) * image_width)
actual_y = int((y / 999.0) * image_height)
# Convert direction to scroll amounts
scroll_x, scroll_y = 0, 0
if direction == "up":
scroll_y = -5
elif direction == "down":
scroll_y = 5
elif direction == "left":
scroll_x = -5
elif direction == "right":
scroll_x = 5
computer_call = model_dump(make_scroll_item(actual_x, actual_y, scroll_x, scroll_y))
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
elif action == "WAIT()":
computer_call = model_dump(make_wait_item())
computer_call["call_id"] = call_id
computer_call["status"] = "completed"
response_items.append(computer_call)
return response_items
@register_agent(models=r"(?i).*GLM-4\.5V.*")
class Glm4vConfig(AsyncAgentConfig):
"""GLM-4.5V agent configuration using liteLLM."""
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Dict[str, Any]:
"""
Predict the next step using GLM-4.5V model.
Args:
messages: Input messages following Responses format
model: Model name to use
tools: Optional list of tool schemas
max_retries: Maximum number of retries for API calls
stream: Whether to stream the response
computer_handler: Computer handler for taking screenshots
use_prompt_caching: Whether to use prompt caching
_on_api_start: Callback for API start
_on_api_end: Callback for API end
_on_usage: Callback for usage tracking
_on_screenshot: Callback for screenshot events
Returns:
Dict with "output" and "usage" keys
"""
# Get the user instruction from the last user message
user_instruction = ""
for message in reversed(messages):
if isinstance(message, dict) and message.get("role") == "user":
content = message.get("content", "")
if isinstance(content, str):
user_instruction = content
elif isinstance(content, list):
for item in content:
if isinstance(item, dict) and item.get("type") == "text":
user_instruction = item.get("text", "")
break
break
# Get the last image for processing
last_image_b64 = get_last_image_from_messages(messages)
if not last_image_b64 and computer_handler:
# Take a screenshot if no image available
screenshot_b64 = await computer_handler.screenshot()
if screenshot_b64:
last_image_b64 = screenshot_b64
if _on_screenshot:
await _on_screenshot(screenshot_b64)
if not last_image_b64:
raise ValueError("No image available for GLM-4.5V processing")
# Convert responses items to GLM-4.5V PC prompt format with historical actions
prompt_content = convert_responses_items_to_glm45v_pc_prompt(
messages=messages,
task=user_instruction,
memory="[]" # Initialize with empty memory for now
)
# Add the current screenshot to the end
prompt_content.append({
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{last_image_b64}"}
})
# Prepare messages for liteLLM
litellm_messages = [
{
"role": "system",
"content": "You are a helpful GUI agent assistant."
},
{
"role": "user",
"content": prompt_content
}
]
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": litellm_messages,
# "max_tokens": 2048,
# "temperature": 0.001,
# "extra_body": {
# "skip_special_tokens": False,
# }
}
# Add API callbacks
if _on_api_start:
await _on_api_start(api_kwargs)
# Call liteLLM
response = await litellm.acompletion(**api_kwargs)
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Get image dimensions for coordinate scaling
image_width, image_height = 1920, 1080 # Default dimensions
# Try to get actual dimensions from the image
try:
image_data = base64.b64decode(last_image_b64)
image = Image.open(BytesIO(image_data))
image_width, image_height = image.size
except Exception:
pass # Use default dimensions
# Convert GLM completion response to responses items
response_items = convert_glm_completion_to_responses_items(response, image_width, image_height)
# Extract usage information
response_usage = {
**LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(response_usage)
# Create agent response
agent_response = {
"output": response_items,
"usage": response_usage
}
return agent_response
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str,
**kwargs
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates using GLM-4.5V model.
Args:
model: Model name to use
image_b64: Base64 encoded image
instruction: Instruction for where to click
Returns:
Tuple with (x, y) coordinates or None
"""
try:
# Create a simple click instruction prompt
click_prompt = f"""You are a GUI agent. Look at the screenshot and identify where to click for: {instruction}
Respond with a single click action in this format:
left_click(start_box='[x,y]')
Where x,y are coordinates normalized to 0-999 range."""
# Prepare messages for liteLLM
litellm_messages = [
{
"role": "system",
"content": "You are a helpful GUI agent assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": click_prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
]
}
]
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": litellm_messages,
"max_tokens": 100,
"temperature": 0.001,
"extra_body": {
"skip_special_tokens": False,
}
}
# Call liteLLM
response = await litellm.acompletion(**api_kwargs)
# Extract response content
response_content = response.choices[0].message.content.strip()
# Parse response for click coordinates
# Look for coordinates in the response, handling special tokens
coord_pattern = r"<\|begin_of_box\|>.*?left_click\(start_box='?\[(\d+),(\d+)\]'?\).*?<\|end_of_box\|>"
match = re.search(coord_pattern, response_content)
if not match:
# Fallback: look for coordinates without special tokens
coord_pattern = r"left_click\(start_box='?\[(\d+),(\d+)\]'?\)"
match = re.search(coord_pattern, response_content)
if match:
x, y = int(match.group(1)), int(match.group(2))
# Get actual image dimensions for scaling
try:
image_data = base64.b64decode(image_b64)
image = Image.open(BytesIO(image_data))
image_width, image_height = image.size
except Exception:
# Use default dimensions
image_width, image_height = 1920, 1080
# Convert from 0-999 normalized coordinates to actual pixel coordinates
actual_x = int((x / 999.0) * image_width)
actual_y = int((y / 999.0) * image_height)
return (actual_x, actual_y)
return None
except Exception as e:
# Log error and return None
print(f"Error in predict_click: {e}")
return None
def get_capabilities(self) -> List[AgentCapability]:
"""
Get list of capabilities supported by this agent config.
Returns:
List of capability strings
"""
return ["step", "click"]

View File

@@ -0,0 +1,178 @@
"""
GTA1 agent loop implementation for click prediction using litellm.acompletion
Paper: https://arxiv.org/pdf/2507.05791
Code: https://github.com/Yan98/GTA1
"""
import asyncio
import json
import re
import base64
from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
from io import BytesIO
import uuid
from PIL import Image
import litellm
import math
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
from ..loops.base import AsyncAgentConfig
SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
Output the coordinate pair exactly:
(x,y)
'''.strip()
def extract_coordinates(raw_string: str) -> Tuple[float, float]:
"""Extract coordinates from model output."""
try:
matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
return tuple(map(float, matches[0])) # type: ignore
except:
return (0.0, 0.0)
def smart_resize(height: int, width: int, factor: int = 28, min_pixels: int = 3136, max_pixels: int = 8847360) -> Tuple[int, int]:
"""Smart resize function similar to qwen_vl_utils."""
# Calculate the total pixels
total_pixels = height * width
# If already within bounds, return original dimensions
if min_pixels <= total_pixels <= max_pixels:
# Round to nearest factor
new_height = (height // factor) * factor
new_width = (width // factor) * factor
return new_height, new_width
# Calculate scaling factor
if total_pixels > max_pixels:
scale = (max_pixels / total_pixels) ** 0.5
else:
scale = (min_pixels / total_pixels) ** 0.5
# Apply scaling
new_height = int(height * scale)
new_width = int(width * scale)
# Round to nearest factor
new_height = (new_height // factor) * factor
new_width = (new_width // factor) * factor
# Ensure minimum size
new_height = max(new_height, factor)
new_width = max(new_width, factor)
return new_height, new_width
@register_agent(models=r".*GTA1.*")
class GTA1Config(AsyncAgentConfig):
"""GTA1 agent configuration implementing AsyncAgentConfig protocol for click prediction."""
def __init__(self):
self.current_model = None
self.last_screenshot_b64 = None
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Dict[str, Any]:
raise NotImplementedError()
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str,
**kwargs
) -> Optional[Tuple[float, float]]:
"""
Predict click coordinates using GTA1 model via litellm.acompletion.
Args:
model: The GTA1 model name
image_b64: Base64 encoded image
instruction: Instruction for where to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
# Decode base64 image
image_data = base64.b64decode(image_b64)
image = Image.open(BytesIO(image_data))
width, height = image.width, image.height
# Smart resize the image (similar to qwen_vl_utils)
resized_height, resized_width = smart_resize(
height, width,
factor=28, # Default factor for Qwen models
min_pixels=3136,
max_pixels=4096 * 2160
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height
# Convert resized image back to base64
buffered = BytesIO()
resized_image.save(buffered, format="PNG")
resized_image_b64 = base64.b64encode(buffered.getvalue()).decode()
# Prepare system and user messages
system_message = {
"role": "system",
"content": SYSTEM_PROMPT.format(height=resized_height, width=resized_width)
}
user_message = {
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{resized_image_b64}"
}
},
{
"type": "text",
"text": instruction
}
]
}
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": [system_message, user_message],
"max_tokens": 32,
"temperature": 0.0,
**kwargs
}
# Use liteLLM acompletion
response = await litellm.acompletion(**api_kwargs)
# Extract response text
output_text = response.choices[0].message.content # type: ignore
# Extract and rescale coordinates
pred_x, pred_y = extract_coordinates(output_text) # type: ignore
pred_x *= scale_x
pred_y *= scale_y
return (math.floor(pred_x), math.floor(pred_y))
def get_capabilities(self) -> List[AgentCapability]:
"""Return the capabilities supported by this agent."""
return ["click"]

View File

@@ -0,0 +1,6 @@
model,predict_step,predict_point
anthropic,,
openai,,
uitars,,
omniparser,,
gta1,,
1 model predict_step predict_point
2 anthropic
3 openai
4 uitars
5 omniparser
6 gta1

View File

@@ -1,5 +1,7 @@
"""
OpenAI computer-use-preview agent loop implementation using liteLLM
Paper: https://arxiv.org/abs/2408.00203
Code: https://github.com/microsoft/OmniParser
"""
import asyncio
@@ -9,8 +11,9 @@ import litellm
import inspect
import base64
from ..decorators import agent_loop
from ..types import Messages, AgentResponse, Tools
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
from ..loops.base import AsyncAgentConfig
SOM_TOOL_SCHEMA = {
"type": "function",
@@ -246,94 +249,185 @@ async def replace_computer_call_with_function(item: Dict[str, Any], xy2id: Dict[
return [item]
@agent_loop(models=r"omniparser\+.*|omni\+.*", priority=10)
async def omniparser_loop(
messages: Messages,
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
"""
OpenAI computer-use-preview agent loop using liteLLM responses.
@register_agent(models=r"omniparser\+.*|omni\+.*", priority=2)
class OmniparserConfig(AsyncAgentConfig):
"""Omniparser agent configuration implementing AsyncAgentConfig protocol."""
Supports OpenAI's computer use preview models.
"""
if not OMNIPARSER_AVAILABLE:
raise ValueError("omniparser loop requires som to be installed. Install it with `pip install cua-som`.")
tools = tools or []
llm_model = model.split('+')[-1]
# Prepare tools for OpenAI API
openai_tools, id2xy = _prepare_tools_for_omniparser(tools)
# Find last computer_call_output
last_computer_call_output = get_last_computer_call_output(messages)
if last_computer_call_output:
image_url = last_computer_call_output.get("output", {}).get("image_url", "")
image_data = image_url.split(",")[-1]
if image_data:
parser = get_parser()
result = parser.parse(image_data)
if _on_screenshot:
await _on_screenshot(result.annotated_image_base64, "annotated_image")
for element in result.elements:
id2xy[element.id] = ((element.bbox.x1 + element.bbox.x2) / 2, (element.bbox.y1 + element.bbox.y2) / 2)
# handle computer calls -> function calls
new_messages = []
for message in messages:
if not isinstance(message, dict):
message = message.__dict__
new_messages += await replace_computer_call_with_function(message, id2xy)
messages = new_messages
# Prepare API call kwargs
api_kwargs = {
"model": llm_model,
"input": messages,
"tools": openai_tools if openai_tools else None,
"stream": stream,
"reasoning": {"summary": "concise"},
"truncation": "auto",
"num_retries": max_retries,
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
}
) -> Dict[str, Any]:
"""
OpenAI computer-use-preview agent loop using liteLLM responses.
Supports OpenAI's computer use preview models.
"""
if not OMNIPARSER_AVAILABLE:
raise ValueError("omniparser loop requires som to be installed. Install it with `pip install cua-som`.")
tools = tools or []
llm_model = model.split('+')[-1]
# Prepare tools for OpenAI API
openai_tools, id2xy = _prepare_tools_for_omniparser(tools)
# Find last computer_call_output
last_computer_call_output = get_last_computer_call_output(messages) # type: ignore
if last_computer_call_output:
image_url = last_computer_call_output.get("output", {}).get("image_url", "")
image_data = image_url.split(",")[-1]
if image_data:
parser = get_parser()
result = parser.parse(image_data)
if _on_screenshot:
await _on_screenshot(result.annotated_image_base64, "annotated_image")
for element in result.elements:
id2xy[element.id] = ((element.bbox.x1 + element.bbox.x2) / 2, (element.bbox.y1 + element.bbox.y2) / 2)
# handle computer calls -> function calls
new_messages = []
for message in messages:
if not isinstance(message, dict):
message = message.__dict__
new_messages += await replace_computer_call_with_function(message, id2xy) # type: ignore
messages = new_messages
# Prepare API call kwargs
api_kwargs = {
"model": llm_model,
"input": messages,
"tools": openai_tools if openai_tools else None,
"stream": stream,
"truncation": "auto",
"num_retries": max_retries,
**kwargs
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
print(str(api_kwargs)[:1000])
# Use liteLLM responses
response = await litellm.aresponses(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract usage information
usage = {
**response.usage.model_dump(), # type: ignore
"response_cost": response._hidden_params.get("response_cost", 0.0), # type: ignore
}
if _on_usage:
await _on_usage(usage)
# handle som function calls -> xy computer calls
new_output = []
for i in range(len(response.output)): # type: ignore
new_output += await replace_function_with_computer_call(response.output[i].model_dump(), id2xy) # type: ignore
return {
"output": new_output,
"usage": usage
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str,
**kwargs
) -> Optional[Tuple[float, float]]:
"""
Predict click coordinates using OmniParser and LLM.
Uses OmniParser to annotate the image with element IDs, then uses LLM
to identify the correct element ID based on the instruction.
"""
if not OMNIPARSER_AVAILABLE:
return None
# Parse the image with OmniParser to get annotated image and elements
parser = get_parser()
result = parser.parse(image_b64)
# Extract the LLM model from composed model string
llm_model = model.split('+')[-1]
# Create system prompt for element ID prediction
SYSTEM_PROMPT = f'''
You are an expert UI element locator. Given a GUI image annotated with numerical IDs over each interactable element, along with a user's element description, provide the ID of the specified element.
The image shows UI elements with numbered overlays. Each number corresponds to a clickable/interactable element.
Output only the element ID as a single integer.
'''.strip()
# Prepare messages for LLM
messages = [
{
"role": "system",
"content": SYSTEM_PROMPT
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{result.annotated_image_base64}"
}
},
{
"type": "text",
"text": f"Find the element: {instruction}"
}
]
}
]
# Call LLM to predict element ID
response = await litellm.acompletion(
model=llm_model,
messages=messages,
max_tokens=10,
temperature=0.1
)
# Extract element ID from response
response_text = response.choices[0].message.content.strip() # type: ignore
# Try to parse the element ID
try:
element_id = int(response_text)
# Find the element with this ID and return its center coordinates
for element in result.elements:
if element.id == element_id:
center_x = (element.bbox.x1 + element.bbox.x2) / 2
center_y = (element.bbox.y1 + element.bbox.y2) / 2
return (center_x, center_y)
except ValueError:
# If we can't parse the ID, return None
pass
return None
print(str(api_kwargs)[:1000])
# Use liteLLM responses
response = await litellm.aresponses(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract usage information
response.usage = {
**response.usage.model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(response.usage)
# handle som function calls -> xy computer calls
new_output = []
for i in range(len(response.output)):
new_output += await replace_function_with_computer_call(response.output[i].model_dump(), id2xy)
response.output = new_output
return response
def get_capabilities(self) -> List[AgentCapability]:
"""Return the capabilities supported by this agent."""
return ["step"]

View File

@@ -3,31 +3,49 @@ OpenAI computer-use-preview agent loop implementation using liteLLM
"""
import asyncio
import base64
import json
from typing import Dict, List, Any, AsyncGenerator, Union, Optional
from io import BytesIO
from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
import litellm
from PIL import Image
from ..decorators import agent_loop
from ..types import Messages, AgentResponse, Tools
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
def _map_computer_tool_to_openai(computer_tool: Any) -> Dict[str, Any]:
async def _map_computer_tool_to_openai(computer_handler: Any) -> Dict[str, Any]:
"""Map a computer tool to OpenAI's computer-use-preview tool schema"""
# Get dimensions from the computer handler
try:
width, height = await computer_handler.get_dimensions()
except Exception:
# Fallback to default dimensions if method fails
width, height = 1024, 768
# Get environment from the computer handler
try:
environment = await computer_handler.get_environment()
except Exception:
# Fallback to default environment if method fails
environment = "linux"
return {
"type": "computer_use_preview",
"display_width": getattr(computer_tool, 'display_width', 1024),
"display_height": getattr(computer_tool, 'display_height', 768),
"environment": getattr(computer_tool, 'environment', "linux") # mac, windows, linux, browser
"display_width": width,
"display_height": height,
"environment": environment # mac, windows, linux, browser
}
def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
async def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
"""Prepare tools for OpenAI API format"""
openai_tools = []
for schema in tool_schemas:
if schema["type"] == "computer":
# Map computer tool to OpenAI format
openai_tools.append(_map_computer_tool_to_openai(schema["computer"]))
computer_tool = await _map_computer_tool_to_openai(schema["computer"])
openai_tools.append(computer_tool)
elif schema["type"] == "function":
# Function tools use OpenAI-compatible schema directly (liteLLM expects this format)
# Schema should be: {type, name, description, parameters}
@@ -36,60 +54,182 @@ def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
return openai_tools
@agent_loop(models=r".*computer-use-preview.*", priority=10)
async def openai_computer_use_loop(
messages: Messages,
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
@register_agent(models=r".*computer-use-preview.*")
class OpenAIComputerUseConfig:
"""
OpenAI computer-use-preview agent loop using liteLLM responses.
OpenAI computer-use-preview agent configuration using liteLLM responses.
Supports OpenAI's computer use preview models.
"""
tools = tools or []
# Prepare tools for OpenAI API
openai_tools = _prepare_tools_for_openai(tools)
# Prepare API call kwargs
api_kwargs = {
"model": model,
"input": messages,
"tools": openai_tools if openai_tools else None,
"stream": stream,
"reasoning": {"summary": "concise"},
"truncation": "auto",
"num_retries": max_retries,
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
# Use liteLLM responses
response = await litellm.aresponses(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
) -> Dict[str, Any]:
"""
Predict the next step based on input items.
Args:
messages: Input items following Responses format
model: Model name to use
tools: Optional list of tool schemas
max_retries: Maximum number of retries
stream: Whether to stream responses
computer_handler: Computer handler instance
_on_api_start: Callback for API start
_on_api_end: Callback for API end
_on_usage: Callback for usage tracking
_on_screenshot: Callback for screenshot events
**kwargs: Additional arguments
Returns:
Dictionary with "output" (output items) and "usage" array
"""
tools = tools or []
# Prepare tools for OpenAI API
openai_tools = await _prepare_tools_for_openai(tools)
# Extract usage information
response.usage = {
**response.usage.model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(response.usage)
# Prepare API call kwargs
api_kwargs = {
"model": model,
"input": messages,
"tools": openai_tools if openai_tools else None,
"stream": stream,
"reasoning": {"summary": "concise"},
"truncation": "auto",
"num_retries": max_retries,
**kwargs
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
# Use liteLLM responses
response = await litellm.aresponses(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract usage information
usage = {
**response.usage.model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(usage)
# Return in the expected format
output_dict = response.model_dump()
output_dict["usage"] = usage
return output_dict
return response
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates based on image and instruction.
Uses OpenAI computer-use-preview with manually constructed input items
and a prompt that instructs the agent to only output clicks.
Args:
model: Model name to use
image_b64: Base64 encoded image
instruction: Instruction for where to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
# TODO: use computer tool to get dimensions + environment
# Manually construct input items with image and click instruction
input_items = [
{
"role": "user",
"content": f"You are a UI grounding expert. Look at the image and {instruction}. Output ONLY a click action on the target element. No explanations, confirmations, or additional text."
},
{
"role": "user",
"content": [
{
"type": "input_image",
"image_url": f"data:image/png;base64,{image_b64}"
}
]
}
]
# Get image dimensions from base64 data
try:
image_data = base64.b64decode(image_b64)
image = Image.open(BytesIO(image_data))
display_width, display_height = image.size
except Exception:
# Fallback to default dimensions if image parsing fails
display_width, display_height = 1024, 768
# Prepare computer tool for click actions
computer_tool = {
"type": "computer_use_preview",
"display_width": display_width,
"display_height": display_height,
"environment": "windows"
}
# Prepare API call kwargs
api_kwargs = {
"model": model,
"input": input_items,
"tools": [computer_tool],
"stream": False,
"reasoning": {"summary": "concise"},
"truncation": "auto",
"max_tokens": 100 # Keep response short for click prediction
}
# Use liteLLM responses
response = await litellm.aresponses(**api_kwargs)
# Extract click coordinates from response output
output_dict = response.model_dump()
output_items = output_dict.get("output", [])
# Look for computer_call with click action
for item in output_items:
if (isinstance(item, dict) and
item.get("type") == "computer_call" and
isinstance(item.get("action"), dict)):
action = item["action"]
if action.get("type") == "click":
x = action.get("x")
y = action.get("y")
if x is not None and y is not None:
return (int(x), int(y))
return None
def get_capabilities(self) -> List[AgentCapability]:
"""
Get list of capabilities supported by this agent config.
Returns:
List of capability strings
"""
return ["click", "step"]

View File

@@ -1,5 +1,7 @@
"""
UITARS agent loop implementation using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B
Paper: https://arxiv.org/abs/2501.12326
Code: https://github.com/bytedance/UI-TARS
"""
import asyncio
@@ -9,7 +11,7 @@ import base64
import math
import re
import ast
from typing import Dict, List, Any, AsyncGenerator, Union, Optional
from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
from io import BytesIO
from PIL import Image
import litellm
@@ -21,8 +23,8 @@ from openai.types.responses.response_input_param import ComputerCallOutput
from openai.types.responses.response_output_message_param import ResponseOutputMessageParam
from openai.types.responses.response_reasoning_item_param import ResponseReasoningItemParam, Summary
from ..decorators import agent_loop
from ..types import Messages, AgentResponse, Tools
from ..decorators import register_agent
from ..types import Messages, AgentResponse, Tools, AgentCapability
from ..responses import (
make_reasoning_item,
make_output_text_item,
@@ -79,6 +81,18 @@ Action: ...
{instruction}
"""
GROUNDING_UITARS_PROMPT_TEMPLATE = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Action: ...
## Action Space
click(point='<|box_start|>(x1,y1)<|box_end|>')
## User Instruction
{instruction}"""
def round_by_factor(number: float, factor: int) -> int:
"""Returns the closest integer to 'number' that is divisible by 'factor'."""
@@ -501,188 +515,301 @@ def convert_uitars_messages_to_litellm(messages: Messages) -> List[Dict[str, Any
return litellm_messages
@agent_loop(models=r"(?i).*ui-?tars.*", priority=10)
async def uitars_loop(
messages: Messages,
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
@register_agent(models=r"(?i).*ui-?tars.*")
class UITARSConfig:
"""
UITARS agent loop using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B model.
UITARS agent configuration using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B model.
Supports UITARS vision-language models for computer control.
"""
tools = tools or []
# Create response items
response_items = []
# Find computer tool for screen dimensions
computer_tool = None
for tool_schema in tools:
if tool_schema["type"] == "computer":
computer_tool = tool_schema["computer"]
break
# Get screen dimensions
screen_width, screen_height = 1024, 768
if computer_tool:
try:
screen_width, screen_height = await computer_tool.get_dimensions()
except:
pass
# Process messages to extract instruction and image
instruction = ""
image_data = None
# Convert messages to list if string
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
# Extract instruction and latest screenshot
for message in reversed(messages):
if isinstance(message, dict):
content = message.get("content", "")
async def predict_step(
self,
messages: List[Dict[str, Any]],
model: str,
tools: Optional[List[Dict[str, Any]]] = None,
max_retries: Optional[int] = None,
stream: bool = False,
computer_handler=None,
use_prompt_caching: Optional[bool] = False,
_on_api_start=None,
_on_api_end=None,
_on_usage=None,
_on_screenshot=None,
**kwargs
) -> Dict[str, Any]:
"""
Predict the next step based on input messages.
Args:
messages: Input messages following Responses format
model: Model name to use
tools: Optional list of tool schemas
max_retries: Maximum number of retries
stream: Whether to stream responses
computer_handler: Computer handler instance
_on_api_start: Callback for API start
_on_api_end: Callback for API end
_on_usage: Callback for usage tracking
_on_screenshot: Callback for screenshot events
**kwargs: Additional arguments
# Handle different content formats
if isinstance(content, str):
if not instruction and message.get("role") == "user":
instruction = content
elif isinstance(content, list):
for item in content:
if isinstance(item, dict):
if item.get("type") == "text" and not instruction:
instruction = item.get("text", "")
elif item.get("type") == "image_url" and not image_data:
image_url = item.get("image_url", {})
if isinstance(image_url, dict):
image_data = image_url.get("url", "")
else:
image_data = image_url
Returns:
Dictionary with "output" (output items) and "usage" array
"""
tools = tools or []
# Also check for computer_call_output with screenshots
if message.get("type") == "computer_call_output" and not image_data:
output = message.get("output", {})
if isinstance(output, dict) and output.get("type") == "input_image":
image_data = output.get("image_url", "")
# Create response items
response_items = []
if instruction and image_data:
break
if not instruction:
instruction = "Help me complete this task by analyzing the screen and taking appropriate actions."
# Create prompt
user_prompt = UITARS_PROMPT_TEMPLATE.format(
instruction=instruction,
action_space=UITARS_ACTION_SPACE,
language="English"
)
# Convert conversation history to LiteLLM format
history_messages = convert_uitars_messages_to_litellm(messages)
# Prepare messages for liteLLM
litellm_messages = [
{
"role": "system",
"content": "You are a helpful assistant."
}
]
# Add current user instruction with screenshot
current_user_message = {
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
# Find computer tool for screen dimensions
computer_tool = None
for tool_schema in tools:
if tool_schema["type"] == "computer":
computer_tool = tool_schema["computer"]
break
# Get screen dimensions
screen_width, screen_height = 1024, 768
if computer_tool:
try:
screen_width, screen_height = await computer_tool.get_dimensions()
except:
pass
# Process messages to extract instruction and image
instruction = ""
image_data = None
# Convert messages to list if string
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
# Extract instruction and latest screenshot
for message in reversed(messages):
if isinstance(message, dict):
content = message.get("content", "")
# Handle different content formats
if isinstance(content, str):
if not instruction and message.get("role") == "user":
instruction = content
elif isinstance(content, list):
for item in content:
if isinstance(item, dict):
if item.get("type") == "text" and not instruction:
instruction = item.get("text", "")
elif item.get("type") == "image_url" and not image_data:
image_url = item.get("image_url", {})
if isinstance(image_url, dict):
image_data = image_url.get("url", "")
else:
image_data = image_url
# Also check for computer_call_output with screenshots
if message.get("type") == "computer_call_output" and not image_data:
output = message.get("output", {})
if isinstance(output, dict) and output.get("type") == "input_image":
image_data = output.get("image_url", "")
if instruction and image_data:
break
if not instruction:
instruction = "Help me complete this task by analyzing the screen and taking appropriate actions."
# Create prompt
user_prompt = UITARS_PROMPT_TEMPLATE.format(
instruction=instruction,
action_space=UITARS_ACTION_SPACE,
language="English"
)
# Convert conversation history to LiteLLM format
history_messages = convert_uitars_messages_to_litellm(messages)
# Prepare messages for liteLLM
litellm_messages = [
{
"role": "system",
"content": "You are a helpful assistant."
}
]
}
litellm_messages.append(current_user_message)
# Process image for UITARS
if not image_data:
# Take screenshot if none found in messages
if computer_handler:
image_data = await computer_handler.screenshot()
await _on_screenshot(image_data, "screenshot_before")
# Add screenshot to output items so it can be retained in history
response_items.append(make_input_image_item(image_data))
else:
raise ValueError("No screenshot found in messages and no computer_handler provided")
processed_image, original_width, original_height = process_image_for_uitars(image_data)
encoded_image = pil_to_base64(processed_image)
# Add conversation history
if history_messages:
litellm_messages.extend(history_messages)
else:
litellm_messages.append({
"role": "user",
# Add current user instruction with screenshot
current_user_message = {
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
{"type": "text", "text": user_prompt},
]
})
}
litellm_messages.append(current_user_message)
# Process image for UITARS
if not image_data:
# Take screenshot if none found in messages
if computer_handler:
image_data = await computer_handler.screenshot()
await _on_screenshot(image_data, "screenshot_before")
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": litellm_messages,
"max_tokens": kwargs.get("max_tokens", 500),
"temperature": kwargs.get("temperature", 0.0),
"do_sample": kwargs.get("temperature", 0.0) > 0.0,
"num_retries": max_retries,
**{k: v for k, v in kwargs.items() if k not in ["max_tokens", "temperature"]}
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
# Call liteLLM with UITARS model
response = await litellm.acompletion(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract response content
response_content = response.choices[0].message.content.strip() # type: ignore
# Parse UITARS response
parsed_responses = parse_uitars_response(response_content, original_width, original_height)
# Convert to computer actions
computer_actions = convert_to_computer_actions(parsed_responses, original_width, original_height)
# Add computer actions to response items
thought = parsed_responses[0].get("thought", "")
if thought:
response_items.append(make_reasoning_item(thought))
response_items.extend(computer_actions)
# Extract usage information
response_usage = {
**LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(response_usage)
# Add screenshot to output items so it can be retained in history
response_items.append(make_input_image_item(image_data))
else:
raise ValueError("No screenshot found in messages and no computer_handler provided")
processed_image, original_width, original_height = process_image_for_uitars(image_data)
encoded_image = pil_to_base64(processed_image)
# Add conversation history
if history_messages:
litellm_messages.extend(history_messages)
else:
litellm_messages.append({
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
]
})
# Create agent response
agent_response = {
"output": response_items,
"usage": response_usage
}
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": litellm_messages,
"max_tokens": kwargs.get("max_tokens", 500),
"temperature": kwargs.get("temperature", 0.0),
"do_sample": kwargs.get("temperature", 0.0) > 0.0,
"num_retries": max_retries,
**{k: v for k, v in kwargs.items() if k not in ["max_tokens", "temperature"]}
}
# Call API start hook
if _on_api_start:
await _on_api_start(api_kwargs)
# Call liteLLM with UITARS model
response = await litellm.acompletion(**api_kwargs)
# Call API end hook
if _on_api_end:
await _on_api_end(api_kwargs, response)
# Extract response content
response_content = response.choices[0].message.content.strip() # type: ignore
# Parse UITARS response
parsed_responses = parse_uitars_response(response_content, original_width, original_height)
# Convert to computer actions
computer_actions = convert_to_computer_actions(parsed_responses, original_width, original_height)
# Add computer actions to response items
thought = parsed_responses[0].get("thought", "")
if thought:
response_items.append(make_reasoning_item(thought))
response_items.extend(computer_actions)
# Extract usage information
response_usage = {
**LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
"response_cost": response._hidden_params.get("response_cost", 0.0),
}
if _on_usage:
await _on_usage(response_usage)
# Create agent response
agent_response = {
"output": response_items,
"usage": response_usage
}
return agent_response
return agent_response
async def predict_click(
self,
model: str,
image_b64: str,
instruction: str
) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates based on image and instruction.
UITARS supports click prediction through its action parsing.
Args:
model: Model name to use
image_b64: Base64 encoded image
instruction: Instruction for where to click
Returns:
Tuple with (x, y) coordinates or None
"""
try:
# Create prompt using grounding template
user_prompt = GROUNDING_UITARS_PROMPT_TEMPLATE.format(
instruction=instruction
)
# Process image for UITARS
processed_image, original_width, original_height = process_image_for_uitars(image_b64)
encoded_image = pil_to_base64(processed_image)
# Prepare messages for liteLLM
litellm_messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
]
}
]
# Prepare API call kwargs
api_kwargs = {
"model": model,
"messages": litellm_messages,
"max_tokens": 100,
"temperature": 0.0,
"do_sample": False
}
# Call liteLLM with UITARS model
response = await litellm.acompletion(**api_kwargs)
# Extract response content
response_content = response.choices[0].message.content.strip() # type: ignore
# Parse the response to extract click coordinates
# Look for click action with coordinates
click_pattern = r"click\(point='<\|box_start\|>\((\d+),(\d+)\)<\|box_end\|>'\)"
match = re.search(click_pattern, response_content)
if match:
x, y = int(match.group(1)), int(match.group(2))
# Scale coordinates back to original image dimensions
scale_x = original_width / processed_image.width
scale_y = original_height / processed_image.height
scaled_x = int(x * scale_x)
scaled_y = int(y * scale_y)
return (scaled_x, scaled_y)
return None
except Exception as e:
# Log error and return None
print(f"Error in predict_click: {e}")
return None
def get_capabilities(self) -> List[AgentCapability]:
"""
Get list of capabilities supported by this agent config.
Returns:
List of capability strings
"""
return ["step", "click"]

View File

@@ -40,7 +40,7 @@ def make_input_image_item(image_data: Union[str, bytes]) -> EasyInputMessagePara
ResponseInputImageParam(
type="input_image",
image_url=f"data:image/png;base64,{base64.b64encode(image_data).decode('utf-8') if isinstance(image_data, bytes) else image_data}"
)
) # type: ignore
],
role="user",
type="message"
@@ -205,3 +205,524 @@ def make_wait_item(call_id: Optional[str] = None) -> ResponseComputerToolCallPar
status="completed",
type="computer_call"
)
# Extra anthropic computer calls
def make_left_mouse_down_item(x: Optional[int] = None, y: Optional[int] = None, call_id: Optional[str] = None) -> Dict[str, Any]:
return {
"id": random_id(),
"call_id": call_id if call_id else random_id(),
"action": {
"type": "left_mouse_down",
"x": x,
"y": y
},
"pending_safety_checks": [],
"status": "completed",
"type": "computer_call"
}
def make_left_mouse_up_item(x: Optional[int] = None, y: Optional[int] = None, call_id: Optional[str] = None) -> Dict[str, Any]:
return {
"id": random_id(),
"call_id": call_id if call_id else random_id(),
"action": {
"type": "left_mouse_up",
"x": x,
"y": y
},
"pending_safety_checks": [],
"status": "completed",
"type": "computer_call"
}
def make_failed_tool_call_items(tool_name: str, tool_kwargs: Dict[str, Any], error_message: str, call_id: Optional[str] = None) -> List[Dict[str, Any]]:
call_id = call_id if call_id else random_id()
return [
{
"type": "function_call",
"id": random_id(),
"call_id": call_id,
"name": tool_name,
"arguments": json.dumps(tool_kwargs),
},
{
"type": "function_call_output",
"call_id": call_id,
"output": json.dumps({"error": error_message}),
}
]
# Conversion functions between element descriptions and coordinates
def convert_computer_calls_desc2xy(responses_items: List[Dict[str, Any]], desc2xy: Dict[str, tuple]) -> List[Dict[str, Any]]:
"""
Convert computer calls from element descriptions to x,y coordinates.
Args:
responses_items: List of response items containing computer calls with element_description
desc2xy: Dictionary mapping element descriptions to (x, y) coordinate tuples
Returns:
List of response items with element_description replaced by x,y coordinates
"""
converted_items = []
for item in responses_items:
if item.get("type") == "computer_call" and "action" in item:
action = item["action"].copy()
# Handle single element_description
if "element_description" in action:
desc = action["element_description"]
if desc in desc2xy:
x, y = desc2xy[desc]
action["x"] = x
action["y"] = y
del action["element_description"]
# Handle start_element_description and end_element_description for drag operations
elif "start_element_description" in action and "end_element_description" in action:
start_desc = action["start_element_description"]
end_desc = action["end_element_description"]
if start_desc in desc2xy and end_desc in desc2xy:
start_x, start_y = desc2xy[start_desc]
end_x, end_y = desc2xy[end_desc]
action["path"] = [{"x": start_x, "y": start_y}, {"x": end_x, "y": end_y}]
del action["start_element_description"]
del action["end_element_description"]
converted_item = item.copy()
converted_item["action"] = action
converted_items.append(converted_item)
else:
converted_items.append(item)
return converted_items
def convert_computer_calls_xy2desc(responses_items: List[Dict[str, Any]], desc2xy: Dict[str, tuple]) -> List[Dict[str, Any]]:
"""
Convert computer calls from x,y coordinates to element descriptions.
Args:
responses_items: List of response items containing computer calls with x,y coordinates
desc2xy: Dictionary mapping element descriptions to (x, y) coordinate tuples
Returns:
List of response items with x,y coordinates replaced by element_description
"""
# Create reverse mapping from coordinates to descriptions
xy2desc = {coords: desc for desc, coords in desc2xy.items()}
converted_items = []
for item in responses_items:
if item.get("type") == "computer_call" and "action" in item:
action = item["action"].copy()
# Handle single x,y coordinates
if "x" in action and "y" in action:
coords = (action["x"], action["y"])
if coords in xy2desc:
action["element_description"] = xy2desc[coords]
del action["x"]
del action["y"]
# Handle path for drag operations
elif "path" in action and isinstance(action["path"], list) and len(action["path"]) == 2:
start_point = action["path"][0]
end_point = action["path"][1]
if ("x" in start_point and "y" in start_point and
"x" in end_point and "y" in end_point):
start_coords = (start_point["x"], start_point["y"])
end_coords = (end_point["x"], end_point["y"])
if start_coords in xy2desc and end_coords in xy2desc:
action["start_element_description"] = xy2desc[start_coords]
action["end_element_description"] = xy2desc[end_coords]
del action["path"]
converted_item = item.copy()
converted_item["action"] = action
converted_items.append(converted_item)
else:
converted_items.append(item)
return converted_items
def get_all_element_descriptions(responses_items: List[Dict[str, Any]]) -> List[str]:
"""
Extract all element descriptions from computer calls in responses items.
Args:
responses_items: List of response items containing computer calls
Returns:
List of unique element descriptions found in computer calls
"""
descriptions = set()
for item in responses_items:
if item.get("type") == "computer_call" and "action" in item:
action = item["action"]
# Handle single element_description
if "element_description" in action:
descriptions.add(action["element_description"])
# Handle start_element_description and end_element_description for drag operations
if "start_element_description" in action:
descriptions.add(action["start_element_description"])
if "end_element_description" in action:
descriptions.add(action["end_element_description"])
return list(descriptions)
# Conversion functions between responses_items and completion messages formats
def convert_responses_items_to_completion_messages(messages: List[Dict[str, Any]], allow_images_in_tool_results: bool = True) -> List[Dict[str, Any]]:
"""Convert responses_items message format to liteLLM completion format.
Args:
messages: List of responses_items format messages
allow_images_in_tool_results: If True, include images in tool role messages.
If False, send tool message + separate user message with image.
"""
completion_messages = []
for message in messages:
msg_type = message.get("type")
role = message.get("role")
# Handle user messages (both with and without explicit type)
if role == "user" or msg_type == "user":
content = message.get("content", "")
if isinstance(content, list):
# Handle list content (images, text blocks)
completion_content = []
for item in content:
if item.get("type") == "input_image":
completion_content.append({
"type": "image_url",
"image_url": {
"url": item.get("image_url")
}
})
elif item.get("type") == "input_text":
completion_content.append({
"type": "text",
"text": item.get("text")
})
elif item.get("type") == "text":
completion_content.append({
"type": "text",
"text": item.get("text")
})
completion_messages.append({
"role": "user",
"content": completion_content
})
elif isinstance(content, str):
# Handle string content
completion_messages.append({
"role": "user",
"content": content
})
# Handle assistant messages
elif role == "assistant" or msg_type == "message":
content = message.get("content", [])
if isinstance(content, list):
text_parts = []
for item in content:
if item.get("type") == "output_text":
text_parts.append(item.get("text", ""))
elif item.get("type") == "text":
text_parts.append(item.get("text", ""))
if text_parts:
completion_messages.append({
"role": "assistant",
"content": "\n".join(text_parts)
})
# Handle reasoning items (convert to assistant message)
elif msg_type == "reasoning":
summary = message.get("summary", [])
text_parts = []
for item in summary:
if item.get("type") == "summary_text":
text_parts.append(item.get("text", ""))
if text_parts:
completion_messages.append({
"role": "assistant",
"content": "\n".join(text_parts)
})
# Handle function calls
elif msg_type == "function_call":
# Add tool call to last assistant message or create new one
if not completion_messages or completion_messages[-1]["role"] != "assistant":
completion_messages.append({
"role": "assistant",
"content": "",
"tool_calls": []
})
if "tool_calls" not in completion_messages[-1]:
completion_messages[-1]["tool_calls"] = []
completion_messages[-1]["tool_calls"].append({
"id": message.get("call_id"),
"type": "function",
"function": {
"name": message.get("name"),
"arguments": message.get("arguments")
}
})
# Handle computer calls
elif msg_type == "computer_call":
# Add tool call to last assistant message or create new one
if not completion_messages or completion_messages[-1]["role"] != "assistant":
completion_messages.append({
"role": "assistant",
"content": "",
"tool_calls": []
})
if "tool_calls" not in completion_messages[-1]:
completion_messages[-1]["tool_calls"] = []
action = message.get("action", {})
completion_messages[-1]["tool_calls"].append({
"id": message.get("call_id"),
"type": "function",
"function": {
"name": "computer",
"arguments": json.dumps(action)
}
})
# Handle function/computer call outputs
elif msg_type in ["function_call_output", "computer_call_output"]:
output = message.get("output")
call_id = message.get("call_id")
if isinstance(output, dict) and output.get("type") == "input_image":
if allow_images_in_tool_results:
# Handle image output as tool response (may not work with all APIs)
completion_messages.append({
"role": "tool",
"tool_call_id": call_id,
"content": [{
"type": "image_url",
"image_url": {
"url": output.get("image_url")
}
}]
})
else:
# Send tool message + separate user message with image (OpenAI compatible)
completion_messages += [{
"role": "tool",
"tool_call_id": call_id,
"content": "[Execution completed. See screenshot below]"
}, {
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": output.get("image_url")
}
}]
}]
else:
# Handle text output as tool response
completion_messages.append({
"role": "tool",
"tool_call_id": call_id,
"content": str(output)
})
return completion_messages
def convert_completion_messages_to_responses_items(completion_messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Convert completion messages format to responses_items message format."""
responses_items = []
skip_next = False
for i, message in enumerate(completion_messages):
if skip_next:
skip_next = False
continue
role = message.get("role")
content = message.get("content")
tool_calls = message.get("tool_calls", [])
# Handle assistant messages with text content
if role == "assistant" and content and isinstance(content, str):
responses_items.append({
"type": "message",
"role": "assistant",
"content": [{
"type": "output_text",
"text": content
}]
})
# Handle tool calls
if tool_calls:
for tool_call in tool_calls:
if tool_call.get("type") == "function":
function = tool_call.get("function", {})
function_name = function.get("name")
if function_name == "computer":
# Parse computer action
try:
action = json.loads(function.get("arguments", "{}"))
# Change key from "action" -> "type"
if action.get("action"):
action["type"] = action["action"]
del action["action"]
responses_items.append({
"type": "computer_call",
"call_id": tool_call.get("id"),
"action": action,
"status": "completed"
})
except json.JSONDecodeError:
# Fallback to function call format
responses_items.append({
"type": "function_call",
"call_id": tool_call.get("id"),
"name": function_name,
"arguments": function.get("arguments", "{}"),
"status": "completed"
})
else:
# Regular function call
responses_items.append({
"type": "function_call",
"call_id": tool_call.get("id"),
"name": function_name,
"arguments": function.get("arguments", "{}"),
"status": "completed"
})
# Handle tool messages (function/computer call outputs)
elif role == "tool" and content:
tool_call_id = message.get("tool_call_id")
if isinstance(content, str):
# Check if this is the "[Execution completed. See screenshot below]" pattern
if content == "[Execution completed. See screenshot below]":
# Look ahead for the next user message with image
next_idx = i + 1
if (next_idx < len(completion_messages) and
completion_messages[next_idx].get("role") == "user" and
isinstance(completion_messages[next_idx].get("content"), list)):
# Found the pattern - extract image from next message
next_content = completion_messages[next_idx]["content"]
for item in next_content:
if item.get("type") == "image_url":
responses_items.append({
"type": "computer_call_output",
"call_id": tool_call_id,
"output": {
"type": "input_image",
"image_url": item.get("image_url", {}).get("url")
}
})
# Skip the next user message since we processed it
skip_next = True
break
else:
# No matching user message, treat as regular text
responses_items.append({
"type": "computer_call_output",
"call_id": tool_call_id,
"output": content
})
else:
# Determine if this is a computer call or function call output
try:
# Try to parse as structured output
parsed_content = json.loads(content)
if parsed_content.get("type") == "input_image":
responses_items.append({
"type": "computer_call_output",
"call_id": tool_call_id,
"output": parsed_content
})
else:
responses_items.append({
"type": "computer_call_output",
"call_id": tool_call_id,
"output": content
})
except json.JSONDecodeError:
# Plain text output - could be function or computer call
responses_items.append({
"type": "function_call_output",
"call_id": tool_call_id,
"output": content
})
elif isinstance(content, list):
# Handle structured content (e.g., images)
for item in content:
if item.get("type") == "image_url":
responses_items.append({
"type": "computer_call_output",
"call_id": tool_call_id,
"output": {
"type": "input_image",
"image_url": item.get("image_url", {}).get("url")
}
})
elif item.get("type") == "text":
responses_items.append({
"type": "function_call_output",
"call_id": tool_call_id,
"output": item.get("text")
})
# Handle actual user messages
elif role == "user" and content:
if isinstance(content, list):
# Handle structured user content (e.g., text + images)
user_content = []
for item in content:
if item.get("type") == "image_url":
user_content.append({
"type": "input_image",
"image_url": item.get("image_url", {}).get("url")
})
elif item.get("type") == "text":
user_content.append({
"type": "input_text",
"text": item.get("text")
})
if user_content:
responses_items.append({
"role": "user",
"type": "message",
"content": user_content
})
elif isinstance(content, str):
# Handle simple text user message
responses_items.append({
"role": "user",
"content": content
})
return responses_items

View File

@@ -9,71 +9,21 @@ from litellm import ResponseInputParam, ResponsesAPIResponse, ToolParam
from collections.abc import Iterable
# Agent input types
Messages = str | ResponseInputParam
Messages = str | ResponseInputParam | List[Dict[str, Any]]
Tools = Optional[Iterable[ToolParam]]
# Agent output types
AgentResponse = ResponsesAPIResponse
AgentCapability = Literal["step", "click"]
# Agent loop registration
class AgentLoopInfo(BaseModel):
"""Information about a registered agent loop"""
func: Callable
# Agent config registration
class AgentConfigInfo(BaseModel):
"""Information about a registered agent config"""
agent_class: type
models_regex: str
priority: int = 0
def matches_model(self, model: str) -> bool:
"""Check if this loop matches the given model"""
"""Check if this agent config matches the given model"""
return bool(re.match(self.models_regex, model))
# Computer tool interface
class Computer(Protocol):
"""Protocol defining the interface for computer interactions."""
async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
"""Get the current environment type."""
...
async def get_dimensions(self) -> tuple[int, int]:
"""Get screen dimensions as (width, height)."""
...
async def screenshot(self) -> str:
"""Take a screenshot and return as base64 string."""
...
async def click(self, x: int, y: int, button: str = "left") -> None:
"""Click at coordinates with specified button."""
...
async def double_click(self, x: int, y: int) -> None:
"""Double click at coordinates."""
...
async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
"""Scroll at coordinates with specified scroll amounts."""
...
async def type(self, text: str) -> None:
"""Type text."""
...
async def wait(self, ms: int = 1000) -> None:
"""Wait for specified milliseconds."""
...
async def move(self, x: int, y: int) -> None:
"""Move cursor to coordinates."""
...
async def keypress(self, keys: List[str]) -> None:
"""Press key combination."""
...
async def drag(self, path: List[Dict[str, int]]) -> None:
"""Drag along specified path."""
...
async def get_current_url(self) -> str:
"""Get current URL (for browser environments)."""
...

View File

@@ -178,13 +178,20 @@ def create_computer_instance(
"""Create or get the global Computer instance."""
global global_computer
if global_computer is None:
global_computer = Computer(
verbosity=verbosity,
os_type=os_type,
provider_type=provider_type,
name=name if name else "",
api_key=api_key
)
if provider_type == "localhost":
global_computer = Computer(
verbosity=verbosity,
os_type=os_type,
use_host_computer_server=True
)
else:
global_computer = Computer(
verbosity=verbosity,
os_type=os_type,
provider_type=provider_type,
name=name if name else "",
api_key=api_key
)
return global_computer

View File

@@ -211,7 +211,7 @@ if __name__ == "__main__":
is_windows = platform.system().lower() == "windows"
is_mac = platform.system().lower() == "darwin"
providers = ["cloud"]
providers = ["cloud", "localhost"]
if is_mac:
providers += ["lume"]
if is_windows:
@@ -403,6 +403,23 @@ if __name__ == "__main__":
type="password",
)
# Provider visibility update function
def update_provider_visibility(provider):
"""Update visibility of container name and API key based on selected provider."""
is_localhost = provider == "localhost"
return [
gr.update(visible=not is_localhost), # container_name
gr.update(visible=not is_localhost and not has_cua_key) # cua_cloud_api_key
]
# Connect provider change event
computer_provider.change(
fn=update_provider_visibility,
inputs=[computer_provider],
outputs=[container_name, cua_cloud_api_key],
queue=False
)
# Connect UI update events
for dropdown in [agent_loop, omni_model_choice, uitars_model_choice, openai_model_choice, anthropic_model_choice]:
dropdown.change(

View File

@@ -0,0 +1,3 @@
output/
interactive_output/
*_results.md

View File

@@ -0,0 +1,68 @@
# Computer Agent Benchmarks
This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.
## Overview
The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:
- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
## Available Benchmarks
### 1. ScreenSpot-v2 (`ss-v2.py`)
- **Dataset**: ScreenSpot-v2 (click-only GUI grounding)
- **Format**: Standard resolution screenshots
- **Task**: Predict click coordinates given an instruction and image
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
### 2. ScreenSpot-Pro (`ss-pro.py`)
- **Dataset**: ScreenSpot-Pro (high-resolution click-only GUI grounding)
- **Format**: High-resolution screenshots
- **Task**: Predict click coordinates given an instruction and image
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
### 3. Interactive Testing (`interactive.py`)
- **Real-time testing**: Take screenshots and visualize model predictions
- **Commands**:
- Type instruction → test all models on last screenshot
- `screenshot` → take screenshot
- `models` → list available models
- `quit`/`exit` → exit tool
- **Output**: Visual predictions with crosshairs for each model
## Running Benchmarks
### 1. Configure Models
Edit `utils.py` to specify which models you want to test in `get_available_models()`.
### 2. Run Benchmark
```bash
# ScreenSpot-v2 benchmark
python ss-v2.py --samples 50
# ScreenSpot-Pro benchmark
python ss-pro.py --samples 50
# Interactive testing
python interactive.py
```
## Output
### Console Output
```
Model Results:
Accuracy: 85.50% (171/200)
Avg Time: 1.23s (0.89s - 2.45s)
VRAM Usage: 4.5GB (max) / 3.4GB (avg)
```
### Generated Files
- **Markdown Report**: `*_results.md` with detailed results tables
- **Visualizations**: `output/` directory with prediction visualizations
- **Interactive Output**: `interactive_output/` for interactive session results
## Contributing
To add a new reference model, follow the instructions in [contrib.md](contrib.md).

View File

@@ -0,0 +1,163 @@
# Contributing Reference Agent Implementations
This guide explains how to add your own reference agent implementations to the benchmark system.
## Adding Reference Agent Implementations
### 1. Implement the ModelProtocol
Create a new file in `models/` directory implementing the `ModelProtocol`:
```python
from models.base import ModelProtocol
from typing import Optional, Tuple
from PIL import Image
class YourModelName(ModelProtocol):
def __init__(self, model_path: str):
self.model_path = model_path
self._model = None
@property
def model_name(self) -> str:
return self.model_path
async def load_model(self) -> None:
"""Load the model into memory."""
# Your model loading logic here
pass
async def unload_model(self) -> None:
"""Unload the model from memory."""
# Your model cleanup logic here
pass
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates for the given image and instruction.
Args:
image: PIL Image to analyze
instruction: Text instruction describing what to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
# Your prediction logic here
return (x, y) # Return predicted coordinates
```
### 2. Register Your Model
Add your model to the `get_available_models()` function in `utils.py`:
```python
def get_available_models() -> List[Union[str, ModelProtocol]]:
models = [
# Computer Agent SDK providers
"huggingface-local/HelloKKMe/GTA1-7B",
# Reference implementations
GTA1Model("HelloKKMe/GTA1-7B"),
YourModelName("path/to/your/model"), # Add your model here
]
return models
```
### 3. Test Your Implementation
Before submitting, test your model with the interactive tool:
```bash
python interactive.py
```
This will help you verify that your model loads correctly and produces reasonable predictions.
## Example: Adding a New Model
Here's a complete example of adding a hypothetical "MyVisionModel":
1. **Create `models/my_vision_model.py`:**
```python
import torch
from transformers import AutoModel, AutoProcessor
from models.base import ModelProtocol
from typing import Optional, Tuple
from PIL import Image
class MyVisionModel(ModelProtocol):
def __init__(self, model_path: str):
self.model_path = model_path
self.model = None
self.processor = None
@property
def model_name(self) -> str:
return f"MyVisionModel({self.model_path})"
async def load_model(self) -> None:
"""Load the model and processor."""
self.processor = AutoProcessor.from_pretrained(self.model_path)
self.model = AutoModel.from_pretrained(
self.model_path,
torch_dtype=torch.float16,
device_map="auto"
)
async def unload_model(self) -> None:
"""Clean up model resources."""
del self.model
del self.processor
self.model = None
self.processor = None
torch.cuda.empty_cache()
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""Predict click coordinates."""
try:
# Preprocess inputs
inputs = self.processor(
text=instruction,
images=image,
return_tensors="pt"
)
# Run inference
with torch.no_grad():
outputs = self.model(**inputs)
# Extract coordinates (model-specific logic)
x, y = self._extract_coordinates(outputs)
return (int(x), int(y))
except Exception as e:
print(f"Prediction failed: {e}")
return None
def _extract_coordinates(self, outputs):
"""Extract x, y coordinates from model outputs."""
# Your model-specific coordinate extraction logic
pass
```
2. **Update `models/__init__.py`:**
```python
from .gta1 import GTA1Model
from .my_vision_model import MyVisionModel
__all__ = ["GTA1Model", "MyVisionModel"]
```
3. **Update `utils.py`:**
```python
from models import GTA1Model, MyVisionModel
def get_available_models() -> List[Union[str, ModelProtocol]]:
models = [
"huggingface-local/HelloKKMe/GTA1-7B",
GTA1Model("HelloKKMe/GTA1-7B"),
MyVisionModel("my-org/my-vision-model"), # Add here
]
return models
```

View File

@@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""
Interactive Click Prediction Tool
Takes screenshots and allows testing multiple models interactively.
Models are loaded/unloaded one at a time to avoid memory issues.
"""
import asyncio
import os
from datetime import datetime
from typing import List, Dict, Any
from utils import (
ModelWrapper,
take_screenshot,
save_prediction_visualization,
get_available_models
)
async def predict_with_all_models(image, instruction: str, models) -> List[Dict[str, Any]]:
"""
Predict click coordinates with all models sequentially.
Args:
image: PIL Image to analyze
instruction: Instruction text
models: List of model instances
Returns:
List of prediction results
"""
predictions = []
for model in models:
model_wrapper = ModelWrapper(model)
print(f"\n🔄 Loading {model_wrapper.model_name}...")
try:
# Load model
await model_wrapper.load_model()
# Predict
coords = await model_wrapper.predict_click(image, instruction)
predictions.append({
'model_name': model_wrapper.model_name,
'coords': coords,
'error': None
})
if coords:
print(f"{model_wrapper.model_name}: ({coords[0]}, {coords[1]})")
else:
print(f"{model_wrapper.model_name}: No prediction")
except Exception as e:
print(f"{model_wrapper.model_name}: ERROR - {str(e)}")
predictions.append({
'model_name': model_wrapper.model_name,
'coords': None,
'error': str(e)
})
finally:
# Always unload model to free memory
try:
await model_wrapper.unload_model()
print(f"🗑️ Unloaded {model_wrapper.model_name}")
except Exception as e:
print(f"⚠️ Error unloading {model_wrapper.model_name}: {e}")
return predictions
def print_header():
"""Print the interactive tool header."""
print("=" * 60)
print("🖱️ Interactive Click Prediction Tool")
print("=" * 60)
print("Commands:")
print(" • Type an instruction to test models on last screenshot")
print("'screenshot' - Take a new screenshot")
print("'models' - List available models")
print("'quit' or 'exit' - Exit the tool")
print("=" * 60)
print("💡 Tip: Take a screenshot first, then send instructions to test models!")
def print_models(models):
"""Print available models."""
print("\n📋 Available Models:")
for i, model in enumerate(models, 1):
if isinstance(model, str):
print(f" {i}. {model}")
else:
print(f" {i}. models.{model.__class__.__name__}")
async def main():
"""
Main interactive loop.
"""
print_header()
# Get available models
models = get_available_models()
print_models(models)
# Create output directory for visualizations
output_dir = "interactive_output"
os.makedirs(output_dir, exist_ok=True)
session_count = 0
last_screenshot = None
screenshot_timestamp = None
while True:
try:
# Get user input
print(f"\n{'='*40}")
user_input = input("🎯 Enter instruction (or command): ").strip()
if not user_input:
continue
# Handle commands
if user_input.lower() in ['quit', 'exit', 'q']:
print("👋 Goodbye!")
break
elif user_input.lower() == 'models':
print_models(models)
continue
elif user_input.lower() == 'screenshot':
print("📸 Taking screenshot...")
try:
last_screenshot = take_screenshot()
screenshot_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
screenshot_path = os.path.join(output_dir, f"screenshot_{screenshot_timestamp}.png")
last_screenshot.save(screenshot_path)
print(f"✅ Screenshot captured and saved to: {screenshot_path}")
print(f"📝 Ready for instructions! Screenshot size: {last_screenshot.size}")
except Exception as e:
print(f"❌ Error taking screenshot: {e}")
continue
# Handle instruction input
if last_screenshot is None:
print("⚠️ No screenshot available! Please take a screenshot first using 'screenshot' command.")
continue
session_count += 1
print(f"\n🎯 Session {session_count}: '{user_input}'")
print(f"📷 Using screenshot from: {screenshot_timestamp}")
# Predict with all models using last screenshot
print(f"\n🤖 Testing {len(models)} models on screenshot...")
predictions = await predict_with_all_models(last_screenshot, user_input, models)
# Display results summary
print(f"\n📊 Results Summary:")
print("-" * 50)
for pred in predictions:
if pred['coords']:
print(f"{pred['model_name']}: ({pred['coords'][0]}, {pred['coords'][1]})")
elif pred['error']:
print(f"{pred['model_name']}: ERROR - {pred['error']}")
else:
print(f"{pred['model_name']}: No prediction")
# Save visualization
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
vis_filename = f"session_{session_count:03d}_{timestamp}.png"
vis_path = os.path.join(output_dir, vis_filename)
try:
save_prediction_visualization(last_screenshot, user_input, predictions, vis_path)
print(f"\n💾 Visualization saved to: {vis_path}")
except Exception as e:
print(f"⚠️ Error saving visualization: {e}")
print(f"\n✨ Session {session_count} completed!")
except KeyboardInterrupt:
print("\n\n👋 Interrupted by user. Goodbye!")
break
except Exception as e:
print(f"\n❌ Unexpected error: {e}")
print("Continuing...")
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n👋 Goodbye!")
except Exception as e:
print(f"❌ Fatal error: {e}")

View File

@@ -0,0 +1,3 @@
from .base import ModelProtocol
__all__ = ["ModelProtocol"]

View File

@@ -0,0 +1,36 @@
"""
Base protocol for benchmark models.
"""
from typing import Protocol, Optional, Tuple
from PIL import Image
class ModelProtocol(Protocol):
"""Protocol for benchmark models that can predict click coordinates."""
@property
def model_name(self) -> str:
"""Return the name of the model."""
...
async def load_model(self) -> None:
"""Load the model into memory."""
...
async def unload_model(self) -> None:
"""Unload the model from memory."""
...
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates for the given image and instruction.
Args:
image: PIL Image to analyze
instruction: Text instruction describing what to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
...

View File

@@ -0,0 +1,162 @@
"""
GTA1 model implementation for benchmarking.
"""
from typing import Optional, Tuple
from PIL import Image
import torch
import re
import gc
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from .base import ModelProtocol
class GTA1Model:
"""Ground truth GTA1 model implementation."""
def __init__(self, model_path: str = "HelloKKMe/GTA1-7B"):
self.model_path = model_path
self.model = None
self.processor = None
self.max_new_tokens = 32
self.system_prompt = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
Output the coordinate pair exactly:
(x,y)
'''.strip()
@property
def model_name(self) -> str:
"""Return the name of the model."""
return f"GTA1-{self.model_path.split('/')[-1]}"
async def load_model(self) -> None:
"""Load the model into memory."""
if self.model is None:
print(f"Loading GTA1 model: {self.model_path}")
self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
self.model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
self.processor = AutoProcessor.from_pretrained(
self.model_path,
min_pixels=3136,
max_pixels=4096 * 2160
)
print("GTA1 model loaded successfully")
async def unload_model(self) -> None:
"""Unload the model from memory."""
if self.model is not None:
print("Unloading GTA1 model from GPU...")
del self.model
del self.processor
self.model = None
self.processor = None
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
print("GTA1 model unloaded")
def _extract_coordinates(self, raw_string: str) -> Tuple[int, int]:
"""Extract coordinates from model output."""
try:
matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
return tuple(map(int, map(float, matches[0]))) # type: ignore
except:
return (0, 0)
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates for the given image and instruction.
Args:
image: PIL Image to analyze
instruction: Text instruction describing what to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
if self.model is None or self.processor is None:
await self.load_model()
assert self.processor is not None
assert self.model is not None
try:
width, height = image.width, image.height
# Resize image according to processor requirements
resized_height, resized_width = smart_resize(
image.height,
image.width,
factor=self.processor.image_processor.patch_size * self.processor.image_processor.merge_size,
min_pixels=self.processor.image_processor.min_pixels,
max_pixels=self.processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height
# Prepare messages
system_message = {
"role": "system",
"content": self.system_prompt.format(height=resized_height, width=resized_width)
}
user_message = {
"role": "user",
"content": [
{"type": "image", "image": resized_image},
{"type": "text", "text": instruction}
]
}
# Process inputs
image_inputs, video_inputs = process_vision_info([system_message, user_message]) # type: ignore
text = self.processor.apply_chat_template(
[system_message, user_message],
tokenize=False,
add_generation_prompt=True
)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
)
inputs = inputs.to(self.model.device)
# Generate prediction
output_ids = self.model.generate(
**inputs,
max_new_tokens=self.max_new_tokens,
do_sample=False,
temperature=1.0,
use_cache=True
)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = self.processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)[0]
# Extract and rescale coordinates
pred_x, pred_y = self._extract_coordinates(output_text)
pred_x = int(pred_x * scale_x)
pred_y = int(pred_y * scale_y)
return (pred_x, pred_y)
except Exception as e:
print(f"Error in GTA1 prediction: {e}")
return None

View File

@@ -0,0 +1,186 @@
#!/usr/bin/env python3
"""
ScreenSpot-Pro Benchmark Script
Evaluates models on the ScreenSpot-Pro dataset for click prediction accuracy.
Supports both ComputerAgent model strings and custom model classes.
"""
import argparse
import asyncio
import random
import statistics
import time
from typing import Optional
from datasets import load_dataset
from tqdm import tqdm
from utils import (
ModelWrapper,
is_click_in_bbox,
save_results_to_markdown,
save_visualizations,
get_available_models,
get_gpu_memory
)
async def evaluate_model(model_wrapper: ModelWrapper, dataset, max_samples: Optional[int] = None) -> dict:
"""
Evaluate a model on the ScreenSpot-Pro dataset.
Args:
model_wrapper: ModelWrapper instance
dataset: ScreenSpot-Pro dataset (list of samples)
max_samples: Maximum number of samples to evaluate (None for all)
Returns:
Dictionary with evaluation results
"""
print(f"\nEvaluating model: {model_wrapper.model_name}")
# Load model
await model_wrapper.load_model()
total_samples = len(dataset)
if max_samples is not None:
total_samples = min(max_samples, total_samples)
correct_predictions = 0
error_predictions = 0
results = []
for i in tqdm(range(total_samples), desc=f"Evaluating {model_wrapper.model_name}"):
sample = dataset[i]
# Extract sample data
image = sample['image']
instruction = sample['instruction']
bbox = sample['bbox'] # [x1, y1, x2, y2]
sample_id = sample['img_filename']
# Predict click coordinates with timing
start_time = time.time()
click_coords = await model_wrapper.predict_click(image, instruction)
prediction_time = time.time() - start_time
# Check if prediction is correct
is_correct = is_click_in_bbox(click_coords, bbox)
if is_correct:
correct_predictions += 1
results.append({
'id': sample_id,
'instruction': instruction,
'bbox': bbox,
'predicted_coords': click_coords,
'is_correct': is_correct,
'failed': False,
'prediction_time': prediction_time
})
# Unload model
await model_wrapper.unload_model()
# Calculate metrics
accuracy = correct_predictions / total_samples if total_samples > 0 else 0.0
error_rate = error_predictions / total_samples if total_samples > 0 else 0.0
# Calculate timing statistics
successful_times = [r['prediction_time'] for r in results if not r['failed']]
avg_prediction_time = sum(successful_times) / len(successful_times) if successful_times else 0.0
median_prediction_time = statistics.median(successful_times) if successful_times else 0.0
min_prediction_time = min(successful_times) if successful_times else 0.0
max_prediction_time = max(successful_times) if successful_times else 0.0
# Get VRAM statistics
vram_stats = model_wrapper.get_vram_stats()
return {
'model_name': model_wrapper.model_name,
'total_samples': total_samples,
'correct_predictions': correct_predictions,
'failed_predictions': error_predictions,
'accuracy': accuracy,
'failure_rate': error_rate,
'avg_prediction_time': avg_prediction_time,
'median_prediction_time': median_prediction_time,
'min_prediction_time': min_prediction_time,
'max_prediction_time': max_prediction_time,
'vram_max_mb': vram_stats['max_mb'],
'vram_avg_mb': vram_stats['avg_mb'],
'results': results
}
async def main():
"""
Main function to run the benchmark.
"""
# Parse command line arguments
parser = argparse.ArgumentParser(description='ScreenSpot-Pro Benchmark Script')
parser.add_argument('--samples', type=int, default=300,
help='Number of samples to evaluate (default: 300)')
parser.add_argument('--seed', type=int, default=42,
help='Random seed for shuffling (default: 42)')
args = parser.parse_args()
# Set random seed
random.seed(args.seed)
# Load dataset
print("Loading ScreenSpot-Pro dataset...")
ds = load_dataset("lmms-lab/ScreenSpot-Pro")
dataset = ds['train'] # type: ignore
# Convert to list to support indexing
dataset_list = list(dataset)
print(f"Dataset loaded: {len(dataset_list)} samples")
# Shuffle dataset with seed
random.shuffle(dataset_list)
print(f"Dataset shuffled with seed {args.seed}")
# Get available models
models = get_available_models()
# Evaluation settings
max_samples = args.samples # Use command line argument
# Run evaluations
all_results = []
for model in models:
model_wrapper = ModelWrapper(model)
result = await evaluate_model(model_wrapper, dataset_list, max_samples)
all_results.append(result)
# Print summary
print(f"\n{result['model_name']} Results:")
print(f" Accuracy: {result['accuracy']*100:.2f}%")
print(f" Correct: {result['correct_predictions']}/{result['total_samples']}")
print(f" Errors: {result['failed_predictions']}")
print(f" Error Rate: {result['failure_rate']*100:.2f}%")
print(f" Avg Time: {result['avg_prediction_time']:.2f}s")
print(f" Median Time: {result['median_prediction_time']:.2f}s")
print(f" Time Range: {result['min_prediction_time']:.2f}s - {result['max_prediction_time']:.2f}s")
print(f" VRAM Max: {result['vram_max_mb']:.1f}MB")
print(f" VRAM Avg: {result['vram_avg_mb']:.1f}MB")
# Print GPU memory info
gpu_memory = get_gpu_memory()
if gpu_memory and gpu_memory[0] > 0:
print(f" GPU Free Memory: {gpu_memory[0]:.1f}MB")
# Save results
if all_results:
save_results_to_markdown(all_results)
save_visualizations(all_results, dataset_list)
print("\nBenchmark completed successfully!")
else:
print("\nNo successful evaluations completed.")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,206 @@
#!/usr/bin/env python3
"""
ScreenSpot-v2 Benchmark Script
Evaluates models on the ScreenSpot-v2 dataset for click prediction accuracy.
Supports both ComputerAgent model strings and custom model classes.
"""
import argparse
import asyncio
import random
import statistics
import time
from typing import Optional
from datasets import load_dataset
from tqdm import tqdm
from utils import (
ModelWrapper,
is_click_in_bbox,
save_results_to_markdown,
save_visualizations,
get_available_models,
get_gpu_memory
)
async def evaluate_model(model_wrapper: ModelWrapper, samples, max_samples: Optional[int] = None) -> dict:
"""
Evaluate a model on any iterable of samples.
Args:
model_wrapper: ModelWrapper instance
samples: Iterable of dicts with keys: image, bbox, instruction
max_samples: Maximum number of samples to evaluate (None for all)
Returns:
Dictionary with evaluation results
"""
print(f"\nEvaluating model: {model_wrapper.model_name}")
# Load model
await model_wrapper.load_model()
# Convert to list if needed and limit samples
if hasattr(samples, '__len__'):
total_samples = len(samples)
if max_samples is not None:
total_samples = min(max_samples, total_samples)
sample_list = list(samples)[:total_samples]
else:
# For iterators, take max_samples or all
sample_list = list(samples)
if max_samples is not None:
sample_list = sample_list[:max_samples]
total_samples = len(sample_list)
correct_predictions = 0
error_predictions = 0
results = []
for i, sample in enumerate(tqdm(sample_list, desc=f"Evaluating {model_wrapper.model_name}")):
# Extract required data (only these 3 keys matter)
image = sample['image']
instruction = sample['instruction']
bbox = sample['bbox'] # [x1, y1, x2, y2]
# Predict click coordinates with timing
start_time = time.time()
click_coords = await model_wrapper.predict_click(image, instruction)
prediction_time = time.time() - start_time
# Check if prediction is correct
is_correct = is_click_in_bbox(click_coords, bbox)
if is_correct:
correct_predictions += 1
results.append({
'sample_idx': i,
'instruction': instruction,
'bbox': bbox,
'predicted_coords': click_coords,
'is_correct': is_correct,
'failed': False,
'prediction_time': prediction_time
})
# Unload model
await model_wrapper.unload_model()
# Calculate metrics
accuracy = correct_predictions / total_samples if total_samples > 0 else 0.0
error_rate = error_predictions / total_samples if total_samples > 0 else 0.0
# Calculate timing statistics
successful_times = [r['prediction_time'] for r in results if not r['failed']]
avg_prediction_time = sum(successful_times) / len(successful_times) if successful_times else 0.0
median_prediction_time = statistics.median(successful_times) if successful_times else 0.0
min_prediction_time = min(successful_times) if successful_times else 0.0
max_prediction_time = max(successful_times) if successful_times else 0.0
# Get VRAM statistics
vram_stats = model_wrapper.get_vram_stats()
return {
'model_name': model_wrapper.model_name,
'total_samples': total_samples,
'correct_predictions': correct_predictions,
'failed_predictions': error_predictions,
'accuracy': accuracy,
'failure_rate': error_rate,
'avg_prediction_time': avg_prediction_time,
'median_prediction_time': median_prediction_time,
'min_prediction_time': min_prediction_time,
'max_prediction_time': max_prediction_time,
'vram_max_mb': vram_stats['max_mb'],
'vram_avg_mb': vram_stats['avg_mb'],
'results': results
}
async def main():
"""
Main function to run the benchmark.
"""
# Parse command line arguments
parser = argparse.ArgumentParser(description='ScreenSpot-v2 Benchmark Script')
parser.add_argument('--samples', type=int, default=500,
help='Number of samples to evaluate (default: 500)')
parser.add_argument('--seed', type=int, default=42,
help='Random seed for shuffling (default: 42)')
args = parser.parse_args()
# Set random seed
random.seed(args.seed)
# Load dataset
print("Loading ScreenSpot-v2 dataset...")
ds = load_dataset("lmms-lab/ScreenSpot-v2")
dataset = ds['train'] # type: ignore
# Convert to simple list of dicts with only required keys
samples = []
for item in dataset:
# Convert dataset item to dict if needed
item_dict = dict(item) if hasattr(item, 'keys') else item
# Convert ScreenSpot-v2 bbox format [x, y, w, h] to [x1, y1, x2, y2]
bbox_xywh = item_dict['bbox'] # type: ignore
x, y, w, h = bbox_xywh
bbox_xyxy = [x, y, x + w, y + h]
samples.append({
'image': item_dict['image'], # type: ignore
'instruction': item_dict['instruction'], # type: ignore
'bbox': bbox_xyxy
})
print(f"Dataset loaded: {len(samples)} samples")
# Shuffle samples with seed
random.shuffle(samples)
print(f"Samples shuffled with seed {args.seed}")
# Get available models
models = get_available_models()
# Evaluation settings
max_samples = args.samples # Use command line argument
# Run evaluations
all_results = []
for model in models:
model_wrapper = ModelWrapper(model)
result = await evaluate_model(model_wrapper, samples, max_samples)
all_results.append(result)
# Print summary
print(f"\n{result['model_name']} Results:")
print(f" Accuracy: {result['accuracy']*100:.2f}%")
print(f" Correct: {result['correct_predictions']}/{result['total_samples']}")
print(f" Errors: {result['failed_predictions']}")
print(f" Error Rate: {result['failure_rate']*100:.2f}%")
print(f" Avg Time: {result['avg_prediction_time']:.2f}s")
print(f" Median Time: {result['median_prediction_time']:.2f}s")
print(f" Time Range: {result['min_prediction_time']:.2f}s - {result['max_prediction_time']:.2f}s")
print(f" VRAM Max: {result['vram_max_mb']:.1f}MB")
print(f" VRAM Avg: {result['vram_avg_mb']:.1f}MB")
# Print GPU memory info
gpu_memory = get_gpu_memory()
if gpu_memory and gpu_memory[0] > 0:
print(f" GPU Free Memory: {gpu_memory[0]:.1f}MB")
# Save results
if all_results:
save_results_to_markdown(all_results, "screenspot_v2_results.md", title="ScreenSpot-v2 Benchmark Results")
save_visualizations(all_results, samples)
print("\nBenchmark completed successfully!")
else:
print("\nNo successful evaluations completed.")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,409 @@
#!/usr/bin/env python3
"""
Shared utilities for ScreenSpot-Pro benchmarking and interactive testing.
"""
import dotenv
dotenv.load_dotenv()
import asyncio
import base64
import os
import sys
import subprocess as sp
import statistics
from datetime import datetime
from io import BytesIO
from typing import List, Union, Tuple, Optional
from PIL import Image, ImageDraw
from tqdm import tqdm
import gc
import torch
# Add parent directory to path for imports
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
from agent.agent import ComputerAgent
from models.base import ModelProtocol
def get_gpu_memory() -> List[int]:
"""
Get GPU memory usage using nvidia-smi.
Returns:
List of free memory values in MB for each GPU
"""
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
return memory_free_values
except (sp.CalledProcessError, FileNotFoundError, IndexError):
# Fallback to torch if nvidia-smi is not available
if torch.cuda.is_available():
device = torch.cuda.current_device()
total = torch.cuda.get_device_properties(device).total_memory / 1024 / 1024
reserved = torch.cuda.memory_reserved(device) / 1024 / 1024
return [int(total - reserved)]
return [0]
def get_vram_usage() -> dict:
"""
Get current VRAM usage statistics.
Returns:
Dictionary with VRAM usage info (in MB)
"""
if torch.cuda.is_available():
device = torch.cuda.current_device()
allocated = torch.cuda.memory_allocated(device) / 1024 / 1024 # Convert to MB
reserved = torch.cuda.memory_reserved(device) / 1024 / 1024 # Convert to MB
total = torch.cuda.get_device_properties(device).total_memory / 1024 / 1024
return {
'allocated_mb': allocated,
'reserved_mb': reserved,
'total_mb': total,
'free_mb': total - reserved
}
else:
return {
'allocated_mb': 0.0,
'reserved_mb': 0.0,
'total_mb': 0.0,
'free_mb': 0.0
}
def get_available_models() -> List[Union[str, ModelProtocol]]:
"""
Get list of available models for testing.
Returns:
List of model strings and model classes
"""
local_provider = "huggingface-local/" # Options: huggingface-local/ or mlx/
# from models.gta1 import GTA1Model
models = [
# === ComputerAgent model strings ===
"openai/computer-use-preview",
"anthropic/claude-opus-4-20250514",
# f"{local_provider}HelloKKMe/GTA1-7B",
# f"{local_provider}HelloKKMe/GTA1-32B",
"openai/computer-use-preview+openai/gpt-4o-mini",
"anthropic/claude-opus-4-20250514+openai/gpt-4o-mini",
# === Reference model classes ===
# GTA1Model("HelloKKMe/GTA1-7B"),
# GTA1Model("HelloKKMe/GTA1-32B"),
]
return models
def is_click_in_bbox(click_coords: Optional[Tuple[int, int]], bbox: List[int]) -> bool:
"""
Check if click coordinates are within the bounding box.
Args:
click_coords: (x, y) coordinates or None
bbox: [x1, y1, x2, y2] bounding box
Returns:
True if click is within bbox, False otherwise
"""
if click_coords is None:
return False
x, y = click_coords
x1, y1, x2, y2 = bbox
return x1 <= x <= x2 and y1 <= y <= y2
def image_to_base64(image: Image.Image) -> str:
"""
Convert PIL Image to base64 string.
Args:
image: PIL Image
Returns:
Base64 encoded image string
"""
buffered = BytesIO()
image.save(buffered, format="PNG")
return base64.b64encode(buffered.getvalue()).decode()
class ModelWrapper:
"""
Wrapper to provide unified interface for both ComputerAgent and custom models.
"""
def __init__(self, model: Union[str, ModelProtocol]):
self.model = model
self.is_computer_agent = isinstance(model, str)
self.agent: Optional[ComputerAgent] = None
self.vram_usage_history: List[float] = [] # Track VRAM usage over time
if self.is_computer_agent:
self.model_name = str(model)
else:
self.model_name = f"{model.__class__.__name__}('{getattr(model, 'model_name', 'unknown')}')"
async def load_model(self) -> None:
"""Load the model."""
if self.is_computer_agent:
self.agent = ComputerAgent(model=str(self.model))
else:
await self.model.load_model() # type: ignore
# Record initial VRAM usage after loading
vram_info = get_vram_usage()
self.vram_usage_history.append(vram_info['allocated_mb'])
async def unload_model(self) -> None:
"""Unload the model."""
if not self.is_computer_agent:
await self.model.unload_model() # type: ignore
else:
del self.agent
self.agent = None
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
# Record VRAM usage after unloading
vram_info = get_vram_usage()
self.vram_usage_history.append(vram_info['allocated_mb'])
def get_vram_stats(self) -> dict:
"""Get VRAM usage statistics for this model."""
if not self.vram_usage_history:
return {'max_mb': 0.0, 'avg_mb': 0.0}
return {
'max_mb': max(self.vram_usage_history),
'avg_mb': sum(self.vram_usage_history) / len(self.vram_usage_history)
}
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""Predict click coordinates."""
# Record VRAM usage before prediction
vram_info = get_vram_usage()
self.vram_usage_history.append(vram_info['allocated_mb'])
if self.is_computer_agent:
if self.agent is None:
await self.load_model()
if self.agent is not None:
image_b64 = image_to_base64(image)
result = await self.agent.predict_click(instruction=instruction, image_b64=image_b64)
# Record VRAM usage after prediction
vram_info = get_vram_usage()
self.vram_usage_history.append(vram_info['allocated_mb'])
return result
return None
else:
result = await self.model.predict_click(image, instruction) # type: ignore
# Record VRAM usage after prediction
vram_info = get_vram_usage()
self.vram_usage_history.append(vram_info['allocated_mb'])
return result
def save_results_to_markdown(all_results: List[dict],output_file: str = "screenspot_pro_results.md", title: str = "ScreenSpot-Pro Benchmark Results") -> None:
"""
Save evaluation results to a markdown table.
Args:
all_results: List of evaluation results for each model
output_file: Output markdown file path
"""
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"# {title}\n\n")
f.write(f"**Evaluation Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
# Summary table
f.write("## Summary\n\n")
f.write("| Model | Total Samples | Correct | Errors | Accuracy | Error Rate | Avg Time (s) | Median Time (s) | Time Range (s) | VRAM Max (GB) | VRAM Avg (GB) |\n")
f.write("|-------|---------------|---------|--------|----------|------------|--------------|-----------------|----------------|---------------|---------------|\n")
for result in all_results:
model_name = result['model_name']
total = result['total_samples']
correct = result['correct_predictions']
errors = result['failed_predictions']
accuracy = result['accuracy'] * 100
error_rate = result['failure_rate'] * 100
avg_time = result.get('avg_prediction_time', 0.0)
median_time = result.get('median_prediction_time', 0.0)
min_time = result.get('min_prediction_time', 0.0)
max_time = result.get('max_prediction_time', 0.0)
time_range = f"{min_time:.2f} - {max_time:.2f}"
vram_max = result.get('vram_max_mb', 0.0) / 1024
vram_avg = result.get('vram_avg_mb', 0.0) / 1024
f.write(f"| {model_name} | {total} | {correct} | {errors} | {accuracy:.2f}% | {error_rate:.2f}% | {avg_time:.2f} | {median_time:.2f} | {time_range} | {vram_max:.1f} | {vram_avg:.1f} |\n")
# Detailed results for each model
for result in all_results:
f.write(f"\n## {result['model_name']} - Detailed Results\n\n")
f.write("| Sample Index | Instruction | BBox | Predicted | Correct | Error | Time (s) |\n")
f.write("|-----------|-------------|------|-----------|---------|-------|----------|\n")
for sample_result in result['results'][:10]: # Show first 10 samples
sample_idx = sample_result['sample_idx']
instruction = sample_result['instruction'][:50] + "..." if len(sample_result['instruction']) > 50 else sample_result['instruction']
bbox = str(sample_result['bbox'])
predicted = str(sample_result['predicted_coords']) if sample_result['predicted_coords'] else "None"
correct = "PASS" if sample_result['is_correct'] else "FAIL"
error = "YES" if sample_result['failed'] else "NO"
pred_time = sample_result.get('prediction_time', 0.0)
f.write(f"| {sample_idx} | {instruction} | {bbox} | {predicted} | {correct} | {error} | {pred_time:.2f} |\n")
if len(result['results']) > 10:
f.write(f"\n*Showing first 10 of {len(result['results'])} samples*\n")
print(f"\nResults saved to: {output_file}")
def save_visualizations(all_results: List[dict], samples, output_dir: str = "output") -> None:
"""
Save visualizations of predicted coordinates vs bboxes to an output folder.
Args:
all_results: List of evaluation results for each model
samples: List of sample dicts with image, bbox, instruction keys
output_dir: Output directory path
"""
os.makedirs(output_dir, exist_ok=True)
for result in all_results:
model_name = result['model_name'].replace('/', '_').replace('\\', '_')
model_dir = os.path.join(output_dir, model_name)
os.makedirs(model_dir, exist_ok=True)
print(f"Saving visualizations for {result['model_name']}...")
# Save first 10 samples for visualization
for i, sample_result in enumerate(tqdm(result['results'][:10], desc=f"Saving {model_name} visualizations")):
# Get sample data using index
sample_idx = sample_result['sample_idx']
if sample_idx < len(samples):
sample = samples[sample_idx]
image = sample['image'].copy() # Make a copy to avoid modifying original
else:
print(f"Warning: Could not find sample at index {sample_idx}")
continue
bbox = sample_result['bbox']
predicted_coords = sample_result['predicted_coords']
is_correct = sample_result['is_correct']
# Draw on image
draw = ImageDraw.Draw(image)
# Draw bounding box (ground truth) in green
x1, y1, x2, y2 = bbox
draw.rectangle([x1, y1, x2, y2], outline="green", width=3)
draw.text((x1, y1-20), "Ground Truth", fill="green")
# Draw predicted click in red or blue
if predicted_coords is not None:
px, py = predicted_coords
color = "blue" if is_correct else "red"
# Draw crosshair
crosshair_size = 15
draw.line([(px-crosshair_size, py), (px+crosshair_size, py)], fill=color, width=3)
draw.line([(px, py-crosshair_size), (px, py+crosshair_size)], fill=color, width=3)
draw.text((px+10, py-20), f"Predicted ({px},{py})", fill=color)
# Add status text
status = "CORRECT" if is_correct else "INCORRECT"
status_color = "blue" if is_correct else "red"
draw.text((10, 10), f"Status: {status}", fill=status_color)
draw.text((10, 30), f"Instruction: {sample_result['instruction'][:50]}...", fill="black")
# Save image
filename = f"sample_{i+1:02d}_idx{sample_idx}_{status.lower()}.png"
filepath = os.path.join(model_dir, filename)
image.save(filepath)
print(f"Visualizations saved to: {model_dir}")
def save_prediction_visualization(image: Image.Image, instruction: str, predictions: List[dict],
output_file: str = "interactive_prediction.png") -> None:
"""
Save visualization of multiple model predictions on a single image.
Args:
image: PIL Image to visualize
instruction: Instruction text
predictions: List of prediction dicts with keys: model_name, coords, error
output_file: Output file path
"""
# Create a copy of the image
vis_image = image.copy()
draw = ImageDraw.Draw(vis_image)
# Colors for different models
colors = ["red", "blue", "orange", "purple", "brown", "pink", "gray", "olive"]
# Draw predictions
for i, pred in enumerate(predictions):
color = colors[i % len(colors)]
model_name = pred['model_name']
coords = pred.get('coords')
error = pred.get('error')
if coords is not None:
px, py = coords
# Draw crosshair
crosshair_size = 20
draw.line([(px-crosshair_size, py), (px+crosshair_size, py)], fill=color, width=4)
draw.line([(px, py-crosshair_size), (px, py+crosshair_size)], fill=color, width=4)
# Draw model name
draw.text((px+15, py+15), f"{model_name}: ({px},{py})", fill=color)
else:
# Draw error text
draw.text((10, 50 + i*20), f"{model_name}: ERROR - {error}", fill=color)
# Add instruction at the top
draw.text((10, 10), f"Instruction: {instruction}", fill="black")
# Save image
vis_image.save(output_file)
print(f"Prediction visualization saved to: {output_file}")
def take_screenshot() -> Image.Image:
"""
Take a screenshot of the current screen.
Returns:
PIL Image of the screenshot
"""
try:
import pyautogui
screenshot = pyautogui.screenshot()
return screenshot
except ImportError:
print("pyautogui not installed. Please install it with: pip install pyautogui")
raise
except Exception as e:
print(f"Error taking screenshot: {e}")
raise

View File

@@ -5,8 +5,7 @@ Example usage of the agent library with docstring-based tool definitions.
import asyncio
import logging
from agent import agent_loop, ComputerAgent
from agent.types import Messages
from agent import ComputerAgent
from computer import Computer
from computer.helpers import sandboxed

View File

@@ -19,10 +19,10 @@ dependencies = [
"pydantic>=2.6.4",
"rich>=13.7.1",
"python-dotenv>=1.0.1",
"cua-computer>=0.3.0,<0.5.0",
"cua-computer>=0.4.0,<0.5.0",
"cua-core>=0.1.8,<0.2.0",
"certifi>=2024.2.2",
"litellm>=1.74.8"
"litellm>=1.74.12"
]
requires-python = ">=3.11"
@@ -38,8 +38,15 @@ uitars-mlx = [
"mlx-vlm>=0.1.27; sys_platform == 'darwin'"
]
uitars-hf = [
"accelerate",
"torch",
"transformers>=4.54.0"
]
glm45v-hf = [
"accelerate",
"torch",
"transformers-v4.55.0-GLM-4.5V-preview"
]
ui = [
"gradio>=5.23.3",
"python-dotenv>=1.0.1",
@@ -47,18 +54,25 @@ ui = [
cli = [
"yaspin>=3.1.0",
]
hud = [
"hud-python==0.2.10",
]
all = [
# omni requirements
"ultralytics>=8.0.0",
"cua-som>=0.1.0,<0.2.0",
# uitars requirements
"mlx-vlm>=0.1.27; sys_platform == 'darwin'",
"accelerate",
"torch",
"transformers>=4.54.0",
# ui requirements
"gradio>=5.23.3",
"python-dotenv>=1.0.1",
# cli requirements
"yaspin>=3.1.0",
# hud requirements
"hud-python==0.2.10",
]
[tool.uv]

View File

@@ -23,6 +23,7 @@ logger = logging.getLogger(__name__)
# This allows the server to run in headless environments
try:
import pyautogui
pyautogui.FAILSAFE = False
logger.info("pyautogui successfully imported, GUI automation available")
except Exception as e:

View File

@@ -1,4 +1,5 @@
import pyautogui
pyautogui.FAILSAFE = False
from pynput.mouse import Button, Controller as MouseController
from pynput.keyboard import Key, Controller as KeyboardController
import time

View File

@@ -18,6 +18,7 @@ logger = logging.getLogger(__name__)
# Try to import pyautogui
try:
import pyautogui
pyautogui.FAILSAFE = False
logger.info("pyautogui successfully imported, GUI automation available")
except Exception as e:
logger.error(f"pyautogui import failed: {str(e)}. GUI operations will not work.")

View File

@@ -4,7 +4,7 @@ build-backend = "pdm.backend"
[project]
name = "cua-computer"
version = "0.3.0"
version = "0.4.0"
description = "Computer-Use Interface (CUI) framework powering Cua"
readme = "README.md"
authors = [

View File

@@ -16,6 +16,21 @@
</div>
**cua-mcp-server** is a MCP server for the Computer-Use Agent (CUA), allowing you to run CUA through Claude Desktop or other MCP clients.
## LiteLLM Integration
This MCP server features comprehensive liteLLM integration, allowing you to use any supported LLM provider with a simple model string configuration.
- **Unified Configuration**: Use a single `CUA_MODEL_NAME` environment variable with a model string
- **Automatic Provider Detection**: The agent automatically detects the provider and capabilities from the model string
- **Extensive Provider Support**: Works with Anthropic, OpenAI, local models, and any liteLLM-compatible provider
### Model String Examples:
- **Anthropic**: `"anthropic/claude-3-5-sonnet-20241022"`
- **OpenAI**: `"openai/computer-use-preview"`
- **UI-TARS**: `"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"`
- **Omni + Any LiteLLM**: `"omniparser+litellm/gpt-4o"`, `"omniparser+litellm/claude-3-haiku"`, `"omniparser+ollama_chat/gemma3"`
### Get started with Agent
## Prerequisites
@@ -65,10 +80,7 @@ You can then use the script in your MCP configuration like this:
"command": "/bin/bash",
"args": ["~/.cua/start_mcp_server.sh"],
"env": {
"CUA_AGENT_LOOP": "OMNI",
"CUA_MODEL_PROVIDER": "ANTHROPIC",
"CUA_MODEL_NAME": "claude-3-7-sonnet-20250219",
"CUA_PROVIDER_API_KEY": "your-api-key"
"CUA_MODEL_NAME": "anthropic/claude-3-5-sonnet-20241022"
}
}
}
@@ -86,11 +98,7 @@ If you want to develop with the cua-mcp-server directly without installation, yo
"command": "/bin/bash",
"args": ["~/cua/libs/python/mcp-server/scripts/start_mcp_server.sh"],
"env": {
"CUA_AGENT_LOOP": "UITARS",
"CUA_MODEL_PROVIDER": "OAICOMPAT",
"CUA_MODEL_NAME": "ByteDance-Seed/UI-TARS-1.5-7B",
"CUA_PROVIDER_BASE_URL": "https://****************.us-east-1.aws.endpoints.huggingface.cloud/v1",
"CUA_PROVIDER_API_KEY": "your-api-key"
"CUA_MODEL_NAME": "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"
}
}
}
@@ -142,10 +150,7 @@ The server is configured using environment variables (can be set in the Claude D
| Variable | Description | Default |
|----------|-------------|---------|
| `CUA_AGENT_LOOP` | Agent loop to use (OPENAI, ANTHROPIC, UITARS, OMNI) | OMNI |
| `CUA_MODEL_PROVIDER` | Model provider (ANTHROPIC, OPENAI, OLLAMA, OAICOMPAT) | ANTHROPIC |
| `CUA_MODEL_NAME` | Model name to use | None (provider default) |
| `CUA_PROVIDER_BASE_URL` | Base URL for provider API | None |
| `CUA_MODEL_NAME` | Model string (e.g., "anthropic/claude-3-5-sonnet-20241022", "openai/computer-use-preview", "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", "omniparser+litellm/gpt-4o", "omniparser+ollama_chat/gemma3") | anthropic/claude-3-5-sonnet-20241022 |
| `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |
## Available Tools

View File

@@ -3,6 +3,7 @@ import base64
import logging
import os
import sys
from tabnanny import verbose
import traceback
from typing import Any, Dict, List, Optional, Union, Tuple
@@ -28,7 +29,7 @@ except ImportError as e:
try:
from computer import Computer
from agent import ComputerAgent, LLMProvider, LLM, AgentLoop
from agent import ComputerAgent
logger.debug("Successfully imported Computer and Agent modules")
except ImportError as e:
@@ -92,49 +93,27 @@ def serve() -> FastMCP:
global_computer = Computer(verbosity=logging.INFO)
await global_computer.run()
# Determine which loop to use
loop_str = os.getenv("CUA_AGENT_LOOP", "OMNI")
loop = getattr(AgentLoop, loop_str)
# Get model name - this now determines the loop and provider
model_name = os.getenv("CUA_MODEL_NAME", "anthropic/claude-3-5-sonnet-20241022")
logger.info(f"Using model: {model_name}")
# Determine provider
provider_str = os.getenv("CUA_MODEL_PROVIDER", "ANTHROPIC")
provider = getattr(LLMProvider, provider_str)
# Get model name (if specified)
model_name = os.getenv("CUA_MODEL_NAME", None)
# Get base URL for provider (if needed)
provider_base_url = os.getenv("CUA_PROVIDER_BASE_URL", None)
# Get api key for provider (if needed)
api_key = os.getenv("CUA_PROVIDER_API_KEY", None)
# Create agent with the specified configuration
# Create agent with the new v0.4.x API
agent = ComputerAgent(
computer=global_computer,
loop=loop,
model=LLM(
provider=provider,
name=model_name,
provider_base_url=provider_base_url,
),
api_key=api_key,
save_trajectory=False,
model=model_name,
only_n_most_recent_images=int(os.getenv("CUA_MAX_IMAGES", "3")),
verbosity=logging.INFO,
tools=[global_computer]
)
# Create messages in the new v0.4.x format
messages = [{"role": "user", "content": task}]
# Collect all results
full_result = ""
async for result in agent.run(task):
logger.info(f"Agent step complete: {result.get('id', 'unknown')}")
ctx.info(f"Agent step complete: {result.get('id', 'unknown')}")
# Add response ID to output
full_result += f"\n[Response ID: {result.get('id', 'unknown')}]\n"
if "content" in result:
full_result += f"Response: {result.get('content', '')}\n"
async for result in agent.run(messages):
logger.info(f"Agent processing step")
ctx.info(f"Agent processing step")
# Process output if available
outputs = result.get("output", [])
@@ -145,25 +124,23 @@ def serve() -> FastMCP:
content = output.get("content", [])
for content_part in content:
if content_part.get("text"):
full_result += f"\nMessage: {content_part.get('text', '')}\n"
elif output_type == "reasoning":
logger.debug(f"Reasoning: {output}")
summary_content = output.get("summary", [])
if summary_content:
for summary_part in summary_content:
if summary_part.get("text"):
full_result += f"\nReasoning: {summary_part.get('text', '')}\n"
full_result += f"Message: {content_part.get('text', '')}\n"
elif output_type == "tool_use":
logger.debug(f"Tool use: {output}")
tool_name = output.get("name", "")
full_result += f"Tool: {tool_name}\n"
elif output_type == "tool_result":
logger.debug(f"Tool result: {output}")
result_content = output.get("content", "")
if isinstance(result_content, list):
for item in result_content:
if item.get("type") == "text":
full_result += f"Result: {item.get('text', '')}\n"
else:
full_result += f"\nReasoning: {output.get('text', output.get('content', ''))}\n"
elif output_type == "computer_call":
logger.debug(f"Computer call: {output}")
action = output.get("action", "")
result_value = output.get("result", "")
full_result += f"\nComputer Action: {action}\nResult: {result_value}\n"
full_result += f"Result: {result_content}\n"
# Add separator between steps
full_result += "\n" + "-" * 40 + "\n"
full_result += "\n" + "-" * 20 + "\n"
logger.info(f"CUA task completed successfully")
ctx.info(f"CUA task completed successfully")
@@ -179,7 +156,21 @@ def serve() -> FastMCP:
error_msg = f"Error running CUA task: {str(e)}\n{traceback.format_exc()}"
logger.error(error_msg)
ctx.error(error_msg)
return f"Error during task execution: {str(e)}"
# Return tuple with error message and a screenshot if possible
try:
if global_computer is not None:
screenshot = await global_computer.interface.screenshot()
return (
f"Error during task execution: {str(e)}",
Image(format="png", data=screenshot)
)
except:
pass
# If we can't get a screenshot, return a placeholder
return (
f"Error during task execution: {str(e)}",
Image(format="png", data=b"")
)
@server.tool()
async def run_multi_cua_tasks(ctx: Context, tasks: List[str]) -> List:

View File

@@ -13,8 +13,8 @@ authors = [
]
dependencies = [
"mcp>=1.6.0,<2.0.0",
"cua-agent[all]>=0.3.0,<0.4.0",
"cua-computer>=0.3.0,<0.4.0",
"cua-agent[all]>=0.4.0,<0.5.0",
"cua-computer>=0.4.0,<0.5.0",
]
[project.scripts]

View File

@@ -379,7 +379,7 @@
"metadata": {},
"outputs": [],
"source": [
"from agent.ui.gradio.app import create_gradio_ui\n",
"from agent.ui.gradio.ui_components import create_gradio_ui\n",
"\n",
"app = create_gradio_ui()\n",
"app.launch(share=False)"

110050
notebooks/eval_osworld.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -257,7 +257,7 @@ from pathlib import Path
from dotenv import load_dotenv
from computer import Computer
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
from agent.ui.gradio.app import create_gradio_ui
from agent.ui.gradio.ui_components import create_gradio_ui
# Load environment variables from .env.local
load_dotenv(Path(__file__).parent / ".env.local")
@@ -292,7 +292,7 @@ from pathlib import Path
from dotenv import load_dotenv
from computer import Computer
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
from agent.ui.gradio.app import create_gradio_ui
from agent.ui.gradio.ui_components import create_gradio_ui
# Load environment variables from .env.local
load_dotenv(Path(__file__).parent / ".env.local")