mirror of
https://github.com/trycua/computer.git
synced 2026-01-05 12:59:58 -06:00
added docs for benchmarks and composed agents
This commit is contained in:
28
docs/content/docs/agent-sdk/benchmarks/index.mdx
Normal file
28
docs/content/docs/agent-sdk/benchmarks/index.mdx
Normal file
@@ -0,0 +1,28 @@
|
||||
---
|
||||
title: Benchmarks
|
||||
description: Computer Agent SDK benchmarks for agentic GUI tasks
|
||||
---
|
||||
|
||||
The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
|
||||
- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
|
||||
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
|
||||
|
||||
## Available Benchmarks
|
||||
|
||||
- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding
|
||||
- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding
|
||||
- **[Interactive Testing](./interactive)** - Real-time testing and visualization
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Clone the benchmark repository
|
||||
git clone https://github.com/trycua/cua
|
||||
cd libs/python/agent/benchmarks
|
||||
|
||||
# Install dependencies
|
||||
pip install "cua-agent[all]"
|
||||
|
||||
# Run a benchmark
|
||||
python ss-v2.py
|
||||
```
|
||||
21
docs/content/docs/agent-sdk/benchmarks/interactive.mdx
Normal file
21
docs/content/docs/agent-sdk/benchmarks/interactive.mdx
Normal file
@@ -0,0 +1,21 @@
|
||||
---
|
||||
title: Interactive Tool
|
||||
description: Real-time testing and visualization tool for GUI grounding models
|
||||
---
|
||||
|
||||
This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Start the interactive tool
|
||||
cd libs/python/agent/benchmarks
|
||||
python interactive.py
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
- **Type instruction**: Screenshot + test all models
|
||||
- **`screenshot`**: Take screenshot without prediction
|
||||
- **`models`**: List available models
|
||||
- **`quit`/`exit`**: Exit the tool
|
||||
57
docs/content/docs/agent-sdk/benchmarks/introduction.mdx
Normal file
57
docs/content/docs/agent-sdk/benchmarks/introduction.mdx
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: Introduction
|
||||
description: Overview of benchmarking in the c/ua agent framework
|
||||
---
|
||||
|
||||
The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.
|
||||
|
||||
## Benchmark Types
|
||||
|
||||
Computer-Agent benchmarks evaluate two key capabilities:
|
||||
- **Plan Generation**: Breaking down complex tasks into a sequence of actions
|
||||
- **Coordinate Generation**: Predicting precise click locations on GUI elements
|
||||
|
||||
## Using State-of-the-Art Models
|
||||
|
||||
Let's see how to use the SOTA vision-language models in the c/ua agent framework.
|
||||
|
||||
### Plan Generation + Coordinate Generation
|
||||
|
||||
**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents
|
||||
|
||||
This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.
|
||||
|
||||
```python
|
||||
# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
|
||||
# This makes it suitable for agentic loops for computer-use
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
agent.run("Open Firefox and go to github.com")
|
||||
# Success! 🎉
|
||||
```
|
||||
|
||||
### Coordinate Generation Only
|
||||
|
||||
**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy
|
||||
|
||||
This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.
|
||||
|
||||
```python
|
||||
# GTA1-7B is a SOTA coordinate generation VLM
|
||||
# It can only generate coordinates, not plan:
|
||||
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
|
||||
agent.predict_click("find the button to open the settings") # (27, 450)
|
||||
# This will raise an error:
|
||||
# agent.run("Open Firefox and go to github.com")
|
||||
```
|
||||
|
||||
### Composed Agent
|
||||
|
||||
The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.
|
||||
|
||||
```python
|
||||
# It can be paired with any LLM to form a composed agent:
|
||||
# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
|
||||
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
|
||||
agent.run("Open Firefox and go to github.com")
|
||||
# Success! 🎉
|
||||
```
|
||||
8
docs/content/docs/agent-sdk/benchmarks/meta.json
Normal file
8
docs/content/docs/agent-sdk/benchmarks/meta.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"pages": [
|
||||
"introduction",
|
||||
"screenspot-v2",
|
||||
"screenspot-pro",
|
||||
"interactive"
|
||||
]
|
||||
}
|
||||
25
docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
Normal file
25
docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
Normal file
@@ -0,0 +1,25 @@
|
||||
---
|
||||
title: ScreenSpot-Pro
|
||||
description: High-resolution GUI grounding benchmark
|
||||
---
|
||||
|
||||
ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Run the benchmark
|
||||
cd libs/python/agent/benchmarks
|
||||
python ss-pro.py
|
||||
|
||||
# Run with custom sample limit
|
||||
python ss-pro.py --samples 50
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Accuracy | Failure Rate | Samples |
|
||||
|-------|----------|--------------|---------|
|
||||
| Coming Soon | - | - | - |
|
||||
|
||||
Results will be populated after running benchmarks with various models.
|
||||
25
docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
Normal file
25
docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
Normal file
@@ -0,0 +1,25 @@
|
||||
---
|
||||
title: ScreenSpot-v2
|
||||
description: Standard resolution GUI grounding benchmark
|
||||
---
|
||||
|
||||
ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Run the benchmark
|
||||
cd libs/python/agent/benchmarks
|
||||
python ss-v2.py
|
||||
|
||||
# Run with custom sample limit
|
||||
python ss-v2.py --samples 100
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Accuracy | Failure Rate | Samples |
|
||||
|-------|----------|--------------|---------|
|
||||
| Coming Soon | - | - | - |
|
||||
|
||||
Results will be populated after running benchmarks with various models.
|
||||
@@ -3,13 +3,14 @@
|
||||
"description": "Build computer-using agents with the Agent SDK",
|
||||
"pages": [
|
||||
"agent-loops",
|
||||
"supported-agents",
|
||||
"supported-agents",
|
||||
"chat-history",
|
||||
"callbacks",
|
||||
"sandboxed-tools",
|
||||
"local-models",
|
||||
"prompt-caching",
|
||||
"usage-tracking",
|
||||
"benchmarks",
|
||||
"migration-guide"
|
||||
]
|
||||
}
|
||||
|
||||
@@ -1,34 +0,0 @@
|
||||
---
|
||||
title: Supported Agents
|
||||
---
|
||||
|
||||
This page lists all supported agent loops and their compatible models/configurations in cua.
|
||||
|
||||
All agent loops are compatible with any LLM provider supported by LiteLLM.
|
||||
|
||||
See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine.
|
||||
|
||||
## Anthropic CUAs
|
||||
|
||||
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
|
||||
- Claude 3.7: `claude-3-7-sonnet-20250219`
|
||||
- Claude 3.5: `claude-3-5-sonnet-20240620`
|
||||
|
||||
## OpenAI CUA Preview
|
||||
|
||||
- Computer-use-preview: `computer-use-preview`
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
## Omniparser + LLMs
|
||||
|
||||
- `omniparser+vertex_ai/gemini-pro`
|
||||
- `omniparser+openai/gpt-4o`
|
||||
- Any LiteLLM-compatible model combined with Omniparser
|
||||
|
||||
---
|
||||
|
||||
For details on agent loop behavior and usage, see [Agent Loops](./agent-loops).
|
||||
106
docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
Normal file
106
docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
Normal file
@@ -0,0 +1,106 @@
|
||||
---
|
||||
title: Composed Agents
|
||||
description: Combine grounding models with any LLM for computer-use capabilities
|
||||
---
|
||||
|
||||
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
|
||||
|
||||
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
|
||||
|
||||
## How Composed Agents Work
|
||||
|
||||
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
|
||||
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
|
||||
3. **Execution**: Actions are performed using the predicted coordinates
|
||||
|
||||
## Supported Grounding Models
|
||||
|
||||
Any model that supports `predict_click()` can be used as the grounding component:
|
||||
|
||||
- `omniparser` (OSS set-of-marks model)
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
|
||||
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
|
||||
- `openai/computer-use-preview` (OpenAI CUA)
|
||||
|
||||
## Supported Thinking Models
|
||||
|
||||
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
|
||||
|
||||
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
|
||||
- **OpenAI**: `openai/gpt-4o`, `openai/gpt-4-vision-preview`
|
||||
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
|
||||
- **Local models**: Any Hugging Face vision-language model
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### GTA1 + Claude 3.5 Sonnet
|
||||
|
||||
Combine state-of-the-art grounding with powerful reasoning:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
"huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
|
||||
pass
|
||||
# Success! 🎉
|
||||
# - Claude 3.5 Sonnet plans the sequence of actions
|
||||
# - GTA1-7B provides precise click coordinates for each UI element
|
||||
```
|
||||
|
||||
### GTA1 + Gemini Pro
|
||||
|
||||
Use Google's Gemini for planning with specialized grounding:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
"huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
|
||||
pass
|
||||
```
|
||||
|
||||
### UI-TARS + GPT-4o
|
||||
|
||||
Combine two different vision models for enhanced capabilities:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
async for _ in agent.run("Help me fill out this form with my personal information"):
|
||||
pass
|
||||
```
|
||||
|
||||
## Benefits of Composed Agents
|
||||
|
||||
- **Specialized Grounding**: Use models optimized for click prediction accuracy
|
||||
- **Flexible Planning**: Choose any LLM for task reasoning and planning
|
||||
- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed
|
||||
- **Performance**: Leverage the strengths of different model architectures
|
||||
|
||||
## Capabilities
|
||||
|
||||
Composed agents support both capabilities:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
|
||||
|
||||
# Full computer-use agent capabilities
|
||||
async for _ in agent.run("Complete this online form"):
|
||||
pass
|
||||
|
||||
# Direct click prediction (uses grounding model only)
|
||||
coords = agent.predict_click("find the submit button")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models).
|
||||
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: Computer-Use Models
|
||||
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
|
||||
---
|
||||
|
||||
These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks.
|
||||
|
||||
All agent loops are compatible with any LLM provider supported by LiteLLM.
|
||||
|
||||
See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine.
|
||||
|
||||
## Anthropic CUAs
|
||||
|
||||
Claude models with computer-use capabilities:
|
||||
|
||||
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
|
||||
- Claude 3.7: `claude-3-7-sonnet-20250219`
|
||||
- Claude 3.5: `claude-3-5-sonnet-20240620`
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
|
||||
async for _ in agent.run("Open Firefox and navigate to github.com"):
|
||||
pass
|
||||
```
|
||||
|
||||
## OpenAI CUA Preview
|
||||
|
||||
OpenAI's computer-use preview model:
|
||||
|
||||
- Computer-use-preview: `computer-use-preview`
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
|
||||
async for _ in agent.run("Take a screenshot and describe what you see"):
|
||||
pass
|
||||
```
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
Unified vision-language model for computer-use:
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
|
||||
@@ -0,0 +1,69 @@
|
||||
---
|
||||
title: Grounding Models
|
||||
description: Models that support click prediction with ComputerAgent.predict_click()
|
||||
---
|
||||
|
||||
These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
|
||||
|
||||
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
|
||||
|
||||
## All Computer-Use Agents
|
||||
|
||||
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
|
||||
|
||||
### Anthropic CUAs
|
||||
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
|
||||
- Claude 3.7: `claude-3-7-sonnet-20250219`
|
||||
- Claude 3.5: `claude-3-5-sonnet-20240620`
|
||||
|
||||
### OpenAI CUA Preview
|
||||
- Computer-use-preview: `computer-use-preview`
|
||||
|
||||
### UI-TARS 1.5
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
## Specialized Grounding Models
|
||||
|
||||
These models are optimized specifically for click prediction and UI element grounding:
|
||||
|
||||
### GTA1-7B
|
||||
|
||||
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
|
||||
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B`
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
|
||||
|
||||
# Predict click coordinates for UI elements
|
||||
coords = agent.predict_click("find the submit button")
|
||||
print(f"Click coordinates: {coords}") # (450, 320)
|
||||
|
||||
# Note: GTA1 cannot perform autonomous task planning
|
||||
# This will raise an error:
|
||||
# agent.run("Fill out the form and submit it")
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
# Using any grounding model for click prediction
|
||||
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
|
||||
|
||||
# Take a screenshot first
|
||||
screenshot = agent.computer.screenshot()
|
||||
|
||||
# Predict coordinates for specific elements
|
||||
login_coords = agent.predict_click("find the login button")
|
||||
search_coords = agent.predict_click("locate the search text field")
|
||||
menu_coords = agent.predict_click("find the hamburger menu icon")
|
||||
|
||||
print(f"Login button: {login_coords}")
|
||||
print(f"Search field: {search_coords}")
|
||||
print(f"Menu icon: {menu_coords}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
|
||||
9
docs/content/docs/agent-sdk/supported-agents/meta.json
Normal file
9
docs/content/docs/agent-sdk/supported-agents/meta.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"title": "Supported Agents",
|
||||
"description": "Models and configurations supported by the Agent SDK",
|
||||
"pages": [
|
||||
"computer-use-agents",
|
||||
"grounding-models",
|
||||
"composed-agents"
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user