added docs for benchmarks and composed agents

This commit is contained in:
Dillon DuPont
2025-08-05 13:02:45 -04:00
parent 74a25f2003
commit 5168b6f082
12 changed files with 403 additions and 35 deletions

View File

@@ -0,0 +1,28 @@
---
title: Benchmarks
description: Computer Agent SDK benchmarks for agentic GUI tasks
---
The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
## Available Benchmarks
- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding
- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding
- **[Interactive Testing](./interactive)** - Real-time testing and visualization
## Quick Start
```bash
# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks
# Install dependencies
pip install "cua-agent[all]"
# Run a benchmark
python ss-v2.py
```

View File

@@ -0,0 +1,21 @@
---
title: Interactive Tool
description: Real-time testing and visualization tool for GUI grounding models
---
This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results.
## Usage
```bash
# Start the interactive tool
cd libs/python/agent/benchmarks
python interactive.py
```
## Commands
- **Type instruction**: Screenshot + test all models
- **`screenshot`**: Take screenshot without prediction
- **`models`**: List available models
- **`quit`/`exit`**: Exit the tool

View File

@@ -0,0 +1,57 @@
---
title: Introduction
description: Overview of benchmarking in the c/ua agent framework
---
The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.
## Benchmark Types
Computer-Agent benchmarks evaluate two key capabilities:
- **Plan Generation**: Breaking down complex tasks into a sequence of actions
- **Coordinate Generation**: Predicting precise click locations on GUI elements
## Using State-of-the-Art Models
Let's see how to use the SOTA vision-language models in the c/ua agent framework.
### Plan Generation + Coordinate Generation
**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents
This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.
```python
# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
# This makes it suitable for agentic loops for computer-use
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉
```
### Coordinate Generation Only
**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy
This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.
```python
# GTA1-7B is a SOTA coordinate generation VLM
# It can only generate coordinates, not plan:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
agent.predict_click("find the button to open the settings") # (27, 450)
# This will raise an error:
# agent.run("Open Firefox and go to github.com")
```
### Composed Agent
The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.
```python
# It can be paired with any LLM to form a composed agent:
# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉
```

View File

@@ -0,0 +1,8 @@
{
"pages": [
"introduction",
"screenspot-v2",
"screenspot-pro",
"interactive"
]
}

View File

@@ -0,0 +1,25 @@
---
title: ScreenSpot-Pro
description: High-resolution GUI grounding benchmark
---
ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts.
## Usage
```bash
# Run the benchmark
cd libs/python/agent/benchmarks
python ss-pro.py
# Run with custom sample limit
python ss-pro.py --samples 50
```
## Results
| Model | Accuracy | Failure Rate | Samples |
|-------|----------|--------------|---------|
| Coming Soon | - | - | - |
Results will be populated after running benchmarks with various models.

View File

@@ -0,0 +1,25 @@
---
title: ScreenSpot-v2
description: Standard resolution GUI grounding benchmark
---
ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots.
## Usage
```bash
# Run the benchmark
cd libs/python/agent/benchmarks
python ss-v2.py
# Run with custom sample limit
python ss-v2.py --samples 100
```
## Results
| Model | Accuracy | Failure Rate | Samples |
|-------|----------|--------------|---------|
| Coming Soon | - | - | - |
Results will be populated after running benchmarks with various models.

View File

@@ -3,13 +3,14 @@
"description": "Build computer-using agents with the Agent SDK",
"pages": [
"agent-loops",
"supported-agents",
"supported-agents",
"chat-history",
"callbacks",
"sandboxed-tools",
"local-models",
"prompt-caching",
"usage-tracking",
"benchmarks",
"migration-guide"
]
}

View File

@@ -1,34 +0,0 @@
---
title: Supported Agents
---
This page lists all supported agent loops and their compatible models/configurations in cua.
All agent loops are compatible with any LLM provider supported by LiteLLM.
See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine.
## Anthropic CUAs
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
## OpenAI CUA Preview
- Computer-use-preview: `computer-use-preview`
## UI-TARS 1.5
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
## Omniparser + LLMs
- `omniparser+vertex_ai/gemini-pro`
- `omniparser+openai/gpt-4o`
- Any LiteLLM-compatible model combined with Omniparser
---
For details on agent loop behavior and usage, see [Agent Loops](./agent-loops).

View File

@@ -0,0 +1,106 @@
---
title: Composed Agents
description: Combine grounding models with any LLM for computer-use capabilities
---
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
## How Composed Agents Work
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
3. **Execution**: Actions are performed using the predicted coordinates
## Supported Grounding Models
Any model that supports `predict_click()` can be used as the grounding component:
- `omniparser` (OSS set-of-marks model)
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
- `openai/computer-use-preview` (OpenAI CUA)
## Supported Thinking Models
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
- **OpenAI**: `openai/gpt-4o`, `openai/gpt-4-vision-preview`
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
- **Local models**: Any Hugging Face vision-language model
## Usage Examples
### GTA1 + Claude 3.5 Sonnet
Combine state-of-the-art grounding with powerful reasoning:
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element
```
### GTA1 + Gemini Pro
Use Google's Gemini for planning with specialized grounding:
```python
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro",
tools=[computer]
)
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
```
### UI-TARS + GPT-4o
Combine two different vision models for enhanced capabilities:
```python
agent = ComputerAgent(
"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
tools=[computer]
)
async for _ in agent.run("Help me fill out this form with my personal information"):
pass
```
## Benefits of Composed Agents
- **Specialized Grounding**: Use models optimized for click prediction accuracy
- **Flexible Planning**: Choose any LLM for task reasoning and planning
- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed
- **Performance**: Leverage the strengths of different model architectures
## Capabilities
Composed agents support both capabilities:
```python
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
pass
# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")
```
---
For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models).

View File

@@ -0,0 +1,53 @@
---
title: Computer-Use Models
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
---
These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks.
All agent loops are compatible with any LLM provider supported by LiteLLM.
See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine.
## Anthropic CUAs
Claude models with computer-use capabilities:
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
```python
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
async for _ in agent.run("Open Firefox and navigate to github.com"):
pass
```
## OpenAI CUA Preview
OpenAI's computer-use preview model:
- Computer-use-preview: `computer-use-preview`
```python
agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
async for _ in agent.run("Take a screenshot and describe what you see"):
pass
```
## UI-TARS 1.5
Unified vision-language model for computer-use:
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
```python
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
pass
```
---
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).

View File

@@ -0,0 +1,69 @@
---
title: Grounding Models
description: Models that support click prediction with ComputerAgent.predict_click()
---
These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
## All Computer-Use Agents
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
### Anthropic CUAs
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
### OpenAI CUA Preview
- Computer-use-preview: `computer-use-preview`
### UI-TARS 1.5
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
## Specialized Grounding Models
These models are optimized specifically for click prediction and UI element grounding:
### GTA1-7B
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
- `huggingface-local/HelloKKMe/GTA1-7B`
```python
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")
```
## Usage Examples
```python
# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
# Take a screenshot first
screenshot = agent.computer.screenshot()
# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")
print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
```
---
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).

View File

@@ -0,0 +1,9 @@
{
"title": "Supported Agents",
"description": "Models and configurations supported by the Agent SDK",
"pages": [
"computer-use-agents",
"grounding-models",
"composed-agents"
]
}