added docs for benchmarks and composed agents

2026-01-05 12:59:58 -06:00 · 2025-08-05 13:02:45 -04:00
parent 74a25f2003
commit 5168b6f082
12 changed files with 403 additions and 35 deletions
--- a/docs/content/docs/agent-sdk/benchmarks/index.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/index.mdx
@@ -0,0 +1,28 @@
+---
+title: Benchmarks
+description: Computer Agent SDK benchmarks for agentic GUI tasks
+---
+
+The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
+- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
+- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
+
+## Available Benchmarks
+
+- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding
+- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding  
+- **[Interactive Testing](./interactive)** - Real-time testing and visualization
+
+## Quick Start
+
+```bash
+# Clone the benchmark repository
+git clone https://github.com/trycua/cua
+cd libs/python/agent/benchmarks
+
+# Install dependencies
+pip install "cua-agent[all]"
+
+# Run a benchmark
+python ss-v2.py
+```
--- a/docs/content/docs/agent-sdk/benchmarks/interactive.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/interactive.mdx
@@ -0,0 +1,21 @@
+---
+title: Interactive Tool
+description: Real-time testing and visualization tool for GUI grounding models
+---
+
+This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results.
+
+## Usage
+
+```bash
+# Start the interactive tool
+cd libs/python/agent/benchmarks
+python interactive.py
+```
+
+## Commands
+
+- **Type instruction**: Screenshot + test all models
+- **`screenshot`**: Take screenshot without prediction
+- **`models`**: List available models
+- **`quit`/`exit`**: Exit the tool
--- a/docs/content/docs/agent-sdk/benchmarks/introduction.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/introduction.mdx
@@ -0,0 +1,57 @@
+---
+title: Introduction
+description: Overview of benchmarking in the c/ua agent framework
+---
+
+The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.
+
+## Benchmark Types
+
+Computer-Agent benchmarks evaluate two key capabilities:
+- **Plan Generation**: Breaking down complex tasks into a sequence of actions
+- **Coordinate Generation**: Predicting precise click locations on GUI elements
+
+## Using State-of-the-Art Models
+
+Let's see how to use the SOTA vision-language models in the c/ua agent framework.
+
+### Plan Generation + Coordinate Generation
+
+**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents
+
+This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.
+
+```python
+# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
+# This makes it suitable for agentic loops for computer-use
+agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
+agent.run("Open Firefox and go to github.com")
+# Success! 🎉
+```
+
+### Coordinate Generation Only
+
+**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy  
+
+This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.
+
+```python
+# GTA1-7B is a SOTA coordinate generation VLM
+# It can only generate coordinates, not plan:
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+agent.predict_click("find the button to open the settings") # (27, 450)
+# This will raise an error:
+# agent.run("Open Firefox and go to github.com") 
+```
+
+### Composed Agent
+
+The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.
+
+```python
+# It can be paired with any LLM to form a composed agent:
+# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
+agent.run("Open Firefox and go to github.com")
+# Success! 🎉
+```
--- a/docs/content/docs/agent-sdk/benchmarks/meta.json
+++ b/docs/content/docs/agent-sdk/benchmarks/meta.json
@@ -0,0 +1,8 @@
+{
+    "pages": [
+        "introduction",
+        "screenspot-v2",
+        "screenspot-pro",
+        "interactive"
+    ]
+}
--- a/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
@@ -0,0 +1,25 @@
+---
+title: ScreenSpot-Pro
+description: High-resolution GUI grounding benchmark
+---
+
+ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts.
+
+## Usage
+
+```bash
+# Run the benchmark
+cd libs/python/agent/benchmarks
+python ss-pro.py
+
+# Run with custom sample limit
+python ss-pro.py --samples 50
+```
+
+## Results
+
+| Model | Accuracy | Failure Rate | Samples |
+|-------|----------|--------------|---------|
+| Coming Soon | - | - | - |
+
+Results will be populated after running benchmarks with various models.
--- a/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
@@ -0,0 +1,25 @@
+---
+title: ScreenSpot-v2
+description: Standard resolution GUI grounding benchmark
+---
+
+ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots.
+
+## Usage
+
+```bash
+# Run the benchmark
+cd libs/python/agent/benchmarks
+python ss-v2.py
+
+# Run with custom sample limit
+python ss-v2.py --samples 100
+```
+
+## Results
+
+| Model | Accuracy | Failure Rate | Samples |
+|-------|----------|--------------|---------|
+| Coming Soon | - | - | - |
+
+Results will be populated after running benchmarks with various models.
--- a/docs/content/docs/agent-sdk/meta.json
+++ b/docs/content/docs/agent-sdk/meta.json
@@ -3,13 +3,14 @@
 	"description": "Build computer-using agents with the Agent SDK",
 	"pages": [
        "agent-loops",
-    	"supported-agents",
+        "supported-agents",
 		"chat-history",
 		"callbacks",
        "sandboxed-tools",
    	"local-models",
        "prompt-caching",
 		"usage-tracking",
+		"benchmarks",
        "migration-guide"
 	]
 }
--- a/docs/content/docs/agent-sdk/supported-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents.mdx
@@ -1,34 +0,0 @@
---
-title: Supported Agents
---
-
-This page lists all supported agent loops and their compatible models/configurations in cua.
-
-All agent loops are compatible with any LLM provider supported by LiteLLM.
-
-See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine.
-
-## Anthropic CUAs
-
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
-
-## OpenAI CUA Preview
-
- Computer-use-preview: `computer-use-preview`
-
-## UI-TARS 1.5
-
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
-
-## Omniparser + LLMs
-
- `omniparser+vertex_ai/gemini-pro`
- `omniparser+openai/gpt-4o`
- Any LiteLLM-compatible model combined with Omniparser
-
---
-
-For details on agent loop behavior and usage, see [Agent Loops](./agent-loops).
--- a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
@@ -0,0 +1,106 @@
+---
+title: Composed Agents
+description: Combine grounding models with any LLM for computer-use capabilities
+---
+
+Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
+
+Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
+
+## How Composed Agents Work
+
+1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
+2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
+3. **Execution**: Actions are performed using the predicted coordinates
+
+## Supported Grounding Models
+
+Any model that supports `predict_click()` can be used as the grounding component:
+
+- `omniparser` (OSS set-of-marks model)
+- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
+- `claude-3-5-sonnet-20241022` (Anthropic CUA)
+- `openai/computer-use-preview` (OpenAI CUA)
+
+## Supported Thinking Models
+
+Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
+
+- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
+- **OpenAI**: `openai/gpt-4o`, `openai/gpt-4-vision-preview`
+- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
+- **Local models**: Any Hugging Face vision-language model
+
+## Usage Examples
+
+### GTA1 + Claude 3.5 Sonnet
+
+Combine state-of-the-art grounding with powerful reasoning:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022", 
+    tools=[computer]
+)
+
+async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
+    pass
+# Success! 🎉
+# - Claude 3.5 Sonnet plans the sequence of actions
+# - GTA1-7B provides precise click coordinates for each UI element
+```
+
+### GTA1 + Gemini Pro
+
+Use Google's Gemini for planning with specialized grounding:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro",
+    tools=[computer]
+)
+
+async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
+    pass
+```
+
+### UI-TARS + GPT-4o
+
+Combine two different vision models for enhanced capabilities:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
+    tools=[computer]
+)
+
+async for _ in agent.run("Help me fill out this form with my personal information"):
+    pass
+```
+
+## Benefits of Composed Agents
+
+- **Specialized Grounding**: Use models optimized for click prediction accuracy
+- **Flexible Planning**: Choose any LLM for task reasoning and planning
+- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed
+- **Performance**: Leverage the strengths of different model architectures
+
+## Capabilities
+
+Composed agents support both capabilities:
+
+```python
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
+
+# Full computer-use agent capabilities
+async for _ in agent.run("Complete this online form"):
+    pass
+
+# Direct click prediction (uses grounding model only)
+coords = agent.predict_click("find the submit button")
+```
+
+---
+
+For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models).
--- a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
@@ -0,0 +1,53 @@
+---
+title: Computer-Use Models
+description: Models that support full computer-use agent capabilities with ComputerAgent.run()
+---
+
+These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks.
+
+All agent loops are compatible with any LLM provider supported by LiteLLM.
+
+See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine.
+
+## Anthropic CUAs
+
+Claude models with computer-use capabilities:
+
+- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
+- Claude 3.7: `claude-3-7-sonnet-20250219`
+- Claude 3.5: `claude-3-5-sonnet-20240620`
+
+```python
+agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
+async for _ in agent.run("Open Firefox and navigate to github.com"):
+    pass
+```
+
+## OpenAI CUA Preview
+
+OpenAI's computer-use preview model:
+
+- Computer-use-preview: `computer-use-preview`
+
+```python
+agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
+async for _ in agent.run("Take a screenshot and describe what you see"):
+    pass
+```
+
+## UI-TARS 1.5
+
+Unified vision-language model for computer-use:
+
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
+
+```python
+agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
+async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
+    pass
+```
+
+---
+
+For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
--- a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
@@ -0,0 +1,69 @@
+---
+title: Grounding Models
+description: Models that support click prediction with ComputerAgent.predict_click()
+---
+
+These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
+
+Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
+
+## All Computer-Use Agents
+
+All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
+
+### Anthropic CUAs
+- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
+- Claude 3.7: `claude-3-7-sonnet-20250219`
+- Claude 3.5: `claude-3-5-sonnet-20240620`
+
+### OpenAI CUA Preview
+- Computer-use-preview: `computer-use-preview`
+
+### UI-TARS 1.5
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
+
+## Specialized Grounding Models
+
+These models are optimized specifically for click prediction and UI element grounding:
+
+### GTA1-7B
+
+State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
+
+- `huggingface-local/HelloKKMe/GTA1-7B`
+
+```python
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+
+# Predict click coordinates for UI elements
+coords = agent.predict_click("find the submit button")
+print(f"Click coordinates: {coords}")  # (450, 320)
+
+# Note: GTA1 cannot perform autonomous task planning
+# This will raise an error:
+# agent.run("Fill out the form and submit it")
+```
+
+## Usage Examples
+
+```python
+# Using any grounding model for click prediction
+agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
+
+# Take a screenshot first
+screenshot = agent.computer.screenshot()
+
+# Predict coordinates for specific elements
+login_coords = agent.predict_click("find the login button")
+search_coords = agent.predict_click("locate the search text field")
+menu_coords = agent.predict_click("find the hamburger menu icon")
+
+print(f"Login button: {login_coords}")
+print(f"Search field: {search_coords}")
+print(f"Menu icon: {menu_coords}")
+```
+
+---
+
+For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
--- a/docs/content/docs/agent-sdk/supported-agents/meta.json
+++ b/docs/content/docs/agent-sdk/supported-agents/meta.json
@@ -0,0 +1,9 @@
+{
+	"title": "Supported Agents",
+	"description": "Models and configurations supported by the Agent SDK",
+	"pages": [
+		"computer-use-agents",
+		"grounding-models", 
+		"composed-agents"
+	]
+}