diff --git a/docs/content/docs/agent-sdk/benchmarks/index.mdx b/docs/content/docs/agent-sdk/benchmarks/index.mdx new file mode 100644 index 00000000..59e9b7ad --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/index.mdx @@ -0,0 +1,28 @@ +--- +title: Benchmarks +description: Computer Agent SDK benchmarks for agentic GUI tasks +--- + +The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both: +- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`) +- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`) + +## Available Benchmarks + +- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding +- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding +- **[Interactive Testing](./interactive)** - Real-time testing and visualization + +## Quick Start + +```bash +# Clone the benchmark repository +git clone https://github.com/trycua/cua +cd libs/python/agent/benchmarks + +# Install dependencies +pip install "cua-agent[all]" + +# Run a benchmark +python ss-v2.py +``` diff --git a/docs/content/docs/agent-sdk/benchmarks/interactive.mdx b/docs/content/docs/agent-sdk/benchmarks/interactive.mdx new file mode 100644 index 00000000..43170ca4 --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/interactive.mdx @@ -0,0 +1,21 @@ +--- +title: Interactive Tool +description: Real-time testing and visualization tool for GUI grounding models +--- + +This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results. + +## Usage + +```bash +# Start the interactive tool +cd libs/python/agent/benchmarks +python interactive.py +``` + +## Commands + +- **Type instruction**: Screenshot + test all models +- **`screenshot`**: Take screenshot without prediction +- **`models`**: List available models +- **`quit`/`exit`**: Exit the tool diff --git a/docs/content/docs/agent-sdk/benchmarks/introduction.mdx b/docs/content/docs/agent-sdk/benchmarks/introduction.mdx new file mode 100644 index 00000000..3f2251f8 --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/introduction.mdx @@ -0,0 +1,57 @@ +--- +title: Introduction +description: Overview of benchmarking in the c/ua agent framework +--- + +The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks. + +## Benchmark Types + +Computer-Agent benchmarks evaluate two key capabilities: +- **Plan Generation**: Breaking down complex tasks into a sequence of actions +- **Coordinate Generation**: Predicting precise click locations on GUI elements + +## Using State-of-the-Art Models + +Let's see how to use the SOTA vision-language models in the c/ua agent framework. + +### Plan Generation + Coordinate Generation + +**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents + +This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks. + +```python +# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM +# This makes it suitable for agentic loops for computer-use +agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer]) +agent.run("Open Firefox and go to github.com") +# Success! 🎉 +``` + +### Coordinate Generation Only + +**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy + +This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take. + +```python +# GTA1-7B is a SOTA coordinate generation VLM +# It can only generate coordinates, not plan: +agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer]) +agent.predict_click("find the button to open the settings") # (27, 450) +# This will raise an error: +# agent.run("Open Firefox and go to github.com") +``` + +### Composed Agent + +The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model. + +```python +# It can be paired with any LLM to form a composed agent: +# "gemini/gemini-1.5-pro" will be used as the plan generation LLM +agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer]) +agent.run("Open Firefox and go to github.com") +# Success! 🎉 +``` diff --git a/docs/content/docs/agent-sdk/benchmarks/meta.json b/docs/content/docs/agent-sdk/benchmarks/meta.json new file mode 100644 index 00000000..aa49a156 --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/meta.json @@ -0,0 +1,8 @@ +{ + "pages": [ + "introduction", + "screenspot-v2", + "screenspot-pro", + "interactive" + ] +} \ No newline at end of file diff --git a/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx b/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx new file mode 100644 index 00000000..402b919e --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx @@ -0,0 +1,25 @@ +--- +title: ScreenSpot-Pro +description: High-resolution GUI grounding benchmark +--- + +ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts. + +## Usage + +```bash +# Run the benchmark +cd libs/python/agent/benchmarks +python ss-pro.py + +# Run with custom sample limit +python ss-pro.py --samples 50 +``` + +## Results + +| Model | Accuracy | Failure Rate | Samples | +|-------|----------|--------------|---------| +| Coming Soon | - | - | - | + +Results will be populated after running benchmarks with various models. diff --git a/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx b/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx new file mode 100644 index 00000000..6cfcf1c1 --- /dev/null +++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx @@ -0,0 +1,25 @@ +--- +title: ScreenSpot-v2 +description: Standard resolution GUI grounding benchmark +--- + +ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots. + +## Usage + +```bash +# Run the benchmark +cd libs/python/agent/benchmarks +python ss-v2.py + +# Run with custom sample limit +python ss-v2.py --samples 100 +``` + +## Results + +| Model | Accuracy | Failure Rate | Samples | +|-------|----------|--------------|---------| +| Coming Soon | - | - | - | + +Results will be populated after running benchmarks with various models. diff --git a/docs/content/docs/agent-sdk/meta.json b/docs/content/docs/agent-sdk/meta.json index 933452cb..fadc5a12 100644 --- a/docs/content/docs/agent-sdk/meta.json +++ b/docs/content/docs/agent-sdk/meta.json @@ -3,13 +3,14 @@ "description": "Build computer-using agents with the Agent SDK", "pages": [ "agent-loops", - "supported-agents", + "supported-agents", "chat-history", "callbacks", "sandboxed-tools", "local-models", "prompt-caching", "usage-tracking", + "benchmarks", "migration-guide" ] } diff --git a/docs/content/docs/agent-sdk/supported-agents.mdx b/docs/content/docs/agent-sdk/supported-agents.mdx deleted file mode 100644 index 61abf521..00000000 --- a/docs/content/docs/agent-sdk/supported-agents.mdx +++ /dev/null @@ -1,34 +0,0 @@ ---- -title: Supported Agents ---- - -This page lists all supported agent loops and their compatible models/configurations in cua. - -All agent loops are compatible with any LLM provider supported by LiteLLM. - -See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine. - -## Anthropic CUAs - -- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514` -- Claude 3.7: `claude-3-7-sonnet-20250219` -- Claude 3.5: `claude-3-5-sonnet-20240620` - -## OpenAI CUA Preview - -- Computer-use-preview: `computer-use-preview` - -## UI-TARS 1.5 - -- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` -- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) - -## Omniparser + LLMs - -- `omniparser+vertex_ai/gemini-pro` -- `omniparser+openai/gpt-4o` -- Any LiteLLM-compatible model combined with Omniparser - ---- - -For details on agent loop behavior and usage, see [Agent Loops](./agent-loops). diff --git a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx new file mode 100644 index 00000000..50160fd8 --- /dev/null +++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx @@ -0,0 +1,106 @@ +--- +title: Composed Agents +description: Combine grounding models with any LLM for computer-use capabilities +--- + +Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning. + +Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model. + +## How Composed Agents Work + +1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`) +2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates +3. **Execution**: Actions are performed using the predicted coordinates + +## Supported Grounding Models + +Any model that supports `predict_click()` can be used as the grounding component: + +- `omniparser` (OSS set-of-marks model) +- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model) +- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model) +- `claude-3-5-sonnet-20241022` (Anthropic CUA) +- `openai/computer-use-preview` (OpenAI CUA) + +## Supported Thinking Models + +Any vision-enabled LiteLLM-compatible model can be used as the thinking component: + +- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229` +- **OpenAI**: `openai/gpt-4o`, `openai/gpt-4-vision-preview` +- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision` +- **Local models**: Any Hugging Face vision-language model + +## Usage Examples + +### GTA1 + Claude 3.5 Sonnet + +Combine state-of-the-art grounding with powerful reasoning: + +```python +agent = ComputerAgent( + "huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022", + tools=[computer] +) + +async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"): + pass +# Success! 🎉 +# - Claude 3.5 Sonnet plans the sequence of actions +# - GTA1-7B provides precise click coordinates for each UI element +``` + +### GTA1 + Gemini Pro + +Use Google's Gemini for planning with specialized grounding: + +```python +agent = ComputerAgent( + "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", + tools=[computer] +) + +async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"): + pass +``` + +### UI-TARS + GPT-4o + +Combine two different vision models for enhanced capabilities: + +```python +agent = ComputerAgent( + "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o", + tools=[computer] +) + +async for _ in agent.run("Help me fill out this form with my personal information"): + pass +``` + +## Benefits of Composed Agents + +- **Specialized Grounding**: Use models optimized for click prediction accuracy +- **Flexible Planning**: Choose any LLM for task reasoning and planning +- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed +- **Performance**: Leverage the strengths of different model architectures + +## Capabilities + +Composed agents support both capabilities: + +```python +agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022") + +# Full computer-use agent capabilities +async for _ in agent.run("Complete this online form"): + pass + +# Direct click prediction (uses grounding model only) +coords = agent.predict_click("find the submit button") +``` + +--- + +For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models). diff --git a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx new file mode 100644 index 00000000..e22e63cc --- /dev/null +++ b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx @@ -0,0 +1,53 @@ +--- +title: Computer-Use Models +description: Models that support full computer-use agent capabilities with ComputerAgent.run() +--- + +These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks. + +All agent loops are compatible with any LLM provider supported by LiteLLM. + +See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine. + +## Anthropic CUAs + +Claude models with computer-use capabilities: + +- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514` +- Claude 3.7: `claude-3-7-sonnet-20250219` +- Claude 3.5: `claude-3-5-sonnet-20240620` + +```python +agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer]) +async for _ in agent.run("Open Firefox and navigate to github.com"): + pass +``` + +## OpenAI CUA Preview + +OpenAI's computer-use preview model: + +- Computer-use-preview: `computer-use-preview` + +```python +agent = ComputerAgent("openai/computer-use-preview", tools=[computer]) +async for _ in agent.run("Take a screenshot and describe what you see"): + pass +``` + +## UI-TARS 1.5 + +Unified vision-language model for computer-use: + +- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` +- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) + +```python +agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer]) +async for _ in agent.run("Open the settings menu and change the theme to dark mode"): + pass +``` + +--- + +For details on agent loop behavior and usage, see [Agent Loops](../agent-loops). diff --git a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx new file mode 100644 index 00000000..14ff9c1e --- /dev/null +++ b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx @@ -0,0 +1,69 @@ +--- +title: Grounding Models +description: Models that support click prediction with ComputerAgent.predict_click() +--- + +These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning. + +Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements. + +## All Computer-Use Agents + +All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`: + +### Anthropic CUAs +- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514` +- Claude 3.7: `claude-3-7-sonnet-20250219` +- Claude 3.5: `claude-3-5-sonnet-20240620` + +### OpenAI CUA Preview +- Computer-use-preview: `computer-use-preview` + +### UI-TARS 1.5 +- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` +- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) + +## Specialized Grounding Models + +These models are optimized specifically for click prediction and UI element grounding: + +### GTA1-7B + +State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/): + +- `huggingface-local/HelloKKMe/GTA1-7B` + +```python +agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer]) + +# Predict click coordinates for UI elements +coords = agent.predict_click("find the submit button") +print(f"Click coordinates: {coords}") # (450, 320) + +# Note: GTA1 cannot perform autonomous task planning +# This will raise an error: +# agent.run("Fill out the form and submit it") +``` + +## Usage Examples + +```python +# Using any grounding model for click prediction +agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer]) + +# Take a screenshot first +screenshot = agent.computer.screenshot() + +# Predict coordinates for specific elements +login_coords = agent.predict_click("find the login button") +search_coords = agent.predict_click("locate the search text field") +menu_coords = agent.predict_click("find the hamburger menu icon") + +print(f"Login button: {login_coords}") +print(f"Search field: {search_coords}") +print(f"Menu icon: {menu_coords}") +``` + +--- + +For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents). diff --git a/docs/content/docs/agent-sdk/supported-agents/meta.json b/docs/content/docs/agent-sdk/supported-agents/meta.json new file mode 100644 index 00000000..092fd051 --- /dev/null +++ b/docs/content/docs/agent-sdk/supported-agents/meta.json @@ -0,0 +1,9 @@ +{ + "title": "Supported Agents", + "description": "Models and configurations supported by the Agent SDK", + "pages": [ + "computer-use-agents", + "grounding-models", + "composed-agents" + ] +}