mirror of
https://github.com/trycua/computer.git
synced 2026-02-24 23:39:53 -06:00
Extend HUD integration documentation
This commit is contained in:
@@ -3,20 +3,38 @@ title: HUD Evals
|
||||
description: Use ComputerAgent with HUD for benchmarking and evaluation
|
||||
---
|
||||
|
||||
The HUD integration allows you to use ComputerAgent with the [HUD benchmarking framework](https://www.hud.so/), providing the same interface as existing HUD agents while leveraging ComputerAgent's capabilities.
|
||||
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
|
||||
|
||||
## Installation
|
||||
|
||||
First, install the required package:
|
||||
|
||||
```bash
|
||||
pip install "cua-agent[hud]"
|
||||
## or install hud-python directly
|
||||
# pip install hud-python==0.4.12
|
||||
```
|
||||
|
||||
## Usage
|
||||
## Environment Variables
|
||||
|
||||
Before running any evaluations, you’ll need to set up your environment variables for HUD and your model providers:
|
||||
|
||||
```bash
|
||||
# HUD access
|
||||
export HUD_API_KEY="your_hud_api_key"
|
||||
|
||||
# Model provider keys (at least one required)
|
||||
export OPENAI_API_KEY="your_openai_key"
|
||||
export ANTHROPIC_API_KEY="your_anthropic_key"
|
||||
```
|
||||
|
||||
## Running a Single Task
|
||||
|
||||
You can run a single task from a HUD dataset for quick verification.
|
||||
|
||||
### Example
|
||||
|
||||
```python
|
||||
# Quick single-task smoke test
|
||||
from agent.integrations.hud import run_single_task
|
||||
|
||||
await run_single_task(
|
||||
@@ -24,8 +42,20 @@ await run_single_task(
|
||||
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
|
||||
task_id=155, # e.g., reopen last closed tab
|
||||
)
|
||||
```
|
||||
|
||||
# Run a small split of OSWorld-Verified in parallel
|
||||
### Parameters
|
||||
|
||||
- `task_id` (`int`): Default: `0`
|
||||
Index of the task to run from the dataset.
|
||||
|
||||
## Running a Full Dataset
|
||||
|
||||
To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.
|
||||
|
||||
### Example
|
||||
|
||||
```python
|
||||
from agent.integrations.hud import run_full_dataset
|
||||
|
||||
results = await run_full_dataset(
|
||||
@@ -35,13 +65,69 @@ results = await run_full_dataset(
|
||||
max_concurrent=20, # tune to your infra
|
||||
max_steps=50 # safety cap per task
|
||||
)
|
||||
|
||||
# Environment variables required:
|
||||
# - HUD_API_KEY (HUD access)
|
||||
# - OPENAI_API_KEY or ANTHROPIC_API_KEY depending on your chosen model(s)
|
||||
```
|
||||
|
||||
**Available Benchmarks:**
|
||||
1. [OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified) - Benchmark on OSWorld tasks
|
||||
### Parameters
|
||||
|
||||
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
|
||||
- `job_name` (`str` | `None`):
|
||||
Optional human-readable name for the evaluation job (shows up in HUD UI).
|
||||
- `max_concurrent` (`int`): Default: `30`
|
||||
Number of tasks to run in parallel. Scale this based on your infra.
|
||||
- `max_steps` (`int`): Default: `50`
|
||||
Safety cap on steps per task to prevent infinite loops.
|
||||
- `split` (`str`): Default: `"train"`
|
||||
Dataset split or subset (e.g., `"train[:10]"`).
|
||||
|
||||
## Additional Parameters
|
||||
|
||||
Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.
|
||||
|
||||
- `dataset` (`str` | `Dataset` | `list[dict]`): **Required**
|
||||
HUD dataset name (e.g. `"hud-evals/OSWorld-Verified-XLang"`), a loaded `Dataset`, or a list of tasks.
|
||||
- `model` (`str`): Default: `"computer-use-preview"`
|
||||
Model string, e.g. `"openai/computer-use-preview+openai/gpt-5-nano"`. Supports composition with `+` (planning + grounding).
|
||||
- `allowed_tools` (`list[str]`): Default: `["openai_computer"]`
|
||||
Restrict which tools the agent may use.
|
||||
- `tools` (`list[Any]`):
|
||||
Extra tool configs to inject.
|
||||
- `custom_loop` (`Callable`):
|
||||
Optional custom agent loop function. If provided, overrides automatic loop selection.
|
||||
- `only_n_most_recent_images` (`int`): Default: `5` for full dataset, `None` for single task.
|
||||
Retain only the last N screenshots in memory.
|
||||
- `callbacks` (`list[Any]`):
|
||||
Hook functions for logging, telemetry, or side effects.
|
||||
- `verbosity` (`int`):
|
||||
Logging level. Set `2` for debugging every call/action.
|
||||
- `trajectory_dir` (`str` | `dict`):
|
||||
Save local copies of trajectories for replay/analysis.
|
||||
- `max_retries` (`int`): Default: `3`
|
||||
Number of retries for failed model/tool calls.
|
||||
- `screenshot_delay` (`float` | `int`): Default: `0.5`
|
||||
Delay (seconds) between screenshots to avoid race conditions.
|
||||
- `use_prompt_caching` (`bool`): Default: `False`
|
||||
Cache repeated prompts to reduce API calls.
|
||||
- `max_trajectory_budget` (`float` | `dict`):
|
||||
Limit on trajectory size/budget (e.g., tokens, steps).
|
||||
- `telemetry_enabled` (`bool`): Default: `True`
|
||||
Whether to send telemetry/traces to HUD.
|
||||
- `**kwargs` (`any`):
|
||||
Any additional keyword arguments are passed through to the agent loop or model provider.
|
||||
|
||||
## Available Benchmarks
|
||||
|
||||
HUD provides multiple benchmark datasets for realistic evaluation.
|
||||
|
||||
1. **[OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified)** – Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc.
|
||||
*Best for*: evaluating full computer-use agents in realistic environments.
|
||||
*Verified variant*: fixes 300+ issues from earlier versions for reliability.
|
||||
|
||||
**Coming soon:** SheetBench (spreadsheet automation) and other specialized HUD datasets.
|
||||
|
||||
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
|
||||
|
||||
## Tips
|
||||
|
||||
* **Debugging:** set `verbosity=2` to see every model call and tool action.
|
||||
* **Performance:** lower `screenshot_delay` for faster runs; raise it if you see race conditions.
|
||||
* **Safety:** always set `max_steps` (defaults to 50) to prevent runaway loops.
|
||||
* **Custom tools:** pass extra `tools=[...]` into the agent config if you need beyond `openai_computer`.
|
||||
Reference in New Issue
Block a user