mirror of
https://github.com/trycua/computer.git
synced 2026-05-07 07:33:08 -05:00
135 lines
5.1 KiB
Plaintext
135 lines
5.1 KiB
Plaintext
---
|
||
title: HUD Evals
|
||
description: Use ComputerAgent with HUD for benchmarking and evaluation
|
||
---
|
||
|
||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
|
||
|
||
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
|
||
|
||
## Installation
|
||
|
||
First, install the required package:
|
||
|
||
```bash
|
||
pip install "cua-agent[hud]"
|
||
## or install hud-python directly
|
||
# pip install hud-python==0.4.12
|
||
```
|
||
|
||
## Environment Variables
|
||
|
||
Before running any evaluations, you’ll need to set up your environment variables for HUD and your model providers:
|
||
|
||
```bash
|
||
# HUD access
|
||
export HUD_API_KEY="your_hud_api_key"
|
||
|
||
# Model provider keys (at least one required)
|
||
export OPENAI_API_KEY="your_openai_key"
|
||
export ANTHROPIC_API_KEY="your_anthropic_key"
|
||
```
|
||
|
||
## Running a Single Task
|
||
|
||
You can run a single task from a HUD dataset for quick verification.
|
||
|
||
### Example
|
||
|
||
```python
|
||
from agent.integrations.hud import run_single_task
|
||
|
||
await run_single_task(
|
||
dataset="hud-evals/OSWorld-Verified", # or another HUD dataset
|
||
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
|
||
task_id=155, # e.g., reopen last closed tab
|
||
)
|
||
```
|
||
|
||
### Parameters
|
||
|
||
- `task_id` (`int`): Default: `0`
|
||
Index of the task to run from the dataset.
|
||
|
||
## Running a Full Dataset
|
||
|
||
To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.
|
||
|
||
### Example
|
||
|
||
```python
|
||
from agent.integrations.hud import run_full_dataset
|
||
|
||
results = await run_full_dataset(
|
||
dataset="hud-evals/OSWorld-Verified", # can also pass a Dataset or list[dict]
|
||
model="openai/computer-use-preview",
|
||
split="train[:3]", # try a few tasks to start
|
||
max_concurrent=20, # tune to your infra
|
||
max_steps=50 # safety cap per task
|
||
)
|
||
```
|
||
|
||
### Parameters
|
||
|
||
- `job_name` (`str` | `None`):
|
||
Optional human-readable name for the evaluation job (shows up in HUD UI).
|
||
- `max_concurrent` (`int`): Default: `30`
|
||
Number of tasks to run in parallel. Scale this based on your infra.
|
||
- `max_steps` (`int`): Default: `50`
|
||
Safety cap on steps per task to prevent infinite loops.
|
||
- `split` (`str`): Default: `"train"`
|
||
Dataset split or subset (e.g., `"train[:10]"`).
|
||
|
||
## Additional Parameters
|
||
|
||
Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.
|
||
|
||
- `dataset` (`str` | `Dataset` | `list[dict]`): **Required**
|
||
HUD dataset name (e.g. `"hud-evals/OSWorld-Verified"`), a loaded `Dataset`, or a list of tasks.
|
||
- `model` (`str`): Default: `"computer-use-preview"`
|
||
Model string, e.g. `"openai/computer-use-preview+openai/gpt-5-nano"`. Supports composition with `+` (planning + grounding).
|
||
- `allowed_tools` (`list[str]`): Default: `["openai_computer"]`
|
||
Restrict which tools the agent may use.
|
||
- `tools` (`list[Any]`):
|
||
Extra tool configs to inject.
|
||
- `custom_loop` (`Callable`):
|
||
Optional custom agent loop function. If provided, overrides automatic loop selection.
|
||
- `only_n_most_recent_images` (`int`): Default: `5` for full dataset, `None` for single task.
|
||
Retain only the last N screenshots in memory.
|
||
- `callbacks` (`list[Any]`):
|
||
Hook functions for logging, telemetry, or side effects.
|
||
- `verbosity` (`int`):
|
||
Logging level. Set `2` for debugging every call/action.
|
||
- `trajectory_dir` (`str` | `dict`):
|
||
Save local copies of trajectories for replay/analysis.
|
||
- `max_retries` (`int`): Default: `3`
|
||
Number of retries for failed model/tool calls.
|
||
- `screenshot_delay` (`float` | `int`): Default: `0.5`
|
||
Delay (seconds) between screenshots to avoid race conditions.
|
||
- `use_prompt_caching` (`bool`): Default: `False`
|
||
Cache repeated prompts to reduce API calls.
|
||
- `max_trajectory_budget` (`float` | `dict`):
|
||
Limit on trajectory size/budget (e.g., tokens, steps).
|
||
- `telemetry_enabled` (`bool`): Default: `True`
|
||
Whether to send telemetry/traces to HUD.
|
||
- `**kwargs` (`any`):
|
||
Any additional keyword arguments are passed through to the agent loop or model provider.
|
||
|
||
## Available Benchmarks
|
||
|
||
HUD provides multiple benchmark datasets for realistic evaluation.
|
||
|
||
1. **[OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified)** – Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc.
|
||
*Best for*: evaluating full computer-use agents in realistic environments.
|
||
*Verified variant*: fixes 300+ issues from earlier versions for reliability.
|
||
|
||
**Coming soon:** SheetBench (spreadsheet automation) and other specialized HUD datasets.
|
||
|
||
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
|
||
|
||
## Tips
|
||
|
||
* **Debugging:** set `verbosity=2` to see every model call and tool action.
|
||
* **Performance:** lower `screenshot_delay` for faster runs; raise it if you see race conditions.
|
||
* **Safety:** always set `max_steps` (defaults to 50) to prevent runaway loops.
|
||
* **Custom tools:** pass extra `tools=[...]` into the agent config if you need beyond `openai_computer`. |