Files
computer/docs/content/docs/agent-sdk/integrations/hud.mdx
T
2025-09-05 11:29:33 -04:00

135 lines
5.1 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
## Installation
First, install the required package:
```bash
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.4.12
```
## Environment Variables
Before running any evaluations, youll need to set up your environment variables for HUD and your model providers:
```bash
# HUD access
export HUD_API_KEY="your_hud_api_key"
# Model provider keys (at least one required)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
```
## Running a Single Task
You can run a single task from a HUD dataset for quick verification.
### Example
```python
from agent.integrations.hud import run_single_task
await run_single_task(
dataset="hud-evals/OSWorld-Verified", # or another HUD dataset
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
task_id=155, # e.g., reopen last closed tab
)
```
### Parameters
- `task_id` (`int`): Default: `0`
Index of the task to run from the dataset.
## Running a Full Dataset
To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.
### Example
```python
from agent.integrations.hud import run_full_dataset
results = await run_full_dataset(
dataset="hud-evals/OSWorld-Verified", # can also pass a Dataset or list[dict]
model="openai/computer-use-preview",
split="train[:3]", # try a few tasks to start
max_concurrent=20, # tune to your infra
max_steps=50 # safety cap per task
)
```
### Parameters
- `job_name` (`str` | `None`):
Optional human-readable name for the evaluation job (shows up in HUD UI).
- `max_concurrent` (`int`): Default: `30`
Number of tasks to run in parallel. Scale this based on your infra.
- `max_steps` (`int`): Default: `50`
Safety cap on steps per task to prevent infinite loops.
- `split` (`str`): Default: `"train"`
Dataset split or subset (e.g., `"train[:10]"`).
## Additional Parameters
Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.
- `dataset` (`str` | `Dataset` | `list[dict]`): **Required**
HUD dataset name (e.g. `"hud-evals/OSWorld-Verified"`), a loaded `Dataset`, or a list of tasks.
- `model` (`str`): Default: `"computer-use-preview"`
Model string, e.g. `"openai/computer-use-preview+openai/gpt-5-nano"`. Supports composition with `+` (planning + grounding).
- `allowed_tools` (`list[str]`): Default: `["openai_computer"]`
Restrict which tools the agent may use.
- `tools` (`list[Any]`):
Extra tool configs to inject.
- `custom_loop` (`Callable`):
Optional custom agent loop function. If provided, overrides automatic loop selection.
- `only_n_most_recent_images` (`int`): Default: `5` for full dataset, `None` for single task.
Retain only the last N screenshots in memory.
- `callbacks` (`list[Any]`):
Hook functions for logging, telemetry, or side effects.
- `verbosity` (`int`):
Logging level. Set `2` for debugging every call/action.
- `trajectory_dir` (`str` | `dict`):
Save local copies of trajectories for replay/analysis.
- `max_retries` (`int`): Default: `3`
Number of retries for failed model/tool calls.
- `screenshot_delay` (`float` | `int`): Default: `0.5`
Delay (seconds) between screenshots to avoid race conditions.
- `use_prompt_caching` (`bool`): Default: `False`
Cache repeated prompts to reduce API calls.
- `max_trajectory_budget` (`float` | `dict`):
Limit on trajectory size/budget (e.g., tokens, steps).
- `telemetry_enabled` (`bool`): Default: `True`
Whether to send telemetry/traces to HUD.
- `**kwargs` (`any`):
Any additional keyword arguments are passed through to the agent loop or model provider.
## Available Benchmarks
HUD provides multiple benchmark datasets for realistic evaluation.
1. **[OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified)** Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc.
*Best for*: evaluating full computer-use agents in realistic environments.
*Verified variant*: fixes 300+ issues from earlier versions for reliability.
**Coming soon:** SheetBench (spreadsheet automation) and other specialized HUD datasets.
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
## Tips
* **Debugging:** set `verbosity=2` to see every model call and tool action.
* **Performance:** lower `screenshot_delay` for faster runs; raise it if you see race conditions.
* **Safety:** always set `max_steps` (defaults to 50) to prevent runaway loops.
* **Custom tools:** pass extra `tools=[...]` into the agent config if you need beyond `openai_computer`.