Extend HUD integration documentation

This commit is contained in:
James Murdza
2025-09-02 15:00:36 -04:00
parent d039c57c68
commit 7850fac399

View File

@@ -3,20 +3,38 @@ title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---
The HUD integration allows you to use ComputerAgent with the [HUD benchmarking framework](https://www.hud.so/), providing the same interface as existing HUD agents while leveraging ComputerAgent's capabilities.
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
## Installation
First, install the required package:
```bash
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.4.12
```
## Usage
## Environment Variables
Before running any evaluations, youll need to set up your environment variables for HUD and your model providers:
```bash
# HUD access
export HUD_API_KEY="your_hud_api_key"
# Model provider keys (at least one required)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
```
## Running a Single Task
You can run a single task from a HUD dataset for quick verification.
### Example
```python
# Quick single-task smoke test
from agent.integrations.hud import run_single_task
await run_single_task(
@@ -24,8 +42,20 @@ await run_single_task(
model="openai/computer-use-preview+openai/gpt-5-nano", # any supported model string
task_id=155, # e.g., reopen last closed tab
)
```
# Run a small split of OSWorld-Verified in parallel
### Parameters
- `task_id` (`int`): Default: `0`
Index of the task to run from the dataset.
## Running a Full Dataset
To benchmark your agent at scale, you can run an entire dataset (or a subset) in parallel.
### Example
```python
from agent.integrations.hud import run_full_dataset
results = await run_full_dataset(
@@ -35,13 +65,69 @@ results = await run_full_dataset(
max_concurrent=20, # tune to your infra
max_steps=50 # safety cap per task
)
# Environment variables required:
# - HUD_API_KEY (HUD access)
# - OPENAI_API_KEY or ANTHROPIC_API_KEY depending on your chosen model(s)
```
**Available Benchmarks:**
1. [OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified) - Benchmark on OSWorld tasks
### Parameters
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
- `job_name` (`str` | `None`):
Optional human-readable name for the evaluation job (shows up in HUD UI).
- `max_concurrent` (`int`): Default: `30`
Number of tasks to run in parallel. Scale this based on your infra.
- `max_steps` (`int`): Default: `50`
Safety cap on steps per task to prevent infinite loops.
- `split` (`str`): Default: `"train"`
Dataset split or subset (e.g., `"train[:10]"`).
## Additional Parameters
Both single-task and full-dataset runs share a common set of configuration options. These let you fine-tune how the evaluation runs.
- `dataset` (`str` | `Dataset` | `list[dict]`): **Required**
HUD dataset name (e.g. `"hud-evals/OSWorld-Verified-XLang"`), a loaded `Dataset`, or a list of tasks.
- `model` (`str`): Default: `"computer-use-preview"`
Model string, e.g. `"openai/computer-use-preview+openai/gpt-5-nano"`. Supports composition with `+` (planning + grounding).
- `allowed_tools` (`list[str]`): Default: `["openai_computer"]`
Restrict which tools the agent may use.
- `tools` (`list[Any]`):
Extra tool configs to inject.
- `custom_loop` (`Callable`):
Optional custom agent loop function. If provided, overrides automatic loop selection.
- `only_n_most_recent_images` (`int`): Default: `5` for full dataset, `None` for single task.
Retain only the last N screenshots in memory.
- `callbacks` (`list[Any]`):
Hook functions for logging, telemetry, or side effects.
- `verbosity` (`int`):
Logging level. Set `2` for debugging every call/action.
- `trajectory_dir` (`str` | `dict`):
Save local copies of trajectories for replay/analysis.
- `max_retries` (`int`): Default: `3`
Number of retries for failed model/tool calls.
- `screenshot_delay` (`float` | `int`): Default: `0.5`
Delay (seconds) between screenshots to avoid race conditions.
- `use_prompt_caching` (`bool`): Default: `False`
Cache repeated prompts to reduce API calls.
- `max_trajectory_budget` (`float` | `dict`):
Limit on trajectory size/budget (e.g., tokens, steps).
- `telemetry_enabled` (`bool`): Default: `True`
Whether to send telemetry/traces to HUD.
- `**kwargs` (`any`):
Any additional keyword arguments are passed through to the agent loop or model provider.
## Available Benchmarks
HUD provides multiple benchmark datasets for realistic evaluation.
1. **[OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified)** Benchmark on 369+ real-world desktop tasks across Chrome, LibreOffice, GIMP, VS Code, etc.
*Best for*: evaluating full computer-use agents in realistic environments.
*Verified variant*: fixes 300+ issues from earlier versions for reliability.
**Coming soon:** SheetBench (spreadsheet automation) and other specialized HUD datasets.
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
## Tips
* **Debugging:** set `verbosity=2` to see every model call and tool action.
* **Performance:** lower `screenshot_delay` for faster runs; raise it if you see race conditions.
* **Safety:** always set `max_steps` (defaults to 50) to prevent runaway loops.
* **Custom tools:** pass extra `tools=[...]` into the agent config if you need beyond `openai_computer`.