Added HUD integration

This commit is contained in:
Dillon DuPont
2025-08-08 12:17:35 -04:00
parent 6ebf06e2fa
commit b9f307a149
5 changed files with 140 additions and 2 deletions

View File

@@ -3,6 +3,7 @@
"introduction",
"screenspot-v2",
"screenspot-pro",
"interactive"
"interactive",
"osworld-verified"
]
}

View File

@@ -0,0 +1,89 @@
---
title: OSWorld-Verified
description: Benchmark ComputerAgent on OSWorld tasks using HUD
---
OSWorld-Verified is a curated subset of OSWorld tasks that can be run using the HUD framework. Use ComputerAgent with HUD to benchmark on these tasks.
## Setup
```bash
pip install hud-python==0.2.10
```
Set environment variables:
```bash
export HUD_API_KEY="your_hud_key"
export ANTHROPIC_API_KEY="your_anthropic_key" # For Claude
export OPENAI_API_KEY="your_openai_key" # For OpenAI
```
## Quick Start
```python
import asyncio
from hud import gym, load_taskset
from agent.integrations.hud import ComputerAgent
async def run_osworld():
# Load taskset
taskset = await load_taskset("OSWorld-Verified")
test = taskset[144] # Example task
# Create environment (~2.5 min startup)
env = await gym.make(test)
# Create agent
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022", # any ComputerAgent model string
environment="linux",
max_iterations=8
)
# Run benchmark
obs, _ = await env.reset()
for i in range(agent.max_iterations):
action, done = await agent.predict(obs)
obs, reward, terminated, info = await env.step(action)
if done or terminated:
break
# Evaluate results
result = await env.evaluate()
await env.close()
return result
# Run benchmark
result = asyncio.run(run_osworld())
print(f"Success: {result.get('success', False)}")
```
## Parallel Execution
Run all tasks in parallel using `run_job`:
```python
from hud import run_job
from agent.integrations.hud import ComputerAgent
import logging
logging.basicConfig(level=logging.INFO)
# Load full taskset
taskset = await load_taskset("OSWorld-Verified")
# Run parallel job
job = await run_job(
ComputerAgent,
taskset,
"osworld-computeragent",
max_steps_per_task=8,
max_concurrent_tasks=20,
auto_reply_question=True,
agent_kwargs={"model": "anthropic/claude-3-5-sonnet-20241022"}
)
# Get analytics
analytics = await job.get_analytics()
```

View File

@@ -0,0 +1,43 @@
---
title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---
The HUD integration allows you to use ComputerAgent with the [HUD benchmarking framework](https://www.hud.so/), providing the same interface as existing HUD agents while leveraging ComputerAgent's capabilities.
## Installation
```bash
pip install "cua-agent[hud]"
## or install hud-python directly
# pip install hud-python==0.2.10
```
## Usage
```python
from agent.integrations.hud import ComputerAgent
# Create agent with any ComputerAgent model
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022", # or any model string
environment="linux"
)
# Use exactly like other HUD agents
action, done = await agent.predict(observation)
```
## Environment Variables
Set these environment variables:
- `HUD_API_KEY` - Your HUD API key
- `ANTHROPIC_API_KEY` - For Claude models
- `OPENAI_API_KEY` - For OpenAI models
## Example Benchmarks
1. [OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified) - Benchmark on OSWorld tasks with parallel execution
See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.

View File

@@ -0,0 +1,4 @@
{
"title": "Integrations",
"pages": ["hud"]
}

View File

@@ -12,6 +12,7 @@
"prompt-caching",
"usage-tracking",
"benchmarks",
"migration-guide"
"migration-guide",
"integrations"
]
}