mirror of
https://github.com/trycua/computer.git
synced 2025-12-31 10:29:59 -06:00
Computer Agent Benchmarks
This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.
Overview
The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:
- Computer Agent SDK providers (using model strings like
"huggingface-local/HelloKKMe/GTA1-7B") - Reference agent implementations (custom model classes implementing the
ModelProtocol)
Available Benchmarks
1. ScreenSpot-v2 (ss-v2.py)
- Dataset: ScreenSpot-v2 (click-only GUI grounding)
- Format: Standard resolution screenshots
- Task: Predict click coordinates given an instruction and image
- Metrics: Accuracy, Error Rate, Timing, VRAM usage
2. ScreenSpot-Pro (ss-pro.py)
- Dataset: ScreenSpot-Pro (high-resolution click-only GUI grounding)
- Format: High-resolution screenshots
- Task: Predict click coordinates given an instruction and image
- Metrics: Accuracy, Error Rate, Timing, VRAM usage
3. Interactive Testing (interactive.py)
- Real-time testing: Take screenshots and visualize model predictions
- Commands:
- Type instruction → test all models on last screenshot
screenshot→ take screenshotmodels→ list available modelsquit/exit→ exit tool
- Output: Visual predictions with crosshairs for each model
Running Benchmarks
1. Configure Models
Edit utils.py to specify which models you want to test in get_available_models().
2. Run Benchmark
# ScreenSpot-v2 benchmark
python ss-v2.py --samples 50
# ScreenSpot-Pro benchmark
python ss-pro.py --samples 50
# Interactive testing
python interactive.py
Output
Console Output
Model Results:
Accuracy: 85.50% (171/200)
Avg Time: 1.23s (0.89s - 2.45s)
VRAM Usage: 4.5GB (max) / 3.4GB (avg)
Generated Files
- Markdown Report:
*_results.mdwith detailed results tables - Visualizations:
output/directory with prediction visualizations - Interactive Output:
interactive_output/for interactive session results
Contributing
To add a new reference model, follow the instructions in contrib.md.