mirror of
https://github.com/trycua/computer.git
synced 2026-01-01 19:10:30 -06:00
68 lines
2.2 KiB
Markdown
68 lines
2.2 KiB
Markdown
# Computer Agent Benchmarks
|
|
|
|
This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.
|
|
|
|
## Overview
|
|
|
|
The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:
|
|
- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
|
|
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
|
|
|
|
## Available Benchmarks
|
|
|
|
### 1. ScreenSpot-v2 (`ss-v2.py`)
|
|
- **Dataset**: ScreenSpot-v2 (click-only GUI grounding)
|
|
- **Format**: Standard resolution screenshots
|
|
- **Task**: Predict click coordinates given an instruction and image
|
|
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
|
|
|
|
### 2. ScreenSpot-Pro (`ss-pro.py`)
|
|
- **Dataset**: ScreenSpot-Pro (high-resolution click-only GUI grounding)
|
|
- **Format**: High-resolution screenshots
|
|
- **Task**: Predict click coordinates given an instruction and image
|
|
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
|
|
|
|
### 3. Interactive Testing (`interactive.py`)
|
|
- **Real-time testing**: Take screenshots and visualize model predictions
|
|
- **Commands**:
|
|
- Type instruction → test all models on last screenshot
|
|
- `screenshot` → take screenshot
|
|
- `models` → list available models
|
|
- `quit`/`exit` → exit tool
|
|
- **Output**: Visual predictions with crosshairs for each model
|
|
|
|
## Running Benchmarks
|
|
|
|
### 1. Configure Models
|
|
Edit `utils.py` to specify which models you want to test in `get_available_models()`.
|
|
|
|
### 2. Run Benchmark
|
|
```bash
|
|
# ScreenSpot-v2 benchmark
|
|
python ss-v2.py --samples 50
|
|
|
|
# ScreenSpot-Pro benchmark
|
|
python ss-pro.py --samples 50
|
|
|
|
# Interactive testing
|
|
python interactive.py
|
|
```
|
|
|
|
## Output
|
|
|
|
### Console Output
|
|
```
|
|
Model Results:
|
|
Accuracy: 85.50% (171/200)
|
|
Avg Time: 1.23s (0.89s - 2.45s)
|
|
VRAM Usage: 4.5GB (max) / 3.4GB (avg)
|
|
```
|
|
|
|
### Generated Files
|
|
- **Markdown Report**: `*_results.md` with detailed results tables
|
|
- **Visualizations**: `output/` directory with prediction visualizations
|
|
- **Interactive Output**: `interactive_output/` for interactive session results
|
|
|
|
## Contributing
|
|
|
|
To add a new reference model, follow the instructions in [contrib.md](contrib.md). |