Computer Agent SDK providers (using model strings like "huggingface-local/HelloKKMe/GTA1-7B")
Reference agent implementations (custom model classes implementing the ModelProtocol)

Available Benchmarks

1. ScreenSpot-v2 (`ss-v2.py`)

Dataset: ScreenSpot-v2 (click-only GUI grounding)
Format: Standard resolution screenshots
Task: Predict click coordinates given an instruction and image
Metrics: Accuracy, Error Rate, Timing, VRAM usage

2. ScreenSpot-Pro (`ss-pro.py`)

Dataset: ScreenSpot-Pro (high-resolution click-only GUI grounding)
Format: High-resolution screenshots
Task: Predict click coordinates given an instruction and image
Metrics: Accuracy, Error Rate, Timing, VRAM usage

3. Interactive Testing (`interactive.py`)

Real-time testing: Take screenshots and visualize model predictions
Commands:
- Type instruction → test all models on last screenshot
- screenshot → take screenshot
- models → list available models
- quit/exit → exit tool
Output: Visual predictions with crosshairs for each model

Running Benchmarks

1. Configure Models

Edit utils.py to specify which models you want to test in get_available_models().

2. Run Benchmark

# ScreenSpot-v2 benchmark
python ss-v2.py --samples 50

# ScreenSpot-Pro benchmark
python ss-pro.py --samples 50

# Interactive testing
python interactive.py

Output

Console Output

Model Results:
  Accuracy: 85.50% (171/200)
  Avg Time: 1.23s (0.89s - 2.45s)
  VRAM Usage: 4.5GB (max) / 3.4GB (avg)

Generated Files

Markdown Report: *_results.md with detailed results tables
Visualizations: output/ directory with prediction visualizations
Interactive Output: interactive_output/ for interactive session results

Contributing

To add a new reference model, follow the instructions in contrib.md.

README.md

Computer Agent Benchmarks

Overview

Available Benchmarks

1. ScreenSpot-v2 (ss-v2.py)

2. ScreenSpot-Pro (ss-pro.py)

3. Interactive Testing (interactive.py)

Running Benchmarks

1. Configure Models

2. Run Benchmark

Output

Console Output

Generated Files

Contributing

1. ScreenSpot-v2 (`ss-v2.py`)

2. ScreenSpot-Pro (`ss-pro.py`)

3. Interactive Testing (`interactive.py`)