mirror/computer

Fork 0

mirror of https://github.com/trycua/computer.git synced 2025-12-31 10:29:59 -06:00

Files

History

Dillon DuPont 8eb662bf4d added base models to benchmark

2025-08-05 12:45:00 -04:00

models

added grounding+planning composed loop

2025-08-04 16:32:05 -04:00

.gitignore

added agent benchmarks

2025-07-30 13:41:58 -04:00

contrib.md

updated docs

2025-07-30 16:18:12 -04:00

interactive.py

added GTA1 agent and click benchmarks (ss-pro, repl)

2025-07-29 20:48:44 -04:00

README.md

updated docs

2025-07-30 16:19:37 -04:00

ss-pro.py

updated metrics

2025-07-30 16:12:51 -04:00

ss-v2.py

updated metrics

2025-07-30 16:12:51 -04:00

utils.py

added base models to benchmark

2025-08-05 12:45:00 -04:00

README.md

Computer Agent Benchmarks

This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.

Overview

The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:

Computer Agent SDK providers (using model strings like "huggingface-local/HelloKKMe/GTA1-7B")
Reference agent implementations (custom model classes implementing the ModelProtocol)

Available Benchmarks

1. ScreenSpot-v2 (`ss-v2.py`)

Dataset: ScreenSpot-v2 (click-only GUI grounding)
Format: Standard resolution screenshots
Task: Predict click coordinates given an instruction and image
Metrics: Accuracy, Error Rate, Timing, VRAM usage

2. ScreenSpot-Pro (`ss-pro.py`)

Dataset: ScreenSpot-Pro (high-resolution click-only GUI grounding)
Format: High-resolution screenshots
Task: Predict click coordinates given an instruction and image
Metrics: Accuracy, Error Rate, Timing, VRAM usage

3. Interactive Testing (`interactive.py`)

Real-time testing: Take screenshots and visualize model predictions
Commands:
- Type instruction → test all models on last screenshot
- screenshot → take screenshot
- models → list available models
- quit/exit → exit tool
Output: Visual predictions with crosshairs for each model

Running Benchmarks

1. Configure Models

Edit utils.py to specify which models you want to test in get_available_models().

2. Run Benchmark

# ScreenSpot-v2 benchmark
python ss-v2.py --samples 50

# ScreenSpot-Pro benchmark  
python ss-pro.py --samples 50

# Interactive testing
python interactive.py

Output

Console Output

Model Results:
  Accuracy: 85.50% (171/200)
  Avg Time: 1.23s (0.89s - 2.45s)
  VRAM Usage: 4.5GB (max) / 3.4GB (avg)

Generated Files

Markdown Report: *_results.md with detailed results tables
Visualizations: output/ directory with prediction visualizations
Interactive Output: interactive_output/ for interactive session results

Contributing

To add a new reference model, follow the instructions in contrib.md.

README.md

Computer Agent Benchmarks

Overview

Available Benchmarks

1. ScreenSpot-v2 (ss-v2.py)

2. ScreenSpot-Pro (ss-pro.py)

3. Interactive Testing (interactive.py)

Running Benchmarks

1. Configure Models

2. Run Benchmark

Output

Console Output

Generated Files

Contributing

1. ScreenSpot-v2 (`ss-v2.py`)

2. ScreenSpot-Pro (`ss-pro.py`)

3. Interactive Testing (`interactive.py`)