mirror of
https://github.com/trycua/computer.git
synced 2026-01-01 02:50:15 -06:00
Computer Agent Benchmarks
This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.
Overview
The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:
- Computer Agent SDK providers (using model strings like
"huggingface-local/HelloKKMe/GTA1-7B") - Reference agent implementations (custom model classes implementing the
ModelProtocol)
Available Benchmarks
1. ScreenSpot-v2 (ss-v2.py)
- Dataset: ScreenSpot-v2 (click-only GUI grounding)
- Format: Standard resolution screenshots
- Task: Predict click coordinates given an instruction and image
- Metrics: Accuracy, Error Rate, Timing, VRAM usage
2. ScreenSpot-Pro (ss-pro.py)
- Dataset: ScreenSpot-Pro (high-resolution click-only GUI grounding)
- Format: High-resolution screenshots
- Task: Predict click coordinates given an instruction and image
- Metrics: Accuracy, Error Rate, Timing, VRAM usage
3. Interactive Testing (interactive.py)
- Real-time testing: Take screenshots and visualize model predictions
- Commands:
- Type instruction → screenshot + test all models
screenshot→ take screenshot without predictionmodels→ list available modelsquit/exit→ exit tool
- Output: Visual predictions with crosshairs for each model
Adding Reference Agent Implementations
1. Implement the ModelProtocol
Create a new file in models/ directory implementing the ModelProtocol:
from models.base import ModelProtocol
from typing import Optional, Tuple
from PIL import Image
class YourModelName(ModelProtocol):
def __init__(self, model_path: str):
self.model_path = model_path
self._model = None
@property
def model_name(self) -> str:
return self.model_path
async def load_model(self) -> None:
"""Load the model into memory."""
# Your model loading logic here
pass
async def unload_model(self) -> None:
"""Unload the model from memory."""
# Your model cleanup logic here
pass
async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
"""
Predict click coordinates for the given image and instruction.
Args:
image: PIL Image to analyze
instruction: Text instruction describing what to click
Returns:
Tuple of (x, y) coordinates or None if prediction fails
"""
# Your prediction logic here
return (x, y) # Return predicted coordinates
2. Register Your Model
Add your model to the get_available_models() function in utils.py:
def get_available_models() -> List[Union[str, ModelProtocol]]:
models = [
# Computer Agent SDK providers
"huggingface-local/HelloKKMe/GTA1-7B",
# Reference implementations
GTA1Model("HelloKKMe/GTA1-7B"),
YourModelName("path/to/your/model"), # Add your model here
]
return models
Running Benchmarks
1. Configure Models
Edit utils.py to specify which models you want to test in get_available_models().
2. Set Sample Count
Edit the benchmark script to change the number of samples:
max_samples = 50 # Set to None to evaluate on full dataset
3. Run Benchmark
# ScreenSpot-v2 benchmark
python ss-v2.py
# ScreenSpot-Pro benchmark
python ss-pro.py
# Interactive testing
python interactive.py
Output
Console Output
Model Results:
Accuracy: 85.50%
Correct: 171/200
Errors: 5
Error Rate: 2.50%
Avg Time: 1.23s
Time Range: 0.89s - 2.45s
VRAM Max: 4.5GB
VRAM Avg: 3.4GB
Generated Files
- Markdown Report:
*_results.mdwith detailed results tables - Visualizations:
output/directory with prediction visualizations - Interactive Output:
interactive_output/for interactive session results
Metrics Tracked
- Accuracy: Percentage of clicks within bounding boxes
- Error Rate: Percentage of failed predictions
- Timing: Average, min, max prediction times
- VRAM Usage: Maximum and average GPU memory usage
- Per-sample Results: Detailed breakdown for debugging
Requirements
- Python 3.8+
- PyTorch (for VRAM tracking)
- PIL/Pillow (for image processing)
- datasets (for HuggingFace datasets)
- tqdm (for progress bars)
- Computer Agent SDK
Architecture
The benchmark system is designed for:
- Modularity: Easy to add new models and benchmarks
- Flexibility: Works with any iterator of dicts with
image,bbox,instructionkeys - Performance: VRAM tracking and timing analysis
- Visualization: Automatic generation of prediction visualizations
- No Exception Handling: Fails fast to surface real issues
Results Table
| Model | Dataset | Accuracy | Error Rate | Avg Time | VRAM Max | VRAM Avg |
|---|---|---|---|---|---|---|
| (coming soon) |
Contributing
To add a new benchmark:
- Create a new script following the pattern in
ss-v2.py - Use the
evaluate_model()function from utils - Ensure your dataset yields dicts with
image,bbox,instructionkeys - Update this README with benchmark details