computer/libs/python/agent/benchmarks/README.md

# Computer Agent Benchmarks

This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.

## Overview

The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:

- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)

## Available Benchmarks

### 1. ScreenSpot-v2 (`ss-v2.py`)

- **Dataset**: ScreenSpot-v2 (click-only GUI grounding)
- **Format**: Standard resolution screenshots
- **Task**: Predict click coordinates given an instruction and image
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage

### 2. ScreenSpot-Pro (`ss-pro.py`)

- **Dataset**: ScreenSpot-Pro (high-resolution click-only GUI grounding)
- **Format**: High-resolution screenshots
- **Task**: Predict click coordinates given an instruction and image
- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage

### 3. Interactive Testing (`interactive.py`)

- **Real-time testing**: Take screenshots and visualize model predictions
- **Commands**:
  - Type instruction → test all models on last screenshot
  - `screenshot` → take screenshot
  - `models` → list available models
  - `quit`/`exit` → exit tool
- **Output**: Visual predictions with crosshairs for each model

## Running Benchmarks

### 1. Configure Models

Edit `utils.py` to specify which models you want to test in `get_available_models()`.

### 2. Run Benchmark

```bash
# ScreenSpot-v2 benchmark
python ss-v2.py --samples 50

# ScreenSpot-Pro benchmark
python ss-pro.py --samples 50

# Interactive testing
python interactive.py
```

## Output

### Console Output

```
Model Results:
  Accuracy: 85.50% (171/200)
  Avg Time: 1.23s (0.89s - 2.45s)
  VRAM Usage: 4.5GB (max) / 3.4GB (avg)
```

### Generated Files

- **Markdown Report**: `*_results.md` with detailed results tables
- **Visualizations**: `output/` directory with prediction visualizations
- **Interactive Output**: `interactive_output/` for interactive session results

## Contributing

To add a new reference model, follow the instructions in [contrib.md](contrib.md).