Merge branch 'main' into feature/new-logo

2026-01-29 00:50:04 -06:00 · 2025-08-13 17:17:32 +02:00
parent ad5ebb5fc0 f8fce4199e
commit 1c33de3fd3
79 changed files with 118625 additions and 1731 deletions
--- a/README.md
+++ b/README.md
@@ -16,223 +16,149 @@
 **cua** ("koo-ah") is Docker for [Computer-Use Agents](https://www.oneusefulthing.org/p/when-you-give-a-claude-a-mouse) - it enables AI agents to control full operating systems in virtual containers and deploy them locally or to the cloud.

 <div align="center">
-  <video src="https://github.com/user-attachments/assets/c619b4ea-bb8e-4382-860e-f3757e36af20" width="800" controls></video>
+  <video src="https://github.com/user-attachments/assets/c619b4ea-bb8e-4382-860e-f3757e36af20" width="600" controls></video>
 </div>
-<details>
-<summary><b>Check out more demos of the Computer-Use Agent in action
-</b></summary>

-<details open>
-<summary><b>MCP Server: Work with Claude Desktop and Tableau</b></summary>
-<br>
-<div align="center">
-    <video src="https://github.com/user-attachments/assets/9f573547-5149-493e-9a72-396f3cff29df" width="800" controls></video>
-</div>
-</details>
+With the Computer SDK, you can:
+- automate Windows, Linux, and macOS VMs with a consistent, [pyautogui-like API](https://docs.trycua.com/docs/libraries/computer#interface-actions)
+- create & manage VMs [locally](https://docs.trycua.com/docs/computer-sdk/computers#cua-local-containers) or using [cua cloud](https://www.trycua.com/)

-<details>
-<summary><b>AI-Gradio: Multi-app workflow with browser, VS Code and terminal</b></summary>
-<br>
-<div align="center">
-    <video src="https://github.com/user-attachments/assets/723a115d-1a07-4c8e-b517-88fbdf53ed0f" width="800" controls></video>
-</div>
-</details>
+With the Agent SDK, you can:
+- run computer-use models with a [consistent output](https://docs.trycua.com/docs/agent-sdk/chat-history#message-array-structure)
+- run composed agents using UI grounding models and any LLM
+- use any liteLLM provider (`openai/`, `openrouter/`, etc.) or our included local providers (`huggingface-local/`, `mlx/`)
+- quickly evaluate new UI agent models and UI grounding models
+  - `anthropic/claude-opus-4-1-20250805` (using [Computer-Use Models](https://docs.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents))
+  - `openai/computer-use-preview`
+  - `openrouter/z-ai/glm-4.5v`
+  - `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+  - `omniparser+{any LLM}` (using [Composed Agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents))
+  - `huggingface-local/HelloKKMe/GTA1-7B+{any LLM}`
+  - `huggingface/HelloKKMe/GTA1-32B+{any LLM}`
+  - `vllm_hosted/HelloKKMe/GTA1-72B+{any LLM}`
+  - `human/human` (using [Human-in-the-Loop](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop))
+- benchmark on OSWorld-Verified, SheetBench-V2, and more [with a single line of code using HUD](https://docs.trycua.com/docs/agent-sdk/integrations/hud) ([Notebook](https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb))

-<details>
-<summary><b>Notebook: Fix GitHub issue in Cursor</b></summary>
-<br>
-<div align="center">
-    <video src="https://github.com/user-attachments/assets/f67f0107-a1e1-46dc-aa9f-0146eb077077" width="800" controls></video>
-</div>
-</details>
-</details><br/>
-
-# 🚀 Quick Start with a Computer-Use Agent UI
-
-**Need to automate desktop tasks? Launch the Computer-Use Agent UI with a single command.**
-
-### Option 1: Fully-managed install with Docker (recommended)
-
-*Docker-based guided install for quick use*
-
-**macOS/Linux/Windows (via WSL):**
-
-```bash
-# Requires Docker
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/scripts/playground-docker.sh)"
-```
-
-This script will guide you through setup using Docker containers and launch the Computer-Use Agent UI.
-
---
-
-### Option 2: [Dev Container](./.devcontainer/README.md)
-
-*Best for contributors and development*
-
-This repository includes a [Dev Container](./.devcontainer/README.md) configuration that simplifies setup to a few steps:
-
-1. **Install the Dev Containers extension ([VS Code](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) or [WindSurf](https://docs.windsurf.com/windsurf/advanced#dev-containers-beta))**
-2. **Open the repository in the Dev Container:**
-    - Press `Ctrl+Shift+P` (or `⌘+Shift+P` on macOS)
-    - Select `Dev Containers: Clone Repository in Container Volume...` and paste the repository URL: `https://github.com/trycua/cua.git` (if not cloned) or `Dev Containers: Open Folder in Container...` (if git cloned).
-     > **Note**: On WindSurf, the post install hook might not run automatically. If so, run `/bin/bash .devcontainer/post-install.sh` manually.
-3. **Open the VS Code workspace:** Once the post-install.sh is done running, open the `.vscode/py.code-workspace` workspace and press ![Open Workspace](https://github.com/user-attachments/assets/923bdd43-8c8f-4060-8d78-75bfa302b48c)
-.
-4. **Run the Agent UI example:** Click ![Run Agent UI](https://github.com/user-attachments/assets/7a61ef34-4b22-4dab-9864-f86bf83e290b)
- to start the Gradio UI. If prompted to install **debugpy (Python Debugger)** to enable remote debugging, select 'Yes' to proceed.
-5. **Access the Gradio UI:** The Gradio UI will be available at `http://localhost:7860` and will automatically forward to your host machine.
-
---
-
-### Option 3: PyPI
-
-*Direct Python package installation*
-
-```bash
-# conda create -yn cua python==3.12
-
-pip install -U "cua-computer[all]" "cua-agent[all]"
-python -m agent.ui # Start the agent UI
-```
-
-Or check out the [Usage Guide](#-usage-guide) to learn how to use our Python SDK in your own code.
-
---
-
-## Supported [Agent Loops](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops)
-
- [UITARS-1.5](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Run locally on Apple Silicon with MLX, or use cloud providers
- [OpenAI CUA](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Use OpenAI's Computer-Use Preview model
- [Anthropic CUA](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Use Anthropic's Computer-Use capabilities
- [OmniParser-v2.0](https://github.com/trycua/cua/blob/main/libs/python/agent/README.md#agent-loops) - Control UI with [Set-of-Marks prompting](https://som-gpt4v.github.io/) using any vision model
-
-## 🖥️ Compatibility
-
-For detailed compatibility information including host OS support, VM emulation capabilities, and model provider compatibility, see the [Compatibility Matrix](./COMPATIBILITY.md).
+Missing a model? [Raise a feature request](https://github.com/trycua/cua/issues/new?assignees=&labels=enhancement&projects=&title=%5BAgent%5D%3A+Add+model+support+for+) or [contribute](https://github.com/trycua/cua/blob/main/CONTRIBUTING.md)!

 <br/>
+
+# Quick Start 
+
+- [Get started with a Computer-Use Agent UI](https://docs.trycua.com/docs/quickstart-ui)
+- [Get started with the Computer-Use Agent CLI](https://docs.trycua.com/docs/quickstart-cli)
+- [Get Started with the Python SDKs](https://docs.trycua.com/docs/quickstart-devs)
+
 <br/>

-# 🐍 Usage Guide
-
-Follow these steps to use Cua in your own Python code. See [Developer Guide](./docs/Developer-Guide.md) for building from source.
-
-### Step 1: Install Lume CLI
+# Usage ([Docs](https://docs.trycua.com/docs))

 ```bash
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
+pip install cua-agent[all]
+```
+```python
+from agent import ComputerAgent
+
+agent = ComputerAgent(
+    model="anthropic/claude-3-5-sonnet-20241022",
+    tools=[computer],
+    max_trajectory_budget=5.0
+)
+
+messages = [{"role": "user", "content": "Take a screenshot and tell me what you see"}]
+
+async for result in agent.run(messages):
+    for item in result["output"]:
+        if item["type"] == "message":
+            print(item["content"][0]["text"])
 ```

-Lume CLI manages high-performance macOS/Linux VMs with near-native speed on Apple Silicon.
+### Output format (OpenAI Agent Responses Format):
+```json
+{ 
+  "output": [
+    # user input
+    {
+        "role": "user",
+        "content": "go to trycua on gh"
+    },
+    # first agent turn adds the model output to the history
+    {
+        "summary": [
+            {
+                "text": "Searching Firefox for Trycua GitHub",
+                "type": "summary_text"
+            }
+        ],
+        "type": "reasoning"
+    },
+    {
+        "action": {
+            "text": "Trycua GitHub",
+            "type": "type"
+        },
+        "call_id": "call_QI6OsYkXxl6Ww1KvyJc4LKKq",
+        "status": "completed",
+        "type": "computer_call"
+    },
+    # second agent turn adds the computer output to the history
+    {
+        "type": "computer_call_output",
+        "call_id": "call_QI6OsYkXxl6Ww1KvyJc4LKKq",
+        "output": {
+            "type": "input_image",
+            "image_url": "data:image/png;base64,..."
+        }
+    },
+    # final agent turn adds the agent output text to the history
+    {
+        "type": "message",
+        "role": "assistant",
+        "content": [
+          {
+            "text": "Success! The Trycua GitHub page has been opened.",
+            "type": "output_text"
+          }
+        ]
+    }
+  ], 
+  "usage": {
+      "prompt_tokens": 150,
+      "completion_tokens": 75,
+      "total_tokens": 225,
+      "response_cost": 0.01,
+  }
+}
+```

-### Step 2: Pull the macOS CUA Image
+# Computer ([Docs](https://docs.trycua.com/docs/computer-sdk/computers))

 ```bash
-lume pull macos-sequoia-cua:latest
+pip install cua-computer[all]
 ```
-
-The macOS CUA image contains the default Mac apps and the Computer Server for easy automation.
-
-### Step 3: Install Python SDK
-
-```bash
-pip install "cua-computer[all]" "cua-agent[all]"
-```
-
-### Step 4: Use in Your Code
-
 ```python
 from computer import Computer
-from agent import ComputerAgent, LLM

-async def main():
-    # Start a local macOS VM
-    computer = Computer(os_type="macos")
-    await computer.run()
+async with Computer(
+    os_type="linux",
+    provider_type="cloud",
+    name="your-container-name",
+    api_key="your-api-key"
+) as computer:
+    # Take screenshot
+    screenshot = await computer.interface.screenshot()

-    # Or with Cua Cloud Container
-    computer = Computer(
-      os_type="linux",
-      api_key="your_cua_api_key_here",
-      name="your_container_name_here"
-    )
-
-    # Example: Direct control of a macOS VM with Computer
-    computer.interface.delay = 0.1 # Wait 0.1 seconds between kb/m actions
-    await computer.interface.left_click(100, 200)
-    await computer.interface.type_text("Hello, world!")
-    screenshot_bytes = await computer.interface.screenshot()
-    
-    # Example: Create and run an agent locally using mlx-community/UI-TARS-1.5-7B-6bit
-    agent = ComputerAgent(
-      model="mlx/mlx-community/UI-TARS-1.5-7B-6bit",
-      tools=[computer],
-    )
-    async for result in agent.run("Find the trycua/cua repository on GitHub and follow the quick start guide"):
-        print(result)
-
-if __name__ == "__main__":
-    asyncio.run(main())
+    # Click and type
+    await computer.interface.left_click(100, 100)
+    await computer.interface.type("Hello!")
 ```

-For ready-to-use examples, check out our [Notebooks](./notebooks/) collection.
-
-### Lume CLI Reference
-
-```bash
-# Install Lume CLI and background service
-curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
-
-# List all VMs
-lume ls
-
-# Pull a VM image
-lume pull macos-sequoia-cua:latest
-
-# Create a new VM
-lume create my-vm --os macos --cpu 4 --memory 8GB --disk-size 50GB
-
-# Run a VM (creates and starts if it doesn't exist)
-lume run macos-sequoia-cua:latest
-
-# Stop a VM
-lume stop macos-sequoia-cua_latest
-
-# Delete a VM
-lume delete macos-sequoia-cua_latest
-```
-
-### Lumier CLI Reference
-
-For advanced container-like virtualization, check out [Lumier](./libs/lumier/README.md) - a Docker interface for macOS and Linux VMs.
-
-```bash
-# Install Lume CLI and background service
-curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
-
-# Run macOS in a Docker container
-docker run -it --rm \
-    --name lumier-vm \
-    -p 8006:8006 \
-    -v $(pwd)/storage:/storage \
-    -v $(pwd)/shared:/shared \
-    -e VM_NAME=lumier-vm \
-    -e VERSION=ghcr.io/trycua/macos-sequoia-cua:latest \
-    -e CPU_CORES=4 \
-    -e RAM_SIZE=8192 \
-    -e HOST_STORAGE_PATH=$(pwd)/storage \
-    -e HOST_SHARED_PATH=$(pwd)/shared \
-    trycua/lumier:latest
-```
-
-## Resources
+# Resources

 - [How to use the MCP Server with Claude Desktop or other MCP clients](./libs/python/mcp-server/README.md) - One of the easiest ways to get started with Cua
 - [How to use OpenAI Computer-Use, Anthropic, OmniParser, or UI-TARS for your Computer-Use Agent](./libs/python/agent/README.md)
 - [How to use Lume CLI for managing desktops](./libs/lume/README.md)
 - [Training Computer-Use Models: Collecting Human Trajectories with Cua (Part 1)](https://www.trycua.com/blog/training-computer-use-models-trajectories-1)
- [Build Your Own Operator on macOS (Part 1)](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)

 ## Modules

@@ -249,112 +175,6 @@ docker run -it --rm \
 | [**Core (Python)**](./libs/python/core/README.md) | Python Core utilities | `pip install cua-core` |
 | [**Core (Typescript)**](./libs/typescript/core/README.md) | Typescript Core utilities | `npm install @trycua/core` |

-## Computer Interface Reference
-
-For complete examples, see [computer_examples.py](./examples/computer_examples.py) or [computer_nb.ipynb](./notebooks/computer_nb.ipynb)
-
-```python
-# Shell Actions
-result = await computer.interface.run_command(cmd)       # Run shell command
-# result.stdout, result.stderr, result.returncode
-
-# Mouse Actions
-await computer.interface.left_click(x, y)       # Left click at coordinates
-await computer.interface.right_click(x, y)      # Right click at coordinates
-await computer.interface.double_click(x, y)     # Double click at coordinates
-await computer.interface.move_cursor(x, y)      # Move cursor to coordinates
-await computer.interface.drag_to(x, y, duration)  # Drag to coordinates
-await computer.interface.get_cursor_position()  # Get current cursor position
-await computer.interface.mouse_down(x, y, button="left")  # Press and hold a mouse button
-await computer.interface.mouse_up(x, y, button="left")    # Release a mouse button
-
-# Keyboard Actions
-await computer.interface.type_text("Hello")     # Type text
-await computer.interface.press_key("enter")     # Press a single key
-await computer.interface.hotkey("command", "c") # Press key combination
-await computer.interface.key_down("command")    # Press and hold a key
-await computer.interface.key_up("command")      # Release a key
-
-# Scrolling Actions
-await computer.interface.scroll(x, y)           # Scroll the mouse wheel
-await computer.interface.scroll_down(clicks)    # Scroll down
-await computer.interface.scroll_up(clicks)      # Scroll up
-
-# Screen Actions
-await computer.interface.screenshot()           # Take a screenshot
-await computer.interface.get_screen_size()      # Get screen dimensions
-
-# Clipboard Actions
-await computer.interface.set_clipboard(text)    # Set clipboard content
-await computer.interface.copy_to_clipboard()    # Get clipboard content
-
-# File System Operations
-await computer.interface.file_exists(path)      # Check if file exists
-await computer.interface.directory_exists(path) # Check if directory exists
-await computer.interface.read_text(path, encoding="utf-8")        # Read file content
-await computer.interface.write_text(path, content, encoding="utf-8") # Write file content
-await computer.interface.read_bytes(path)       # Read file content as bytes
-await computer.interface.write_bytes(path, content) # Write file content as bytes
-await computer.interface.delete_file(path)      # Delete file
-await computer.interface.create_dir(path)       # Create directory
-await computer.interface.delete_dir(path)       # Delete directory
-await computer.interface.list_dir(path)         # List directory contents
-
-# Accessibility
-await computer.interface.get_accessibility_tree() # Get accessibility tree
-
-# Delay Configuration
-# Set default delay between all actions (in seconds)
-computer.interface.delay = 0.5  # 500ms delay between actions
-
-# Or specify delay for individual actions
-await computer.interface.left_click(x, y, delay=1.0)     # 1 second delay after click
-await computer.interface.type_text("Hello", delay=0.2)   # 200ms delay after typing
-await computer.interface.press_key("enter", delay=0.5)   # 500ms delay after key press
-
-# Python Virtual Environment Operations
-await computer.venv_install("demo_venv", ["requests", "macos-pyxa"]) # Install packages in a virtual environment
-await computer.venv_cmd("demo_venv", "python -c 'import requests; print(requests.get(`https://httpbin.org/ip`).json())'") # Run a shell command in a virtual environment
-await computer.venv_exec("demo_venv", python_function_or_code, *args, **kwargs) # Run a Python function in a virtual environment and return the result / raise an exception
-
-# Example: Use sandboxed functions to execute code in a Cua Container
-from computer.helpers import sandboxed
-
-@sandboxed("demo_venv")
-def greet_and_print(name):
-    """Get the HTML of the current Safari tab"""
-    import PyXA
-    safari = PyXA.Application("Safari")
-    html = safari.current_document.source()
-    print(f"Hello from inside the container, {name}!")
-    return {"greeted": name, "safari_html": html}
-
-# When a @sandboxed function is called, it will execute in the container
-result = await greet_and_print("Cua")
-# Result: {"greeted": "Cua", "safari_html": "<html>...</html>"}
-# stdout and stderr are also captured and printed / raised
-print("Result from sandboxed function:", result)
-```
-
-## ComputerAgent Reference
-
-For complete examples, see [agent_examples.py](./examples/agent_examples.py) or [agent_nb.ipynb](./notebooks/agent_nb.ipynb)
-
-```python
-# Import necessary components
-from agent import ComputerAgent
-
-# UI-TARS-1.5 agent for local execution with MLX
-ComputerAgent(model="mlx/mlx-community/UI-TARS-1.5-7B-6bit")   
-# OpenAI Computer-Use agent using OPENAI_API_KEY  
-ComputerAgent(model="computer-use-preview")
-# Anthropic Claude agent using ANTHROPIC_API_KEY
-ComputerAgent(model="anthropic/claude-3-5-sonnet-20240620")
-
-# OmniParser loop for UI control using Set-of-Marks (SOM) prompting and any vision LLM
-ComputerAgent(model="omniparser+ollama_chat/gemma3:12b-it-q4_K_M")      
-```
-
 ## Community

 Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas, get assistance, or share your demos!
@@ -409,4 +229,4 @@ Thank you to all our supporters!
 <!-- markdownlint-restore -->
 <!-- prettier-ignore-end -->

-<!-- ALL-CONTRIBUTORS-LIST:END -->
+<!-- ALL-CONTRIBUTORS-LIST:END -->
--- a/docs/content/docs/agent-sdk/agent-loops.mdx
+++ b/docs/content/docs/agent-sdk/agent-loops.mdx
@@ -29,11 +29,4 @@ async for result in agent.run(prompt):
        print("Agent:", result["output"][-1]["content"][0]["text"])
 ```

-We currently support 4 computer-using agent loops:
-
- Anthropic CUAs
- OpenAI CUA Preview
- UI-TARS 1.5
- Omniparser + LLMs
-
-For a full list of supported models and configurations, see the [Supported Agents](./supported-agents) page.
+For a list of supported models and configurations, see the [Supported Agents](./supported-agents/computer-use-agents) page.
--- a/docs/content/docs/agent-sdk/benchmarks/index.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/index.mdx
@@ -0,0 +1,28 @@
+---
+title: Benchmarks
+description: Computer Agent SDK benchmarks for agentic GUI tasks
+---
+
+The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
+- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
+- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
+
+## Available Benchmarks
+
+- **[ScreenSpot-v2](./screenspot-v2)** - Standard resolution GUI grounding
+- **[ScreenSpot-Pro](./screenspot-pro)** - High-resolution GUI grounding  
+- **[Interactive Testing](./interactive)** - Real-time testing and visualization
+
+## Quick Start
+
+```bash
+# Clone the benchmark repository
+git clone https://github.com/trycua/cua
+cd libs/python/agent/benchmarks
+
+# Install dependencies
+pip install "cua-agent[all]"
+
+# Run a benchmark
+python ss-v2.py
+```
--- a/docs/content/docs/agent-sdk/benchmarks/interactive.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/interactive.mdx
@@ -0,0 +1,21 @@
+---
+title: Interactive Tool
+description: Real-time testing and visualization tool for GUI grounding models
+---
+
+This tool allows you to test multiple models interactively by providing natural language instructions. It automatically captures screenshots and tests all configured models sequentially, providing immediate feedback and visual results.
+
+## Usage
+
+```bash
+# Start the interactive tool
+cd libs/python/agent/benchmarks
+python interactive.py
+```
+
+## Commands
+
+- **Type instruction**: Screenshot + test all models
+- **`screenshot`**: Take screenshot without prediction
+- **`models`**: List available models
+- **`quit`/`exit`**: Exit the tool
--- a/docs/content/docs/agent-sdk/benchmarks/introduction.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/introduction.mdx
@@ -0,0 +1,57 @@
+---
+title: Introduction
+description: Overview of benchmarking in the c/ua agent framework
+---
+
+The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.
+
+## Benchmark Types
+
+Computer-Agent benchmarks evaluate two key capabilities:
+- **Plan Generation**: Breaking down complex tasks into a sequence of actions
+- **Coordinate Generation**: Predicting precise click locations on GUI elements
+
+## Using State-of-the-Art Models
+
+Let's see how to use the SOTA vision-language models in the c/ua agent framework.
+
+### Plan Generation + Coordinate Generation
+
+**[OS-World](https://os-world.github.io/)** - Benchmark for complete computer-use agents
+
+This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.
+
+```python
+# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
+# This makes it suitable for agentic loops for computer-use
+agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
+agent.run("Open Firefox and go to github.com")
+# Success! 🎉
+```
+
+### Coordinate Generation Only
+
+**[GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/)** - Benchmark for click prediction accuracy  
+
+This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.
+
+```python
+# GTA1-7B is a SOTA coordinate generation VLM
+# It can only generate coordinates, not plan:
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+agent.predict_click("find the button to open the settings") # (27, 450)
+# This will raise an error:
+# agent.run("Open Firefox and go to github.com") 
+```
+
+### Composed Agent
+
+The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.
+
+```python
+# It can be paired with any LLM to form a composed agent:
+# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
+agent.run("Open Firefox and go to github.com")
+# Success! 🎉
+```
--- a/docs/content/docs/agent-sdk/benchmarks/meta.json
+++ b/docs/content/docs/agent-sdk/benchmarks/meta.json
@@ -0,0 +1,9 @@
+{
+    "pages": [
+        "introduction",
+        "screenspot-v2",
+        "screenspot-pro",
+        "interactive",
+        "osworld-verified"
+    ]
+}
--- a/docs/content/docs/agent-sdk/benchmarks/osworld-verified.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/osworld-verified.mdx
@@ -0,0 +1,89 @@
+---
+title: OSWorld-Verified
+description: Benchmark ComputerAgent on OSWorld tasks using HUD
+---
+
+OSWorld-Verified is a curated subset of OSWorld tasks that can be run using the HUD framework. Use ComputerAgent with HUD to benchmark on these tasks.
+
+## Setup
+
+```bash
+pip install hud-python==0.2.10
+```
+
+Set environment variables:
+```bash
+export HUD_API_KEY="your_hud_key"
+export ANTHROPIC_API_KEY="your_anthropic_key"  # For Claude
+export OPENAI_API_KEY="your_openai_key"        # For OpenAI
+```
+
+## Quick Start
+
+```python
+import asyncio
+from hud import gym, load_taskset
+from agent.integrations.hud import ComputerAgent
+
+async def run_osworld():
+    # Load taskset
+    taskset = await load_taskset("OSWorld-Verified")
+    test = taskset[144]  # Example task
+    
+    # Create environment (~2.5 min startup)
+    env = await gym.make(test)
+    
+    # Create agent
+    agent = ComputerAgent(
+        model="anthropic/claude-3-5-sonnet-20241022", # any ComputerAgent model string
+        environment="linux"
+    )
+    
+    # Run benchmark
+    obs, _ = await env.reset()
+    for i in range(100):
+        action, done = await agent.predict(obs)
+        obs, reward, terminated, info = await env.step(action)
+        if done or terminated:
+            break
+    
+    # Evaluate results
+    result = await env.evaluate()
+    await env.close()
+    
+    return result
+
+# Run benchmark
+result = asyncio.run(run_osworld())
+print(f"Success: {result.get('success', False)}")
+```
+
+## Parallel Execution
+
+Run all tasks in parallel using `run_job`:
+
+```python
+from agent.integrations.hud import run_job
+from hud import load_taskset
+from hud.taskset import TaskSet
+import logging
+
+# Load taskset
+taskset = await load_taskset("OSWorld-Verified")
+taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370
+
+# Run benchmark job
+job = await run_job(
+    model="openai/computer-use-preview",
+    task_or_taskset=taskset,
+    job_name="test-computeragent-job",
+    max_concurrent_tasks=5,
+    # add any extra ComputerAgent kwargs:
+    verbosity=logging.INFO,  # Enable logging
+    # trajectory_dir=".."       # Save trajectories locally
+)
+
+# Get results OR view them at app.hud.so
+print(await job.get_analytics())
+print(f"View results at: https://app.hud.so/jobs/{job.id}")
+```
--- a/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-pro.mdx
@@ -0,0 +1,25 @@
+---
+title: ScreenSpot-Pro
+description: High-resolution GUI grounding benchmark
+---
+
+ScreenSpot-Pro is a benchmark for evaluating click prediction accuracy on high-resolution GUI screenshots with complex layouts.
+
+## Usage
+
+```bash
+# Run the benchmark
+cd libs/python/agent/benchmarks
+python ss-pro.py
+
+# Run with custom sample limit
+python ss-pro.py --samples 50
+```
+
+## Results
+
+| Model | Accuracy | Failure Rate | Samples |
+|-------|----------|--------------|---------|
+| Coming Soon | - | - | - |
+
+Results will be populated after running benchmarks with various models.
--- a/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
+++ b/docs/content/docs/agent-sdk/benchmarks/screenspot-v2.mdx
@@ -0,0 +1,25 @@
+---
+title: ScreenSpot-v2
+description: Standard resolution GUI grounding benchmark
+---
+
+ScreenSpot-v2 is a benchmark for evaluating click prediction accuracy on standard resolution GUI screenshots.
+
+## Usage
+
+```bash
+# Run the benchmark
+cd libs/python/agent/benchmarks
+python ss-v2.py
+
+# Run with custom sample limit
+python ss-v2.py --samples 100
+```
+
+## Results
+
+| Model | Accuracy | Failure Rate | Samples |
+|-------|----------|--------------|---------|
+| Coming Soon | - | - | - |
+
+Results will be populated after running benchmarks with various models.
--- a/docs/content/docs/agent-sdk/custom-computer-handlers.mdx
+++ b/docs/content/docs/agent-sdk/custom-computer-handlers.mdx
@@ -0,0 +1,130 @@
+---
+title: Custom Computers
+slug: custom-computer-handlers
+---
+
+The Agent SDK supports defining custom computer handlers using a simple dictionary interface. This enables integration with custom automation backends, testing frameworks, or specialized computer control systems.
+
+## Example: Defining a Custom Computer Handler
+
+```python
+import asyncio
+from PIL import Image
+
+# Define your custom computer functions
+async def take_screenshot():
+    """Your custom screenshot implementation"""
+    # Return PIL Image, bytes, or base64 string
+    return Image.new('RGB', (1920, 1080), color='white')
+
+# Create dict-based computer handler - only 'screenshot' is required
+custom_computer = {
+    'screenshot': take_screenshot, # required
+
+    # everything below is optional
+    'environment': 'linux', # linux, mac, windows, browser
+    'dimensions': (1920, 1080), # (width, height)
+    'click': lambda x, y, button: print(f"Clicking at ({x}, {y}) with {button} button"),
+}
+```
+
+You can then use this as a tool for your agent:
+
+```python
+from agent import ComputerAgent
+
+agent = ComputerAgent(
+    model="anthropic/claude-3-5-sonnet-20240620",
+    tools=[custom_computer],
+)
+
+# Agent will automatically convert dict to agent.computers.CustomComputerHandler
+await agent.run("Take a screenshot and click at coordinates 100, 200")
+```
+
+## Class-Based Implementation
+
+For more complex implementations, you can create a custom class by inheriting from `AsyncComputerHandler`:
+
+```python
+from agent.computers import AsyncComputerHandler
+from PIL import Image
+from typing import Literal, List, Dict, Union, Optional
+
+class MyCustomComputer(AsyncComputerHandler):
+    """Custom computer handler implementation."""
+    
+    def __init__(self):
+        # Initialize your custom computer interface here
+        pass
+    
+    # ==== Computer-Use-Preview Action Space ==== 
+
+    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
+        """Get the current environment type."""
+        ...
+    
+    async def get_dimensions(self) -> tuple[int, int]:
+        """Get screen dimensions as (width, height)."""
+        ...
+    
+    async def screenshot(self) -> str:
+        """Take a screenshot and return as base64 string."""
+        ...
+    
+    async def click(self, x: int, y: int, button: str = "left") -> None:
+        """Click at coordinates with specified button."""
+        ...
+    
+    async def double_click(self, x: int, y: int) -> None:
+        """Double click at coordinates."""
+        ...
+    
+    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
+        """Scroll at coordinates with specified scroll amounts."""
+        ...
+    
+    async def type(self, text: str) -> None:
+        """Type text."""
+        ...
+    
+    async def wait(self, ms: int = 1000) -> None:
+        """Wait for specified milliseconds."""
+        ...
+    
+    async def move(self, x: int, y: int) -> None:
+        """Move cursor to coordinates."""
+        ...
+    
+    async def keypress(self, keys: Union[List[str], str]) -> None:
+        """Press key combination."""
+        ...
+    
+    async def drag(self, path: List[Dict[str, int]]) -> None:
+        """Drag along specified path."""
+        ...
+    
+    async def get_current_url(self) -> str:
+        """Get current URL (for browser environments)."""
+        ...
+    
+    # ==== Anthropic Action Space ==== 
+
+    async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse down at coordinates."""
+        ...
+    
+    async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse up at coordinates."""
+        ...
+
+# Use with agent
+custom_computer = MyCustomComputer()
+
+agent = ComputerAgent(
+    model="anthropic/claude-3-5-sonnet-20240620",
+    tools=[custom_computer],
+)
+
+await agent.run("Take a screenshot and click at coordinates 100, 200")
+```
--- a/docs/content/docs/agent-sdk/integrations/hud.mdx
+++ b/docs/content/docs/agent-sdk/integrations/hud.mdx
@@ -0,0 +1,49 @@
+---
+title: HUD Evals
+description: Use ComputerAgent with HUD for benchmarking and evaluation
+---
+
+The HUD integration allows you to use ComputerAgent with the [HUD benchmarking framework](https://www.hud.so/), providing the same interface as existing HUD agents while leveraging ComputerAgent's capabilities.
+
+## Installation
+
+```bash
+pip install "cua-agent[hud]"
+## or install hud-python directly
+# pip install hud-python==0.2.10
+```
+
+## Usage
+
+```python
+from agent.integrations.hud import run_job
+from hud import load_taskset
+from hud.taskset import TaskSet
+import logging
+
+# Load taskset
+taskset = await load_taskset("OSWorld-Verified")
+taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370
+
+# Run benchmark job
+job = await run_job(
+    model="openai/computer-use-preview",
+    # model="anthropic/claude-3-5-sonnet-20241022",
+    # model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
+    task_or_taskset=taskset,
+    job_name="test-computeragent-job",
+    max_concurrent_tasks=5,
+    # add any extra ComputerAgent kwargs:
+    verbosity=logging.INFO,  # Enable logging
+    # trajectory_dir=".."       # Save trajectories locally
+)
+
+# Get results OR view them at app.hud.so
+print(await job.get_analytics())
+print(f"View results at: https://app.hud.so/jobs/{job.id}")
+```
+
+**Available Benchmarks:**
+1. [OSWorld-Verified](/agent-sdk/benchmarks/osworld-verified) - Benchmark on OSWorld tasks
+
+See the [HUD docs](https://docs.hud.so/environment-creation) for more eval environments.
--- a/docs/content/docs/agent-sdk/integrations/meta.json
+++ b/docs/content/docs/agent-sdk/integrations/meta.json
@@ -0,0 +1,4 @@
+{
+  "title": "Integrations",
+  "pages": ["hud"]
+}
--- a/docs/content/docs/agent-sdk/meta.json
+++ b/docs/content/docs/agent-sdk/meta.json
@@ -3,13 +3,16 @@
 	"description": "Build computer-using agents with the Agent SDK",
 	"pages": [
        "agent-loops",
-    	"supported-agents",
+        "supported-agents",
 		"chat-history",
 		"callbacks",
        "sandboxed-tools",
+		"custom-computer-handlers",
    	"local-models",
        "prompt-caching",
 		"usage-tracking",
-        "migration-guide"
+		"benchmarks",
+        "migration-guide",
+		"integrations"
 	]
 }
--- a/docs/content/docs/agent-sdk/supported-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents.mdx
@@ -1,34 +0,0 @@
---
-title: Supported Agents
---
-
-This page lists all supported agent loops and their compatible models/configurations in cua.
-
-All agent loops are compatible with any LLM provider supported by LiteLLM.
-
-See [Running Models Locally](./local-models) for how to use Hugging Face and MLX models on your own machine.
-
-## Anthropic CUAs
-
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20240620`
-
-## OpenAI CUA Preview
-
- Computer-use-preview: `computer-use-preview`
-
-## UI-TARS 1.5
-
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
-
-## Omniparser + LLMs
-
- `omniparser+vertex_ai/gemini-pro`
- `omniparser+openai/gpt-4o`
- Any LiteLLM-compatible model combined with Omniparser
-
---
-
-For details on agent loop behavior and usage, see [Agent Loops](./agent-loops).
--- a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
@@ -0,0 +1,106 @@
+---
+title: Composed Agents
+description: Combine grounding models with any LLM for computer-use capabilities
+---
+
+Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
+
+Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
+
+## How Composed Agents Work
+
+1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
+2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
+3. **Execution**: Actions are performed using the predicted coordinates
+
+## Supported Grounding Models
+
+Any model that supports `predict_click()` can be used as the grounding component:
+
+- `omniparser` (OSS set-of-marks model)
+- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
+- `claude-3-5-sonnet-20241022` (Anthropic CUA)
+- `openai/computer-use-preview` (OpenAI CUA)
+
+## Supported Thinking Models
+
+Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
+
+- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
+- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
+- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
+- **Local models**: Any Hugging Face vision-language model
+
+## Usage Examples
+
+### GTA1 + GPT-5
+
+Use Google's Gemini for planning with specialized grounding:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
+    tools=[computer]
+)
+
+async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
+    pass
+```
+
+### GTA1 + Claude 3.5 Sonnet
+
+Combine state-of-the-art grounding with powerful reasoning:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022", 
+    tools=[computer]
+)
+
+async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
+    pass
+# Success! 🎉
+# - Claude 3.5 Sonnet plans the sequence of actions
+# - GTA1-7B provides precise click coordinates for each UI element
+```
+
+### UI-TARS + GPT-4o
+
+Combine two different vision models for enhanced capabilities:
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
+    tools=[computer]
+)
+
+async for _ in agent.run("Help me fill out this form with my personal information"):
+    pass
+```
+
+## Benefits of Composed Agents
+
+- **Specialized Grounding**: Use models optimized for click prediction accuracy
+- **Flexible Planning**: Choose any LLM for task reasoning and planning
+- **Cost Optimization**: Use smaller grounding models with larger planning models only when needed
+- **Performance**: Leverage the strengths of different model architectures
+
+## Capabilities
+
+Composed agents support both capabilities:
+
+```python
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
+
+# Full computer-use agent capabilities
+async for _ in agent.run("Complete this online form"):
+    pass
+
+# Direct click prediction (uses grounding model only)
+coords = agent.predict_click("find the submit button")
+```
+
+---
+
+For more information on individual model capabilities, see [Computer-Use Agents](./computer-use-agents) and [Grounding Models](./grounding-models).
--- a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
@@ -0,0 +1,67 @@
+---
+title: Computer-Use Models
+description: Models that support full computer-use agent capabilities with ComputerAgent.run()
+---
+
+These models support complete computer-use agent functionality through `ComputerAgent.run()`. They can understand natural language instructions and autonomously perform sequences of actions to complete tasks.
+
+All agent loops are compatible with any LLM provider supported by LiteLLM.
+
+See [Running Models Locally](../local-models) for how to use Hugging Face and MLX models on your own machine.
+
+## Anthropic CUAs
+
+Claude models with computer-use capabilities:
+
+- Claude 4.1: `claude-opus-4-1-20250805`
+- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
+- Claude 3.7: `claude-3-7-sonnet-20250219`
+- Claude 3.5: `claude-3-5-sonnet-20240620`
+
+```python
+agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
+async for _ in agent.run("Open Firefox and navigate to github.com"):
+    pass
+```
+
+## OpenAI CUA Preview
+
+OpenAI's computer-use preview model:
+
+- Computer-use-preview: `computer-use-preview`
+
+```python
+agent = ComputerAgent("openai/computer-use-preview", tools=[computer])
+async for _ in agent.run("Take a screenshot and describe what you see"):
+    pass
+```
+
+## UI-TARS 1.5
+
+Unified vision-language model for computer-use:
+
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
+
+```python
+agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
+async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
+    pass
+```
+
+## GLM-4.5V
+
+Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
+
+- `openrouter/z-ai/glm-4.5v`
+- `huggingface-local/zai-org/GLM-4.5V`
+
+```python
+agent = ComputerAgent("openrouter/z-ai/glm-4.5v", tools=[computer])
+async for _ in agent.run("Click on the search bar and type 'hello world'"):
+    pass
+```
+
+---
+
+For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
--- a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
@@ -0,0 +1,89 @@
+---
+title: Grounding Models
+description: Models that support click prediction with ComputerAgent.predict_click()
+---
+
+These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
+
+Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
+
+## All Computer-Use Agents
+
+All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
+
+### Anthropic CUAs
+
+- Claude 4.1: `claude-opus-4-1-20250805`
+- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
+- Claude 3.7: `claude-3-7-sonnet-20250219`
+- Claude 3.5: `claude-3-5-sonnet-20240620`
+
+### OpenAI CUA Preview
+- Computer-use-preview: `computer-use-preview`
+
+### UI-TARS 1.5
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
+
+## Specialized Grounding Models
+
+These models are optimized specifically for click prediction and UI element grounding:
+
+### OmniParser
+
+OCR-focused set-of-marks model that requires an LLM for click prediction:
+
+- `omniparser` (requires combination with any LiteLLM vision model)
+
+### GTA1-7B
+
+State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
+
+- `huggingface-local/HelloKKMe/GTA1-7B`
+
+## Usage Examples
+
+```python
+# Using any grounding model for click prediction
+agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
+
+# Predict coordinates for specific elements
+login_coords = agent.predict_click("find the login button")
+search_coords = agent.predict_click("locate the search text field")
+menu_coords = agent.predict_click("find the hamburger menu icon")
+
+print(f"Login button: {login_coords}")
+print(f"Search field: {search_coords}")
+print(f"Menu icon: {menu_coords}")
+```
+
+```python
+# OmniParser is just for OCR, so it requires an LLM for predict_click
+agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
+
+# Predict click coordinates using composed agent
+coords = agent.predict_click("find the submit button")
+print(f"Click coordinates: {coords}")  # (450, 320)
+
+# Note: Cannot use omniparser alone for click prediction
+# This will raise an error:
+# agent = ComputerAgent("omniparser", tools=[computer])
+# coords = agent.predict_click("find button")  # Error!
+```
+
+```python
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+
+# Predict click coordinates for UI elements
+coords = agent.predict_click("find the submit button")
+print(f"Click coordinates: {coords}")  # (450, 320)
+
+# Note: GTA1 cannot perform autonomous task planning
+# This will raise an error:
+# agent.run("Fill out the form and submit it")
+```
+
+
+---
+
+For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
--- a/docs/content/docs/agent-sdk/supported-agents/human-in-the-loop.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/human-in-the-loop.mdx
@@ -0,0 +1,66 @@
+---
+title: Human-In-The-Loop
+description: Use humans as agents for evaluation, demonstrations, and interactive control
+---
+
+The Agent SDK provides a human tool, with native support for using a human-in-the-loop as a way to evaluate your environment, tools, or to create demonstrations. You can use it by doing `grounding_model+human/human` or `human/human` directly.
+
+## Getting Started
+
+To start the human agent tool, simply run:
+
+```bash
+python -m agent.human_tool
+```
+
+The UI will show you pending completions. Select a completion to take control of the agent.
+
+## Usage Examples
+
+### Direct Human Agent
+
+```python
+from agent import ComputerAgent
+from agent.computer import computer
+
+agent = ComputerAgent(
+    "human/human",
+    tools=[computer]
+)
+
+async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
+    pass
+```
+
+### Composed with Grounding Model
+
+```python
+agent = ComputerAgent(
+    "huggingface-local/HelloKKMe/GTA1-7B+human/human",
+    tools=[computer]
+)
+
+async for _ in agent.run("Navigate to the settings page and enable dark mode"):
+    pass
+```
+
+## Features
+
+The human-in-the-loop interface provides:
+
+- **Interactive UI**: Web-based interface for reviewing and responding to agent requests
+- **Image Display**: Screenshots with click handlers for direct interaction
+- **Action Accordions**: Support for various computer actions (click, type, keypress, etc.)
+- **Tool Calls**: Full OpenAI-compatible tool call support
+- **Real-time Updates**: Smart polling for responsive UI updates
+
+## Use Cases
+
+- **Evaluation**: Have humans evaluate agent performance and provide ground truth responses
+- **Demonstrations**: Create training data by having humans demonstrate tasks
+- **Interactive Control**: Take manual control when automated agents need human guidance
+- **Testing**: Validate agent, tool, and environment behavior manually
+
+---
+
+For more details on the human tool implementation, see the [Human Tool Documentation](../../tools/human-tool).
--- a/docs/content/docs/agent-sdk/supported-agents/meta.json
+++ b/docs/content/docs/agent-sdk/supported-agents/meta.json
@@ -0,0 +1,10 @@
+{
+	"title": "Supported Agents",
+	"description": "Models and configurations supported by the Agent SDK",
+	"pages": [
+		"computer-use-agents",
+		"grounding-models", 
+		"composed-agents",
+		"human-in-the-loop"
+	]
+}
--- a/docs/content/docs/quickstart-cli.mdx
+++ b/docs/content/docs/quickstart-cli.mdx
@@ -169,18 +169,20 @@ python -m agent.cli openai/computer-use-preview
 <Tab value="uv">

 ```bash
-uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-3-5-sonnet-20241022
 uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-opus-4-20250514
+uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-opus-4-1-20250805
 uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-sonnet-4-20250514
+uv run --with "cua-agent[cli]" -m agent.cli anthropic/claude-3-5-sonnet-20241022
 ```

 </Tab>
 <Tab value="conda/pip">

 ```bash
-python -m agent.cli anthropic/claude-3-5-sonnet-20241022
+python -m agent.cli anthropic/claude-opus-4-1-20250805
 python -m agent.cli anthropic/claude-opus-4-20250514
 python -m agent.cli anthropic/claude-sonnet-4-20250514
+python -m agent.cli anthropic/claude-3-5-sonnet-20241022
 ```

 </Tab>
--- a/examples/agent_ui_examples.py
+++ b/examples/agent_ui_examples.py
@@ -13,7 +13,7 @@ from utils import load_dotenv_files
 load_dotenv_files()

 # Import the create_gradio_ui function
-from agent.ui.gradio.app import create_gradio_ui
+from agent.ui.gradio.ui_components import create_gradio_ui

 if __name__ == "__main__":
    print("Launching Computer-Use Agent Gradio UI with advanced features...")
--- a/libs/python/agent/README.md
+++ b/libs/python/agent/README.md
@@ -37,6 +37,7 @@ pip install "cua-agent[omni]"          # Omniparser + any LLM support
 pip install "cua-agent[uitars]"        # UI-TARS
 pip install "cua-agent[uitars-mlx]"    # UI-TARS + MLX support
 pip install "cua-agent[uitars-hf]"     # UI-TARS + Huggingface support
+pip install "cua-agent[glm45v-hf]"     # GLM-4.5V + Huggingface support
 pip install "cua-agent[ui]"            # Gradio UI support
 ```

--- a/libs/python/agent/agent/init.py
+++ b/libs/python/agent/agent/init.py
@@ -5,7 +5,7 @@ agent - Decorator-based Computer Use Agent with liteLLM integration
 import logging
 import sys

-from .decorators import agent_loop
+from .decorators import register_agent
 from .agent import ComputerAgent
 from .types import Messages, AgentResponse

@@ -13,7 +13,7 @@ from .types import Messages, AgentResponse
 from . import loops

 __all__ = [
-    "agent_loop",
+    "register_agent",
    "ComputerAgent",
    "Messages",
    "AgentResponse"
--- a/libs/python/agent/agent/adapters/init.py
+++ b/libs/python/agent/agent/adapters/init.py
@@ -3,7 +3,9 @@ Adapters package for agent - Custom LLM adapters for LiteLLM
 """

 from .huggingfacelocal_adapter import HuggingFaceLocalAdapter
+from .human_adapter import HumanAdapter

 __all__ = [
    "HuggingFaceLocalAdapter",
+    "HumanAdapter",
 ]
--- a/libs/python/agent/agent/adapters/huggingfacelocal_adapter.py
+++ b/libs/python/agent/agent/adapters/huggingfacelocal_adapter.py
@@ -1,5 +1,7 @@
 import asyncio
+import functools
 import warnings
+from concurrent.futures import ThreadPoolExecutor
 from typing import Iterator, AsyncIterator, Dict, List, Any, Optional
 from litellm.types.utils import GenericStreamingChunk, ModelResponse
 from litellm.llms.custom_llm import CustomLLM
@@ -8,7 +10,7 @@ from litellm import completion, acompletion
 # Try to import HuggingFace dependencies
 try:
    import torch
-    from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+    from transformers import AutoModelForImageTextToText, AutoProcessor
    HF_AVAILABLE = True
 except ImportError:
    HF_AVAILABLE = False
@@ -28,6 +30,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
        self.device = device
        self.models = {}  # Cache for loaded models
        self.processors = {}  # Cache for loaded processors
+        self._executor = ThreadPoolExecutor(max_workers=1)  # Single thread pool
        
    def _load_model_and_processor(self, model_name: str):
        """Load model and processor if not already cached.
@@ -40,7 +43,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
        """
        if model_name not in self.models:
            # Load model
-            model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+            model = AutoModelForImageTextToText.from_pretrained(
                model_name,
                torch_dtype=torch.float16,
                device_map=self.device,
@@ -48,7 +51,12 @@ class HuggingFaceLocalAdapter(CustomLLM):
            )
            
            # Load processor
-            processor = AutoProcessor.from_pretrained(model_name)
+            processor = AutoProcessor.from_pretrained(
+                model_name,
+                min_pixels=3136,
+                max_pixels=4096 * 2160,
+                device_map=self.device
+            )
            
            # Cache them
            self.models[model_name] = model
@@ -141,8 +149,7 @@ class HuggingFaceLocalAdapter(CustomLLM):
        )
        
        # Move inputs to the same device as model
-        if torch.cuda.is_available() and self.device != "cpu":
-            inputs = inputs.to("cuda")
+        inputs = inputs.to(model.device)
        
        # Generate response
        with torch.no_grad():
@@ -182,7 +189,11 @@ class HuggingFaceLocalAdapter(CustomLLM):
            ModelResponse with generated text
        """
        # Run _generate in thread pool to avoid blocking
-        generated_text = await asyncio.to_thread(self._generate, **kwargs)
+        loop = asyncio.get_event_loop()
+        generated_text = await loop.run_in_executor(
+            self._executor, 
+            functools.partial(self._generate, **kwargs)
+        )
        
        return await acompletion(
            model=f"huggingface-local/{kwargs['model']}",
@@ -215,7 +226,11 @@ class HuggingFaceLocalAdapter(CustomLLM):
            AsyncIterator of GenericStreamingChunk
        """
        # Run _generate in thread pool to avoid blocking
-        generated_text = await asyncio.to_thread(self._generate, **kwargs)
+        loop = asyncio.get_event_loop()
+        generated_text = await loop.run_in_executor(
+            self._executor, 
+            functools.partial(self._generate, **kwargs)
+        )
        
        generic_streaming_chunk: GenericStreamingChunk = {
            "finish_reason": "stop",
--- a/libs/python/agent/agent/adapters/human_adapter.py
+++ b/libs/python/agent/agent/adapters/human_adapter.py
@@ -0,0 +1,348 @@
+import os
+import asyncio
+import requests
+from typing import List, Dict, Any, Iterator, AsyncIterator
+from litellm.types.utils import GenericStreamingChunk, ModelResponse
+from litellm.llms.custom_llm import CustomLLM
+from litellm import completion, acompletion
+
+
+class HumanAdapter(CustomLLM):
+    """Human Adapter for human-in-the-loop completions.
+    
+    This adapter sends completion requests to a human completion server
+    where humans can review and respond to AI requests.
+    """
+    
+    def __init__(self, base_url: str | None = None, timeout: float = 300.0, **kwargs):
+        """Initialize the human adapter.
+        
+        Args:
+            base_url: Base URL for the human completion server.
+                     Defaults to HUMAN_BASE_URL environment variable or http://localhost:8002
+            timeout: Timeout in seconds for waiting for human response
+            **kwargs: Additional arguments
+        """
+        super().__init__()
+        self.base_url = base_url or os.getenv('HUMAN_BASE_URL', 'http://localhost:8002')
+        self.timeout = timeout
+        
+        # Ensure base_url doesn't end with slash
+        self.base_url = self.base_url.rstrip('/')
+    
+    def _queue_completion(self, messages: List[Dict[str, Any]], model: str) -> str:
+        """Queue a completion request and return the call ID.
+        
+        Args:
+            messages: Messages in OpenAI format
+            model: Model name
+            
+        Returns:
+            Call ID for tracking the request
+            
+        Raises:
+            Exception: If queueing fails
+        """
+        try:
+            response = requests.post(
+                f"{self.base_url}/queue",
+                json={"messages": messages, "model": model},
+                timeout=10
+            )
+            response.raise_for_status()
+            return response.json()["id"]
+        except requests.RequestException as e:
+            raise Exception(f"Failed to queue completion request: {e}")
+    
+    def _wait_for_completion(self, call_id: str) -> Dict[str, Any]:
+        """Wait for human to complete the call.
+        
+        Args:
+            call_id: ID of the queued completion call
+            
+        Returns:
+            Dict containing response and/or tool_calls
+            
+        Raises:
+            TimeoutError: If timeout is exceeded
+            Exception: If completion fails
+        """
+        import time
+        
+        start_time = time.time()
+        
+        while True:
+            try:
+                # Check status
+                status_response = requests.get(f"{self.base_url}/status/{call_id}")
+                status_response.raise_for_status()
+                status_data = status_response.json()
+                
+                if status_data["status"] == "completed":
+                    result = {}
+                    if "response" in status_data and status_data["response"]:
+                        result["response"] = status_data["response"]
+                    if "tool_calls" in status_data and status_data["tool_calls"]:
+                        result["tool_calls"] = status_data["tool_calls"]
+                    return result
+                elif status_data["status"] == "failed":
+                    error_msg = status_data.get("error", "Unknown error")
+                    raise Exception(f"Completion failed: {error_msg}")
+                
+                # Check timeout
+                if time.time() - start_time > self.timeout:
+                    raise TimeoutError(f"Timeout waiting for human response after {self.timeout} seconds")
+                
+                # Wait before checking again
+                time.sleep(1.0)
+                
+            except requests.RequestException as e:
+                if time.time() - start_time > self.timeout:
+                    raise TimeoutError(f"Timeout waiting for human response: {e}")
+                # Continue trying if we haven't timed out
+                time.sleep(1.0)
+    
+    async def _async_wait_for_completion(self, call_id: str) -> Dict[str, Any]:
+        """Async version of wait_for_completion.
+        
+        Args:
+            call_id: ID of the queued completion call
+            
+        Returns:
+            Dict containing response and/or tool_calls
+            
+        Raises:
+            TimeoutError: If timeout is exceeded
+            Exception: If completion fails
+        """
+        import aiohttp
+        import time
+        
+        start_time = time.time()
+        
+        async with aiohttp.ClientSession() as session:
+            while True:
+                try:
+                    # Check status
+                    async with session.get(f"{self.base_url}/status/{call_id}") as response:
+                        response.raise_for_status()
+                        status_data = await response.json()
+                    
+                    if status_data["status"] == "completed":
+                        result = {}
+                        if "response" in status_data and status_data["response"]:
+                            result["response"] = status_data["response"]
+                        if "tool_calls" in status_data and status_data["tool_calls"]:
+                            result["tool_calls"] = status_data["tool_calls"]
+                        return result
+                    elif status_data["status"] == "failed":
+                        error_msg = status_data.get("error", "Unknown error")
+                        raise Exception(f"Completion failed: {error_msg}")
+                    
+                    # Check timeout
+                    if time.time() - start_time > self.timeout:
+                        raise TimeoutError(f"Timeout waiting for human response after {self.timeout} seconds")
+                    
+                    # Wait before checking again
+                    await asyncio.sleep(1.0)
+                    
+                except Exception as e:
+                    if time.time() - start_time > self.timeout:
+                        raise TimeoutError(f"Timeout waiting for human response: {e}")
+                    # Continue trying if we haven't timed out
+                    await asyncio.sleep(1.0)
+    
+    def _generate_response(self, messages: List[Dict[str, Any]], model: str) -> Dict[str, Any]:
+        """Generate a human response for the given messages.
+        
+        Args:
+            messages: Messages in OpenAI format
+            model: Model name
+            
+        Returns:
+            Dict containing response and/or tool_calls
+        """
+        # Queue the completion request
+        call_id = self._queue_completion(messages, model)
+        
+        # Wait for human response
+        response = self._wait_for_completion(call_id)
+        
+        return response
+    
+    async def _async_generate_response(self, messages: List[Dict[str, Any]], model: str) -> Dict[str, Any]:
+        """Async version of _generate_response.
+        
+        Args:
+            messages: Messages in OpenAI format
+            model: Model name
+            
+        Returns:
+            Dict containing response and/or tool_calls
+        """
+        # Queue the completion request (sync operation)
+        call_id = self._queue_completion(messages, model)
+        
+        # Wait for human response (async)
+        response = await self._async_wait_for_completion(call_id)
+        
+        return response
+    
+    def completion(self, *args, **kwargs) -> ModelResponse:
+        """Synchronous completion method.
+        
+        Returns:
+            ModelResponse with human-generated text or tool calls
+        """
+        messages = kwargs.get('messages', [])
+        model = kwargs.get('model', 'human')
+        
+        # Generate human response
+        human_response_data = self._generate_response(messages, model)
+        
+        # Create ModelResponse with proper structure
+        from litellm.types.utils import ModelResponse, Choices, Message
+        import uuid
+        import time
+        
+        # Create message content based on response type
+        if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
+            # Tool calls response
+            message = Message(
+                role="assistant",
+                content=human_response_data.get("response", ""),
+                tool_calls=human_response_data["tool_calls"]
+            )
+        else:
+            # Text response
+            message = Message(
+                role="assistant",
+                content=human_response_data.get("response", "")
+            )
+        
+        choice = Choices(
+            finish_reason="stop",
+            index=0,
+            message=message
+        )
+        
+        result = ModelResponse(
+            id=f"human-{uuid.uuid4()}",
+            choices=[choice],
+            created=int(time.time()),
+            model=f"human/{model}",
+            object="chat.completion"
+        )
+        
+        return result
+    
+    async def acompletion(self, *args, **kwargs) -> ModelResponse:
+        """Asynchronous completion method.
+        
+        Returns:
+            ModelResponse with human-generated text or tool calls
+        """
+        messages = kwargs.get('messages', [])
+        model = kwargs.get('model', 'human')
+        
+        # Generate human response
+        human_response_data = await self._async_generate_response(messages, model)
+        
+        # Create ModelResponse with proper structure
+        from litellm.types.utils import ModelResponse, Choices, Message
+        import uuid
+        import time
+        
+        # Create message content based on response type
+        if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
+            # Tool calls response
+            message = Message(
+                role="assistant",
+                content=human_response_data.get("response", ""),
+                tool_calls=human_response_data["tool_calls"]
+            )
+        else:
+            # Text response
+            message = Message(
+                role="assistant",
+                content=human_response_data.get("response", "")
+            )
+        
+        choice = Choices(
+            finish_reason="stop",
+            index=0,
+            message=message
+        )
+        
+        result = ModelResponse(
+            id=f"human-{uuid.uuid4()}",
+            choices=[choice],
+            created=int(time.time()),
+            model=f"human/{model}",
+            object="chat.completion"
+        )
+        
+        return result
+    
+    def streaming(self, *args, **kwargs) -> Iterator[GenericStreamingChunk]:
+        """Synchronous streaming method.
+        
+        Yields:
+            Streaming chunks with human-generated text or tool calls
+        """
+        messages = kwargs.get('messages', [])
+        model = kwargs.get('model', 'human')
+        
+        # Generate human response
+        human_response_data = self._generate_response(messages, model)
+        
+        import time
+        
+        # Handle tool calls vs text response
+        if "tool_calls" in human_response_data and human_response_data["tool_calls"]:
+            # Stream tool calls as a single chunk
+            generic_chunk: GenericStreamingChunk = {
+                "finish_reason": "tool_calls",
+                "index": 0,
+                "is_finished": True,
+                "text": human_response_data.get("response", ""),
+                "tool_use": human_response_data["tool_calls"],
+                "usage": {"completion_tokens": 1, "prompt_tokens": 0, "total_tokens": 1},
+            }
+            yield generic_chunk
+        else:
+            # Stream text response
+            response_text = human_response_data.get("response", "")
+            generic_chunk: GenericStreamingChunk = {
+                "finish_reason": "stop",
+                "index": 0,
+                "is_finished": True,
+                "text": response_text,
+                "tool_use": None,
+                "usage": {"completion_tokens": len(response_text.split()), "prompt_tokens": 0, "total_tokens": len(response_text.split())},
+            }
+            yield generic_chunk
+    
+    async def astreaming(self, *args, **kwargs) -> AsyncIterator[GenericStreamingChunk]:
+        """Asynchronous streaming method.
+        
+        Yields:
+            Streaming chunks with human-generated text or tool calls
+        """
+        messages = kwargs.get('messages', [])
+        model = kwargs.get('model', 'human')
+        
+        # Generate human response
+        human_response = await self._async_generate_response(messages, model)
+        
+        # Return as single streaming chunk
+        generic_streaming_chunk: GenericStreamingChunk = {
+            "finish_reason": "stop",
+            "index": 0,
+            "is_finished": True,
+            "text": human_response,
+            "tool_use": None,
+            "usage": {"completion_tokens": len(human_response.split()), "prompt_tokens": 0, "total_tokens": len(human_response.split())},
+        }
+        
+        yield generic_streaming_chunk
--- a/libs/python/agent/agent/agent.py
+++ b/libs/python/agent/agent/agent.py
@@ -3,18 +3,20 @@ ComputerAgent - Main agent class that selects and runs agent loops
 """

 import asyncio
-from typing import Dict, List, Any, Optional, AsyncGenerator, Union, cast, Callable, Set
+from typing import Dict, List, Any, Optional, AsyncGenerator, Union, cast, Callable, Set, Tuple

 from litellm.responses.utils import Usage

-from .types import Messages, Computer
-from .decorators import find_agent_loop
-from .computer_handler import OpenAIComputerHandler, acknowledge_safety_check_callback, check_blocklisted_url
+from .types import Messages, AgentCapability
+from .decorators import find_agent_config
 import json
 import litellm
 import litellm.utils
 import inspect
-from .adapters import HuggingFaceLocalAdapter
+from .adapters import (
+    HuggingFaceLocalAdapter,
+    HumanAdapter,
+)
 from .callbacks import (
    ImageRetentionCallback, 
    LoggingCallback, 
@@ -22,9 +24,14 @@ from .callbacks import (
    BudgetManagerCallback,
    TelemetryCallback,
 )
+from .computers import (
+    AsyncComputerHandler,
+    is_agent_computer,
+    make_computer_handler
+)

 def get_json(obj: Any, max_depth: int = 10) -> Any:
-    def custom_serializer(o: Any, depth: int = 0, seen: Set[int] = None) -> Any:
+    def custom_serializer(o: Any, depth: int = 0, seen: Optional[Set[int]] = None) -> Any:
        if seen is None:
            seen = set()
        
@@ -117,6 +124,13 @@ def sanitize_message(msg: Any) -> Any:
            return sanitized
    return msg

+def get_output_call_ids(messages: List[Dict[str, Any]]) -> List[str]:
+    call_ids = []
+    for message in messages:
+        if message.get("type") == "computer_call_output" or message.get("type") == "function_call_output":
+            call_ids.append(message.get("call_id"))
+    return call_ids
+
 class ComputerAgent:
    """
    Main agent class that automatically selects the appropriate agent loop
@@ -204,22 +218,26 @@ class ComputerAgent:
        hf_adapter = HuggingFaceLocalAdapter(
            device="auto"
        )
+        human_adapter = HumanAdapter()
        litellm.custom_provider_map = [
-            {"provider": "huggingface-local", "custom_handler": hf_adapter}
+            {"provider": "huggingface-local", "custom_handler": hf_adapter},
+            {"provider": "human", "custom_handler": human_adapter}
        ]
+        litellm.suppress_debug_info = True

        # == Initialize computer agent ==

        # Find the appropriate agent loop
        if custom_loop:
            self.agent_loop = custom_loop
-            self.agent_loop_info = None
+            self.agent_config_info = None
        else:
-            loop_info = find_agent_loop(model)
-            if not loop_info:
-                raise ValueError(f"No agent loop found for model: {model}")
-            self.agent_loop = loop_info.func
-            self.agent_loop_info = loop_info
+            config_info = find_agent_config(model)
+            if not config_info:
+                raise ValueError(f"No agent config found for model: {model}")
+            # Instantiate the agent config class
+            self.agent_loop = config_info.agent_class()
+            self.agent_config_info = config_info
        
        self.tool_schemas = []
        self.computer_handler = None
@@ -227,10 +245,6 @@ class ComputerAgent:
    async def _initialize_computers(self):
        """Initialize computer objects"""
        if not self.tool_schemas:
-            for tool in self.tools:
-                if hasattr(tool, '_initialized') and not tool._initialized:
-                    await tool.run()
-                
            # Process tools and create tool schemas
            self.tool_schemas = self._process_tools()
            
@@ -238,7 +252,7 @@ class ComputerAgent:
            computer_handler = None
            for schema in self.tool_schemas:
                if schema["type"] == "computer":
-                    computer_handler = OpenAIComputerHandler(schema["computer"].interface)
+                    computer_handler = await make_computer_handler(schema["computer"])
                    break
            self.computer_handler = computer_handler
    
@@ -254,7 +268,7 @@ class ComputerAgent:
        
        for tool in self.tools:
            # Check if it's a computer object (has interface attribute)
-            if hasattr(tool, 'interface'):
+            if is_agent_computer(tool):
                # This is a computer tool - will be handled by agent loop
                schemas.append({
                    "type": "computer",
@@ -389,8 +403,10 @@ class ComputerAgent:
    # AGENT OUTPUT PROCESSING
    # ============================================================================
    
-    async def _handle_item(self, item: Any, computer: Optional[Computer] = None) -> List[Dict[str, Any]]:
+    async def _handle_item(self, item: Any, computer: Optional[AsyncComputerHandler] = None, ignore_call_ids: Optional[List[str]] = None) -> List[Dict[str, Any]]:
        """Handle each item; may cause a computer action + screenshot."""
+        if ignore_call_ids and item.get("call_id") and item.get("call_id") in ignore_call_ids:
+            return []
        
        item_type = item.get("type", None)
        
@@ -411,6 +427,9 @@ class ComputerAgent:
            # Perform computer actions
            action = item.get("action")
            action_type = action.get("type")
+            if action_type is None:
+                print(f"Action type cannot be `None`: action={action}, action_type={action_type}")
+                return []
            
            # Extract action arguments (all fields except 'type')
            action_args = {k: v for k, v in action.items() if k != "type"}
@@ -436,10 +455,12 @@ class ComputerAgent:
            acknowledged_checks = []
            for check in pending_checks:
                check_message = check.get("message", str(check))
-                if acknowledge_safety_check_callback(check_message):
-                    acknowledged_checks.append(check)
-                else:
-                    raise ValueError(f"Safety check failed: {check_message}")
+                acknowledged_checks.append(check)
+                # TODO: implement a callback for safety checks
+                # if acknowledge_safety_check_callback(check_message, allow_always=True):
+                #     acknowledged_checks.append(check)
+                # else:
+                #     raise ValueError(f"Safety check failed: {check_message}")
            
            # Create call output
            call_output = {
@@ -452,11 +473,12 @@ class ComputerAgent:
                },
            }
            
-            # Additional URL safety checks for browser environments
-            if await computer.get_environment() == "browser":
-                current_url = await computer.get_current_url()
-                call_output["output"]["current_url"] = current_url
-                check_blocklisted_url(current_url)
+            # # Additional URL safety checks for browser environments
+            # if await computer.get_environment() == "browser":
+            #     current_url = await computer.get_current_url()
+            #     call_output["output"]["current_url"] = current_url
+            #     # TODO: implement a callback for URL safety checks
+            #     # check_blocklisted_url(current_url)
            
            result = [call_output]
            await self._on_computer_call_end(item, result)
@@ -511,6 +533,12 @@ class ComputerAgent:
        Returns:
            AsyncGenerator that yields response chunks
        """
+        if not self.agent_config_info:
+            raise ValueError("Agent configuration not found")
+        
+        capabilities = self.get_capabilities()
+        if "step" not in capabilities:
+            raise ValueError(f"Agent loop {self.agent_config_info.agent_class.__name__} does not support step predictions")

        await self._initialize_computers()
        
@@ -525,7 +553,7 @@ class ComputerAgent:
            "messages": messages,
            "stream": stream,
            "model": self.model,
-            "agent_loop": self.agent_loop.__name__,
+            "agent_loop": self.agent_config_info.agent_class.__name__,
            **merged_kwargs
        }
        await self._on_run_start(run_kwargs, old_items)
@@ -555,7 +583,7 @@ class ComputerAgent:
            }

            # Run agent loop iteration
-            result = await self.agent_loop(
+            result = await self.agent_loop.predict_step(
                **loop_kwargs,
                _on_api_start=self._on_api_start,
                _on_api_end=self._on_api_end,
@@ -576,9 +604,12 @@ class ComputerAgent:
            # Add agent response to new_items
            new_items += result.get("output")

+            # Get output call ids
+            output_call_ids = get_output_call_ids(result.get("output", []))
+
            # Handle computer actions
            for item in result.get("output"):
-                partial_items = await self._handle_item(item, self.computer_handler)
+                partial_items = await self._handle_item(item, self.computer_handler, ignore_call_ids=output_call_ids)
                new_items += partial_items

                # Yield partial response
@@ -591,4 +622,51 @@ class ComputerAgent:
                    )
                }
        
-        await self._on_run_end(loop_kwargs, old_items, new_items)
+        await self._on_run_end(loop_kwargs, old_items, new_items)
+    
+    async def predict_click(
+        self,
+        instruction: str,
+        image_b64: Optional[str] = None
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates based on image and instruction.
+        
+        Args:
+            instruction: Instruction for where to click
+            image_b64: Base64 encoded image (optional, will take screenshot if not provided)
+            
+        Returns:
+            None or tuple with (x, y) coordinates
+        """
+        if not self.agent_config_info:
+            raise ValueError("Agent configuration not found")
+        
+        capabilities = self.get_capabilities()
+        if "click" not in capabilities:
+            raise ValueError(f"Agent loop {self.agent_config_info.agent_class.__name__} does not support click predictions")
+        if hasattr(self.agent_loop, 'predict_click'):
+            if not image_b64:
+                if not self.computer_handler:
+                    raise ValueError("Computer tool or image_b64 is required for predict_click")
+                image_b64 = await self.computer_handler.screenshot()
+            return await self.agent_loop.predict_click(
+                model=self.model,
+                image_b64=image_b64,
+                instruction=instruction
+            )
+        return None
+    
+    def get_capabilities(self) -> List[AgentCapability]:
+        """
+        Get list of capabilities supported by the current agent config.
+        
+        Returns:
+            List of capability strings (e.g., ["step", "click"])
+        """
+        if not self.agent_config_info:
+            raise ValueError("Agent configuration not found")
+        
+        if hasattr(self.agent_loop, 'get_capabilities'):
+            return self.agent_loop.get_capabilities()
+        return ["step"]  # Default capability
--- a/libs/python/agent/agent/callbacks/pii_anonymization.py
+++ b/libs/python/agent/agent/callbacks/pii_anonymization.py
@@ -9,10 +9,7 @@ import io
 import logging

 try:
-    from presidio_analyzer import AnalyzerEngine
-    from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
-    from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
-    from presidio_image_redactor import ImageRedactorEngine
+    # TODO: Add Presidio dependencies
    from PIL import Image
    PRESIDIO_AVAILABLE = True
 except ImportError:
@@ -32,11 +29,7 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
    
    def __init__(
        self,
-        anonymize_text: bool = True,
-        anonymize_images: bool = True,
-        entities_to_anonymize: Optional[List[str]] = None,
-        anonymization_operator: str = "replace",
-        image_redaction_color: Tuple[int, int, int] = (255, 192, 203)  # Pink
+        # TODO: Any extra kwargs if needed
    ):
        """
        Initialize the PII anonymization callback.
@@ -51,23 +44,10 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
        if not PRESIDIO_AVAILABLE:
            raise ImportError(
                "Presidio is not available. Install with: "
-                "pip install presidio-analyzer presidio-anonymizer presidio-image-redactor"
+                "pip install cua-agent[pii-anonymization]"
            )
        
-        self.anonymize_text = anonymize_text
-        self.anonymize_images = anonymize_images
-        self.entities_to_anonymize = entities_to_anonymize
-        self.anonymization_operator = anonymization_operator
-        self.image_redaction_color = image_redaction_color
-        
-        # Initialize Presidio engines
-        self.analyzer = AnalyzerEngine()
-        self.anonymizer = AnonymizerEngine()
-        self.deanonymizer = DeanonymizeEngine()
-        self.image_redactor = ImageRedactorEngine()
-        
-        # Store anonymization mappings for deanonymization
-        self.anonymization_mappings: Dict[str, Any] = {}
+        # TODO: Implement __init__
    
    async def on_llm_start(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
@@ -79,9 +59,6 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
        Returns:
            List of messages with PII anonymized
        """
-        if not self.anonymize_text and not self.anonymize_images:
-            return messages
-        
        anonymized_messages = []
        for msg in messages:
            anonymized_msg = await self._anonymize_message(msg)
@@ -99,9 +76,6 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
        Returns:
            List of output with PII deanonymized for tool calls
        """
-        if not self.anonymize_text:
-            return output
-        
        deanonymized_output = []
        for item in output:
            # Only deanonymize tool calls and computer_call messages
@@ -114,146 +88,9 @@ class PIIAnonymizationCallback(AsyncCallbackHandler):
        return deanonymized_output
    
    async def _anonymize_message(self, message: Dict[str, Any]) -> Dict[str, Any]:
-        """Anonymize PII in a single message."""
-        msg_copy = message.copy()
-        
-        # Anonymize text content
-        if self.anonymize_text:
-            msg_copy = await self._anonymize_text_content(msg_copy)
-        
-        # Redact images in computer_call_output
-        if self.anonymize_images and msg_copy.get("type") == "computer_call_output":
-            msg_copy = await self._redact_image_content(msg_copy)
-        
-        return msg_copy
-    
-    async def _anonymize_text_content(self, message: Dict[str, Any]) -> Dict[str, Any]:
-        """Anonymize text content in a message."""
-        msg_copy = message.copy()
-        
-        # Handle content array
-        content = msg_copy.get("content", [])
-        if isinstance(content, str):
-            anonymized_text, _ = await self._anonymize_text(content)
-            msg_copy["content"] = anonymized_text
-        elif isinstance(content, list):
-            anonymized_content = []
-            for item in content:
-                if isinstance(item, dict) and item.get("type") == "text":
-                    text = item.get("text", "")
-                    anonymized_text, _ = await self._anonymize_text(text)
-                    item_copy = item.copy()
-                    item_copy["text"] = anonymized_text
-                    anonymized_content.append(item_copy)
-                else:
-                    anonymized_content.append(item)
-            msg_copy["content"] = anonymized_content
-        
-        return msg_copy
-    
-    async def _redact_image_content(self, message: Dict[str, Any]) -> Dict[str, Any]:
-        """Redact PII from images in computer_call_output messages."""
-        msg_copy = message.copy()
-        output = msg_copy.get("output", {})
-        
-        if isinstance(output, dict) and "image_url" in output:
-            try:
-                # Extract base64 image data
-                image_url = output["image_url"]
-                if image_url.startswith("data:image/"):
-                    # Parse data URL
-                    header, data = image_url.split(",", 1)
-                    image_data = base64.b64decode(data)
-                    
-                    # Load image with PIL
-                    image = Image.open(io.BytesIO(image_data))
-                    
-                    # Redact PII from image
-                    redacted_image = self.image_redactor.redact(image, self.image_redaction_color)
-                    
-                    # Convert back to base64
-                    buffer = io.BytesIO()
-                    redacted_image.save(buffer, format="PNG")
-                    redacted_data = base64.b64encode(buffer.getvalue()).decode()
-                    
-                    # Update image URL
-                    output_copy = output.copy()
-                    output_copy["image_url"] = f"data:image/png;base64,{redacted_data}"
-                    msg_copy["output"] = output_copy
-                    
-            except Exception as e:
-                logger.warning(f"Failed to redact image: {e}")
-        
-        return msg_copy
+        # TODO: Implement _anonymize_message
+        return message
    
    async def _deanonymize_item(self, item: Dict[str, Any]) -> Dict[str, Any]:
-        """Deanonymize PII in tool calls and computer outputs."""
-        item_copy = item.copy()
-        
-        # Handle computer_call arguments
-        if item.get("type") == "computer_call":
-            args = item_copy.get("args", {})
-            if isinstance(args, dict):
-                deanonymized_args = {}
-                for key, value in args.items():
-                    if isinstance(value, str):
-                        deanonymized_value, _ = await self._deanonymize_text(value)
-                        deanonymized_args[key] = deanonymized_value
-                    else:
-                        deanonymized_args[key] = value
-                item_copy["args"] = deanonymized_args
-        
-        return item_copy
-    
-    async def _anonymize_text(self, text: str) -> Tuple[str, List[RecognizerResult]]:
-        """Anonymize PII in text and return the anonymized text and results."""
-        if not text.strip():
-            return text, []
-        
-        try:
-            # Analyze text for PII
-            analyzer_results = self.analyzer.analyze(
-                text=text,
-                entities=self.entities_to_anonymize,
-                language="en"
-            )
-            
-            if not analyzer_results:
-                return text, []
-            
-            # Anonymize the text
-            anonymized_result = self.anonymizer.anonymize(
-                text=text,
-                analyzer_results=analyzer_results,
-                operators={entity_type: OperatorConfig(self.anonymization_operator) 
-                          for entity_type in set(result.entity_type for result in analyzer_results)}
-            )
-            
-            # Store mapping for deanonymization
-            mapping_key = str(hash(text))
-            self.anonymization_mappings[mapping_key] = {
-                "original": text,
-                "anonymized": anonymized_result.text,
-                "results": analyzer_results
-            }
-            
-            return anonymized_result.text, analyzer_results
-            
-        except Exception as e:
-            logger.warning(f"Failed to anonymize text: {e}")
-            return text, []
-    
-    async def _deanonymize_text(self, text: str) -> Tuple[str, bool]:
-        """Attempt to deanonymize text using stored mappings."""
-        try:
-            # Look for matching anonymized text in mappings
-            for mapping_key, mapping in self.anonymization_mappings.items():
-                if mapping["anonymized"] == text:
-                    return mapping["original"], True
-            
-            # If no mapping found, return original text
-            return text, False
-            
-        except Exception as e:
-            logger.warning(f"Failed to deanonymize text: {e}")
-            return text, False
+        # TODO: Implement _deanonymize_item
+        return item
--- a/libs/python/agent/agent/callbacks/trajectory_saver.py
+++ b/libs/python/agent/agent/callbacks/trajectory_saver.py
@@ -51,12 +51,14 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
    within the trajectory gets its own folder with screenshots and responses.
    """
    
-    def __init__(self, trajectory_dir: str):
+    def __init__(self, trajectory_dir: str, reset_on_run: bool = True):
        """
        Initialize trajectory saver.
        
        Args:
            trajectory_dir: Base directory to save trajectories
+            reset_on_run: If True, reset trajectory_id/turn/artifact on each run.
+                         If False, continue using existing trajectory_id if set.
        """
        self.trajectory_dir = Path(trajectory_dir)
        self.trajectory_id: Optional[str] = None
@@ -64,6 +66,7 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
        self.current_artifact: int = 0
        self.model: Optional[str] = None
        self.total_usage: Dict[str, Any] = {}
+        self.reset_on_run = reset_on_run
        
        # Ensure trajectory directory exists
        self.trajectory_dir.mkdir(parents=True, exist_ok=True)
@@ -113,32 +116,38 @@ class TrajectorySaverCallback(AsyncCallbackHandler):
    async def on_run_start(self, kwargs: Dict[str, Any], old_items: List[Dict[str, Any]]) -> None:
        """Initialize trajectory tracking for a new run."""
        model = kwargs.get("model", "unknown")
-        model_name_short = model.split("+")[-1].split("/")[-1].lower()[:16]
-        if "+" in model:
-            model_name_short = model.split("+")[0].lower()[:4] + "_" + model_name_short
+        
+        # Only reset trajectory state if reset_on_run is True or no trajectory exists
+        if self.reset_on_run or not self.trajectory_id:
+            model_name_short = model.split("+")[-1].split("/")[-1].lower()[:16]
+            if "+" in model:
+                model_name_short = model.split("+")[0].lower()[:4] + "_" + model_name_short

-        # id format: yyyy-mm-dd_model_hhmmss_uuid[:4]
-        now = datetime.now()
-        self.trajectory_id = f"{now.strftime('%Y-%m-%d')}_{model_name_short}_{now.strftime('%H%M%S')}_{str(uuid.uuid4())[:4]}"
-        self.current_turn = 0
-        self.current_artifact = 0
-        self.model = model
-        self.total_usage = {}
-        
-        # Create trajectory directory
-        trajectory_path = self.trajectory_dir / self.trajectory_id
-        trajectory_path.mkdir(parents=True, exist_ok=True)
-        
-        # Save trajectory metadata
-        metadata = {
-            "trajectory_id": self.trajectory_id,
-            "created_at": str(uuid.uuid1().time),
-            "status": "running",
-            "kwargs": kwargs,
-        }
-        
-        with open(trajectory_path / "metadata.json", "w") as f:
-            json.dump(metadata, f, indent=2)
+            # id format: yyyy-mm-dd_model_hhmmss_uuid[:4]
+            now = datetime.now()
+            self.trajectory_id = f"{now.strftime('%Y-%m-%d')}_{model_name_short}_{now.strftime('%H%M%S')}_{str(uuid.uuid4())[:4]}"
+            self.current_turn = 0
+            self.current_artifact = 0
+            self.model = model
+            self.total_usage = {}
+            
+            # Create trajectory directory
+            trajectory_path = self.trajectory_dir / self.trajectory_id
+            trajectory_path.mkdir(parents=True, exist_ok=True)
+            
+            # Save trajectory metadata
+            metadata = {
+                "trajectory_id": self.trajectory_id,
+                "created_at": str(uuid.uuid1().time),
+                "status": "running",
+                "kwargs": kwargs,
+            }
+            
+            with open(trajectory_path / "metadata.json", "w") as f:
+                json.dump(metadata, f, indent=2)
+        else:
+            # Continue with existing trajectory - just update model if needed
+            self.model = model

    @override
    async def on_run_end(self, kwargs: Dict[str, Any], old_items: List[Dict[str, Any]], new_items: List[Dict[str, Any]]) -> None:
--- a/libs/python/agent/agent/cli.py
+++ b/libs/python/agent/agent/cli.py
@@ -94,14 +94,14 @@ def print_action(action_type: str, details: Dict[str, Any], total_cost: float):
    # Format action details
    args_str = ""
    if action_type == "click" and "x" in details and "y" in details:
-        args_str = f"({details['x']}, {details['y']})"
+        args_str = f"_{details['button']}({details['x']}, {details['y']})"
    elif action_type == "type" and "text" in details:
        text = details["text"]
        if len(text) > 50:
            text = text[:47] + "..."
-        args_str = f'"{text}"'
-    elif action_type == "key" and "key" in details:
-        args_str = f"'{details['key']}'"
+        args_str = f'("{text}")'
+    elif action_type == "key" and "text" in details:
+        args_str = f"('{details['text']}')"
    elif action_type == "scroll" and "x" in details and "y" in details:
        args_str = f"({details['x']}, {details['y']})"
    
@@ -120,7 +120,7 @@ async def ainput(prompt: str = ""):

 async def chat_loop(agent, model: str, container_name: str, initial_prompt: str = "", show_usage: bool = True):
    """Main chat loop with the agent."""
-    print_welcome(model, agent.agent_loop.__name__, container_name)
+    print_welcome(model, agent.agent_config_info.agent_class.__name__, container_name)
    
    history = []
    
@@ -130,7 +130,7 @@ async def chat_loop(agent, model: str, container_name: str, initial_prompt: str
    total_cost = 0

    while True:
-        if history[-1].get("role") != "user":
+        if len(history) == 0 or history[-1].get("role") != "user":
            # Get user input with prompt
            print_colored("> ", end="")
            user_input = await ainput()
@@ -260,7 +260,12 @@ Examples:
        help="Show total cost of the agent runs"
    )

-
+    parser.add_argument(
+        "-r", "--max-retries",
+        type=int,
+        default=3,
+        help="Maximum number of retries for the LLM API calls"
+    )
    
    args = parser.parse_args()
    
@@ -327,6 +332,7 @@ Examples:
            "model": args.model,
            "tools": [computer],
            "verbosity": 20 if args.verbose else 30,  # DEBUG vs WARNING
+            "max_retries": args.max_retries
        }

        if args.images > 0:
--- a/libs/python/agent/agent/computers/init.py
+++ b/libs/python/agent/agent/computers/init.py
@@ -0,0 +1,41 @@
+"""
+Computer handler factory and interface definitions.
+
+This module provides a factory function to create computer handlers from different
+computer interface types, supporting both the ComputerHandler protocol and the
+Computer library interface.
+"""
+
+from .base import AsyncComputerHandler
+from .cua import cuaComputerHandler
+from .custom import CustomComputerHandler
+from computer import Computer as cuaComputer
+
+def is_agent_computer(computer):
+    """Check if the given computer is a ComputerHandler or CUA Computer."""
+    return isinstance(computer, AsyncComputerHandler) or \
+        isinstance(computer, cuaComputer) or \
+        (isinstance(computer, dict)) #and "screenshot" in computer)
+
+async def make_computer_handler(computer):
+    """
+    Create a computer handler from a computer interface.
+    
+    Args:
+        computer: Either a ComputerHandler instance, Computer instance, or dict of functions
+        
+    Returns:
+        ComputerHandler: A computer handler instance
+        
+    Raises:
+        ValueError: If the computer type is not supported
+    """
+    if isinstance(computer, AsyncComputerHandler):
+        return computer
+    if isinstance(computer, cuaComputer):
+        computer_handler = cuaComputerHandler(computer)
+        await computer_handler._initialize()
+        return computer_handler
+    if isinstance(computer, dict):
+        return CustomComputerHandler(computer)
+    raise ValueError(f"Unsupported computer type: {type(computer)}")
--- a/libs/python/agent/agent/computers/base.py
+++ b/libs/python/agent/agent/computers/base.py
@@ -0,0 +1,70 @@
+"""
+Base computer interface protocol for agent interactions.
+"""
+
+from typing import Protocol, Literal, List, Dict, Any, Union, Optional, runtime_checkable
+
+
+@runtime_checkable
+class AsyncComputerHandler(Protocol):
+    """Protocol defining the interface for computer interactions."""
+    
+    # ==== Computer-Use-Preview Action Space ==== 
+
+    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
+        """Get the current environment type."""
+        ...
+    
+    async def get_dimensions(self) -> tuple[int, int]:
+        """Get screen dimensions as (width, height)."""
+        ...
+    
+    async def screenshot(self) -> str:
+        """Take a screenshot and return as base64 string."""
+        ...
+    
+    async def click(self, x: int, y: int, button: str = "left") -> None:
+        """Click at coordinates with specified button."""
+        ...
+    
+    async def double_click(self, x: int, y: int) -> None:
+        """Double click at coordinates."""
+        ...
+    
+    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
+        """Scroll at coordinates with specified scroll amounts."""
+        ...
+    
+    async def type(self, text: str) -> None:
+        """Type text."""
+        ...
+    
+    async def wait(self, ms: int = 1000) -> None:
+        """Wait for specified milliseconds."""
+        ...
+    
+    async def move(self, x: int, y: int) -> None:
+        """Move cursor to coordinates."""
+        ...
+    
+    async def keypress(self, keys: Union[List[str], str]) -> None:
+        """Press key combination."""
+        ...
+    
+    async def drag(self, path: List[Dict[str, int]]) -> None:
+        """Drag along specified path."""
+        ...
+    
+    async def get_current_url(self) -> str:
+        """Get current URL (for browser environments)."""
+        ...
+    
+    # ==== Anthropic Action Space ==== 
+
+    async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse down at coordinates."""
+        ...
+    
+    async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse up at coordinates."""
+        ...
--- a/libs/python/agent/agent/computer_handler.py
+++ b/libs/python/agent/agent/computer_handler.py
@@ -3,34 +3,45 @@ Computer handler implementation for OpenAI computer-use-preview protocol.
 """

 import base64
-from typing import Dict, List, Any, Literal
-from .types import Computer
+from typing import Dict, List, Any, Literal, Union, Optional
+from .base import AsyncComputerHandler
+from computer import Computer

-
-class OpenAIComputerHandler:
+class cuaComputerHandler(AsyncComputerHandler):
    """Computer handler that implements the Computer protocol using the computer interface."""
    
-    def __init__(self, computer_interface):
+    def __init__(self, cua_computer: Computer):
        """Initialize with a computer interface (from tool schema)."""
-        self.interface = computer_interface
+        self.cua_computer = cua_computer
+        self.interface = None
+
+    async def _initialize(self):
+        if hasattr(self.cua_computer, '_initialized') and not self.cua_computer._initialized:
+            await self.cua_computer.run()
+        self.interface = self.cua_computer.interface
    
+    # ==== Computer-Use-Preview Action Space ==== 
+
    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
        """Get the current environment type."""
-        # For now, return a default - this could be enhanced to detect actual environment
-        return "windows"
-    
+        # TODO: detect actual environment
+        return "linux"
+
    async def get_dimensions(self) -> tuple[int, int]:
        """Get screen dimensions as (width, height)."""
+        assert self.interface is not None
        screen_size = await self.interface.get_screen_size()
        return screen_size["width"], screen_size["height"]
    
    async def screenshot(self) -> str:
        """Take a screenshot and return as base64 string."""
+        assert self.interface is not None
        screenshot_bytes = await self.interface.screenshot()
        return base64.b64encode(screenshot_bytes).decode('utf-8')
    
    async def click(self, x: int, y: int, button: str = "left") -> None:
        """Click at coordinates with specified button."""
+        assert self.interface is not None
        if button == "left":
            await self.interface.left_click(x, y)
        elif button == "right":
@@ -41,28 +52,36 @@ class OpenAIComputerHandler:
    
    async def double_click(self, x: int, y: int) -> None:
        """Double click at coordinates."""
+        assert self.interface is not None
        await self.interface.double_click(x, y)
    
    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
        """Scroll at coordinates with specified scroll amounts."""
+        assert self.interface is not None
        await self.interface.move_cursor(x, y)
        await self.interface.scroll(scroll_x, scroll_y)
    
    async def type(self, text: str) -> None:
        """Type text."""
+        assert self.interface is not None
        await self.interface.type_text(text)
    
    async def wait(self, ms: int = 1000) -> None:
        """Wait for specified milliseconds."""
+        assert self.interface is not None
        import asyncio
        await asyncio.sleep(ms / 1000.0)
    
    async def move(self, x: int, y: int) -> None:
        """Move cursor to coordinates."""
+        assert self.interface is not None
        await self.interface.move_cursor(x, y)
    
-    async def keypress(self, keys: List[str]) -> None:
+    async def keypress(self, keys: Union[List[str], str]) -> None:
        """Press key combination."""
+        assert self.interface is not None
+        if isinstance(keys, str):
+            keys = keys.replace("-", "+").split("+")
        if len(keys) == 1:
            await self.interface.press_key(keys[0])
        else:
@@ -71,6 +90,7 @@ class OpenAIComputerHandler:
    
    async def drag(self, path: List[Dict[str, int]]) -> None:
        """Drag along specified path."""
+        assert self.interface is not None
        if not path:
            return
        
@@ -92,16 +112,13 @@ class OpenAIComputerHandler:
        # For now, return empty string
        return ""

-
-def acknowledge_safety_check_callback(message: str) -> bool:
-    """Safety check callback for user acknowledgment."""
-    response = input(
-        f"Safety Check Warning: {message}\nDo you want to acknowledge and proceed? (y/n): "
-    ).lower()
-    return response.strip() == "y"
-
-
-def check_blocklisted_url(url: str) -> None:
-    """Check if URL is blocklisted (placeholder implementation)."""
-    # This would contain actual URL checking logic
-    pass
+    # ==== Anthropic Computer Action Space ==== 
+    async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse down at coordinates."""
+        assert self.interface is not None
+        await self.interface.mouse_down(x, y, button="left")
+    
+    async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse up at coordinates."""
+        assert self.interface is not None
+        await self.interface.mouse_up(x, y, button="left")
--- a/libs/python/agent/agent/computers/custom.py
+++ b/libs/python/agent/agent/computers/custom.py
@@ -0,0 +1,209 @@
+"""
+Custom computer handler implementation that accepts a dictionary of functions.
+"""
+
+import base64
+from typing import Dict, List, Any, Literal, Union, Optional, Callable
+from PIL import Image
+import io
+from .base import AsyncComputerHandler
+
+
+class CustomComputerHandler(AsyncComputerHandler):
+    """Computer handler that implements the Computer protocol using a dictionary of custom functions."""
+    
+    def __init__(self, functions: Dict[str, Callable]):
+        """
+        Initialize with a dictionary of functions.
+        
+        Args:
+            functions: Dictionary where keys are method names and values are callable functions.
+                      Only 'screenshot' is required, all others are optional.
+        
+        Raises:
+            ValueError: If required 'screenshot' function is not provided.
+        """
+        if 'screenshot' not in functions:
+            raise ValueError("'screenshot' function is required in functions dictionary")
+        
+        self.functions = functions
+        self._last_screenshot_size: Optional[tuple[int, int]] = None
+    
+    async def _call_function(self, func, *args, **kwargs):
+        """
+        Call a function, handling both async and sync functions.
+        
+        Args:
+            func: The function to call
+            *args: Positional arguments to pass to the function
+            **kwargs: Keyword arguments to pass to the function
+            
+        Returns:
+            The result of the function call
+        """
+        import asyncio
+        import inspect
+        
+        if callable(func):
+            if inspect.iscoroutinefunction(func):
+                return await func(*args, **kwargs)
+            else:
+                return func(*args, **kwargs)
+        else:
+            return func
+    
+    async def _get_value(self, attribute: str):
+        """
+        Get value for an attribute, checking both 'get_{attribute}' and '{attribute}' keys.
+        
+        Args:
+            attribute: The attribute name to look for
+            
+        Returns:
+            The value from the functions dict, called if callable, returned directly if not
+        """
+        # Check for 'get_{attribute}' first
+        get_key = f"get_{attribute}"
+        if get_key in self.functions:
+            return await self._call_function(self.functions[get_key])
+        
+        # Check for '{attribute}' 
+        if attribute in self.functions:
+            return await self._call_function(self.functions[attribute])
+        
+        return None
+    
+    def _to_b64_str(self, img: Union[bytes, Image.Image, str]) -> str:
+        """
+        Convert image to base64 string.
+        
+        Args:
+            img: Image as bytes, PIL Image, or base64 string
+            
+        Returns:
+            str: Base64 encoded image string
+        """
+        if isinstance(img, str):
+            # Already a base64 string
+            return img
+        elif isinstance(img, bytes):
+            # Raw bytes
+            return base64.b64encode(img).decode('utf-8')
+        elif isinstance(img, Image.Image):
+            # PIL Image
+            buffer = io.BytesIO()
+            img.save(buffer, format='PNG')
+            return base64.b64encode(buffer.getvalue()).decode('utf-8')
+        else:
+            raise ValueError(f"Unsupported image type: {type(img)}")
+    
+    # ==== Computer-Use-Preview Action Space ==== 
+
+    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
+        """Get the current environment type."""
+        result = await self._get_value('environment')
+        if result is None:
+            return "linux"
+        assert result in ["windows", "mac", "linux", "browser"]
+        return result # type: ignore
+
+    async def get_dimensions(self) -> tuple[int, int]:
+        """Get screen dimensions as (width, height)."""
+        result = await self._get_value('dimensions')
+        if result is not None:
+            return result # type: ignore
+        
+        # Fallback: use last screenshot size if available
+        if not self._last_screenshot_size:
+            await self.screenshot()
+        assert self._last_screenshot_size is not None, "Failed to get screenshot size"
+        
+        return self._last_screenshot_size
+    
+    async def screenshot(self) -> str:
+        """Take a screenshot and return as base64 string."""
+        result = await self._call_function(self.functions['screenshot'])
+        b64_str = self._to_b64_str(result) # type: ignore
+        
+        # Try to extract dimensions for fallback use
+        try:
+            if isinstance(result, Image.Image):
+                self._last_screenshot_size = result.size
+            elif isinstance(result, bytes):
+                # Try to decode bytes to get dimensions
+                img = Image.open(io.BytesIO(result))
+                self._last_screenshot_size = img.size
+        except Exception:
+            # If we can't get dimensions, that's okay
+            pass
+        
+        return b64_str
+    
+    async def click(self, x: int, y: int, button: str = "left") -> None:
+        """Click at coordinates with specified button."""
+        if 'click' in self.functions:
+            await self._call_function(self.functions['click'], x, y, button)
+        # No-op if not implemented
+    
+    async def double_click(self, x: int, y: int) -> None:
+        """Double click at coordinates."""
+        if 'double_click' in self.functions:
+            await self._call_function(self.functions['double_click'], x, y)
+        # No-op if not implemented
+    
+    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
+        """Scroll at coordinates with specified scroll amounts."""
+        if 'scroll' in self.functions:
+            await self._call_function(self.functions['scroll'], x, y, scroll_x, scroll_y)
+        # No-op if not implemented
+    
+    async def type(self, text: str) -> None:
+        """Type text."""
+        if 'type' in self.functions:
+            await self._call_function(self.functions['type'], text)
+        # No-op if not implemented
+    
+    async def wait(self, ms: int = 1000) -> None:
+        """Wait for specified milliseconds."""
+        if 'wait' in self.functions:
+            await self._call_function(self.functions['wait'], ms)
+        else:
+            # Default implementation
+            import asyncio
+            await asyncio.sleep(ms / 1000.0)
+    
+    async def move(self, x: int, y: int) -> None:
+        """Move cursor to coordinates."""
+        if 'move' in self.functions:
+            await self._call_function(self.functions['move'], x, y)
+        # No-op if not implemented
+    
+    async def keypress(self, keys: Union[List[str], str]) -> None:
+        """Press key combination."""
+        if 'keypress' in self.functions:
+            await self._call_function(self.functions['keypress'], keys)
+        # No-op if not implemented
+    
+    async def drag(self, path: List[Dict[str, int]]) -> None:
+        """Drag along specified path."""
+        if 'drag' in self.functions:
+            await self._call_function(self.functions['drag'], path)
+        # No-op if not implemented
+    
+    async def get_current_url(self) -> str:
+        """Get current URL (for browser environments)."""
+        if 'get_current_url' in self.functions:
+            return await self._get_value('current_url') # type: ignore
+        return ""  # Default fallback
+    
+    async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse down at coordinates."""
+        if 'left_mouse_down' in self.functions:
+            await self._call_function(self.functions['left_mouse_down'], x, y)
+        # No-op if not implemented
+    
+    async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse up at coordinates."""
+        if 'left_mouse_up' in self.functions:
+            await self._call_function(self.functions['left_mouse_up'], x, y)
+        # No-op if not implemented
--- a/libs/python/agent/agent/decorators.py
+++ b/libs/python/agent/agent/decorators.py
@@ -2,89 +2,51 @@
 Decorators for agent - agent_loop decorator
 """

-import asyncio
-import inspect
-from typing import Dict, List, Any, Callable, Optional
-from functools import wraps
-
-from .types import AgentLoopInfo
+from typing import List, Optional
+from .types import AgentConfigInfo

 # Global registry
-_agent_loops: List[AgentLoopInfo] = []
+_agent_configs: List[AgentConfigInfo] = []

-def agent_loop(models: str, priority: int = 0):
+def register_agent(models: str, priority: int = 0):
    """
-    Decorator to register an agent loop function.
+    Decorator to register an AsyncAgentConfig class.
    
    Args:
        models: Regex pattern to match supported models
-        priority: Priority for loop selection (higher = more priority)
+        priority: Priority for agent selection (higher = more priority)
    """
-    def decorator(func: Callable):
-        # Validate function signature
-        sig = inspect.signature(func)
-        required_params = {'messages', 'model'}
-        func_params = set(sig.parameters.keys())
+    def decorator(agent_class: type):
+        # Validate that the class implements AsyncAgentConfig protocol
+        if not hasattr(agent_class, 'predict_step'):
+            raise ValueError(f"Agent class {agent_class.__name__} must implement predict_step method")
+        if not hasattr(agent_class, 'predict_click'):
+            raise ValueError(f"Agent class {agent_class.__name__} must implement predict_click method")
+        if not hasattr(agent_class, 'get_capabilities'):
+            raise ValueError(f"Agent class {agent_class.__name__} must implement get_capabilities method")
        
-        if not required_params.issubset(func_params):
-            missing = required_params - func_params
-            raise ValueError(f"Agent loop function must have parameters: {missing}")
-        
-        # Register the loop
-        loop_info = AgentLoopInfo(
-            func=func,
+        # Register the agent config
+        config_info = AgentConfigInfo(
+            agent_class=agent_class,
            models_regex=models,
            priority=priority
        )
-        _agent_loops.append(loop_info)
+        _agent_configs.append(config_info)
        
        # Sort by priority (highest first)
-        _agent_loops.sort(key=lambda x: x.priority, reverse=True)
+        _agent_configs.sort(key=lambda x: x.priority, reverse=True)
        
-        @wraps(func)
-        async def wrapper(*args, **kwargs):
-            # Wrap the function in an asyncio.Queue for cancellation support
-            queue = asyncio.Queue()
-            task = None
-            
-            try:
-                # Create a task that can be cancelled
-                async def run_loop():
-                    try:
-                        result = await func(*args, **kwargs)
-                        await queue.put(('result', result))
-                    except Exception as e:
-                        await queue.put(('error', e))
-                
-                task = asyncio.create_task(run_loop())
-                
-                # Wait for result or cancellation
-                event_type, data = await queue.get()
-                
-                if event_type == 'error':
-                    raise data
-                return data
-                
-            except asyncio.CancelledError:
-                if task:
-                    task.cancel()
-                    try:
-                        await task
-                    except asyncio.CancelledError:
-                        pass
-                raise
-        
-        return wrapper
+        return agent_class
    
    return decorator

-def get_agent_loops() -> List[AgentLoopInfo]:
-    """Get all registered agent loops"""
-    return _agent_loops.copy()
+def get_agent_configs() -> List[AgentConfigInfo]:
+    """Get all registered agent configs"""
+    return _agent_configs.copy()

-def find_agent_loop(model: str) -> Optional[AgentLoopInfo]:
-    """Find the best matching agent loop for a model"""
-    for loop_info in _agent_loops:
-        if loop_info.matches_model(model):
-            return loop_info
+def find_agent_config(model: str) -> Optional[AgentConfigInfo]:
+    """Find the best matching agent config for a model"""
+    for config_info in _agent_configs:
+        if config_info.matches_model(model):
+            return config_info
    return None
--- a/libs/python/agent/agent/human_tool/init.py
+++ b/libs/python/agent/agent/human_tool/init.py
@@ -0,0 +1,29 @@
+"""
+Human-in-the-Loop Completion Tool
+
+This package provides a human-in-the-loop completion system that allows
+AI agents to request human assistance for complex decisions or responses.
+
+Components:
+- server.py: FastAPI server with completion queue management
+- ui.py: Gradio UI for human interaction
+- __main__.py: Combined server and UI application
+
+Usage:
+    # Run the server and UI
+    python -m agent.human_tool
+    
+    # Or run components separately
+    python -m agent.human_tool.server  # API server only
+    python -m agent.human_tool.ui      # UI only
+"""
+
+from .server import CompletionQueue, completion_queue
+from .ui import HumanCompletionUI, create_ui
+
+__all__ = [
+    "CompletionQueue",
+    "completion_queue", 
+    "HumanCompletionUI",
+    "create_ui"
+]
--- a/libs/python/agent/agent/human_tool/main.py
+++ b/libs/python/agent/agent/human_tool/main.py
@@ -0,0 +1,38 @@
+#!/usr/bin/env python3
+"""
+Human-in-the-Loop Completion Server and UI
+
+This module combines the FastAPI server for handling completion requests
+with a Gradio UI for human interaction.
+"""
+
+import gradio as gr
+from fastapi import FastAPI
+from .server import app as fastapi_app
+from .ui import create_ui
+
+# Create the Gradio demo
+gradio_demo = create_ui()
+
+# Mount Gradio on FastAPI
+CUSTOM_PATH = "/gradio"
+app = gr.mount_gradio_app(fastapi_app, gradio_demo, path=CUSTOM_PATH)
+
+# Add a redirect from root to Gradio UI
+@fastapi_app.get("/")
+async def redirect_to_ui():
+    """Redirect root to Gradio UI."""
+    return {
+        "message": "Human Completion Server is running",
+        "ui_url": "/gradio",
+        "api_docs": "/docs"
+    }
+
+if __name__ == "__main__":
+    import uvicorn
+    print("🚀 Starting Human-in-the-Loop Completion Server...")
+    print("📊 API Server: http://localhost:8002")
+    print("🎨 Gradio UI: http://localhost:8002/gradio")
+    print("📚 API Docs: http://localhost:8002/docs")
+    
+    uvicorn.run(app, host="0.0.0.0", port=8002)
--- a/libs/python/agent/agent/human_tool/server.py
+++ b/libs/python/agent/agent/human_tool/server.py
@@ -0,0 +1,234 @@
+import asyncio
+import uuid
+from datetime import datetime
+from typing import Dict, List, Any, Optional
+from dataclasses import dataclass, asdict
+from enum import Enum
+
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+
+
+class CompletionStatus(str, Enum):
+    PENDING = "pending"
+    COMPLETED = "completed"
+    FAILED = "failed"
+
+
+@dataclass
+class CompletionCall:
+    id: str
+    messages: List[Dict[str, Any]]
+    model: str
+    status: CompletionStatus
+    created_at: datetime
+    completed_at: Optional[datetime] = None
+    response: Optional[str] = None
+    tool_calls: Optional[List[Dict[str, Any]]] = None
+    error: Optional[str] = None
+
+
+class ToolCall(BaseModel):
+    id: str
+    type: str = "function"
+    function: Dict[str, Any]
+
+
+class CompletionRequest(BaseModel):
+    messages: List[Dict[str, Any]]
+    model: str
+
+
+class CompletionResponse(BaseModel):
+    response: Optional[str] = None
+    tool_calls: Optional[List[Dict[str, Any]]] = None
+
+
+class CompletionQueue:
+    def __init__(self):
+        self._queue: Dict[str, CompletionCall] = {}
+        self._pending_order: List[str] = []
+        self._lock = asyncio.Lock()
+    
+    async def add_completion(self, messages: List[Dict[str, Any]], model: str) -> str:
+        """Add a completion call to the queue."""
+        async with self._lock:
+            call_id = str(uuid.uuid4())
+            completion_call = CompletionCall(
+                id=call_id,
+                messages=messages,
+                model=model,
+                status=CompletionStatus.PENDING,
+                created_at=datetime.now()
+            )
+            self._queue[call_id] = completion_call
+            self._pending_order.append(call_id)
+            return call_id
+    
+    async def get_pending_calls(self) -> List[Dict[str, Any]]:
+        """Get all pending completion calls."""
+        async with self._lock:
+            pending_calls = []
+            for call_id in self._pending_order:
+                if call_id in self._queue and self._queue[call_id].status == CompletionStatus.PENDING:
+                    call = self._queue[call_id]
+                    pending_calls.append({
+                        "id": call.id,
+                        "model": call.model,
+                        "created_at": call.created_at.isoformat(),
+                        "messages": call.messages
+                    })
+            return pending_calls
+    
+    async def get_call_status(self, call_id: str) -> Optional[Dict[str, Any]]:
+        """Get the status of a specific completion call."""
+        async with self._lock:
+            if call_id not in self._queue:
+                return None
+            
+            call = self._queue[call_id]
+            result = {
+                "id": call.id,
+                "status": call.status.value,
+                "created_at": call.created_at.isoformat(),
+                "model": call.model,
+                "messages": call.messages
+            }
+            
+            if call.completed_at:
+                result["completed_at"] = call.completed_at.isoformat()
+            if call.response:
+                result["response"] = call.response
+            if call.tool_calls:
+                result["tool_calls"] = call.tool_calls
+            if call.error:
+                result["error"] = call.error
+                
+            return result
+    
+    async def complete_call(self, call_id: str, response: Optional[str] = None, tool_calls: Optional[List[Dict[str, Any]]] = None) -> bool:
+        """Mark a completion call as completed with a response or tool calls."""
+        async with self._lock:
+            if call_id not in self._queue:
+                return False
+            
+            call = self._queue[call_id]
+            if call.status != CompletionStatus.PENDING:
+                return False
+            
+            call.status = CompletionStatus.COMPLETED
+            call.completed_at = datetime.now()
+            call.response = response
+            call.tool_calls = tool_calls
+            
+            # Remove from pending order
+            if call_id in self._pending_order:
+                self._pending_order.remove(call_id)
+            
+            return True
+    
+    async def fail_call(self, call_id: str, error: str) -> bool:
+        """Mark a completion call as failed with an error."""
+        async with self._lock:
+            if call_id not in self._queue:
+                return False
+            
+            call = self._queue[call_id]
+            if call.status != CompletionStatus.PENDING:
+                return False
+            
+            call.status = CompletionStatus.FAILED
+            call.completed_at = datetime.now()
+            call.error = error
+            
+            # Remove from pending order
+            if call_id in self._pending_order:
+                self._pending_order.remove(call_id)
+            
+            return True
+    
+    async def wait_for_completion(self, call_id: str, timeout: float = 300.0) -> Optional[str]:
+        """Wait for a completion call to be completed and return the response."""
+        start_time = asyncio.get_event_loop().time()
+        
+        while True:
+            status = await self.get_call_status(call_id)
+            if not status:
+                return None
+            
+            if status["status"] == CompletionStatus.COMPLETED.value:
+                return status.get("response")
+            elif status["status"] == CompletionStatus.FAILED.value:
+                raise Exception(f"Completion failed: {status.get('error', 'Unknown error')}")
+            
+            # Check timeout
+            if asyncio.get_event_loop().time() - start_time > timeout:
+                await self.fail_call(call_id, "Timeout waiting for human response")
+                raise TimeoutError("Timeout waiting for human response")
+            
+            # Wait a bit before checking again
+            await asyncio.sleep(0.5)
+
+
+# Global queue instance
+completion_queue = CompletionQueue()
+
+# FastAPI app
+app = FastAPI(title="Human Completion Server", version="1.0.0")
+
+
+@app.post("/queue", response_model=Dict[str, str])
+async def queue_completion(request: CompletionRequest):
+    """Add a completion request to the queue."""
+    call_id = await completion_queue.add_completion(request.messages, request.model)
+    return {"id": call_id, "status": "queued"}
+
+
+@app.get("/pending")
+async def list_pending():
+    """List all pending completion calls."""
+    pending_calls = await completion_queue.get_pending_calls()
+    return {"pending_calls": pending_calls}
+
+
+@app.get("/status/{call_id}")
+async def get_status(call_id: str):
+    """Get the status of a specific completion call."""
+    status = await completion_queue.get_call_status(call_id)
+    if not status:
+        raise HTTPException(status_code=404, detail="Completion call not found")
+    return status
+
+
+@app.post("/complete/{call_id}")
+async def complete_call(call_id: str, response: CompletionResponse):
+    """Complete a call with a human response."""
+    success = await completion_queue.complete_call(
+        call_id, 
+        response=response.response, 
+        tool_calls=response.tool_calls
+    )
+    if success:
+        return {"status": "success", "message": "Call completed"}
+    else:
+        raise HTTPException(status_code=404, detail="Call not found or already completed")
+
+
+@app.post("/fail/{call_id}")
+async def fail_call(call_id: str, error: Dict[str, str]):
+    """Mark a call as failed."""
+    success = await completion_queue.fail_call(call_id, error.get("error", "Unknown error"))
+    if not success:
+        raise HTTPException(status_code=404, detail="Completion call not found or already completed")
+    return {"status": "failed"}
+
+
+@app.get("/")
+async def root():
+    """Root endpoint."""
+    return {"message": "Human Completion Server is running"}
+
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8002)
--- a/libs/python/agent/agent/human_tool/ui.py
+++ b/libs/python/agent/agent/human_tool/ui.py
@@ -0,0 +1,630 @@
+import gradio as gr
+import json
+import time
+from typing import List, Dict, Any, Optional
+from datetime import datetime
+import requests
+from .server import completion_queue
+import base64
+import io
+from PIL import Image
+
+class HumanCompletionUI:
+    def __init__(self, server_url: str = "http://localhost:8002"):
+        self.server_url = server_url
+        self.current_call_id: Optional[str] = None
+        self.refresh_interval = 2.0  # seconds
+        self.last_image = None  # Store the last image for display
+    
+    def format_messages_for_chatbot(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """Format messages for display in gr.Chatbot with type='messages'."""
+        formatted = []
+        for msg in messages:
+            role = msg.get("role", "user")
+            content = msg.get("content", "")
+            tool_calls = msg.get("tool_calls", [])
+            
+            # Handle different content formats
+            if isinstance(content, list):
+                # Multi-modal content - can include text and images
+                formatted_content = []
+                for item in content:
+                    if item.get("type") == "text":
+                        text = item.get("text", "")
+                        if text.strip():  # Only add non-empty text
+                            formatted_content.append(text)
+                    elif item.get("type") == "image_url":
+                        image_url = item.get("image_url", {}).get("url", "")
+                        if image_url:
+                            # Check if it's a base64 image or URL
+                            if image_url.startswith("data:image"):
+                                # For base64 images, decode and create gr.Image
+                                try:
+                                    header, data = image_url.split(",", 1)
+                                    image_data = base64.b64decode(data)
+                                    image = Image.open(io.BytesIO(image_data))
+                                    formatted_content.append(gr.Image(value=image))
+                                except Exception as e:
+                                    print(f"Error loading image: {e}")
+                                    formatted_content.append(f"[Image loading error: {e}]")
+                            else:
+                                # For URL images, create gr.Image with URL
+                                formatted_content.append(gr.Image(value=image_url))
+                
+                # Determine final content format
+                if len(formatted_content) == 1:
+                    content = formatted_content[0]
+                elif len(formatted_content) > 1:
+                    content = formatted_content
+                else:
+                    content = "[Empty content]"
+            
+            # Ensure role is valid for Gradio Chatbot
+            if role not in ["user", "assistant"]:
+                role = "assistant" if role == "system" else "user"
+            
+            # Invert roles for better display in human UI context
+            # (what the AI says becomes "user", what human should respond becomes "assistant")
+            if role == "user":
+                role = "assistant"
+            else:
+                role = "user"
+            
+            # Add the main message if it has content
+            if content and str(content).strip():
+                formatted.append({"role": role, "content": content})
+            
+            # Handle tool calls - create separate messages for each tool call
+            if tool_calls:
+                for tool_call in tool_calls:
+                    function_name = tool_call.get("function", {}).get("name", "unknown")
+                    arguments_str = tool_call.get("function", {}).get("arguments", "{}")
+                    
+                    try:
+                        # Parse arguments to format them nicely
+                        arguments = json.loads(arguments_str)
+                        formatted_args = json.dumps(arguments, indent=2)
+                    except json.JSONDecodeError:
+                        # If parsing fails, use the raw string
+                        formatted_args = arguments_str
+                    
+                    # Create a formatted message for the tool call
+                    tool_call_content = f"```json\n{formatted_args}\n```"
+                    
+                    formatted.append({
+                        "role": role,
+                        "content": tool_call_content,
+                        "metadata": {"title": f"🛠️ Used {function_name}"}
+                    })
+        
+        return formatted
+    
+    def get_pending_calls(self) -> List[Dict[str, Any]]:
+        """Get pending calls from the server."""
+        try:
+            response = requests.get(f"{self.server_url}/pending", timeout=5)
+            if response.status_code == 200:
+                return response.json().get("pending_calls", [])
+        except Exception as e:
+            print(f"Error fetching pending calls: {e}")
+        return []
+    
+    def complete_call_with_response(self, call_id: str, response: str) -> bool:
+        """Complete a call with a text response."""
+        try:
+            response_data = {"response": response}
+            response_obj = requests.post(
+                f"{self.server_url}/complete/{call_id}",
+                json=response_data,
+                timeout=10
+            )
+            response_obj.raise_for_status()
+            return True
+        except requests.RequestException as e:
+            print(f"Error completing call: {e}")
+            return False
+    
+    def complete_call_with_tool_calls(self, call_id: str, tool_calls: List[Dict[str, Any]]) -> bool:
+        """Complete a call with tool calls."""
+        try:
+            response_data = {"tool_calls": tool_calls}
+            response_obj = requests.post(
+                f"{self.server_url}/complete/{call_id}",
+                json=response_data,
+                timeout=10
+            )
+            response_obj.raise_for_status()
+            return True
+        except requests.RequestException as e:
+            print(f"Error completing call: {e}")
+            return False
+    
+    def complete_call(self, call_id: str, response: Optional[str] = None, tool_calls: Optional[List[Dict[str, Any]]] = None) -> bool:
+        """Complete a call with either a response or tool calls."""
+        try:
+            response_data = {}
+            if response:
+                response_data["response"] = response
+            if tool_calls:
+                response_data["tool_calls"] = tool_calls
+            
+            response_obj = requests.post(
+                f"{self.server_url}/complete/{call_id}",
+                json=response_data,
+                timeout=10
+            )
+            response_obj.raise_for_status()
+            return True
+        except requests.RequestException as e:
+            print(f"Error completing call: {e}")
+            return False
+    
+    def get_last_image_from_messages(self, messages: List[Dict[str, Any]]) -> Optional[Any]:
+        """Extract the last image from the messages for display above conversation."""
+        last_image = None
+        
+        for msg in reversed(messages):  # Start from the last message
+            content = msg.get("content", "")
+            
+            if isinstance(content, list):
+                for item in reversed(content):  # Get the last image in the message
+                    if item.get("type") == "image_url":
+                        image_url = item.get("image_url", {}).get("url", "")
+                        if image_url:
+                            if image_url.startswith("data:image"):
+                                # For base64 images, create a gr.Image component
+                                try:
+                                    header, data = image_url.split(",", 1)
+                                    image_data = base64.b64decode(data)
+                                    image = Image.open(io.BytesIO(image_data))
+                                    return image
+                                except Exception as e:
+                                    print(f"Error loading image: {e}")
+                                    continue
+                            else:
+                                # For URL images, return the URL
+                                return image_url
+        
+        return last_image
+    
+    def refresh_pending_calls(self):
+        """Refresh the list of pending calls."""
+        pending_calls = self.get_pending_calls()
+        
+        if not pending_calls:
+            return (
+                gr.update(choices=["latest"], value="latest"),  # dropdown
+                gr.update(value=None),  # image (no image)
+                gr.update(value=[]),  # chatbot (empty messages)
+                gr.update(interactive=False)  # submit button
+            )
+        
+        # Sort pending calls by created_at to get oldest first
+        sorted_calls = sorted(pending_calls, key=lambda x: x.get("created_at", ""))
+        
+        # Create choices for dropdown
+        choices = [("latest", "latest")]  # Add "latest" option first
+        
+        for call in sorted_calls:
+            call_id = call["id"]
+            model = call.get("model", "unknown")
+            created_at = call.get("created_at", "")
+            # Format timestamp
+            try:
+                dt = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
+                time_str = dt.strftime("%H:%M:%S")
+            except:
+                time_str = created_at
+            
+            choice_label = f"{call_id[:8]}... ({model}) - {time_str}"
+            choices.append((choice_label, call_id))
+        
+        # Default to "latest" which shows the oldest pending conversation
+        selected_call_id = "latest"
+        if selected_call_id == "latest" and sorted_calls:
+            # Use the oldest call (first in sorted list)
+            selected_call = sorted_calls[0]
+            conversation = self.format_messages_for_chatbot(selected_call.get("messages", []))
+            self.current_call_id = selected_call["id"]
+            # Get the last image from messages
+            self.last_image = self.get_last_image_from_messages(selected_call.get("messages", []))
+        else:
+            conversation = []
+            self.current_call_id = None
+            self.last_image = None
+        
+        return (
+            gr.update(choices=choices, value="latest"),
+            gr.update(value=self.last_image),
+            gr.update(value=conversation),
+            gr.update(interactive=bool(choices))
+        )
+    
+    def on_call_selected(self, selected_choice):
+        """Handle when a call is selected from the dropdown."""
+        if not selected_choice:
+            return (
+                gr.update(value=None),  # no image
+                gr.update(value=[]),  # empty chatbot
+                gr.update(interactive=False)
+            )
+        
+        pending_calls = self.get_pending_calls()
+        if not pending_calls:
+            return (
+                gr.update(value=None),  # no image
+                gr.update(value=[]),  # empty chatbot
+                gr.update(interactive=False)
+            )
+        
+        # Handle "latest" option
+        if selected_choice == "latest":
+            # Sort calls by created_at to get oldest first
+            sorted_calls = sorted(pending_calls, key=lambda x: x.get("created_at", ""))
+            selected_call = sorted_calls[0]  # Get the oldest call
+            call_id = selected_call["id"]
+        else:
+            # Extract call_id from the choice for specific calls
+            call_id = None
+            for call in pending_calls:
+                call_id_short = call["id"][:8]
+                if call_id_short in selected_choice:
+                    call_id = call["id"]
+                    break
+            
+            if not call_id:
+                return (
+                    gr.update(value=None),  # no image
+                    gr.update(value=[]),  # empty chatbot
+                    gr.update(interactive=False)
+                )
+            
+            # Find the selected call
+            selected_call = next((c for c in pending_calls if c["id"] == call_id), None)
+        
+        if not selected_call:
+            return (
+                gr.update(value=None),  # no image
+                gr.update(value=[]),  # empty chatbot
+                gr.update(interactive=False)
+            )
+        
+        conversation = self.format_messages_for_chatbot(selected_call.get("messages", []))
+        self.current_call_id = call_id
+        # Get the last image from messages
+        self.last_image = self.get_last_image_from_messages(selected_call.get("messages", []))
+        
+        return (
+            gr.update(value=self.last_image),
+            gr.update(value=conversation),
+            gr.update(interactive=True)
+        )
+    
+    def submit_response(self, response_text: str):
+        """Submit a text response to the current call."""
+        if not self.current_call_id:
+            return (
+                gr.update(value=response_text),  # keep response text
+                gr.update(value="❌ No call selected")  # status
+            )
+        
+        if not response_text.strip():
+            return (
+                gr.update(value=response_text),  # keep response text
+                gr.update(value="❌ Response cannot be empty")  # status
+            )
+        
+        success = self.complete_call_with_response(self.current_call_id, response_text)
+        
+        if success:
+            status_msg = "✅ Response submitted successfully!"
+            return (
+                gr.update(value=""),  # clear response text
+                gr.update(value=status_msg)  # status
+            )
+        else:
+            return (
+                gr.update(value=response_text),  # keep response text
+                gr.update(value="❌ Failed to submit response")  # status
+            )
+    
+    def submit_action(self, action_type: str, **kwargs) -> str:
+        """Submit a computer action as a tool call."""
+        if not self.current_call_id:
+            return "❌ No call selected"
+        
+        import uuid
+        
+        # Create tool call structure
+        action_data = {"type": action_type, **kwargs}
+        tool_call = {
+            "id": f"call_{uuid.uuid4().hex[:24]}",
+            "type": "function",
+            "function": {
+                "name": "computer",
+                "arguments": json.dumps(action_data)
+            }
+        }
+        
+        success = self.complete_call_with_tool_calls(self.current_call_id, [tool_call])
+        
+        if success:
+            return f"✅ {action_type.capitalize()} action submitted as tool call"
+        else:
+            return f"❌ Failed to submit {action_type} action"
+    
+    def submit_click_action(self, x: int, y: int, action_type: str = "click", button: str = "left") -> str:
+        """Submit a coordinate-based action."""
+        if action_type == "click":
+            return self.submit_action(action_type, x=x, y=y, button=button)
+        else:
+            return self.submit_action(action_type, x=x, y=y)
+    
+    def submit_type_action(self, text: str) -> str:
+        """Submit a type action."""
+        return self.submit_action("type", text=text)
+    
+    def submit_hotkey_action(self, keys: str) -> str:
+        """Submit a hotkey action."""
+        return self.submit_action("keypress", keys=keys)
+    
+    def submit_description_click(self, description: str, action_type: str = "click", button: str = "left") -> str:
+        """Submit a description-based action."""
+        if action_type == "click":
+            return self.submit_action(action_type, element_description=description, button=button)
+        else:
+            return self.submit_action(action_type, element_description=description)
+    
+    def wait_for_pending_calls(self, max_seconds: float = 10.0, check_interval: float = 0.2):
+        """Wait for pending calls to appear or until max_seconds elapsed.
+        
+        This method loops and checks for pending calls at regular intervals,
+        returning as soon as a pending call is found or the maximum wait time is reached.
+        
+        Args:
+            max_seconds: Maximum number of seconds to wait
+            check_interval: How often to check for pending calls (in seconds)
+        """
+        import time
+        
+        start_time = time.time()
+        
+        while time.time() - start_time < max_seconds:
+            # Check if there are any pending calls
+            pending_calls = self.get_pending_calls()
+            if pending_calls:
+                # Found pending calls, return immediately
+                return self.refresh_pending_calls()
+            
+            # Wait before checking again
+            time.sleep(check_interval)
+        
+        # Max wait time reached, return current state
+        return self.refresh_pending_calls()
+
+
+def create_ui():
+    """Create the Gradio interface."""
+    ui_handler = HumanCompletionUI()
+    
+    with gr.Blocks(title="Human-in-the-Loop Agent Tool") as demo:
+        gr.Markdown("# 🤖 Human-in-the-Loop Agent Tool")
+        gr.Markdown("Review AI conversation requests and provide human responses.")
+        
+        with gr.Row():
+            with gr.Column(scale=2):
+                with gr.Group():
+                    screenshot_image = gr.Image(
+                        label="Screenshot",
+                        interactive=False,
+                        height=600
+                    )
+                    
+                    # Action type selection for image clicks
+                    with gr.Row():
+                        action_type_radio = gr.Radio(
+                            label="Action Type",
+                            choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
+                            value="click",
+                            scale=2
+                        )
+                        action_button_radio = gr.Radio(
+                            label="Button (for click only)",
+                            choices=["left", "right", "wheel", "back", "forward"],
+                            value="left",
+                            visible=True,
+                            scale=1
+                        )
+                    
+                    conversation_chatbot = gr.Chatbot(
+                        label="Messages",
+                        type="messages",
+                        height=500,
+                        show_copy_button=True
+                    )
+            
+            with gr.Column(scale=1):
+                with gr.Group():
+                    call_dropdown = gr.Dropdown(
+                        label="Select a pending call",
+                        choices=["latest"],
+                        interactive=True,
+                        value="latest"
+                    )
+                    refresh_btn = gr.Button("🔄 Refresh", variant="secondary")
+
+                with gr.Group():
+                    response_text = gr.Textbox(
+                        label="Response",
+                        lines=3,
+                        placeholder="Enter your response here..."
+                    )
+                    submit_btn = gr.Button("📤 Submit Response", variant="primary", interactive=False)
+                
+                # Action Accordions
+                with gr.Accordion("🖱️ Click Actions", open=False):
+                    with gr.Group():
+                        with gr.Row():
+                            click_x = gr.Number(label="X", value=0, minimum=0)
+                            click_y = gr.Number(label="Y", value=0, minimum=0)
+                        with gr.Row():
+                            click_action_type = gr.Dropdown(
+                                label="Action Type",
+                                choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
+                                value="click"
+                            )
+                            click_button = gr.Dropdown(
+                                label="Button (for click only)",
+                                choices=["left", "right", "wheel", "back", "forward"],
+                                value="left"
+                            )
+                        click_submit_btn = gr.Button("Submit Action")
+                
+                with gr.Accordion("📝 Type Action", open=False):
+                    with gr.Group():
+                        type_text = gr.Textbox(
+                            label="Text to Type",
+                            placeholder="Enter text to type..."
+                        )
+                        type_submit_btn = gr.Button("Submit Type")
+                
+                with gr.Accordion("⌨️ Keypress Action", open=False):
+                    with gr.Group():
+                        keypress_text = gr.Textbox(
+                            label="Keys",
+                            placeholder="e.g., ctrl+c, alt+tab"
+                        )
+                        keypress_submit_btn = gr.Button("Submit Keypress")
+                
+                with gr.Accordion("🎯 Description Action", open=False):
+                    with gr.Group():
+                        description_text = gr.Textbox(
+                            label="Element Description",
+                            placeholder="e.g., 'Privacy and security option in left sidebar'"
+                        )
+                        with gr.Row():
+                            description_action_type = gr.Dropdown(
+                                label="Action Type",
+                                choices=["click", "double_click", "move", "left_mouse_up", "left_mouse_down"],
+                                value="click"
+                            )
+                            description_button = gr.Radio(
+                                label="Button (for click only)",
+                                choices=["left", "right", "wheel", "back", "forward"],
+                                value="left"
+                            )
+                        description_submit_btn = gr.Button("Submit Description Action")
+                
+                status_display = gr.Textbox(
+                    label="Status",
+                    interactive=False,
+                    value="Ready to receive calls..."
+                )
+        
+        # Event handlers
+        refresh_btn.click(
+            fn=ui_handler.refresh_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        call_dropdown.change(
+            fn=ui_handler.on_call_selected,
+            inputs=[call_dropdown],
+            outputs=[screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        def handle_image_click(evt: gr.SelectData):
+            if evt.index is not None:
+                x, y = evt.index
+                action_type = action_type_radio.value or "click"
+                button = action_button_radio.value or "left"
+                result = ui_handler.submit_click_action(x, y, action_type, button)
+                ui_handler.wait_for_pending_calls()
+                return result
+            return "No coordinates selected"
+
+        screenshot_image.select(
+            fn=handle_image_click,
+            outputs=[status_display]
+        ).then(
+            fn=ui_handler.wait_for_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+
+        # Response submission
+        submit_btn.click(
+            fn=ui_handler.submit_response,
+            inputs=[response_text],
+            outputs=[response_text, status_display]
+        ).then(
+            fn=ui_handler.refresh_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        # Toggle button radio visibility based on action type
+        def toggle_button_visibility(action_type):
+            return gr.update(visible=(action_type == "click"))
+        
+        action_type_radio.change(
+            fn=toggle_button_visibility,
+            inputs=[action_type_radio],
+            outputs=[action_button_radio]
+        )
+
+        # Action accordion handlers
+        click_submit_btn.click(
+            fn=ui_handler.submit_click_action,
+            inputs=[click_x, click_y, click_action_type, click_button],
+            outputs=[status_display]
+        ).then(
+            fn=ui_handler.wait_for_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        type_submit_btn.click(
+            fn=ui_handler.submit_type_action,
+            inputs=[type_text],
+            outputs=[status_display]
+        ).then(
+            fn=ui_handler.wait_for_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        keypress_submit_btn.click(
+            fn=ui_handler.submit_hotkey_action,
+            inputs=[keypress_text],
+            outputs=[status_display]
+        ).then(
+            fn=ui_handler.wait_for_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        def handle_description_submit(description, action_type, button):
+            if description:
+                result = ui_handler.submit_description_click(description, action_type, button)
+                ui_handler.wait_for_pending_calls()
+                return result
+            return "Please enter a description"
+
+        description_submit_btn.click(
+            fn=handle_description_submit,
+            inputs=[description_text, description_action_type, description_button],
+            outputs=[status_display]
+        ).then(
+            fn=ui_handler.wait_for_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+        
+        # Load initial data
+        demo.load(
+            fn=ui_handler.refresh_pending_calls,
+            outputs=[call_dropdown, screenshot_image, conversation_chatbot, submit_btn]
+        )
+    
+    return demo
+
+
+if __name__ == "__main__":
+    demo = create_ui()
+    demo.queue()
+    demo.launch(server_name="0.0.0.0", server_port=7860)
--- a/libs/python/agent/agent/integrations/hud/init.py
+++ b/libs/python/agent/agent/integrations/hud/init.py
@@ -0,0 +1,77 @@
+"""HUD integration for ComputerAgent."""
+
+import logging
+from typing import Any, Optional, Dict
+from hud import run_job as hud_run_job
+
+from .agent import ComputerAgent
+from .adapter import ComputerAgentAdapter
+from .computer_handler import HUDComputerHandler
+
+
+async def run_job(
+    model: str,
+    task_or_taskset: Any,
+    job_name: str,
+    # Job kwargs
+    auto_reply_question: bool = False,
+    adapter_cls: Any = None,
+    adapter_kwargs: Optional[Dict[str, Any]] = None,
+    max_steps_per_task: int = 20,
+    run_parallel: bool = True,
+    job_metadata: Optional[Dict[str, Any]] = None,
+    show_progress: bool = True,
+    max_concurrent_env_creations: Optional[int] = 30,  # Limits gym.make calls
+    max_concurrent_agent_predictions: Optional[int] = None,  # No limit on LLM calls
+    max_concurrent_tasks: Optional[int] = 30,  # Limits overall task concurrency
+    **agent_kwargs: Any
+) -> Any:
+    """
+    Run a job using ComputerAgent with the specified model.
+    
+    Args:
+        model: Model string for ComputerAgent (e.g., "anthropic/claude-3-5-sonnet-20241022")
+        task_or_taskset: Task or TaskSet to run
+        job_name: Name for the job
+        auto_reply_question: Whether to auto-reply to questions
+        adapter_cls: Custom adapter class (defaults to ComputerAgentAdapter)
+        adapter_kwargs: Additional kwargs for the adapter
+        max_steps_per_task: Maximum steps per task
+        run_parallel: Whether to run tasks in parallel
+        job_metadata: Additional metadata for the job
+        show_progress: Whether to show progress
+        max_concurrent_env_creations: Max concurrent environment creations
+        max_concurrent_agent_predictions: Max concurrent agent predictions
+        max_concurrent_tasks: Max concurrent tasks
+        **agent_kwargs: Additional kwargs to pass to ComputerAgent
+    
+    Returns:
+        Job instance from HUD
+    """
+    # combine verbose and verbosity kwargs
+    if "verbose" in agent_kwargs:
+        agent_kwargs["verbosity"] = logging.INFO
+        del agent_kwargs["verbose"]
+    verbose = True if agent_kwargs.get("verbosity", logging.WARNING) > logging.INFO else False
+    
+    # run job
+    return await hud_run_job(
+        agent_cls=ComputerAgent,
+        agent_kwargs={"model": model, **agent_kwargs},
+        task_or_taskset=task_or_taskset,
+        job_name=job_name,
+        auto_reply_question=auto_reply_question,
+        adapter_cls=adapter_cls,
+        adapter_kwargs=adapter_kwargs,
+        max_steps_per_task=max_steps_per_task,
+        run_parallel=run_parallel,
+        job_metadata=job_metadata,
+        show_progress=show_progress,
+        verbose=verbose,
+        max_concurrent_env_creations=max_concurrent_env_creations,
+        max_concurrent_agent_predictions=max_concurrent_agent_predictions,
+        max_concurrent_tasks=max_concurrent_tasks
+    )
+
+
+__all__ = ["ComputerAgent", "ComputerAgentAdapter", "HUDComputerHandler", "run_job"]
--- a/libs/python/agent/agent/integrations/hud/adapter.py
+++ b/libs/python/agent/agent/integrations/hud/adapter.py
@@ -0,0 +1,121 @@
+"""HUD Adapter for ComputerAgent integration."""
+
+from __future__ import annotations
+
+from typing import Any, ClassVar
+
+from hud.adapters.common import CLA, Adapter
+from hud.adapters.common.types import (
+    CLAButton,
+    CLAKey,
+    ClickAction,
+    CustomAction,
+    DragAction,
+    MoveAction,
+    Point,
+    PressAction,
+    ResponseAction,
+    ScreenshotFetch,
+    ScrollAction,
+    TypeAction,
+    WaitAction,
+)
+
+
+class ComputerAgentAdapter(Adapter):
+    """Adapter for ComputerAgent to work with HUD."""
+    
+    KEY_MAP: ClassVar[dict[str, CLAKey]] = {
+        "return": "enter",
+        "arrowup": "up",
+        "arrowdown": "down",
+        "arrowleft": "left",
+        "arrowright": "right",
+        "cmd": "ctrl",
+        "super": "win",
+        "meta": "win",
+    }
+
+    BUTTON_MAP: ClassVar[dict[str, CLAButton]] = {
+        "wheel": "middle",
+        "middle": "middle",
+    }
+
+    def __init__(self) -> None:
+        super().__init__()
+        # ComputerAgent default dimensions (can be overridden)
+        self.agent_width = 1024
+        self.agent_height = 768
+
+    def _map_key(self, key: str) -> CLAKey:
+        """Map a key to its standardized form."""
+        return self.KEY_MAP.get(key.lower(), key.lower())  # type: ignore
+
+    def convert(self, data: Any) -> CLA:
+        """Convert a ComputerAgent action to a HUD action."""
+        try:
+            action_type = data.get("type")
+
+            if action_type == "click":
+                x, y = data.get("x", 0), data.get("y", 0)
+                button = data.get("button", "left")
+                button = self.BUTTON_MAP.get(button, button)
+                if button is None:
+                    button = "left"
+                converted_action = ClickAction(point=Point(x=x, y=y), button=button)
+
+            elif action_type == "double_click":
+                x, y = data.get("x", 0), data.get("y", 0)
+                converted_action = ClickAction(point=Point(x=x, y=y), button="left", pattern=[100])
+
+            elif action_type == "scroll":
+                x, y = int(data.get("x", 0)), int(data.get("y", 0))
+                scroll_x = int(data.get("scroll_x", 0))
+                scroll_y = int(data.get("scroll_y", 0))
+                converted_action = ScrollAction(
+                    point=Point(x=x, y=y), scroll=Point(x=scroll_x, y=scroll_y)
+                )
+
+            elif action_type == "type":
+                text = data.get("text", "")
+                converted_action = TypeAction(text=text, enter_after=False)
+
+            elif action_type == "wait":
+                ms = data.get("ms", 1000)
+                converted_action = WaitAction(time=ms)
+
+            elif action_type == "move":
+                x, y = data.get("x", 0), data.get("y", 0)
+                converted_action = MoveAction(point=Point(x=x, y=y))
+
+            elif action_type == "keypress":
+                keys = data.get("keys", [])
+                if isinstance(keys, str):
+                    keys = [keys]
+                converted_action = PressAction(keys=[self._map_key(k) for k in keys])
+
+            elif action_type == "drag":
+                path = data.get("path", [])
+                points = [Point(x=p.get("x", 0), y=p.get("y", 0)) for p in path]
+                converted_action = DragAction(path=points)
+
+            elif action_type == "screenshot":
+                converted_action = ScreenshotFetch()
+
+            elif action_type == "response":
+                converted_action = ResponseAction(text=data.get("text", ""))
+                
+            elif action_type == "custom":
+                converted_action = CustomAction(action=data.get("action", ""))
+                
+            else:
+                raise ValueError(f"Unsupported action type: {action_type}")
+
+            # Add reasoning and logs if available
+            converted_action.reasoning = data.get("reasoning", "")
+            converted_action.logs = data.get("logs", "")
+
+            return converted_action
+
+        except Exception as e:
+            raise ValueError(f"Invalid action: {data}. Error: {e!s}") from e
--- a/libs/python/agent/agent/integrations/hud/agent.py
+++ b/libs/python/agent/agent/integrations/hud/agent.py
@@ -0,0 +1,373 @@
+"""HUD ComputerAgent wrapper for OSWorld benchmarking."""
+
+import logging
+from typing import Any, Literal, Optional, Union, List, Dict
+import asyncio
+
+from agent import ComputerAgent as BaseComputerAgent
+from agent.responses import make_failed_tool_call_items
+from hud.adapters import Adapter
+from hud.agent.base import Agent
+from hud.utils.common import Observation
+from hud.adapters.common.types import LogType
+from hud.types import Gym
+
+from .adapter import ComputerAgentAdapter
+from .computer_handler import HUDComputerHandler
+
+logger = logging.getLogger(__name__)
+
+BASE_SYSTEM_PROMPT = """
+You are an autonomous computer-using agent. Follow these guidelines:
+
+1. Be decisive and complete tasks without asking for confirmation unless absolutely necessary.
+2. Use the computer tools to complete the task and do not stop until the task is complete.
+3. Do NOT ask questions like "Should I proceed?" or "Would you like me to continue?" - just proceed with the task.
+4. When you find what you're looking for (e.g., a file to upload), proceed with the action directly.
+5. Only stop when the task is fully complete or if you encounter an error that prevents completion.
+6. Trust that the user wants you to complete the entire task they've requested.
+7. You must say "Task completed" when the task is complete.
+
+Remember: You have been given permission to complete the requested task autonomously.
+""".strip()
+
+class ComputerAgent(Agent[BaseComputerAgent, dict[str, Any]]):
+    """
+    A ComputerAgent wrapper for HUD integration.
+    
+    This agent wraps the base ComputerAgent to work with HUD environments,
+    providing the same interface as OperatorAgent but using ComputerAgent internally.
+    """
+    
+    transfer_gyms: dict[Gym, Gym] = {"qa": "hud-browser"}
+
+    def __init__(
+        self,
+        model: str = "anthropic/claude-3-5-sonnet-20241022",
+        environment: Literal["windows", "mac", "linux", "browser"] = "linux",
+        adapter: Optional[Adapter] = None,
+        name: Optional[str] = None,
+        **kwargs: Any,
+    ):
+        """
+        Initialize the ComputerAgent for HUD.
+
+        Args:
+            model: The model string for ComputerAgent (e.g., "anthropic/claude-3-5-sonnet-20241022")
+            environment: The environment type (windows, mac, linux, browser)
+            adapter: The adapter to use for preprocessing and postprocessing
+            name: The name of the agent
+            **kwargs: Additional arguments passed to ComputerAgent
+        """
+        # Create adapter if not provided
+        adapter = adapter or ComputerAgentAdapter()
+        
+        if name is None:
+            name = f"computeragent-{model.split('/')[-1]}"
+
+        # Initialize the base Agent class without client (we'll create it later)
+        super().__init__(client=None, adapter=adapter, name=name)
+
+        self.model = model
+        self.environment = environment
+        self.kwargs = kwargs
+
+        # Default dimensions
+        self.width = 1024
+        self.height = 768
+
+        # Update dimensions if adapter is provided
+        if self.adapter:
+            self.width = self.adapter.agent_width
+            self.height = self.adapter.agent_height
+
+        # Create HUD computer handler
+        self.hud_computer = HUDComputerHandler(
+            environment=environment,
+            dimensions=(self.width, self.height)
+        )
+
+        # Handle trajectory_dir by adding TrajectorySaverCallback
+        trajectory_dir = kwargs.pop("trajectory_dir", None)
+        callbacks = kwargs.get("callbacks", [])
+        
+        if trajectory_dir:
+            from agent.callbacks.trajectory_saver import TrajectorySaverCallback
+            trajectory_callback = TrajectorySaverCallback(trajectory_dir, reset_on_run=False)
+            callbacks = callbacks + [trajectory_callback]
+            kwargs["callbacks"] = callbacks
+
+        # Initialize ComputerAgent with HUD computer handler
+        self.computer_agent = BaseComputerAgent(
+            model=model,
+            tools=[self.hud_computer],
+            **kwargs
+        )
+        
+        # Set the client to the computer_agent for compatibility
+        self.client = self.computer_agent
+
+        # State tracking
+        self.conversation_history: List[Dict[str, Any]] = []
+        self.initial_prompt: Optional[str] = None
+
+        # System prompt for computer use tasks
+        self.base_system_prompt = BASE_SYSTEM_PROMPT
+
+    async def fetch_response(self, observation: Observation) -> tuple[list[dict[str, Any]], bool]:
+        """
+        Fetch a response from ComputerAgent based on the observation.
+
+        Args:
+            observation: The preprocessed observation, attributes: 
+                screenshot: Base64 encoded PNG string of the screen
+                text: Text observation, if available
+
+        Returns:
+            tuple[list[dict[str, Any]], bool, list[LogType] | None]: A tuple containing the list of raw actions,
+                                             boolean indicating if the agent believes the task is complete.
+        """
+        try:
+            # Update the computer handler with the current screenshot
+            if observation.screenshot:
+                self.hud_computer.update_screenshot(observation.screenshot)
+
+            # Set up action callback to capture actions
+            captured_actions = []
+            action_done = False
+
+            async def action_callback(action: Dict[str, Any]) -> None:
+                """Callback to capture actions from ComputerAgent."""
+                nonlocal captured_actions, action_done
+                captured_actions.append(action)
+
+            # Set the action callback
+            self.hud_computer.set_action_callback(action_callback)
+
+            # Prepare the message for ComputerAgent
+            if not self.conversation_history:
+                # First interaction - use the observation text as initial prompt
+                if observation.text:
+                    self.initial_prompt = observation.text
+                    message = f"{self.base_system_prompt}\n\nTask: {observation.text}"
+                else:
+                    message = f"{self.base_system_prompt}\n\nPlease analyze the current screen and determine what action to take."
+                
+                input_content = [
+                    {"type": "input_text", "text": message}
+                ]
+
+                # Add screenshot if present
+                if observation.screenshot:
+                    input_content.append(
+                        {
+                            "type": "input_image",
+                            "image_url": f"data:image/png;base64,{observation.screenshot}",
+                        }
+                    )
+
+                self.conversation_history.append({"role": "user", "content": input_content})                    
+            else:
+                # Subsequent interactions - check if last action was computer_call
+                # If so, add computer_call_output with screenshot instead of user message
+                last_computer_calls = []
+                for msg in reversed(self.conversation_history):
+                    if msg.get("type") == "computer_call":
+                        call_id = msg.get("call_id")
+                        if call_id:
+                            # Check if this call_id already has a computer_call_output
+                            has_output = any(
+                                m.get("type") == "computer_call_output" and m.get("call_id") == call_id
+                                for m in self.conversation_history
+                            )
+                            if not has_output:
+                                last_computer_calls.append(call_id)
+                
+                if last_computer_calls:
+                    if not observation.screenshot:
+                        print("No screenshot found, taking screenshot")
+                    screenshot_b64 = await self.hud_computer.screenshot()
+                    # Add computer_call_output for each unresponded computer_call
+                    for call_id in reversed(last_computer_calls):  # Maintain order
+                        self.conversation_history.append({
+                            "type": "computer_call_output",
+                            "call_id": call_id,
+                            "output": {
+                                "type": "input_image",
+                                "image_url": f"data:image/png;base64,{screenshot_b64}"
+                            }
+                        })
+                else:
+                    # No computer_call found, add regular user message
+                    message = "Continue with the task based on the current screen state."
+                    input_content = [
+                        {"type": "input_text", "text": message}
+                    ]
+
+                    # Add screenshot if present
+                    if observation.screenshot:
+                        input_content.append(
+                            {
+                                "type": "input_image",
+                                "image_url": f"data:image/png;base64,{observation.screenshot}",
+                            }
+                        )
+
+                    self.conversation_history.append({"role": "user", "content": input_content})                  
+
+                # If the last message is a reasoning message, change it to output_text
+                if (self.conversation_history and 
+                    self.conversation_history[-1].get("type") == "reasoning" and 
+                    self.conversation_history[-1].get("summary")):
+                    
+                    reasoning_msg = self.conversation_history[-1]
+                    summary_texts = []
+                    
+                    # Extract all summary_text entries
+                    for summary_item in reasoning_msg["summary"]:
+                        if summary_item.get("type") == "summary_text":
+                            summary_texts.append(summary_item.get("text", ""))
+                    
+                    # Convert to message format with output_text
+                    if summary_texts:
+                        converted_message = {
+                            "type": "message",
+                            "role": "assistant",
+                            "content": [
+                                {
+                                    "text": " ".join(summary_texts),
+                                    "type": "output_text"
+                                }
+                            ]
+                        }
+                        
+                        # Replace the reasoning message with the converted message
+                        self.conversation_history[-1] = converted_message
+
+            # Run ComputerAgent
+            try:
+                new_items = []
+
+                # ComputerAgent.run returns an async generator
+                try:
+                    async for result in self.computer_agent.run(self.conversation_history, stream=False):
+                        # if the result has computer_call_output, immediately exit
+                        if result.get("output", []) and result.get("output", [])[-1].get("type") == "computer_call_output":
+                            break
+                        # otherwise add agent output to conversation history
+                        new_items += result["output"]
+                except Exception as e:
+                    # if the last message is reasoning, change it to output_text
+                    if new_items and new_items[-1].get("type") == "reasoning":
+                        new_items[-1] = {
+                            "type": "message",
+                            "role": "assistant",
+                            "content": [
+                                {
+                                    "text": new_items[-1].get("summary", [{}])[0].get("text", ""),
+                                    "type": "output_text"
+                                }
+                            ]
+                        }
+                    # Check if there are any computer_call items in new_items
+                    computer_calls = [item for item in new_items if item.get("type") == "computer_call"]
+                    if computer_calls:
+                        # Remove computer_call items from new_items
+                        new_items = [item for item in new_items if item.get("type") != "computer_call"]
+                        
+                        # Add failed tool call items for each computer call
+                        for computer_call in computer_calls:
+                            tool_input = computer_call.get("action", {})
+                            call_id = computer_call.get("call_id")
+                            new_items.extend(make_failed_tool_call_items(
+                                tool_name="computer",
+                                tool_kwargs=tool_input,
+                                error_message=repr(e),
+                                call_id=call_id
+                            ))
+                    else:
+                        # add error message to conversation history (fallback for non-computer-call errors)
+                        new_items.append({
+                            "type": "user",
+                            "content": [
+                                {
+                                    "type": "input_text",
+                                    "text": f"Error during previous attempted action: {repr(e)}"
+                                }
+                            ]
+                        })
+
+                # Check if we captured any actions
+                if captured_actions:
+                    # Extract reasoning from the conversation history
+                    reasoning = ""
+                    # Look for the latest reasoning message
+                    for msg in reversed(new_items):
+                        if msg.get("type") == "reasoning" and msg.get("summary"):
+                            reasoning = " ".join([s.get("text", "") for s in msg["summary"] if s.get("type") == "summary_text"])
+                            break
+                        elif msg.get("type") == "message" and msg.get("role") == "assistant":
+                            content = msg.get("content", [])
+                            if isinstance(content, list):
+                                reasoning = " ".join([c.get("text", "") for c in content if c.get("type") == "output_text"])
+                            break
+                    
+                    # update conversation history
+                    self.conversation_history += new_items
+
+                    # Add reasoning and logs to each action
+                    for action in captured_actions:
+                        action["reasoning"] = reasoning
+                        action["logs"] = {"conversation_length": len(self.conversation_history)}
+                    
+                    return captured_actions, False
+                    
+                # Check if the last message is "Task completed"
+                response_text = ""
+                for msg in reversed(new_items):
+                    if msg.get("type") == "message" and msg.get("role") == "assistant":
+                        content = msg.get("content", [])
+                        for c in content:
+                            if c.get("type") == "output_text":
+                                response_text = c.get("text", response_text)
+                                break
+                        break
+                
+                done = "task completed" in response_text.lower()
+                
+                # update conversation history
+                self.conversation_history += new_items
+                
+                response_action = {
+                    "type": "response",
+                    "text": response_text,
+                    "reasoning": response_text,
+                    "logs": {"conversation_length": len(self.conversation_history)}
+                }
+                
+                # Check if this indicates task completion or failure
+                if "task is infeasible" in response_text.lower():
+                    response_action = {"type": "custom", "action": "FAIL"}
+                    done = True
+                
+                return [response_action], done
+            except Exception as e:
+                logger.error(f"Error running ComputerAgent: {e}")
+                # Return an error response
+                error_action = {
+                    "type": "response", 
+                    "text": f"Error occurred: {str(e)}",
+                    "reasoning": f"ComputerAgent encountered an error: {str(e)}",
+                    "logs": {"error": str(e)}
+                }
+                return [error_action], True
+
+        except Exception as e:
+            logger.error(f"Error in fetch_response: {e}")
+            error_action = {
+                "type": "response",
+                "text": f"Error in agent processing: {str(e)}",
+                "reasoning": f"Agent processing error: {str(e)}",
+                "logs": {"error": str(e)}
+            }
+            return [error_action], True
--- a/libs/python/agent/agent/integrations/hud/computer_handler.py
+++ b/libs/python/agent/agent/integrations/hud/computer_handler.py
@@ -0,0 +1,187 @@
+"""HUD Computer Handler for ComputerAgent integration."""
+
+import base64
+from io import BytesIO
+from typing import Literal, Optional, Any, Dict, Callable
+from PIL import Image
+
+from agent.computers import AsyncComputerHandler
+
+
+class HUDComputerHandler(AsyncComputerHandler):
+    """Computer handler that interfaces with HUD environment."""
+    
+    def __init__(
+        self,
+        environment: Literal["windows", "mac", "linux", "browser"] = "linux",
+        dimensions: tuple[int, int] = (1024, 768),
+        screenshot_callback: Optional[Callable] = None,
+        action_callback: Optional[Callable] = None,
+    ):
+        """
+        Initialize HUD computer handler.
+        
+        Args:
+            environment: The environment type for HUD
+            dimensions: Screen dimensions as (width, height)
+            screenshot_callback: Optional callback to get screenshots from HUD environment
+            action_callback: Optional callback to execute actions in HUD environment
+        """
+        super().__init__()
+        self._environment = environment
+        self._dimensions = dimensions
+        self._screenshot_callback = screenshot_callback
+        self._action_callback = action_callback
+        
+        # Store the last screenshot for reuse
+        self._last_screenshot: Optional[str] = None
+        
+    def set_screenshot_callback(self, callback: Callable) -> None:
+        """Set the screenshot callback."""
+        self._screenshot_callback = callback
+        
+    def set_action_callback(self, callback: Callable) -> None:
+        """Set the action callback."""
+        self._action_callback = callback
+        
+    def update_screenshot(self, screenshot: str) -> None:
+        """Update the stored screenshot (base64 string)."""
+        self._last_screenshot = screenshot
+
+    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
+        """Get the current environment type."""
+        return self._environment # type: ignore
+    
+    async def get_dimensions(self) -> tuple[int, int]:
+        """Get screen dimensions as (width, height)."""
+        return self._dimensions
+    
+    async def screenshot(self) -> str:
+        """Take a screenshot and return as base64 string."""
+        if self._screenshot_callback:
+            screenshot = await self._screenshot_callback()
+            if isinstance(screenshot, str):
+                self._last_screenshot = screenshot
+                return screenshot
+            elif isinstance(screenshot, Image.Image):
+                # Convert PIL Image to base64
+                buffer = BytesIO()
+                screenshot.save(buffer, format="PNG")
+                screenshot_b64 = base64.b64encode(buffer.getvalue()).decode()
+                self._last_screenshot = screenshot_b64
+                return screenshot_b64
+            elif isinstance(screenshot, bytes):
+                screenshot_b64 = base64.b64encode(screenshot).decode()
+                self._last_screenshot = screenshot_b64
+                return screenshot_b64
+        
+        # Return last screenshot if available, otherwise create a blank one
+        if self._last_screenshot:
+            return self._last_screenshot
+            
+        # Create a blank screenshot as fallback
+        blank_image = Image.new('RGB', self._dimensions, color='white')
+        buffer = BytesIO()
+        blank_image.save(buffer, format="PNG")
+        screenshot_b64 = base64.b64encode(buffer.getvalue()).decode()
+        self._last_screenshot = screenshot_b64
+        return screenshot_b64
+    
+    async def click(self, x: int, y: int, button: str = "left") -> None:
+        """Click at coordinates with specified button."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "click",
+                "x": x,
+                "y": y,
+                "button": button
+            })
+    
+    async def double_click(self, x: int, y: int) -> None:
+        """Double click at coordinates."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "double_click",
+                "x": x,
+                "y": y
+            })
+    
+    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
+        """Scroll at coordinates with specified scroll amounts."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "scroll",
+                "x": x,
+                "y": y,
+                "scroll_x": scroll_x,
+                "scroll_y": scroll_y
+            })
+    
+    async def type(self, text: str) -> None:
+        """Type text."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "type",
+                "text": text
+            })
+    
+    async def wait(self, ms: int = 1000) -> None:
+        """Wait for specified milliseconds."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "wait",
+                "ms": ms
+            })
+    
+    async def move(self, x: int, y: int) -> None:
+        """Move cursor to coordinates."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "move",
+                "x": x,
+                "y": y
+            })
+    
+    async def keypress(self, keys: list[str] | str) -> None:
+        """Press key combination."""
+        if isinstance(keys, str):
+            keys = [keys]
+        if self._action_callback:
+            await self._action_callback({
+                "type": "keypress",
+                "keys": keys
+            })
+    
+    async def drag(self, path: list[dict[str, int]]) -> None:
+        """Drag along a path of points."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "drag",
+                "path": path
+            })
+
+    async def left_mouse_down(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse down at coordinates."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "left_mouse_down",
+                "x": x,
+                "y": y
+            })
+    
+    async def left_mouse_up(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
+        """Left mouse up at coordinates."""
+        if self._action_callback:
+            await self._action_callback({
+                "type": "left_mouse_up",
+                "x": x,
+                "y": y
+            })
+    
+    async def get_current_url(self) -> str:
+        """Get the current URL."""
+        if self._action_callback:
+            return await self._action_callback({
+                "type": "get_current_url"
+            })
+        return ""
--- a/libs/python/agent/agent/loops/init.py
+++ b/libs/python/agent/agent/loops/init.py
@@ -7,5 +7,8 @@ from . import anthropic
 from . import openai
 from . import uitars
 from . import omniparser
+from . import gta1
+from . import composed_grounded
+from . import glm45v

-__all__ = ["anthropic", "openai", "uitars", "omniparser"]
+__all__ = ["anthropic", "openai", "uitars", "omniparser", "gta1", "composed_grounded", "glm45v"]
--- a/libs/python/agent/agent/loops/anthropic.py
+++ b/libs/python/agent/agent/loops/anthropic.py
--- a/libs/python/agent/agent/loops/base.py
+++ b/libs/python/agent/agent/loops/base.py
@@ -0,0 +1,76 @@
+"""
+Base protocol for async agent configurations
+"""
+
+from typing import Protocol, List, Dict, Any, Optional, Tuple, Union
+from abc import abstractmethod
+from ..types import AgentCapability
+
+class AsyncAgentConfig(Protocol):
+    """Protocol defining the interface for async agent configurations."""
+    
+    @abstractmethod
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Predict the next step based on input items.
+        
+        Args:
+            messages: Input items following Responses format (message, function_call, computer_call)
+            model: Model name to use
+            tools: Optional list of tool schemas
+            max_retries: Maximum number of retries for failed API calls
+            stream: Whether to stream responses
+            computer_handler: Computer handler instance
+            _on_api_start: Callback for API start
+            _on_api_end: Callback for API end
+            _on_usage: Callback for usage tracking
+            _on_screenshot: Callback for screenshot events
+            **kwargs: Additional arguments
+            
+        Returns:
+            Dictionary with "output" (output items) and "usage" array
+        """
+        ...
+    
+    @abstractmethod
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates based on image and instruction.
+        
+        Args:
+            model: Model name to use
+            image_b64: Base64 encoded image
+            instruction: Instruction for where to click
+            
+        Returns:
+            None or tuple with (x, y) coordinates
+        """
+        ...
+    
+    @abstractmethod
+    def get_capabilities(self) -> List[AgentCapability]:
+        """
+        Get list of capabilities supported by this agent config.
+        
+        Returns:
+            List of capability strings (e.g., ["step", "click"])
+        """
+        ...
--- a/libs/python/agent/agent/loops/composed_grounded.py
+++ b/libs/python/agent/agent/loops/composed_grounded.py
@@ -0,0 +1,318 @@
+"""
+Composed-grounded agent loop implementation that combines grounding and thinking models.
+Uses a two-stage approach: grounding model for element detection, thinking model for reasoning.
+"""
+
+import uuid
+import asyncio
+import json
+import base64
+from typing import Dict, List, Any, Optional, Tuple
+from io import BytesIO
+from PIL import Image
+import litellm
+
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability
+from ..loops.base import AsyncAgentConfig
+from ..responses import (
+    convert_computer_calls_xy2desc,
+    convert_responses_items_to_completion_messages,
+    convert_completion_messages_to_responses_items,
+    convert_computer_calls_desc2xy,
+    get_all_element_descriptions
+)
+from ..agent import find_agent_config
+
+GROUNDED_COMPUTER_TOOL_SCHEMA = {
+  "type": "function",
+  "function": {
+    "name": "computer",
+    "description": "Control a computer by taking screenshots and interacting with UI elements. This tool uses element descriptions to locate and interact with UI elements on the screen (e.g., 'red submit button', 'search text field', 'hamburger menu icon', 'close button in top right corner').",
+    "parameters": {
+        "type": "object",
+        "properties": {
+        "action": {
+            "type": "string",
+            "enum": [
+            "screenshot",
+            "click",
+            "double_click",
+            "drag",
+            "type",
+            "keypress",
+            "scroll",
+            "move",
+            "wait",
+            "get_current_url",
+            "get_dimensions",
+            "get_environment"
+            ],
+            "description": "The action to perform"
+        },
+        "element_description": {
+            "type": "string",
+            "description": "Description of the element to interact with (required for click, double_click, move, scroll actions, and as start/end for drag)"
+        },
+        "start_element_description": {
+            "type": "string",
+            "description": "Description of the element to start dragging from (required for drag action)"
+        },
+        "end_element_description": {
+            "type": "string",
+            "description": "Description of the element to drag to (required for drag action)"
+        },
+        "text": {
+            "type": "string",
+            "description": "The text to type (required for type action)"
+        },
+        "keys": {
+            "type": "string",
+            "description": "Key combination to press (required for keypress action). Single key for individual key press, multiple keys for combinations (e.g., 'ctrl+c')"
+        },
+        "button": {
+            "type": "string",
+            "description": "The mouse button to use for click action (left, right, wheel, back, forward) Default: left",
+        },
+        "scroll_x": {
+            "type": "integer",
+            "description": "Horizontal scroll amount for scroll action (positive for right, negative for left)",
+        },
+        "scroll_y": {
+            "type": "integer",
+            "description": "Vertical scroll amount for scroll action (positive for down, negative for up)",
+        },
+        },
+        "required": [
+            "action"
+        ]
+    }
+  }
+}
+
+def _prepare_tools_for_grounded(tool_schemas: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """Prepare tools for grounded API format"""
+    grounded_tools = []
+    
+    for schema in tool_schemas:
+        if schema["type"] == "computer":
+            grounded_tools.append(GROUNDED_COMPUTER_TOOL_SCHEMA)
+        else:
+            grounded_tools.append(schema)
+    
+    return grounded_tools
+
+def get_last_computer_call_image(messages: List[Dict[str, Any]]) -> Optional[str]:
+    """Get the last computer call output image from messages."""
+    for message in reversed(messages):
+        if (isinstance(message, dict) and 
+            message.get("type") == "computer_call_output" and
+            isinstance(message.get("output"), dict) and
+            message["output"].get("type") == "input_image"):
+            image_url = message["output"].get("image_url", "")
+            if image_url.startswith("data:image/png;base64,"):
+                return image_url.split(",", 1)[1]
+    return None
+
+
+@register_agent(r".*\+.*", priority=1)
+class ComposedGroundedConfig:
+    """
+    Composed-grounded agent configuration that uses both grounding and thinking models.
+    
+    The model parameter should be in format: "grounding_model+thinking_model"
+    e.g., "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro"
+    """
+    
+    def __init__(self):
+        self.desc2xy: Dict[str, Tuple[float, float]] = {}
+    
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        use_prompt_caching: Optional[bool] = False,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Composed-grounded predict step implementation.
+        
+        Process:
+        0. Store last computer call image, if none then take a screenshot
+        1. Convert computer calls from xy to descriptions
+        2. Convert responses items to completion messages
+        3. Call thinking model with litellm.acompletion
+        4. Convert completion messages to responses items
+        5. Get all element descriptions and populate desc2xy mapping
+        6. Convert computer calls from descriptions back to xy coordinates
+        7. Return output and usage
+        """
+        # Parse the composed model
+        if "+" not in model:
+            raise ValueError(f"Composed model must be in format 'grounding_model+thinking_model', got: {model}")
+        grounding_model, thinking_model = model.split("+", 1)
+        
+        pre_output_items = []
+        
+        # Step 0: Store last computer call image, if none then take a screenshot
+        last_image_b64 = get_last_computer_call_image(messages)
+        if last_image_b64 is None:
+            # Take a screenshot
+            screenshot_b64 = await computer_handler.screenshot() # type: ignore
+            if screenshot_b64:
+                
+                call_id = uuid.uuid4().hex
+                pre_output_items += [
+                    {
+                        "type": "message",
+                        "role": "assistant",
+                        "content": [
+                            {
+                                "type": "output_text",
+                                "text": "Taking a screenshot to see the current computer screen."
+                            }
+                        ]
+                    },
+                    {
+                        "action": {
+                            "type": "screenshot"
+                        },
+                        "call_id": call_id,
+                        "status": "completed",
+                        "type": "computer_call"
+                    },
+                    {
+                        "type": "computer_call_output",
+                        "call_id": call_id,
+                        "output": {
+                            "type": "input_image",
+                            "image_url": f"data:image/png;base64,{screenshot_b64}"
+                        }
+                    },
+                ]
+                last_image_b64 = screenshot_b64
+                
+                # Call screenshot callback if provided
+                if _on_screenshot:
+                    await _on_screenshot(screenshot_b64)
+        
+        tool_schemas = _prepare_tools_for_grounded(tools) # type: ignore
+
+        # Step 1: Convert computer calls from xy to descriptions
+        input_messages = messages + pre_output_items
+        messages_with_descriptions = convert_computer_calls_xy2desc(input_messages, self.desc2xy)
+        
+        # Step 2: Convert responses items to completion messages
+        completion_messages = convert_responses_items_to_completion_messages(
+            messages_with_descriptions, 
+            allow_images_in_tool_results=False
+        )
+        
+        # Step 3: Call thinking model with litellm.acompletion
+        api_kwargs = {
+            "model": thinking_model,
+            "messages": completion_messages,
+            "tools": tool_schemas,
+            "max_retries": max_retries,
+            "stream": stream,
+            **kwargs
+        }
+
+        if use_prompt_caching:
+            api_kwargs["use_prompt_caching"] = use_prompt_caching
+        
+        # Call API start hook
+        if _on_api_start:
+            await _on_api_start(api_kwargs)
+        
+        # Make the completion call
+        response = await litellm.acompletion(**api_kwargs)
+        
+        # Call API end hook
+        if _on_api_end:
+            await _on_api_end(api_kwargs, response)
+        
+        # Extract usage information
+        usage = {
+            **response.usage.model_dump(), # type: ignore
+            "response_cost": response._hidden_params.get("response_cost", 0.0),
+        }
+        if _on_usage:
+            await _on_usage(usage)
+        
+        # Step 4: Convert completion messages back to responses items format
+        response_dict = response.model_dump() # type: ignore
+        choice_messages = [choice["message"] for choice in response_dict["choices"]]
+        thinking_output_items = []
+        
+        for choice_message in choice_messages:
+            thinking_output_items.extend(convert_completion_messages_to_responses_items([choice_message]))
+        
+        # Step 5: Get all element descriptions and populate desc2xy mapping
+        element_descriptions = get_all_element_descriptions(thinking_output_items)
+        
+        if element_descriptions and last_image_b64:
+            # Use grounding model to predict coordinates for each description
+            grounding_agent_conf = find_agent_config(grounding_model)
+            if grounding_agent_conf:
+                grounding_agent = grounding_agent_conf.agent_class()
+                
+                for desc in element_descriptions:
+                    coords = await grounding_agent.predict_click(
+                        model=grounding_model,
+                        image_b64=last_image_b64,
+                        instruction=desc
+                    )
+                    if coords:
+                        self.desc2xy[desc] = coords
+        
+        # Step 6: Convert computer calls from descriptions back to xy coordinates
+        final_output_items = convert_computer_calls_desc2xy(thinking_output_items, self.desc2xy)
+        
+        # Step 7: Return output and usage
+        return {
+            "output": pre_output_items + final_output_items,
+            "usage": usage
+        }
+    
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str,
+        **kwargs
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates using the grounding model.
+        
+        For composed models, uses only the grounding model part for click prediction.
+        """
+        # Parse the composed model to get grounding model
+        if "+" not in model:
+            raise ValueError(f"Composed model must be in format 'grounding_model+thinking_model', got: {model}")
+        grounding_model, thinking_model = model.split("+", 1)
+        
+        # Find and use the grounding agent
+        grounding_agent_conf = find_agent_config(grounding_model)
+        if grounding_agent_conf:
+            grounding_agent = grounding_agent_conf.agent_class()
+            return await grounding_agent.predict_click(
+                model=grounding_model,
+                image_b64=image_b64,
+                instruction=instruction,
+                **kwargs
+            )
+        
+        return None
+    
+    def get_capabilities(self) -> List[AgentCapability]:
+        """Return the capabilities supported by this agent."""
+        return ["click", "step"]
--- a/libs/python/agent/agent/loops/glm45v.py
+++ b/libs/python/agent/agent/loops/glm45v.py
@@ -0,0 +1,902 @@
+"""
+GLM-4.5V agent loop implementation using liteLLM for GLM-4.5V model.
+Supports vision-language models for computer control with bounding box parsing.
+"""
+
+import asyncio
+import json
+import base64
+import re
+from typing import Dict, List, Any, Optional, Tuple
+from io import BytesIO
+from PIL import Image
+import litellm
+from litellm.types.utils import ModelResponse
+from litellm.responses.litellm_completion_transformation.transformation import LiteLLMCompletionResponsesConfig
+
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability
+from ..loops.base import AsyncAgentConfig
+from ..responses import (
+    convert_responses_items_to_completion_messages,
+    convert_completion_messages_to_responses_items,
+    make_reasoning_item,
+    make_output_text_item,
+    make_click_item,
+    make_double_click_item,
+    make_drag_item,
+    make_keypress_item,
+    make_scroll_item,
+    make_type_item,
+    make_wait_item,
+    make_input_image_item
+)
+
+# GLM-4.5V specific constants
+GLM_ACTION_SPACE = """
+### {left,right,middle}_click
+
+Call rule: `{left,right,middle}_click(start_box='[x,y]', element_info='')`
+{
+    'name': ['left_click', 'right_click', 'middle_click'],
+    'description': 'Perform a left/right/middle mouse click at the specified coordinates on the screen.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'start_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Coordinates [x,y] where to perform the click, normalized to 0-999 range.'
+            },
+            'element_info': {
+                'type': 'string',
+                'description': 'Optional text description of the UI element being clicked.'
+            }
+        },
+        'required': ['start_box']
+    }
+}
+
+### hover
+
+Call rule: `hover(start_box='[x,y]', element_info='')`
+{
+    'name': 'hover',
+    'description': 'Move the mouse pointer to the specified coordinates without performing any click action.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'start_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Coordinates [x,y] where to move the mouse pointer, normalized to 0-999 range.'
+            },
+            'element_info': {
+                'type': 'string',
+                'description': 'Optional text description of the UI element being hovered over.'
+            }
+        },
+        'required': ['start_box']
+    }
+}
+
+### left_double_click
+
+Call rule: `left_double_click(start_box='[x,y]', element_info='')`
+{
+    'name': 'left_double_click',
+    'description': 'Perform a left mouse double-click at the specified coordinates on the screen.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'start_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Coordinates [x,y] where to perform the double-click, normalized to 0-999 range.'
+            },
+            'element_info': {
+                'type': 'string',
+                'description': 'Optional text description of the UI element being double-clicked.'
+            }
+        },
+        'required': ['start_box']
+    }
+}
+
+### left_drag
+
+Call rule: `left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')`
+{
+    'name': 'left_drag',
+    'description': 'Drag the mouse from starting coordinates to ending coordinates while holding the left mouse button.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'start_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Starting coordinates [x1,y1] for the drag operation, normalized to 0-999 range.'
+            },
+            'end_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Ending coordinates [x2,y2] for the drag operation, normalized to 0-999 range.'
+            },
+            'element_info': {
+                'type': 'string',
+                'description': 'Optional text description of the UI element being dragged.'
+            }
+        },
+        'required': ['start_box', 'end_box']
+    }
+}
+
+### key
+
+Call rule: `key(keys='')`
+{
+    'name': 'key',
+    'description': 'Simulate pressing a single key or combination of keys on the keyboard.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'keys': {
+                'type': 'string',
+                'description': 'The key or key combination to press. Use '+' to separate keys in combinations (e.g., 'ctrl+c', 'alt+tab').'
+            }
+        },
+        'required': ['keys']
+    }
+}
+
+### type
+
+Call rule: `type(content='')`
+{
+    'name': 'type',
+    'description': 'Type text content into the currently focused text input field. This action only performs typing and does not handle field activation or clearing.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'content': {
+                'type': 'string',
+                'description': 'The text content to be typed into the active text field.'
+            }
+        },
+        'required': ['content']
+    }
+}
+
+### scroll
+
+Call rule: `scroll(start_box='[x,y]', direction='', step=5, element_info='')`
+{
+    'name': 'scroll',
+    'description': 'Scroll an element at the specified coordinates in the specified direction by a given number of wheel steps.',
+    'parameters': {
+        'type': 'object',
+        'properties': {
+            'start_box': {
+                'type': 'array',
+                'items': {
+                    'type': 'integer'
+                },
+                'description': 'Coordinates [x,y] of the element or area to scroll, normalized to 0-999 range.'
+            },
+            'direction': {
+                'type': 'string',
+                'enum': ['down', 'up'],
+                'description': 'The direction to scroll: 'down' or 'up'.'
+            },
+            'step': {
+                'type': 'integer',
+                'default': 5,
+                'description': 'Number of wheel steps to scroll, default is 5.'
+            },
+            'element_info': {
+                'type': 'string',
+                'description': 'Optional text description of the UI element being scrolled.'
+            }
+        },
+        'required': ['start_box', 'direction']
+    }
+}
+
+### WAIT
+
+Call rule: `WAIT()`
+{
+    'name': 'WAIT',
+    'description': 'Wait for 5 seconds before proceeding to the next action.',
+    'parameters': {
+        'type': 'object',
+        'properties': {},
+        'required': []
+    }
+}
+
+### DONE
+
+Call rule: `DONE()`
+{
+    'name': 'DONE',
+    'description': 'Indicate that the current task has been completed successfully and no further actions are needed.',
+    'parameters': {
+        'type': 'object',
+        'properties': {},
+        'required': []
+    }
+}
+
+### FAIL
+
+Call rule: `FAIL()`
+{
+    'name': 'FAIL',
+    'description': 'Indicate that the current task cannot be completed or is impossible to accomplish.',
+    'parameters': {
+        'type': 'object',
+        'properties': {},
+        'required': []
+    }
+}"""
+
+def encode_image_to_base64(image_path: str) -> str:
+    """Encode image file to base64 string with data URI."""
+    with open(image_path, "rb") as image_file:
+        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
+        return f"data:image/png;base64,{encoded_string}"
+
+def parse_glm_response(response: str) -> Dict[str, Any]:
+    """
+    Parse GLM-4.5V response to extract action and memory.
+    
+    The special tokens <|begin_of_box|> and <|end_of_box|> mark bounding boxes.
+    Coordinates are normalized values between 0 and 1000.
+    """
+    # Extract action from between special tokens
+    pattern = r"<\|begin_of_box\|>(.*?)<\|end_of_box\|>"
+    match = re.search(pattern, response)
+    if match:
+        action = match.group(1).strip()
+    else:
+        # Fallback: look for function call patterns
+        action_pattern = r"[\w_]+\([^)]*\)"
+        matches = re.findall(action_pattern, response)
+        action = matches[0] if matches else None
+    
+    # Extract memory section
+    memory_pattern = r"Memory:(.*?)$"
+    memory_match = re.search(memory_pattern, response, re.DOTALL)
+    memory = memory_match.group(1).strip() if memory_match else "[]"
+    
+    # Extract action text (everything before Memory:)
+    action_text_pattern = r'^(.*?)Memory:'
+    action_text_match = re.search(action_text_pattern, response, re.DOTALL)
+    action_text = action_text_match.group(1).strip() if action_text_match else response
+    
+    # Clean up action text by removing special tokens
+    if action_text:
+        action_text = action_text.replace("<|begin_of_box|>", "").replace("<|end_of_box|>", "")
+    
+    return {
+        "action": action,
+        "action_text": action_text,
+        "memory": memory
+    }
+
+def get_last_image_from_messages(messages: Messages) -> Optional[str]:
+    """Extract the last image from messages for processing."""
+    for message in reversed(messages):
+        if isinstance(message, dict):
+            if message.get("type") == "computer_call_output":
+                output = message.get("output", {})
+                if isinstance(output, dict) and output.get("type") == "input_image":
+                    image_url = output.get("image_url", "")
+                    if isinstance(image_url, str) and image_url.startswith("data:image/"):
+                        # Extract base64 part
+                        return image_url.split(",", 1)[1]
+            elif message.get("role") == "user":
+                content = message.get("content", [])
+                if isinstance(content, list):
+                    for item in reversed(content):
+                        if isinstance(item, dict) and item.get("type") == "image_url":
+                            image_url_obj = item.get("image_url", {})
+                            if isinstance(image_url_obj, dict):
+                                image_url = image_url_obj.get("url", "")
+                                if isinstance(image_url, str) and image_url.startswith("data:image/"):
+                                    return image_url.split(",", 1)[1]
+    return None
+
+def convert_responses_items_to_glm45v_pc_prompt(messages: Messages, task: str, memory: str = "") -> List[Dict[str, Any]]:
+    """Convert responses items to GLM-4.5V PC prompt format with historical actions.
+    
+    Args:
+        messages: List of message items from the conversation
+        task: The task description
+        memory: Current memory state
+        
+    Returns:
+        List of content items for the prompt (text and image_url items)
+    """
+    action_space = GLM_ACTION_SPACE
+    
+    # Template head
+    head_text = f"""You are a GUI Agent, and your primary task is to respond accurately to user requests or questions. In addition to directly answering the user's queries, you can also use tools or perform GUI operations directly until you fulfill the user's request or provide a correct answer. You should carefully read and understand the images and questions provided by the user, and engage in thinking and reflection when appropriate. The coordinates involved are all represented in thousandths (0-999).
+
+# Task:
+{task}
+
+# Task Platform
+Ubuntu
+
+# Action Space
+{action_space}
+
+# Historical Actions and Current Memory
+History:"""
+    
+    # Template tail
+    tail_text = f"""
+Memory:
+{memory}
+# Output Format
+Plain text explanation with action(param='...')
+Memory:
+[{{"key": "value"}}, ...]
+
+# Some Additional Notes
+- I'll give you the most recent 4 history screenshots(shrunked to 50%*50%) along with the historical action steps.
+- You should put the key information you *have to remember* in a seperated memory part and I'll give it to you in the next round. The content in this part should be a dict list. If you no longer need some given information, you should remove it from the memory. Even if you don't need to remember anything, you should also output an empty list.
+- My computer's password is "password", feel free to use it when you need sudo rights.
+- For the thunderbird account "anonym-x2024@outlook.com", the password is "gTCI";=@y7|QJ0nDa_kN3Sb&>".
+
+Current Screenshot:
+"""
+    
+    # Build history from messages
+    history = []
+    history_images = []
+    
+    # Group messages into steps
+    current_step = []
+    step_num = 0
+    
+    for message in messages:
+        msg_type = message.get("type")
+        
+        if msg_type == "reasoning":
+            current_step.append(message)
+        elif msg_type == "message" and message.get("role") == "assistant":
+            current_step.append(message)
+        elif msg_type == "computer_call":
+            current_step.append(message)
+        elif msg_type == "computer_call_output":
+            current_step.append(message)
+            # End of step - process it
+            if current_step:
+                step_num += 1
+                
+                # Extract bot thought from message content
+                bot_thought = ""
+                for item in current_step:
+                    if item.get("type") == "message" and item.get("role") == "assistant":
+                        content = item.get("content", [])
+                        for content_item in content:
+                            if content_item.get("type") == "output_text":
+                                bot_thought = content_item.get("text", "")
+                                break
+                        break
+                
+                # Extract action from computer_call
+                action_text = ""
+                for item in current_step:
+                    if item.get("type") == "computer_call":
+                        action = item.get("action", {})
+                        action_type = action.get("type", "")
+                        
+                        if action_type == "click":
+                            x, y = action.get("x", 0), action.get("y", 0)
+                            # Convert to 0-999 range (assuming screen dimensions)
+                            # For now, use direct coordinates - this may need adjustment
+                            action_text = f"left_click(start_box='[{x},{y}]')"
+                        elif action_type == "double_click":
+                            x, y = action.get("x", 0), action.get("y", 0)
+                            action_text = f"left_double_click(start_box='[{x},{y}]')"
+                        elif action_type == "right_click":
+                            x, y = action.get("x", 0), action.get("y", 0)
+                            action_text = f"right_click(start_box='[{x},{y}]')"
+                        elif action_type == "drag":
+                            # Handle drag with path
+                            path = action.get("path", [])
+                            if len(path) >= 2:
+                                start = path[0]
+                                end = path[-1]
+                                action_text = f"left_drag(start_box='[{start.get('x', 0)},{start.get('y', 0)}]', end_box='[{end.get('x', 0)},{end.get('y', 0)}]')"
+                        elif action_type == "keypress":
+                            key = action.get("key", "")
+                            action_text = f"key(keys='{key}')"
+                        elif action_type == "type":
+                            text = action.get("text", "")
+                            action_text = f"type(content='{text}')"
+                        elif action_type == "scroll":
+                            x, y = action.get("x", 0), action.get("y", 0)
+                            direction = action.get("direction", "down")
+                            action_text = f"scroll(start_box='[{x},{y}]', direction='{direction}')"
+                        elif action_type == "wait":
+                            action_text = "WAIT()"
+                        break
+                
+                # Extract screenshot from computer_call_output
+                screenshot_url = None
+                for item in current_step:
+                    if item.get("type") == "computer_call_output":
+                        output = item.get("output", {})
+                        if output.get("type") == "input_image":
+                            screenshot_url = output.get("image_url", "")
+                            break
+                
+                # Store step info
+                step_info = {
+                    "step_num": step_num,
+                    "bot_thought": bot_thought,
+                    "action_text": action_text,
+                    "screenshot_url": screenshot_url
+                }
+                history.append(step_info)
+                
+                # Store screenshot for last 4 steps
+                if screenshot_url:
+                    history_images.append(screenshot_url)
+                
+                current_step = []
+    
+    # Build content array with head, history, and tail
+    content = []
+    current_text = head_text
+    
+    total_history_steps = len(history)
+    history_image_count = min(4, len(history_images))  # Last 4 images
+    
+    for step_idx, step_info in enumerate(history):
+        step_num = step_info["step_num"]
+        bot_thought = step_info["bot_thought"]
+        action_text = step_info["action_text"]
+        
+        if step_idx < total_history_steps - history_image_count:
+            # For steps beyond the last 4, use text placeholder
+            current_text += f"\nstep {step_num}: Screenshot:(Omitted in context.) Thought: {bot_thought}\nAction: {action_text}"
+        else:
+            # For the last 4 steps, insert images
+            current_text += f"\nstep {step_num}: Screenshot:"
+            content.append({"type": "text", "text": current_text})
+            
+            # Add image
+            img_idx = step_idx - (total_history_steps - history_image_count)
+            if img_idx < len(history_images):
+                content.append({"type": "image_url", "image_url": {"url": history_images[img_idx]}})
+            
+            current_text = f" Thought: {bot_thought}\nAction: {action_text}"
+    
+    # Add tail
+    current_text += tail_text
+    content.append({"type": "text", "text": current_text})
+    
+    return content
+
+def model_dump(obj) -> Dict[str, Any]:
+    if isinstance(obj, dict):
+        return {k: model_dump(v) for k, v in obj.items()}
+    elif hasattr(obj, "model_dump"):
+        return obj.model_dump()
+    else:
+        return obj
+
+def convert_glm_completion_to_responses_items(response: ModelResponse, image_width: int, image_height: int) -> List[Dict[str, Any]]:
+    """
+    Convert GLM-4.5V completion response to responses items format.
+    
+    Args:
+        response: LiteLLM ModelResponse from GLM-4.5V
+        image_width: Original image width for coordinate scaling
+        image_height: Original image height for coordinate scaling
+        
+    Returns:
+        List of response items in the proper format
+    """
+    import uuid
+    
+    response_items = []
+    
+    if not response.choices or not response.choices[0].message:
+        return response_items
+    
+    message = response.choices[0].message
+    content = message.content or ""
+    reasoning_content = getattr(message, 'reasoning_content', None)
+    
+    # Add reasoning item if present
+    if reasoning_content:
+        reasoning_item = model_dump(make_reasoning_item(reasoning_content))
+        response_items.append(reasoning_item)
+    
+    # Parse the content to extract action and text
+    parsed_response = parse_glm_response(content)
+    action = parsed_response.get("action", "")
+    action_text = parsed_response.get("action_text", "")
+    
+    # Add message item with text content (excluding action and memory)
+    if action_text:
+        # Remove action from action_text if it's there
+        clean_text = action_text
+        if action and action in clean_text:
+            clean_text = clean_text.replace(action, "").strip()
+        
+        # Remove memory section
+        memory_pattern = r"Memory:\s*\[.*?\]\s*$"
+        clean_text = re.sub(memory_pattern, "", clean_text, flags=re.DOTALL).strip()
+        
+        if clean_text:
+            message_item = model_dump(make_output_text_item(clean_text))
+            response_items.append(message_item)
+    
+    # Convert action to computer call if present
+    if action:
+        call_id = f"call_{uuid.uuid4().hex[:8]}"
+        
+        # Parse different action types and create appropriate computer calls
+        if action.startswith("left_click"):
+            coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
+            if coord_match:
+                x, y = int(coord_match.group(1)), int(coord_match.group(2))
+                # Convert from 0-999 to actual pixel coordinates
+                actual_x = int((x / 999.0) * image_width)
+                actual_y = int((y / 999.0) * image_height)
+                computer_call = model_dump(make_click_item(actual_x, actual_y))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("right_click"):
+            coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
+            if coord_match:
+                x, y = int(coord_match.group(1)), int(coord_match.group(2))
+                actual_x = int((x / 999.0) * image_width)
+                actual_y = int((y / 999.0) * image_height)
+                computer_call = model_dump(make_click_item(actual_x, actual_y, button="right"))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("left_double_click"):
+            coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
+            if coord_match:
+                x, y = int(coord_match.group(1)), int(coord_match.group(2))
+                actual_x = int((x / 999.0) * image_width)
+                actual_y = int((y / 999.0) * image_height)
+                computer_call = model_dump(make_double_click_item(actual_x, actual_y))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("left_drag"):
+            start_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
+            end_match = re.search(r"end_box='?\[(\d+),\s*(\d+)\]'?", action)
+            if start_match and end_match:
+                x1, y1 = int(start_match.group(1)), int(start_match.group(2))
+                x2, y2 = int(end_match.group(1)), int(end_match.group(2))
+                actual_x1 = int((x1 / 999.0) * image_width)
+                actual_y1 = int((y1 / 999.0) * image_height)
+                actual_x2 = int((x2 / 999.0) * image_width)
+                actual_y2 = int((y2 / 999.0) * image_height)
+                # Create path for drag operation
+                drag_path = [{"x": actual_x1, "y": actual_y1}, {"x": actual_x2, "y": actual_y2}]
+                computer_call = model_dump(make_drag_item(drag_path))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("key"):
+            key_match = re.search(r"keys='([^']+)'", action)
+            if key_match:
+                keys = key_match.group(1)
+                # Split keys by '+' for key combinations, or use as single key
+                key_list = keys.split('+') if '+' in keys else [keys]
+                computer_call = model_dump(make_keypress_item(key_list))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("type"):
+            content_match = re.search(r"content='([^']*)'", action)
+            if content_match:
+                content = content_match.group(1)
+                computer_call = model_dump(make_type_item(content))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action.startswith("scroll"):
+            coord_match = re.search(r"start_box='?\[(\d+),\s*(\d+)\]'?", action)
+            direction_match = re.search(r"direction='([^']+)'", action)
+            if coord_match and direction_match:
+                x, y = int(coord_match.group(1)), int(coord_match.group(2))
+                direction = direction_match.group(1)
+                actual_x = int((x / 999.0) * image_width)
+                actual_y = int((y / 999.0) * image_height)
+                # Convert direction to scroll amounts
+                scroll_x, scroll_y = 0, 0
+                if direction == "up":
+                    scroll_y = -5
+                elif direction == "down":
+                    scroll_y = 5
+                elif direction == "left":
+                    scroll_x = -5
+                elif direction == "right":
+                    scroll_x = 5
+                computer_call = model_dump(make_scroll_item(actual_x, actual_y, scroll_x, scroll_y))
+                computer_call["call_id"] = call_id
+                computer_call["status"] = "completed"
+                response_items.append(computer_call)
+        
+        elif action == "WAIT()":
+            computer_call = model_dump(make_wait_item())
+            computer_call["call_id"] = call_id
+            computer_call["status"] = "completed"
+            response_items.append(computer_call)
+    
+    return response_items
+
+@register_agent(models=r"(?i).*GLM-4\.5V.*")
+class Glm4vConfig(AsyncAgentConfig):
+    """GLM-4.5V agent configuration using liteLLM."""
+
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        use_prompt_caching: Optional[bool] = False,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Predict the next step using GLM-4.5V model.
+        
+        Args:
+            messages: Input messages following Responses format
+            model: Model name to use
+            tools: Optional list of tool schemas
+            max_retries: Maximum number of retries for API calls
+            stream: Whether to stream the response
+            computer_handler: Computer handler for taking screenshots
+            use_prompt_caching: Whether to use prompt caching
+            _on_api_start: Callback for API start
+            _on_api_end: Callback for API end
+            _on_usage: Callback for usage tracking
+            _on_screenshot: Callback for screenshot events
+            
+        Returns:
+            Dict with "output" and "usage" keys
+        """
+        # Get the user instruction from the last user message
+        user_instruction = ""
+        for message in reversed(messages):
+            if isinstance(message, dict) and message.get("role") == "user":
+                content = message.get("content", "")
+                if isinstance(content, str):
+                    user_instruction = content
+                elif isinstance(content, list):
+                    for item in content:
+                        if isinstance(item, dict) and item.get("type") == "text":
+                            user_instruction = item.get("text", "")
+                            break
+                break
+        
+        # Get the last image for processing
+        last_image_b64 = get_last_image_from_messages(messages)
+        if not last_image_b64 and computer_handler:
+            # Take a screenshot if no image available
+            screenshot_b64 = await computer_handler.screenshot()
+            if screenshot_b64:
+                last_image_b64 = screenshot_b64
+                if _on_screenshot:
+                    await _on_screenshot(screenshot_b64)
+        
+        if not last_image_b64:
+            raise ValueError("No image available for GLM-4.5V processing")
+        
+        # Convert responses items to GLM-4.5V PC prompt format with historical actions
+        prompt_content = convert_responses_items_to_glm45v_pc_prompt(
+            messages=messages,
+            task=user_instruction,
+            memory="[]"  # Initialize with empty memory for now
+        )
+        
+        # Add the current screenshot to the end
+        prompt_content.append({
+            "type": "image_url",
+            "image_url": {"url": f"data:image/png;base64,{last_image_b64}"}
+        })
+        
+        # Prepare messages for liteLLM
+        litellm_messages = [
+            {
+                "role": "system",
+                "content": "You are a helpful GUI agent assistant."
+            },
+            {
+                "role": "user", 
+                "content": prompt_content
+            }
+        ]
+        
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": model,
+            "messages": litellm_messages,
+            # "max_tokens": 2048,
+            # "temperature": 0.001,
+            # "extra_body": {
+            #     "skip_special_tokens": False,
+            # }
+        }
+        
+        # Add API callbacks
+        if _on_api_start:
+            await _on_api_start(api_kwargs)
+        
+        # Call liteLLM
+        response = await litellm.acompletion(**api_kwargs)
+        
+        if _on_api_end:
+            await _on_api_end(api_kwargs, response)
+        
+        # Get image dimensions for coordinate scaling
+        image_width, image_height = 1920, 1080  # Default dimensions
+        
+        # Try to get actual dimensions from the image
+        try:
+            image_data = base64.b64decode(last_image_b64)
+            image = Image.open(BytesIO(image_data))
+            image_width, image_height = image.size
+        except Exception:
+            pass  # Use default dimensions
+        
+        # Convert GLM completion response to responses items
+        response_items = convert_glm_completion_to_responses_items(response, image_width, image_height)
+        
+        # Extract usage information
+        response_usage = {
+            **LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
+            "response_cost": response._hidden_params.get("response_cost", 0.0),
+        }
+        if _on_usage:
+            await _on_usage(response_usage)
+        
+        # Create agent response
+        agent_response = {
+            "output": response_items,
+            "usage": response_usage
+        }
+        
+        return agent_response
+
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str,
+        **kwargs
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates using GLM-4.5V model.
+        
+        Args:
+            model: Model name to use
+            image_b64: Base64 encoded image
+            instruction: Instruction for where to click
+            
+        Returns:
+            Tuple with (x, y) coordinates or None
+        """
+        try:
+            # Create a simple click instruction prompt
+            click_prompt = f"""You are a GUI agent. Look at the screenshot and identify where to click for: {instruction}
+
+Respond with a single click action in this format:
+left_click(start_box='[x,y]')
+
+Where x,y are coordinates normalized to 0-999 range."""
+            
+            # Prepare messages for liteLLM
+            litellm_messages = [
+                {
+                    "role": "system",
+                    "content": "You are a helpful GUI agent assistant."
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": click_prompt},
+                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
+                    ]
+                }
+            ]
+            
+            # Prepare API call kwargs
+            api_kwargs = {
+                "model": model,
+                "messages": litellm_messages,
+                "max_tokens": 100,
+                "temperature": 0.001,
+                "extra_body": {
+                    "skip_special_tokens": False,
+                }
+            }
+            
+            # Call liteLLM
+            response = await litellm.acompletion(**api_kwargs)
+            
+            # Extract response content
+            response_content = response.choices[0].message.content.strip()
+            
+            # Parse response for click coordinates
+            # Look for coordinates in the response, handling special tokens
+            coord_pattern = r"<\|begin_of_box\|>.*?left_click\(start_box='?\[(\d+),(\d+)\]'?\).*?<\|end_of_box\|>"
+            match = re.search(coord_pattern, response_content)
+            
+            if not match:
+                # Fallback: look for coordinates without special tokens
+                coord_pattern = r"left_click\(start_box='?\[(\d+),(\d+)\]'?\)"
+                match = re.search(coord_pattern, response_content)
+            
+            if match:
+                x, y = int(match.group(1)), int(match.group(2))
+                
+                # Get actual image dimensions for scaling
+                try:
+                    image_data = base64.b64decode(image_b64)
+                    image = Image.open(BytesIO(image_data))
+                    image_width, image_height = image.size
+                except Exception:
+                    # Use default dimensions
+                    image_width, image_height = 1920, 1080
+                
+                # Convert from 0-999 normalized coordinates to actual pixel coordinates
+                actual_x = int((x / 999.0) * image_width)
+                actual_y = int((y / 999.0) * image_height)
+                
+                return (actual_x, actual_y)
+            
+            return None
+            
+        except Exception as e:
+            # Log error and return None
+            print(f"Error in predict_click: {e}")
+            return None
+
+    def get_capabilities(self) -> List[AgentCapability]:
+        """
+        Get list of capabilities supported by this agent config.
+        
+        Returns:
+            List of capability strings
+        """
+        return ["step", "click"]
--- a/libs/python/agent/agent/loops/gta1.py
+++ b/libs/python/agent/agent/loops/gta1.py
@@ -0,0 +1,178 @@
+"""
+GTA1 agent loop implementation for click prediction using litellm.acompletion
+Paper: https://arxiv.org/pdf/2507.05791
+Code: https://github.com/Yan98/GTA1
+"""
+
+import asyncio
+import json
+import re
+import base64
+from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
+from io import BytesIO
+import uuid
+from PIL import Image
+import litellm
+import math
+
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability
+from ..loops.base import AsyncAgentConfig
+
+SYSTEM_PROMPT = '''
+You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
+
+Output the coordinate pair exactly:
+(x,y)
+'''.strip()
+
+def extract_coordinates(raw_string: str) -> Tuple[float, float]:
+    """Extract coordinates from model output."""
+    try:
+        matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
+        return tuple(map(float, matches[0])) # type: ignore
+    except:
+        return (0.0, 0.0)
+
+def smart_resize(height: int, width: int, factor: int = 28, min_pixels: int = 3136, max_pixels: int = 8847360) -> Tuple[int, int]:
+    """Smart resize function similar to qwen_vl_utils."""
+    # Calculate the total pixels
+    total_pixels = height * width
+    
+    # If already within bounds, return original dimensions
+    if min_pixels <= total_pixels <= max_pixels:
+        # Round to nearest factor
+        new_height = (height // factor) * factor
+        new_width = (width // factor) * factor
+        return new_height, new_width
+    
+    # Calculate scaling factor
+    if total_pixels > max_pixels:
+        scale = (max_pixels / total_pixels) ** 0.5
+    else:
+        scale = (min_pixels / total_pixels) ** 0.5
+    
+    # Apply scaling
+    new_height = int(height * scale)
+    new_width = int(width * scale)
+    
+    # Round to nearest factor
+    new_height = (new_height // factor) * factor
+    new_width = (new_width // factor) * factor
+    
+    # Ensure minimum size
+    new_height = max(new_height, factor)
+    new_width = max(new_width, factor)
+    
+    return new_height, new_width
+
+@register_agent(models=r".*GTA1.*")
+class GTA1Config(AsyncAgentConfig):
+    """GTA1 agent configuration implementing AsyncAgentConfig protocol for click prediction."""
+    
+    def __init__(self):
+        self.current_model = None
+        self.last_screenshot_b64 = None
+    
+
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        raise NotImplementedError()
+
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str,
+        **kwargs
+    ) -> Optional[Tuple[float, float]]:
+        """
+        Predict click coordinates using GTA1 model via litellm.acompletion.
+        
+        Args:
+            model: The GTA1 model name
+            image_b64: Base64 encoded image
+            instruction: Instruction for where to click
+            
+        Returns:
+            Tuple of (x, y) coordinates or None if prediction fails
+        """
+        # Decode base64 image
+        image_data = base64.b64decode(image_b64)
+        image = Image.open(BytesIO(image_data))
+        width, height = image.width, image.height
+        
+        # Smart resize the image (similar to qwen_vl_utils)
+        resized_height, resized_width = smart_resize(
+            height, width, 
+            factor=28,  # Default factor for Qwen models
+            min_pixels=3136,
+            max_pixels=4096 * 2160
+        )
+        resized_image = image.resize((resized_width, resized_height))
+        scale_x, scale_y = width / resized_width, height / resized_height
+        
+        # Convert resized image back to base64
+        buffered = BytesIO()
+        resized_image.save(buffered, format="PNG")
+        resized_image_b64 = base64.b64encode(buffered.getvalue()).decode()
+        
+        # Prepare system and user messages
+        system_message = {
+            "role": "system",
+            "content": SYSTEM_PROMPT.format(height=resized_height, width=resized_width)
+        }
+        
+        user_message = {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": f"data:image/png;base64,{resized_image_b64}"
+                    }
+                },
+                {
+                    "type": "text",
+                    "text": instruction
+                }
+            ]
+        }
+        
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": model,
+            "messages": [system_message, user_message],
+            "max_tokens": 32,
+            "temperature": 0.0,
+            **kwargs
+        }
+        
+        # Use liteLLM acompletion
+        response = await litellm.acompletion(**api_kwargs)
+        
+        # Extract response text
+        output_text = response.choices[0].message.content # type: ignore
+        
+        # Extract and rescale coordinates
+        pred_x, pred_y = extract_coordinates(output_text) # type: ignore
+        pred_x *= scale_x
+        pred_y *= scale_y
+        
+        return (math.floor(pred_x), math.floor(pred_y))
+    
+    def get_capabilities(self) -> List[AgentCapability]:
+        """Return the capabilities supported by this agent."""
+        return ["click"]
--- a/libs/python/agent/agent/loops/model_types.csv
+++ b/libs/python/agent/agent/loops/model_types.csv
@@ -0,0 +1,6 @@
+model,predict_step,predict_point
+anthropic,✅,✅
+openai,✅,✅
+uitars,✅,✅
+omniparser,❌,✅
+gta1,❌,✅
--- a/libs/python/agent/agent/loops/omniparser.py
+++ b/libs/python/agent/agent/loops/omniparser.py
@@ -1,5 +1,7 @@
 """
 OpenAI computer-use-preview agent loop implementation using liteLLM
+Paper: https://arxiv.org/abs/2408.00203
+Code: https://github.com/microsoft/OmniParser
 """

 import asyncio
@@ -9,8 +11,9 @@ import litellm
 import inspect
 import base64

-from ..decorators import agent_loop
-from ..types import Messages, AgentResponse, Tools
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability
+from ..loops.base import AsyncAgentConfig

 SOM_TOOL_SCHEMA = {
  "type": "function",
@@ -246,94 +249,185 @@ async def replace_computer_call_with_function(item: Dict[str, Any], xy2id: Dict[
    return [item]


-@agent_loop(models=r"omniparser\+.*|omni\+.*", priority=10)
-async def omniparser_loop(
-    messages: Messages,
-    model: str,
-    tools: Optional[List[Dict[str, Any]]] = None,
-    max_retries: Optional[int] = None,
-    stream: bool = False,
-    computer_handler=None,
-    use_prompt_caching: Optional[bool] = False,
-    _on_api_start=None,
-    _on_api_end=None,
-    _on_usage=None,
-    _on_screenshot=None,
-    **kwargs
-) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
-    """
-    OpenAI computer-use-preview agent loop using liteLLM responses.
+@register_agent(models=r"omniparser\+.*|omni\+.*", priority=2)
+class OmniparserConfig(AsyncAgentConfig):
+    """Omniparser agent configuration implementing AsyncAgentConfig protocol."""
    
-    Supports OpenAI's computer use preview models.
-    """
-    if not OMNIPARSER_AVAILABLE:
-        raise ValueError("omniparser loop requires som to be installed. Install it with `pip install cua-som`.")
-      
-    tools = tools or []
-    
-    llm_model = model.split('+')[-1]
-
-    # Prepare tools for OpenAI API
-    openai_tools, id2xy = _prepare_tools_for_omniparser(tools)
-
-    # Find last computer_call_output
-    last_computer_call_output = get_last_computer_call_output(messages)
-    if last_computer_call_output:
-        image_url = last_computer_call_output.get("output", {}).get("image_url", "")
-        image_data = image_url.split(",")[-1]
-        if image_data:
-            parser = get_parser()
-            result = parser.parse(image_data)
-            if _on_screenshot:
-                await _on_screenshot(result.annotated_image_base64, "annotated_image")
-            for element in result.elements:
-                id2xy[element.id] = ((element.bbox.x1 + element.bbox.x2) / 2, (element.bbox.y1 + element.bbox.y2) / 2)
-    
-    # handle computer calls -> function calls
-    new_messages = []
-    for message in messages:
-        if not isinstance(message, dict):
-            message = message.__dict__
-        new_messages += await replace_computer_call_with_function(message, id2xy)
-    messages = new_messages
-
-    # Prepare API call kwargs
-    api_kwargs = {
-        "model": llm_model,
-        "input": messages,
-        "tools": openai_tools if openai_tools else None,
-        "stream": stream,
-        "reasoning": {"summary": "concise"},
-        "truncation": "auto",
-        "num_retries": max_retries,
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        use_prompt_caching: Optional[bool] = False,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
        **kwargs
-    }
+    ) -> Dict[str, Any]:
+        """
+        OpenAI computer-use-preview agent loop using liteLLM responses.
+        
+        Supports OpenAI's computer use preview models.
+        """
+        if not OMNIPARSER_AVAILABLE:
+            raise ValueError("omniparser loop requires som to be installed. Install it with `pip install cua-som`.")
+          
+        tools = tools or []
+        
+        llm_model = model.split('+')[-1]
+
+        # Prepare tools for OpenAI API
+        openai_tools, id2xy = _prepare_tools_for_omniparser(tools)
+
+        # Find last computer_call_output
+        last_computer_call_output = get_last_computer_call_output(messages) # type: ignore
+        if last_computer_call_output:
+            image_url = last_computer_call_output.get("output", {}).get("image_url", "")
+            image_data = image_url.split(",")[-1]
+            if image_data:
+                parser = get_parser()
+                result = parser.parse(image_data)
+                if _on_screenshot:
+                    await _on_screenshot(result.annotated_image_base64, "annotated_image")
+                for element in result.elements:
+                    id2xy[element.id] = ((element.bbox.x1 + element.bbox.x2) / 2, (element.bbox.y1 + element.bbox.y2) / 2)
+        
+        # handle computer calls -> function calls
+        new_messages = []
+        for message in messages:
+            if not isinstance(message, dict):
+                message = message.__dict__
+            new_messages += await replace_computer_call_with_function(message, id2xy) # type: ignore
+        messages = new_messages
+
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": llm_model,
+            "input": messages,
+            "tools": openai_tools if openai_tools else None,
+            "stream": stream,
+            "truncation": "auto",
+            "num_retries": max_retries,
+            **kwargs
+        }
+        
+        # Call API start hook
+        if _on_api_start:
+            await _on_api_start(api_kwargs)
+        
+        print(str(api_kwargs)[:1000])
+
+        # Use liteLLM responses
+        response = await litellm.aresponses(**api_kwargs)
+
+        # Call API end hook
+        if _on_api_end:
+            await _on_api_end(api_kwargs, response)
+
+        # Extract usage information
+        usage = {
+            **response.usage.model_dump(), # type: ignore
+            "response_cost": response._hidden_params.get("response_cost", 0.0), # type: ignore
+        }
+        if _on_usage:
+            await _on_usage(usage)
+
+        # handle som function calls -> xy computer calls
+        new_output = []
+        for i in range(len(response.output)): # type: ignore
+          new_output += await replace_function_with_computer_call(response.output[i].model_dump(), id2xy) # type: ignore
+        
+        return {
+            "output": new_output,
+            "usage": usage
+        }
    
-    # Call API start hook
-    if _on_api_start:
-        await _on_api_start(api_kwargs)
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str,
+        **kwargs
+    ) -> Optional[Tuple[float, float]]:
+        """
+        Predict click coordinates using OmniParser and LLM.
+        
+        Uses OmniParser to annotate the image with element IDs, then uses LLM
+        to identify the correct element ID based on the instruction.
+        """
+        if not OMNIPARSER_AVAILABLE:
+            return None
+        
+        # Parse the image with OmniParser to get annotated image and elements
+        parser = get_parser()
+        result = parser.parse(image_b64)
+        
+        # Extract the LLM model from composed model string
+        llm_model = model.split('+')[-1]
+        
+        # Create system prompt for element ID prediction
+        SYSTEM_PROMPT = f'''
+You are an expert UI element locator. Given a GUI image annotated with numerical IDs over each interactable element, along with a user's element description, provide the ID of the specified element.
+
+The image shows UI elements with numbered overlays. Each number corresponds to a clickable/interactable element.
+
+Output only the element ID as a single integer.
+'''.strip()
+        
+        # Prepare messages for LLM
+        messages = [
+            {
+                "role": "system",
+                "content": SYSTEM_PROMPT
+            },
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "image_url",
+                        "image_url": {
+                            "url": f"data:image/png;base64,{result.annotated_image_base64}"
+                        }
+                    },
+                    {
+                        "type": "text",
+                        "text": f"Find the element: {instruction}"
+                    }
+                ]
+            }
+        ]
+        
+        # Call LLM to predict element ID
+        response = await litellm.acompletion(
+            model=llm_model,
+            messages=messages,
+            max_tokens=10,
+            temperature=0.1
+        )
+        
+        # Extract element ID from response
+        response_text = response.choices[0].message.content.strip() # type: ignore
+        
+        # Try to parse the element ID
+        try:
+            element_id = int(response_text)
+            
+            # Find the element with this ID and return its center coordinates
+            for element in result.elements:
+                if element.id == element_id:
+                    center_x = (element.bbox.x1 + element.bbox.x2) / 2
+                    center_y = (element.bbox.y1 + element.bbox.y2) / 2
+                    return (center_x, center_y)
+        except ValueError:
+            # If we can't parse the ID, return None
+            pass
+            
+        return None
    
-    print(str(api_kwargs)[:1000])
-
-    # Use liteLLM responses
-    response = await litellm.aresponses(**api_kwargs)
-
-    # Call API end hook
-    if _on_api_end:
-        await _on_api_end(api_kwargs, response)
-
-    # Extract usage information
-    response.usage = {
-        **response.usage.model_dump(),
-        "response_cost": response._hidden_params.get("response_cost", 0.0),
-    }
-    if _on_usage:
-        await _on_usage(response.usage)
-
-    # handle som function calls -> xy computer calls
-    new_output = []
-    for i in range(len(response.output)):
-      new_output += await replace_function_with_computer_call(response.output[i].model_dump(), id2xy)
-    response.output = new_output
-
-    return response
+    def get_capabilities(self) -> List[AgentCapability]:
+        """Return the capabilities supported by this agent."""
+        return ["step"]
--- a/libs/python/agent/agent/loops/openai.py
+++ b/libs/python/agent/agent/loops/openai.py
@@ -3,31 +3,49 @@ OpenAI computer-use-preview agent loop implementation using liteLLM
 """

 import asyncio
+import base64
 import json
-from typing import Dict, List, Any, AsyncGenerator, Union, Optional
+from io import BytesIO
+from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
 import litellm
+from PIL import Image

-from ..decorators import agent_loop
-from ..types import Messages, AgentResponse, Tools
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability

-def _map_computer_tool_to_openai(computer_tool: Any) -> Dict[str, Any]:
+async def _map_computer_tool_to_openai(computer_handler: Any) -> Dict[str, Any]:
    """Map a computer tool to OpenAI's computer-use-preview tool schema"""
+    # Get dimensions from the computer handler
+    try:
+        width, height = await computer_handler.get_dimensions()
+    except Exception:
+        # Fallback to default dimensions if method fails
+        width, height = 1024, 768
+    
+    # Get environment from the computer handler
+    try:
+        environment = await computer_handler.get_environment()
+    except Exception:
+        # Fallback to default environment if method fails
+        environment = "linux"
+    
    return {
        "type": "computer_use_preview",
-        "display_width": getattr(computer_tool, 'display_width', 1024),
-        "display_height": getattr(computer_tool, 'display_height', 768),
-        "environment": getattr(computer_tool, 'environment', "linux")  # mac, windows, linux, browser
+        "display_width": width,
+        "display_height": height,
+        "environment": environment  # mac, windows, linux, browser
    }


-def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
+async def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
    """Prepare tools for OpenAI API format"""
    openai_tools = []
    
    for schema in tool_schemas:
        if schema["type"] == "computer":
            # Map computer tool to OpenAI format
-            openai_tools.append(_map_computer_tool_to_openai(schema["computer"]))
+            computer_tool = await _map_computer_tool_to_openai(schema["computer"])
+            openai_tools.append(computer_tool)
        elif schema["type"] == "function":
            # Function tools use OpenAI-compatible schema directly (liteLLM expects this format)
            # Schema should be: {type, name, description, parameters}
@@ -36,60 +54,182 @@ def _prepare_tools_for_openai(tool_schemas: List[Dict[str, Any]]) -> Tools:
    return openai_tools


-@agent_loop(models=r".*computer-use-preview.*", priority=10)
-async def openai_computer_use_loop(
-    messages: Messages,
-    model: str,
-    tools: Optional[List[Dict[str, Any]]] = None,
-    max_retries: Optional[int] = None,
-    stream: bool = False,
-    computer_handler=None,
-    use_prompt_caching: Optional[bool] = False,
-    _on_api_start=None,
-    _on_api_end=None,
-    _on_usage=None,
-    _on_screenshot=None,
-    **kwargs
-) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
+@register_agent(models=r".*computer-use-preview.*")
+class OpenAIComputerUseConfig:
    """
-    OpenAI computer-use-preview agent loop using liteLLM responses.
+    OpenAI computer-use-preview agent configuration using liteLLM responses.
    
    Supports OpenAI's computer use preview models.
    """
-    tools = tools or []
    
-    # Prepare tools for OpenAI API
-    openai_tools = _prepare_tools_for_openai(tools)
-
-    # Prepare API call kwargs
-    api_kwargs = {
-        "model": model,
-        "input": messages,
-        "tools": openai_tools if openai_tools else None,
-        "stream": stream,
-        "reasoning": {"summary": "concise"},
-        "truncation": "auto",
-        "num_retries": max_retries,
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        use_prompt_caching: Optional[bool] = False,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
        **kwargs
-    }
-    
-    # Call API start hook
-    if _on_api_start:
-        await _on_api_start(api_kwargs)
-    
-    # Use liteLLM responses
-    response = await litellm.aresponses(**api_kwargs)
-    
-    # Call API end hook
-    if _on_api_end:
-        await _on_api_end(api_kwargs, response)
+    ) -> Dict[str, Any]:
+        """
+        Predict the next step based on input items.
+        
+        Args:
+            messages: Input items following Responses format
+            model: Model name to use
+            tools: Optional list of tool schemas
+            max_retries: Maximum number of retries
+            stream: Whether to stream responses
+            computer_handler: Computer handler instance
+            _on_api_start: Callback for API start
+            _on_api_end: Callback for API end
+            _on_usage: Callback for usage tracking
+            _on_screenshot: Callback for screenshot events
+            **kwargs: Additional arguments
+            
+        Returns:
+            Dictionary with "output" (output items) and "usage" array
+        """
+        tools = tools or []
+        
+        # Prepare tools for OpenAI API
+        openai_tools = await _prepare_tools_for_openai(tools)

-    # Extract usage information
-    response.usage = {
-        **response.usage.model_dump(),
-        "response_cost": response._hidden_params.get("response_cost", 0.0),
-    }
-    if _on_usage:
-        await _on_usage(response.usage)
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": model,
+            "input": messages,
+            "tools": openai_tools if openai_tools else None,
+            "stream": stream,
+            "reasoning": {"summary": "concise"},
+            "truncation": "auto",
+            "num_retries": max_retries,
+            **kwargs
+        }
+        
+        # Call API start hook
+        if _on_api_start:
+            await _on_api_start(api_kwargs)
+        
+        # Use liteLLM responses
+        response = await litellm.aresponses(**api_kwargs)
+        
+        # Call API end hook
+        if _on_api_end:
+            await _on_api_end(api_kwargs, response)
+
+        # Extract usage information
+        usage = {
+            **response.usage.model_dump(),
+            "response_cost": response._hidden_params.get("response_cost", 0.0),
+        }
+        if _on_usage:
+            await _on_usage(usage)
+
+        # Return in the expected format
+        output_dict = response.model_dump()
+        output_dict["usage"] = usage
+        return output_dict
    
-    return response
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates based on image and instruction.
+        
+        Uses OpenAI computer-use-preview with manually constructed input items
+        and a prompt that instructs the agent to only output clicks.
+        
+        Args:
+            model: Model name to use
+            image_b64: Base64 encoded image
+            instruction: Instruction for where to click
+            
+        Returns:
+            Tuple of (x, y) coordinates or None if prediction fails
+        """
+        # TODO: use computer tool to get dimensions + environment
+        # Manually construct input items with image and click instruction
+        input_items = [
+            {
+                "role": "user", 
+                "content": f"You are a UI grounding expert. Look at the image and {instruction}. Output ONLY a click action on the target element. No explanations, confirmations, or additional text."
+            },
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "input_image",
+                        "image_url": f"data:image/png;base64,{image_b64}"
+                    }
+                ]
+            }
+        ]
+        
+        # Get image dimensions from base64 data
+        try:
+            image_data = base64.b64decode(image_b64)
+            image = Image.open(BytesIO(image_data))
+            display_width, display_height = image.size
+        except Exception:
+            # Fallback to default dimensions if image parsing fails
+            display_width, display_height = 1024, 768
+        
+        # Prepare computer tool for click actions
+        computer_tool = {
+            "type": "computer_use_preview",
+            "display_width": display_width,
+            "display_height": display_height,
+            "environment": "windows"
+        }
+        
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": model,
+            "input": input_items,
+            "tools": [computer_tool],
+            "stream": False,
+            "reasoning": {"summary": "concise"},
+            "truncation": "auto",
+            "max_tokens": 100  # Keep response short for click prediction
+        }
+        
+        # Use liteLLM responses
+        response = await litellm.aresponses(**api_kwargs)
+        
+        # Extract click coordinates from response output
+        output_dict = response.model_dump()
+        output_items = output_dict.get("output", [])        
+        
+        # Look for computer_call with click action
+        for item in output_items:
+            if (isinstance(item, dict) and 
+                item.get("type") == "computer_call" and
+                isinstance(item.get("action"), dict)):
+                
+                action = item["action"]
+                if action.get("type") == "click":
+                    x = action.get("x")
+                    y = action.get("y")
+                    if x is not None and y is not None:
+                        return (int(x), int(y))
+        
+        return None
+    
+    def get_capabilities(self) -> List[AgentCapability]:
+        """
+        Get list of capabilities supported by this agent config.
+        
+        Returns:
+            List of capability strings
+        """
+        return ["click", "step"]
--- a/libs/python/agent/agent/loops/uitars.py
+++ b/libs/python/agent/agent/loops/uitars.py
@@ -1,5 +1,7 @@
 """
 UITARS agent loop implementation using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B
+Paper: https://arxiv.org/abs/2501.12326
+Code: https://github.com/bytedance/UI-TARS
 """

 import asyncio
@@ -9,7 +11,7 @@ import base64
 import math
 import re
 import ast
-from typing import Dict, List, Any, AsyncGenerator, Union, Optional
+from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
 from io import BytesIO
 from PIL import Image
 import litellm
@@ -21,8 +23,8 @@ from openai.types.responses.response_input_param import ComputerCallOutput
 from openai.types.responses.response_output_message_param import ResponseOutputMessageParam
 from openai.types.responses.response_reasoning_item_param import ResponseReasoningItemParam, Summary

-from ..decorators import agent_loop
-from ..types import Messages, AgentResponse, Tools
+from ..decorators import register_agent
+from ..types import Messages, AgentResponse, Tools, AgentCapability
 from ..responses import (
    make_reasoning_item, 
    make_output_text_item,
@@ -79,6 +81,18 @@ Action: ...
 {instruction}
 """

+GROUNDING_UITARS_PROMPT_TEMPLATE = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 
+
+## Output Format
+
+Action: ...
+
+
+## Action Space
+click(point='<|box_start|>(x1,y1)<|box_end|>')
+
+## User Instruction
+{instruction}"""

 def round_by_factor(number: float, factor: int) -> int:
    """Returns the closest integer to 'number' that is divisible by 'factor'."""
@@ -501,188 +515,301 @@ def convert_uitars_messages_to_litellm(messages: Messages) -> List[Dict[str, Any
    
    return litellm_messages

-@agent_loop(models=r"(?i).*ui-?tars.*", priority=10)
-async def uitars_loop(
-    messages: Messages,
-    model: str,
-    tools: Optional[List[Dict[str, Any]]] = None,
-    max_retries: Optional[int] = None,
-    stream: bool = False,
-    computer_handler=None,
-    use_prompt_caching: Optional[bool] = False,
-    _on_api_start=None,
-    _on_api_end=None,
-    _on_usage=None,
-    _on_screenshot=None,
-    **kwargs
-) -> Union[AgentResponse, AsyncGenerator[Dict[str, Any], None]]:
+@register_agent(models=r"(?i).*ui-?tars.*")
+class UITARSConfig:
    """
-    UITARS agent loop using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B model.
+    UITARS agent configuration using liteLLM for ByteDance-Seed/UI-TARS-1.5-7B model.
    
    Supports UITARS vision-language models for computer control.
    """
-    tools = tools or []
    
-    # Create response items
-    response_items = []
-    
-    # Find computer tool for screen dimensions
-    computer_tool = None
-    for tool_schema in tools:
-        if tool_schema["type"] == "computer":
-            computer_tool = tool_schema["computer"]
-            break
-    
-    # Get screen dimensions
-    screen_width, screen_height = 1024, 768
-    if computer_tool:
-        try:
-            screen_width, screen_height = await computer_tool.get_dimensions()
-        except:
-            pass
-    
-    # Process messages to extract instruction and image
-    instruction = ""
-    image_data = None
-    
-    # Convert messages to list if string
-    if isinstance(messages, str):
-        messages = [{"role": "user", "content": messages}]
-    
-    # Extract instruction and latest screenshot
-    for message in reversed(messages):
-        if isinstance(message, dict):
-            content = message.get("content", "")
+    async def predict_step(
+        self,
+        messages: List[Dict[str, Any]],
+        model: str,
+        tools: Optional[List[Dict[str, Any]]] = None,
+        max_retries: Optional[int] = None,
+        stream: bool = False,
+        computer_handler=None,
+        use_prompt_caching: Optional[bool] = False,
+        _on_api_start=None,
+        _on_api_end=None,
+        _on_usage=None,
+        _on_screenshot=None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Predict the next step based on input messages.
+        
+        Args:
+            messages: Input messages following Responses format
+            model: Model name to use
+            tools: Optional list of tool schemas
+            max_retries: Maximum number of retries
+            stream: Whether to stream responses
+            computer_handler: Computer handler instance
+            _on_api_start: Callback for API start
+            _on_api_end: Callback for API end
+            _on_usage: Callback for usage tracking
+            _on_screenshot: Callback for screenshot events
+            **kwargs: Additional arguments
            
-            # Handle different content formats
-            if isinstance(content, str):
-                if not instruction and message.get("role") == "user":
-                    instruction = content
-            elif isinstance(content, list):
-                for item in content:
-                    if isinstance(item, dict):
-                        if item.get("type") == "text" and not instruction:
-                            instruction = item.get("text", "")
-                        elif item.get("type") == "image_url" and not image_data:
-                            image_url = item.get("image_url", {})
-                            if isinstance(image_url, dict):
-                                image_data = image_url.get("url", "")
-                            else:
-                                image_data = image_url
+        Returns:
+            Dictionary with "output" (output items) and "usage" array
+        """
+        tools = tools or []
        
-        # Also check for computer_call_output with screenshots
-        if message.get("type") == "computer_call_output" and not image_data:
-            output = message.get("output", {})
-            if isinstance(output, dict) and output.get("type") == "input_image":
-                image_data = output.get("image_url", "")
+        # Create response items
+        response_items = []
        
-        if instruction and image_data:
-            break
-    
-    if not instruction:
-        instruction = "Help me complete this task by analyzing the screen and taking appropriate actions."
-    
-    # Create prompt
-    user_prompt = UITARS_PROMPT_TEMPLATE.format(
-        instruction=instruction,
-        action_space=UITARS_ACTION_SPACE,
-        language="English"
-    )
-    
-    # Convert conversation history to LiteLLM format
-    history_messages = convert_uitars_messages_to_litellm(messages)
-    
-    # Prepare messages for liteLLM
-    litellm_messages = [
-        {
-            "role": "system",
-            "content": "You are a helpful assistant."
-        }
-    ]
-
-    # Add current user instruction with screenshot
-    current_user_message = {
-        "role": "user", 
-        "content": [
-            {"type": "text", "text": user_prompt},
+        # Find computer tool for screen dimensions
+        computer_tool = None
+        for tool_schema in tools:
+            if tool_schema["type"] == "computer":
+                computer_tool = tool_schema["computer"]
+                break
+        
+        # Get screen dimensions
+        screen_width, screen_height = 1024, 768
+        if computer_tool:
+            try:
+                screen_width, screen_height = await computer_tool.get_dimensions()
+            except:
+                pass
+        
+        # Process messages to extract instruction and image
+        instruction = ""
+        image_data = None
+        
+        # Convert messages to list if string
+        if isinstance(messages, str):
+            messages = [{"role": "user", "content": messages}]
+        
+        # Extract instruction and latest screenshot
+        for message in reversed(messages):
+            if isinstance(message, dict):
+                content = message.get("content", "")
+                
+                # Handle different content formats
+                if isinstance(content, str):
+                    if not instruction and message.get("role") == "user":
+                        instruction = content
+                elif isinstance(content, list):
+                    for item in content:
+                        if isinstance(item, dict):
+                            if item.get("type") == "text" and not instruction:
+                                instruction = item.get("text", "")
+                            elif item.get("type") == "image_url" and not image_data:
+                                image_url = item.get("image_url", {})
+                                if isinstance(image_url, dict):
+                                    image_data = image_url.get("url", "")
+                                else:
+                                    image_data = image_url
+            
+            # Also check for computer_call_output with screenshots
+            if message.get("type") == "computer_call_output" and not image_data:
+                output = message.get("output", {})
+                if isinstance(output, dict) and output.get("type") == "input_image":
+                    image_data = output.get("image_url", "")
+            
+            if instruction and image_data:
+                break
+        
+        if not instruction:
+            instruction = "Help me complete this task by analyzing the screen and taking appropriate actions."
+        
+        # Create prompt
+        user_prompt = UITARS_PROMPT_TEMPLATE.format(
+            instruction=instruction,
+            action_space=UITARS_ACTION_SPACE,
+            language="English"
+        )
+        
+        # Convert conversation history to LiteLLM format
+        history_messages = convert_uitars_messages_to_litellm(messages)
+        
+        # Prepare messages for liteLLM
+        litellm_messages = [
+            {
+                "role": "system",
+                "content": "You are a helpful assistant."
+            }
        ]
-    }
-    litellm_messages.append(current_user_message)
-    
-    # Process image for UITARS
-    if not image_data:
-        # Take screenshot if none found in messages
-        if computer_handler:
-            image_data = await computer_handler.screenshot()
-            await _on_screenshot(image_data, "screenshot_before")

-            # Add screenshot to output items so it can be retained in history
-            response_items.append(make_input_image_item(image_data))
-        else:
-            raise ValueError("No screenshot found in messages and no computer_handler provided")
-    processed_image, original_width, original_height = process_image_for_uitars(image_data)
-    encoded_image = pil_to_base64(processed_image)
-    
-    # Add conversation history
-    if history_messages:
-        litellm_messages.extend(history_messages)
-    else:
-        litellm_messages.append({
-            "role": "user",
+        # Add current user instruction with screenshot
+        current_user_message = {
+            "role": "user", 
            "content": [
-                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
+                {"type": "text", "text": user_prompt},
            ]
-        })
+        }
+        litellm_messages.append(current_user_message)
+        
+        # Process image for UITARS
+        if not image_data:
+            # Take screenshot if none found in messages
+            if computer_handler:
+                image_data = await computer_handler.screenshot()
+                await _on_screenshot(image_data, "screenshot_before")

-    # Prepare API call kwargs
-    api_kwargs = {
-        "model": model,
-        "messages": litellm_messages,
-        "max_tokens": kwargs.get("max_tokens", 500),
-        "temperature": kwargs.get("temperature", 0.0),
-        "do_sample": kwargs.get("temperature", 0.0) > 0.0,
-        "num_retries": max_retries,
-        **{k: v for k, v in kwargs.items() if k not in ["max_tokens", "temperature"]}
-    }
-    
-    # Call API start hook
-    if _on_api_start:
-        await _on_api_start(api_kwargs)
-    
-    # Call liteLLM with UITARS model
-    response = await litellm.acompletion(**api_kwargs)
-    
-    # Call API end hook
-    if _on_api_end:
-        await _on_api_end(api_kwargs, response)
-    
-    # Extract response content
-    response_content = response.choices[0].message.content.strip() # type: ignore
-    
-    # Parse UITARS response
-    parsed_responses = parse_uitars_response(response_content, original_width, original_height)
-    
-    # Convert to computer actions
-    computer_actions = convert_to_computer_actions(parsed_responses, original_width, original_height)
-    
-    # Add computer actions to response items
-    thought = parsed_responses[0].get("thought", "")
-    if thought:
-        response_items.append(make_reasoning_item(thought))
-    response_items.extend(computer_actions)
-    
-    # Extract usage information
-    response_usage = {
-        **LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
-        "response_cost": response._hidden_params.get("response_cost", 0.0),
-    }
-    if _on_usage:
-        await _on_usage(response_usage)
+                # Add screenshot to output items so it can be retained in history
+                response_items.append(make_input_image_item(image_data))
+            else:
+                raise ValueError("No screenshot found in messages and no computer_handler provided")
+        processed_image, original_width, original_height = process_image_for_uitars(image_data)
+        encoded_image = pil_to_base64(processed_image)
+        
+        # Add conversation history
+        if history_messages:
+            litellm_messages.extend(history_messages)
+        else:
+            litellm_messages.append({
+                "role": "user",
+                "content": [
+                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
+                ]
+            })

-    # Create agent response
-    agent_response = {
-        "output": response_items,
-        "usage": response_usage
-    }
+        # Prepare API call kwargs
+        api_kwargs = {
+            "model": model,
+            "messages": litellm_messages,
+            "max_tokens": kwargs.get("max_tokens", 500),
+            "temperature": kwargs.get("temperature", 0.0),
+            "do_sample": kwargs.get("temperature", 0.0) > 0.0,
+            "num_retries": max_retries,
+            **{k: v for k, v in kwargs.items() if k not in ["max_tokens", "temperature"]}
+        }
+        
+        # Call API start hook
+        if _on_api_start:
+            await _on_api_start(api_kwargs)
+        
+        # Call liteLLM with UITARS model
+        response = await litellm.acompletion(**api_kwargs)
+        
+        # Call API end hook
+        if _on_api_end:
+            await _on_api_end(api_kwargs, response)
+        
+        # Extract response content
+        response_content = response.choices[0].message.content.strip() # type: ignore
+        
+        # Parse UITARS response
+        parsed_responses = parse_uitars_response(response_content, original_width, original_height)
+        
+        # Convert to computer actions
+        computer_actions = convert_to_computer_actions(parsed_responses, original_width, original_height)
+        
+        # Add computer actions to response items
+        thought = parsed_responses[0].get("thought", "")
+        if thought:
+            response_items.append(make_reasoning_item(thought))
+        response_items.extend(computer_actions)
+        
+        # Extract usage information
+        response_usage = {
+            **LiteLLMCompletionResponsesConfig._transform_chat_completion_usage_to_responses_usage(response.usage).model_dump(),
+            "response_cost": response._hidden_params.get("response_cost", 0.0),
+        }
+        if _on_usage:
+            await _on_usage(response_usage)
+
+        # Create agent response
+        agent_response = {
+            "output": response_items,
+            "usage": response_usage
+        }
+        
+        return agent_response
    
-    return agent_response
+    async def predict_click(
+        self,
+        model: str,
+        image_b64: str,
+        instruction: str
+    ) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates based on image and instruction.
+        
+        UITARS supports click prediction through its action parsing.
+        
+        Args:
+            model: Model name to use
+            image_b64: Base64 encoded image
+            instruction: Instruction for where to click
+            
+        Returns:
+            Tuple with (x, y) coordinates or None
+        """
+        try:
+            # Create prompt using grounding template
+            user_prompt = GROUNDING_UITARS_PROMPT_TEMPLATE.format(
+                instruction=instruction
+            )
+            
+            # Process image for UITARS
+            processed_image, original_width, original_height = process_image_for_uitars(image_b64)
+            encoded_image = pil_to_base64(processed_image)
+            
+            # Prepare messages for liteLLM
+            litellm_messages = [
+                {
+                    "role": "system",
+                    "content": "You are a helpful assistant."
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": user_prompt},
+                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}}
+                    ]
+                }
+            ]
+            
+            # Prepare API call kwargs
+            api_kwargs = {
+                "model": model,
+                "messages": litellm_messages,
+                "max_tokens": 100,
+                "temperature": 0.0,
+                "do_sample": False
+            }
+            
+            # Call liteLLM with UITARS model
+            response = await litellm.acompletion(**api_kwargs)
+            
+            # Extract response content
+            response_content = response.choices[0].message.content.strip() # type: ignore
+            
+            # Parse the response to extract click coordinates
+            # Look for click action with coordinates
+            click_pattern = r"click\(point='<\|box_start\|>\((\d+),(\d+)\)<\|box_end\|>'\)"
+            match = re.search(click_pattern, response_content)
+            
+            if match:
+                x, y = int(match.group(1)), int(match.group(2))
+                # Scale coordinates back to original image dimensions
+                scale_x = original_width / processed_image.width
+                scale_y = original_height / processed_image.height
+                
+                scaled_x = int(x * scale_x)
+                scaled_y = int(y * scale_y)
+                
+                return (scaled_x, scaled_y)
+            
+            return None
+            
+        except Exception as e:
+            # Log error and return None
+            print(f"Error in predict_click: {e}")
+            return None
+    
+    def get_capabilities(self) -> List[AgentCapability]:
+        """
+        Get list of capabilities supported by this agent config.
+        
+        Returns:
+            List of capability strings
+        """
+        return ["step", "click"]
--- a/libs/python/agent/agent/responses.py
+++ b/libs/python/agent/agent/responses.py
@@ -40,7 +40,7 @@ def make_input_image_item(image_data: Union[str, bytes]) -> EasyInputMessagePara
            ResponseInputImageParam(
                type="input_image",
                image_url=f"data:image/png;base64,{base64.b64encode(image_data).decode('utf-8') if isinstance(image_data, bytes) else image_data}"
-            )
+            ) # type: ignore
        ],
        role="user",
        type="message"
@@ -205,3 +205,524 @@ def make_wait_item(call_id: Optional[str] = None) -> ResponseComputerToolCallPar
        status="completed",
        type="computer_call"
    )
+
+# Extra anthropic computer calls
+def make_left_mouse_down_item(x: Optional[int] = None, y: Optional[int] = None, call_id: Optional[str] = None) -> Dict[str, Any]:
+    return {
+        "id": random_id(),
+        "call_id": call_id if call_id else random_id(),
+        "action": {
+            "type": "left_mouse_down",
+            "x": x,
+            "y": y
+        },
+        "pending_safety_checks": [],
+        "status": "completed",
+        "type": "computer_call"
+    }
+
+def make_left_mouse_up_item(x: Optional[int] = None, y: Optional[int] = None, call_id: Optional[str] = None) -> Dict[str, Any]:
+    return {
+        "id": random_id(),
+        "call_id": call_id if call_id else random_id(),
+        "action": {
+            "type": "left_mouse_up",
+            "x": x,
+            "y": y
+        },
+        "pending_safety_checks": [],
+        "status": "completed",
+        "type": "computer_call"
+    }
+
+def make_failed_tool_call_items(tool_name: str, tool_kwargs: Dict[str, Any], error_message: str, call_id: Optional[str] = None) -> List[Dict[str, Any]]:
+    call_id = call_id if call_id else random_id()
+    return [
+        {
+            "type": "function_call",
+            "id": random_id(),
+            "call_id": call_id,
+            "name": tool_name,
+            "arguments": json.dumps(tool_kwargs),
+        },
+        {
+            "type": "function_call_output",
+            "call_id": call_id,
+            "output": json.dumps({"error": error_message}),
+        }
+    ]
+
+# Conversion functions between element descriptions and coordinates
+def convert_computer_calls_desc2xy(responses_items: List[Dict[str, Any]], desc2xy: Dict[str, tuple]) -> List[Dict[str, Any]]:
+    """
+    Convert computer calls from element descriptions to x,y coordinates.
+    
+    Args:
+        responses_items: List of response items containing computer calls with element_description
+        desc2xy: Dictionary mapping element descriptions to (x, y) coordinate tuples
+        
+    Returns:
+        List of response items with element_description replaced by x,y coordinates
+    """
+    converted_items = []
+    
+    for item in responses_items:
+        if item.get("type") == "computer_call" and "action" in item:
+            action = item["action"].copy()
+            
+            # Handle single element_description
+            if "element_description" in action:
+                desc = action["element_description"]
+                if desc in desc2xy:
+                    x, y = desc2xy[desc]
+                    action["x"] = x
+                    action["y"] = y
+                    del action["element_description"]
+            
+            # Handle start_element_description and end_element_description for drag operations
+            elif "start_element_description" in action and "end_element_description" in action:
+                start_desc = action["start_element_description"]
+                end_desc = action["end_element_description"]
+                
+                if start_desc in desc2xy and end_desc in desc2xy:
+                    start_x, start_y = desc2xy[start_desc]
+                    end_x, end_y = desc2xy[end_desc]
+                    action["path"] = [{"x": start_x, "y": start_y}, {"x": end_x, "y": end_y}]
+                    del action["start_element_description"]
+                    del action["end_element_description"]
+            
+            converted_item = item.copy()
+            converted_item["action"] = action
+            converted_items.append(converted_item)
+        else:
+            converted_items.append(item)
+    
+    return converted_items
+
+
+def convert_computer_calls_xy2desc(responses_items: List[Dict[str, Any]], desc2xy: Dict[str, tuple]) -> List[Dict[str, Any]]:
+    """
+    Convert computer calls from x,y coordinates to element descriptions.
+    
+    Args:
+        responses_items: List of response items containing computer calls with x,y coordinates
+        desc2xy: Dictionary mapping element descriptions to (x, y) coordinate tuples
+        
+    Returns:
+        List of response items with x,y coordinates replaced by element_description
+    """
+    # Create reverse mapping from coordinates to descriptions
+    xy2desc = {coords: desc for desc, coords in desc2xy.items()}
+    
+    converted_items = []
+    
+    for item in responses_items:
+        if item.get("type") == "computer_call" and "action" in item:
+            action = item["action"].copy()
+            
+            # Handle single x,y coordinates
+            if "x" in action and "y" in action:
+                coords = (action["x"], action["y"])
+                if coords in xy2desc:
+                    action["element_description"] = xy2desc[coords]
+                    del action["x"]
+                    del action["y"]
+            
+            # Handle path for drag operations
+            elif "path" in action and isinstance(action["path"], list) and len(action["path"]) == 2:
+                start_point = action["path"][0]
+                end_point = action["path"][1]
+                
+                if ("x" in start_point and "y" in start_point and 
+                    "x" in end_point and "y" in end_point):
+                    
+                    start_coords = (start_point["x"], start_point["y"])
+                    end_coords = (end_point["x"], end_point["y"])
+                    
+                    if start_coords in xy2desc and end_coords in xy2desc:
+                        action["start_element_description"] = xy2desc[start_coords]
+                        action["end_element_description"] = xy2desc[end_coords]
+                        del action["path"]
+            
+            converted_item = item.copy()
+            converted_item["action"] = action
+            converted_items.append(converted_item)
+        else:
+            converted_items.append(item)
+    
+    return converted_items
+
+
+def get_all_element_descriptions(responses_items: List[Dict[str, Any]]) -> List[str]:
+    """
+    Extract all element descriptions from computer calls in responses items.
+    
+    Args:
+        responses_items: List of response items containing computer calls
+        
+    Returns:
+        List of unique element descriptions found in computer calls
+    """
+    descriptions = set()
+    
+    for item in responses_items:
+        if item.get("type") == "computer_call" and "action" in item:
+            action = item["action"]
+            
+            # Handle single element_description
+            if "element_description" in action:
+                descriptions.add(action["element_description"])
+            
+            # Handle start_element_description and end_element_description for drag operations
+            if "start_element_description" in action:
+                descriptions.add(action["start_element_description"])
+            
+            if "end_element_description" in action:
+                descriptions.add(action["end_element_description"])
+    
+    return list(descriptions)
+
+
+# Conversion functions between responses_items and completion messages formats
+def convert_responses_items_to_completion_messages(messages: List[Dict[str, Any]], allow_images_in_tool_results: bool = True) -> List[Dict[str, Any]]:
+    """Convert responses_items message format to liteLLM completion format.
+    
+    Args:
+        messages: List of responses_items format messages
+        allow_images_in_tool_results: If True, include images in tool role messages.
+                                    If False, send tool message + separate user message with image.
+    """
+    completion_messages = []
+    
+    for message in messages:
+        msg_type = message.get("type")
+        role = message.get("role")
+        
+        # Handle user messages (both with and without explicit type)
+        if role == "user" or msg_type == "user":
+            content = message.get("content", "")
+            if isinstance(content, list):
+                # Handle list content (images, text blocks)
+                completion_content = []
+                for item in content:
+                    if item.get("type") == "input_image":
+                        completion_content.append({
+                            "type": "image_url",
+                            "image_url": {
+                                "url": item.get("image_url")
+                            }
+                        })
+                    elif item.get("type") == "input_text":
+                        completion_content.append({
+                            "type": "text",
+                            "text": item.get("text")
+                        })
+                    elif item.get("type") == "text":
+                        completion_content.append({
+                            "type": "text",
+                            "text": item.get("text")
+                        })
+                
+                completion_messages.append({
+                    "role": "user",
+                    "content": completion_content
+                })
+            elif isinstance(content, str):
+                # Handle string content
+                completion_messages.append({
+                    "role": "user",
+                    "content": content
+                })
+        
+        # Handle assistant messages
+        elif role == "assistant" or msg_type == "message":
+            content = message.get("content", [])
+            if isinstance(content, list):
+                text_parts = []
+                for item in content:
+                    if item.get("type") == "output_text":
+                        text_parts.append(item.get("text", ""))
+                    elif item.get("type") == "text":
+                        text_parts.append(item.get("text", ""))
+                
+                if text_parts:
+                    completion_messages.append({
+                        "role": "assistant",
+                        "content": "\n".join(text_parts)
+                    })
+        
+        # Handle reasoning items (convert to assistant message)
+        elif msg_type == "reasoning":
+            summary = message.get("summary", [])
+            text_parts = []
+            for item in summary:
+                if item.get("type") == "summary_text":
+                    text_parts.append(item.get("text", ""))
+            
+            if text_parts:
+                completion_messages.append({
+                    "role": "assistant",
+                    "content": "\n".join(text_parts)
+                })
+        
+        # Handle function calls
+        elif msg_type == "function_call":
+            # Add tool call to last assistant message or create new one
+            if not completion_messages or completion_messages[-1]["role"] != "assistant":
+                completion_messages.append({
+                    "role": "assistant",
+                    "content": "",
+                    "tool_calls": []
+                })
+            
+            if "tool_calls" not in completion_messages[-1]:
+                completion_messages[-1]["tool_calls"] = []
+            
+            completion_messages[-1]["tool_calls"].append({
+                "id": message.get("call_id"),
+                "type": "function",
+                "function": {
+                    "name": message.get("name"),
+                    "arguments": message.get("arguments")
+                }
+            })
+        
+        # Handle computer calls
+        elif msg_type == "computer_call":
+            # Add tool call to last assistant message or create new one
+            if not completion_messages or completion_messages[-1]["role"] != "assistant":
+                completion_messages.append({
+                    "role": "assistant",
+                    "content": "",
+                    "tool_calls": []
+                })
+            
+            if "tool_calls" not in completion_messages[-1]:
+                completion_messages[-1]["tool_calls"] = []
+            
+            action = message.get("action", {})
+            completion_messages[-1]["tool_calls"].append({
+                "id": message.get("call_id"),
+                "type": "function",
+                "function": {
+                    "name": "computer",
+                    "arguments": json.dumps(action)
+                }
+            })
+        
+        # Handle function/computer call outputs
+        elif msg_type in ["function_call_output", "computer_call_output"]:
+            output = message.get("output")
+            call_id = message.get("call_id")
+            
+            if isinstance(output, dict) and output.get("type") == "input_image":
+                if allow_images_in_tool_results:
+                    # Handle image output as tool response (may not work with all APIs)
+                    completion_messages.append({
+                        "role": "tool",
+                        "tool_call_id": call_id,
+                        "content": [{
+                            "type": "image_url",
+                            "image_url": {
+                                "url": output.get("image_url")
+                            }
+                        }]
+                    })
+                else:
+                    # Send tool message + separate user message with image (OpenAI compatible)
+                    completion_messages += [{
+                        "role": "tool",
+                        "tool_call_id": call_id,
+                        "content": "[Execution completed. See screenshot below]"
+                    }, {
+                        "role": "user",
+                        "content": [{
+                            "type": "image_url",
+                            "image_url": {
+                                "url": output.get("image_url")
+                            }
+                        }]
+                    }]
+            else:
+                # Handle text output as tool response
+                completion_messages.append({
+                    "role": "tool",
+                    "tool_call_id": call_id,
+                    "content": str(output)
+                })
+    
+    return completion_messages
+
+
+def convert_completion_messages_to_responses_items(completion_messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """Convert completion messages format to responses_items message format."""
+    responses_items = []
+    skip_next = False
+    
+    for i, message in enumerate(completion_messages):
+        if skip_next:
+            skip_next = False
+            continue
+
+        role = message.get("role")
+        content = message.get("content")
+        tool_calls = message.get("tool_calls", [])
+        
+        # Handle assistant messages with text content
+        if role == "assistant" and content and isinstance(content, str):
+            responses_items.append({
+                "type": "message",
+                "role": "assistant",
+                "content": [{
+                    "type": "output_text",
+                    "text": content
+                }]
+            })
+        
+        # Handle tool calls
+        if tool_calls:
+            for tool_call in tool_calls:
+                if tool_call.get("type") == "function":
+                    function = tool_call.get("function", {})
+                    function_name = function.get("name")
+                    
+                    if function_name == "computer":
+                        # Parse computer action
+                        try:
+                            action = json.loads(function.get("arguments", "{}"))
+                            # Change key from "action" -> "type"
+                            if action.get("action"):
+                                action["type"] = action["action"]
+                                del action["action"]
+                            responses_items.append({
+                                "type": "computer_call",
+                                "call_id": tool_call.get("id"),
+                                "action": action,
+                                "status": "completed"
+                            })
+                        except json.JSONDecodeError:
+                            # Fallback to function call format
+                            responses_items.append({
+                                "type": "function_call",
+                                "call_id": tool_call.get("id"),
+                                "name": function_name,
+                                "arguments": function.get("arguments", "{}"),
+                                "status": "completed"
+                            })
+                    else:
+                        # Regular function call
+                        responses_items.append({
+                            "type": "function_call",
+                            "call_id": tool_call.get("id"),
+                            "name": function_name,
+                            "arguments": function.get("arguments", "{}"),
+                            "status": "completed"
+                        })
+        
+        # Handle tool messages (function/computer call outputs)
+        elif role == "tool" and content:
+            tool_call_id = message.get("tool_call_id")
+            if isinstance(content, str):
+                # Check if this is the "[Execution completed. See screenshot below]" pattern
+                if content == "[Execution completed. See screenshot below]":
+                    # Look ahead for the next user message with image
+                    next_idx = i + 1
+                    if (next_idx < len(completion_messages) and 
+                        completion_messages[next_idx].get("role") == "user" and 
+                        isinstance(completion_messages[next_idx].get("content"), list)):
+                        # Found the pattern - extract image from next message
+                        next_content = completion_messages[next_idx]["content"]
+                        for item in next_content:
+                            if item.get("type") == "image_url":
+                                responses_items.append({
+                                    "type": "computer_call_output",
+                                    "call_id": tool_call_id,
+                                    "output": {
+                                        "type": "input_image",
+                                        "image_url": item.get("image_url", {}).get("url")
+                                    }
+                                })
+                                # Skip the next user message since we processed it
+                                skip_next = True
+                                break
+                    else:
+                        # No matching user message, treat as regular text
+                        responses_items.append({
+                            "type": "computer_call_output",
+                            "call_id": tool_call_id,
+                            "output": content
+                        })
+                else:
+                    # Determine if this is a computer call or function call output
+                    try:
+                        # Try to parse as structured output
+                        parsed_content = json.loads(content)
+                        if parsed_content.get("type") == "input_image":
+                            responses_items.append({
+                                "type": "computer_call_output",
+                                "call_id": tool_call_id,
+                                "output": parsed_content
+                            })
+                        else:
+                            responses_items.append({
+                                "type": "computer_call_output",
+                                "call_id": tool_call_id,
+                                "output": content
+                            })
+                    except json.JSONDecodeError:
+                        # Plain text output - could be function or computer call
+                        responses_items.append({
+                            "type": "function_call_output",
+                            "call_id": tool_call_id,
+                            "output": content
+                        })
+            elif isinstance(content, list):
+                # Handle structured content (e.g., images)
+                for item in content:
+                    if item.get("type") == "image_url":
+                        responses_items.append({
+                            "type": "computer_call_output",
+                            "call_id": tool_call_id,
+                            "output": {
+                                "type": "input_image",
+                                "image_url": item.get("image_url", {}).get("url")
+                            }
+                        })
+                    elif item.get("type") == "text":
+                        responses_items.append({
+                            "type": "function_call_output",
+                            "call_id": tool_call_id,
+                            "output": item.get("text")
+                        })
+        
+        # Handle actual user messages
+        elif role == "user" and content:
+            if isinstance(content, list):
+                # Handle structured user content (e.g., text + images)
+                user_content = []
+                for item in content:
+                    if item.get("type") == "image_url":
+                        user_content.append({
+                            "type": "input_image",
+                            "image_url": item.get("image_url", {}).get("url")
+                        })
+                    elif item.get("type") == "text":
+                        user_content.append({
+                            "type": "input_text",
+                            "text": item.get("text")
+                        })
+                
+                if user_content:
+                    responses_items.append({
+                        "role": "user",
+                        "type": "message",
+                        "content": user_content
+                    })
+            elif isinstance(content, str):
+                # Handle simple text user message
+                responses_items.append({
+                    "role": "user",
+                    "content": content
+                })
+    
+    return responses_items
--- a/libs/python/agent/agent/types.py
+++ b/libs/python/agent/agent/types.py
@@ -9,71 +9,21 @@ from litellm import ResponseInputParam, ResponsesAPIResponse, ToolParam
 from collections.abc import Iterable

 # Agent input types
-Messages = str | ResponseInputParam
+Messages = str | ResponseInputParam | List[Dict[str, Any]]
 Tools = Optional[Iterable[ToolParam]]

 # Agent output types
 AgentResponse = ResponsesAPIResponse 
+AgentCapability = Literal["step", "click"]

-# Agent loop registration
-class AgentLoopInfo(BaseModel):
-    """Information about a registered agent loop"""
-    func: Callable
+
+# Agent config registration
+class AgentConfigInfo(BaseModel):
+    """Information about a registered agent config"""
+    agent_class: type
    models_regex: str
    priority: int = 0
    
    def matches_model(self, model: str) -> bool:
-        """Check if this loop matches the given model"""
+        """Check if this agent config matches the given model"""
        return bool(re.match(self.models_regex, model))
-
-# Computer tool interface
-class Computer(Protocol):
-    """Protocol defining the interface for computer interactions."""
-    
-    async def get_environment(self) -> Literal["windows", "mac", "linux", "browser"]:
-        """Get the current environment type."""
-        ...
-    
-    async def get_dimensions(self) -> tuple[int, int]:
-        """Get screen dimensions as (width, height)."""
-        ...
-    
-    async def screenshot(self) -> str:
-        """Take a screenshot and return as base64 string."""
-        ...
-    
-    async def click(self, x: int, y: int, button: str = "left") -> None:
-        """Click at coordinates with specified button."""
-        ...
-    
-    async def double_click(self, x: int, y: int) -> None:
-        """Double click at coordinates."""
-        ...
-    
-    async def scroll(self, x: int, y: int, scroll_x: int, scroll_y: int) -> None:
-        """Scroll at coordinates with specified scroll amounts."""
-        ...
-    
-    async def type(self, text: str) -> None:
-        """Type text."""
-        ...
-    
-    async def wait(self, ms: int = 1000) -> None:
-        """Wait for specified milliseconds."""
-        ...
-    
-    async def move(self, x: int, y: int) -> None:
-        """Move cursor to coordinates."""
-        ...
-    
-    async def keypress(self, keys: List[str]) -> None:
-        """Press key combination."""
-        ...
-    
-    async def drag(self, path: List[Dict[str, int]]) -> None:
-        """Drag along specified path."""
-        ...
-    
-    async def get_current_url(self) -> str:
-        """Get current URL (for browser environments)."""
-        ...
--- a/libs/python/agent/agent/ui/gradio/app.py
+++ b/libs/python/agent/agent/ui/gradio/app.py
@@ -178,13 +178,20 @@ def create_computer_instance(
    """Create or get the global Computer instance."""
    global global_computer
    if global_computer is None:
-        global_computer = Computer(
-            verbosity=verbosity,
-            os_type=os_type,
-            provider_type=provider_type,
-            name=name if name else "",
-            api_key=api_key
-        )
+        if provider_type == "localhost":
+            global_computer = Computer(
+                verbosity=verbosity,
+                os_type=os_type,
+                use_host_computer_server=True
+            )
+        else:
+            global_computer = Computer(
+                verbosity=verbosity,
+                os_type=os_type,
+                provider_type=provider_type,
+                name=name if name else "",
+                api_key=api_key
+            )
    return global_computer


--- a/libs/python/agent/agent/ui/gradio/ui_components.py
+++ b/libs/python/agent/agent/ui/gradio/ui_components.py
@@ -211,7 +211,7 @@ if __name__ == "__main__":
                    is_windows = platform.system().lower() == "windows"
                    is_mac = platform.system().lower() == "darwin"
                    
-                    providers = ["cloud"]
+                    providers = ["cloud", "localhost"]
                    if is_mac:
                        providers += ["lume"]
                    if is_windows:
@@ -403,6 +403,23 @@ if __name__ == "__main__":
                        type="password",
                    )
                    
+                    # Provider visibility update function
+                    def update_provider_visibility(provider):
+                        """Update visibility of container name and API key based on selected provider."""
+                        is_localhost = provider == "localhost"
+                        return [
+                            gr.update(visible=not is_localhost),  # container_name
+                            gr.update(visible=not is_localhost and not has_cua_key)  # cua_cloud_api_key
+                        ]
+                    
+                    # Connect provider change event
+                    computer_provider.change(
+                        fn=update_provider_visibility,
+                        inputs=[computer_provider],
+                        outputs=[container_name, cua_cloud_api_key],
+                        queue=False
+                    )
+                    
                    # Connect UI update events
                    for dropdown in [agent_loop, omni_model_choice, uitars_model_choice, openai_model_choice, anthropic_model_choice]:
                        dropdown.change(
--- a/libs/python/agent/benchmarks/.gitignore
+++ b/libs/python/agent/benchmarks/.gitignore
@@ -0,0 +1,3 @@
+output/
+interactive_output/
+*_results.md
--- a/libs/python/agent/benchmarks/README.md
+++ b/libs/python/agent/benchmarks/README.md
@@ -0,0 +1,68 @@
+# Computer Agent Benchmarks
+
+This directory contains benchmarks designed to test agent providers in the Computer Agent SDK against reference agent implementations.
+
+## Overview
+
+The benchmark system evaluates models on GUI grounding tasks, specifically click prediction accuracy. It supports both:
+- **Computer Agent SDK providers** (using model strings like `"huggingface-local/HelloKKMe/GTA1-7B"`)
+- **Reference agent implementations** (custom model classes implementing the `ModelProtocol`)
+
+## Available Benchmarks
+
+### 1. ScreenSpot-v2 (`ss-v2.py`)
+- **Dataset**: ScreenSpot-v2 (click-only GUI grounding)
+- **Format**: Standard resolution screenshots
+- **Task**: Predict click coordinates given an instruction and image
+- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
+
+### 2. ScreenSpot-Pro (`ss-pro.py`) 
+- **Dataset**: ScreenSpot-Pro (high-resolution click-only GUI grounding)
+- **Format**: High-resolution screenshots
+- **Task**: Predict click coordinates given an instruction and image
+- **Metrics**: Accuracy, Error Rate, Timing, VRAM usage
+
+### 3. Interactive Testing (`interactive.py`)
+- **Real-time testing**: Take screenshots and visualize model predictions
+- **Commands**: 
+  - Type instruction → test all models on last screenshot
+  - `screenshot` → take screenshot
+  - `models` → list available models
+  - `quit`/`exit` → exit tool
+- **Output**: Visual predictions with crosshairs for each model
+
+## Running Benchmarks
+
+### 1. Configure Models
+Edit `utils.py` to specify which models you want to test in `get_available_models()`.
+
+### 2. Run Benchmark
+```bash
+# ScreenSpot-v2 benchmark
+python ss-v2.py --samples 50
+
+# ScreenSpot-Pro benchmark  
+python ss-pro.py --samples 50
+
+# Interactive testing
+python interactive.py
+```
+
+## Output
+
+### Console Output
+```
+Model Results:
+  Accuracy: 85.50% (171/200)
+  Avg Time: 1.23s (0.89s - 2.45s)
+  VRAM Usage: 4.5GB (max) / 3.4GB (avg)
+```
+
+### Generated Files
+- **Markdown Report**: `*_results.md` with detailed results tables
+- **Visualizations**: `output/` directory with prediction visualizations
+- **Interactive Output**: `interactive_output/` for interactive session results
+
+## Contributing
+
+To add a new reference model, follow the instructions in [contrib.md](contrib.md).
--- a/libs/python/agent/benchmarks/contrib.md
+++ b/libs/python/agent/benchmarks/contrib.md
@@ -0,0 +1,163 @@
+# Contributing Reference Agent Implementations
+
+This guide explains how to add your own reference agent implementations to the benchmark system.
+
+## Adding Reference Agent Implementations
+
+### 1. Implement the ModelProtocol
+
+Create a new file in `models/` directory implementing the `ModelProtocol`:
+
+```python
+from models.base import ModelProtocol
+from typing import Optional, Tuple
+from PIL import Image
+
+class YourModelName(ModelProtocol):
+    def __init__(self, model_path: str):
+        self.model_path = model_path
+        self._model = None
+    
+    @property
+    def model_name(self) -> str:
+        return self.model_path
+    
+    async def load_model(self) -> None:
+        """Load the model into memory."""
+        # Your model loading logic here
+        pass
+    
+    async def unload_model(self) -> None:
+        """Unload the model from memory."""
+        # Your model cleanup logic here
+        pass
+    
+    async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates for the given image and instruction.
+        
+        Args:
+            image: PIL Image to analyze
+            instruction: Text instruction describing what to click
+            
+        Returns:
+            Tuple of (x, y) coordinates or None if prediction fails
+        """
+        # Your prediction logic here
+        return (x, y)  # Return predicted coordinates
+```
+
+### 2. Register Your Model
+
+Add your model to the `get_available_models()` function in `utils.py`:
+
+```python
+def get_available_models() -> List[Union[str, ModelProtocol]]:
+    models = [
+        # Computer Agent SDK providers
+        "huggingface-local/HelloKKMe/GTA1-7B",
+        
+        # Reference implementations
+        GTA1Model("HelloKKMe/GTA1-7B"),
+        YourModelName("path/to/your/model"),  # Add your model here
+    ]
+    return models
+```
+
+### 3. Test Your Implementation
+
+Before submitting, test your model with the interactive tool:
+
+```bash
+python interactive.py
+```
+
+This will help you verify that your model loads correctly and produces reasonable predictions.
+
+## Example: Adding a New Model
+
+Here's a complete example of adding a hypothetical "MyVisionModel":
+
+1. **Create `models/my_vision_model.py`:**
+```python
+import torch
+from transformers import AutoModel, AutoProcessor
+from models.base import ModelProtocol
+from typing import Optional, Tuple
+from PIL import Image
+
+class MyVisionModel(ModelProtocol):
+    def __init__(self, model_path: str):
+        self.model_path = model_path
+        self.model = None
+        self.processor = None
+    
+    @property
+    def model_name(self) -> str:
+        return f"MyVisionModel({self.model_path})"
+    
+    async def load_model(self) -> None:
+        """Load the model and processor."""
+        self.processor = AutoProcessor.from_pretrained(self.model_path)
+        self.model = AutoModel.from_pretrained(
+            self.model_path,
+            torch_dtype=torch.float16,
+            device_map="auto"
+        )
+    
+    async def unload_model(self) -> None:
+        """Clean up model resources."""
+        del self.model
+        del self.processor
+        self.model = None
+        self.processor = None
+        torch.cuda.empty_cache()
+    
+    async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
+        """Predict click coordinates."""
+        try:
+            # Preprocess inputs
+            inputs = self.processor(
+                text=instruction,
+                images=image,
+                return_tensors="pt"
+            )
+            
+            # Run inference
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+            
+            # Extract coordinates (model-specific logic)
+            x, y = self._extract_coordinates(outputs)
+            return (int(x), int(y))
+            
+        except Exception as e:
+            print(f"Prediction failed: {e}")
+            return None
+    
+    def _extract_coordinates(self, outputs):
+        """Extract x, y coordinates from model outputs."""
+        # Your model-specific coordinate extraction logic
+        pass
+```
+
+2. **Update `models/__init__.py`:**
+```python
+from .gta1 import GTA1Model
+from .my_vision_model import MyVisionModel
+
+__all__ = ["GTA1Model", "MyVisionModel"]
+```
+
+3. **Update `utils.py`:**
+```python
+from models import GTA1Model, MyVisionModel
+
+def get_available_models() -> List[Union[str, ModelProtocol]]:
+    models = [
+        "huggingface-local/HelloKKMe/GTA1-7B",
+        GTA1Model("HelloKKMe/GTA1-7B"),
+        MyVisionModel("my-org/my-vision-model"),  # Add here
+    ]
+    return models
+```
--- a/libs/python/agent/benchmarks/interactive.py
+++ b/libs/python/agent/benchmarks/interactive.py
@@ -0,0 +1,201 @@
+#!/usr/bin/env python3
+"""
+Interactive Click Prediction Tool
+
+Takes screenshots and allows testing multiple models interactively.
+Models are loaded/unloaded one at a time to avoid memory issues.
+"""
+
+import asyncio
+import os
+from datetime import datetime
+from typing import List, Dict, Any
+
+from utils import (
+    ModelWrapper,
+    take_screenshot,
+    save_prediction_visualization,
+    get_available_models
+)
+
+
+async def predict_with_all_models(image, instruction: str, models) -> List[Dict[str, Any]]:
+    """
+    Predict click coordinates with all models sequentially.
+    
+    Args:
+        image: PIL Image to analyze
+        instruction: Instruction text
+        models: List of model instances
+        
+    Returns:
+        List of prediction results
+    """
+    predictions = []
+    
+    for model in models:
+        model_wrapper = ModelWrapper(model)
+        print(f"\n🔄 Loading {model_wrapper.model_name}...")
+        
+        try:
+            # Load model
+            await model_wrapper.load_model()
+            
+            # Predict
+            coords = await model_wrapper.predict_click(image, instruction)
+            
+            predictions.append({
+                'model_name': model_wrapper.model_name,
+                'coords': coords,
+                'error': None
+            })
+            
+            if coords:
+                print(f"✅ {model_wrapper.model_name}: ({coords[0]}, {coords[1]})")
+            else:
+                print(f"❌ {model_wrapper.model_name}: No prediction")
+                
+        except Exception as e:
+            print(f"❌ {model_wrapper.model_name}: ERROR - {str(e)}")
+            predictions.append({
+                'model_name': model_wrapper.model_name,
+                'coords': None,
+                'error': str(e)
+            })
+        
+        finally:
+            # Always unload model to free memory
+            try:
+                await model_wrapper.unload_model()
+                print(f"🗑️  Unloaded {model_wrapper.model_name}")
+            except Exception as e:
+                print(f"⚠️  Error unloading {model_wrapper.model_name}: {e}")
+    
+    return predictions
+
+
+def print_header():
+    """Print the interactive tool header."""
+    print("=" * 60)
+    print("🖱️  Interactive Click Prediction Tool")
+    print("=" * 60)
+    print("Commands:")
+    print("  • Type an instruction to test models on last screenshot")
+    print("  • 'screenshot' - Take a new screenshot")
+    print("  • 'models' - List available models")
+    print("  • 'quit' or 'exit' - Exit the tool")
+    print("=" * 60)
+    print("💡 Tip: Take a screenshot first, then send instructions to test models!")
+
+
+def print_models(models):
+    """Print available models."""
+    print("\n📋 Available Models:")
+    for i, model in enumerate(models, 1):
+        if isinstance(model, str):
+            print(f"  {i}. {model}")
+        else:
+            print(f"  {i}. models.{model.__class__.__name__}")
+
+
+async def main():
+    """
+    Main interactive loop.
+    """
+    print_header()
+    
+    # Get available models
+    models = get_available_models()
+    print_models(models)
+    
+    # Create output directory for visualizations
+    output_dir = "interactive_output"
+    os.makedirs(output_dir, exist_ok=True)
+    
+    session_count = 0
+    last_screenshot = None
+    screenshot_timestamp = None
+    
+    while True:
+        try:
+            # Get user input
+            print(f"\n{'='*40}")
+            user_input = input("🎯 Enter instruction (or command): ").strip()
+            
+            if not user_input:
+                continue
+                
+            # Handle commands
+            if user_input.lower() in ['quit', 'exit', 'q']:
+                print("👋 Goodbye!")
+                break
+                
+            elif user_input.lower() == 'models':
+                print_models(models)
+                continue
+                
+            elif user_input.lower() == 'screenshot':
+                print("📸 Taking screenshot...")
+                try:
+                    last_screenshot = take_screenshot()
+                    screenshot_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    screenshot_path = os.path.join(output_dir, f"screenshot_{screenshot_timestamp}.png")
+                    last_screenshot.save(screenshot_path)
+                    print(f"✅ Screenshot captured and saved to: {screenshot_path}")
+                    print(f"📝 Ready for instructions! Screenshot size: {last_screenshot.size}")
+                except Exception as e:
+                    print(f"❌ Error taking screenshot: {e}")
+                continue
+            
+            # Handle instruction input
+            if last_screenshot is None:
+                print("⚠️  No screenshot available! Please take a screenshot first using 'screenshot' command.")
+                continue
+                
+            session_count += 1
+            print(f"\n🎯 Session {session_count}: '{user_input}'")
+            print(f"📷 Using screenshot from: {screenshot_timestamp}")
+            
+            # Predict with all models using last screenshot
+            print(f"\n🤖 Testing {len(models)} models on screenshot...")
+            predictions = await predict_with_all_models(last_screenshot, user_input, models)
+            
+            # Display results summary
+            print(f"\n📊 Results Summary:")
+            print("-" * 50)
+            for pred in predictions:
+                if pred['coords']:
+                    print(f"✅ {pred['model_name']}: ({pred['coords'][0]}, {pred['coords'][1]})")
+                elif pred['error']:
+                    print(f"❌ {pred['model_name']}: ERROR - {pred['error']}")
+                else:
+                    print(f"❌ {pred['model_name']}: No prediction")
+            
+            # Save visualization
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            vis_filename = f"session_{session_count:03d}_{timestamp}.png"
+            vis_path = os.path.join(output_dir, vis_filename)
+            
+            try:
+                save_prediction_visualization(last_screenshot, user_input, predictions, vis_path)
+                print(f"\n💾 Visualization saved to: {vis_path}")
+            except Exception as e:
+                print(f"⚠️  Error saving visualization: {e}")
+            
+            print(f"\n✨ Session {session_count} completed!")
+            
+        except KeyboardInterrupt:
+            print("\n\n👋 Interrupted by user. Goodbye!")
+            break
+        except Exception as e:
+            print(f"\n❌ Unexpected error: {e}")
+            print("Continuing...")
+
+
+if __name__ == "__main__":
+    try:
+        asyncio.run(main())
+    except KeyboardInterrupt:
+        print("\n👋 Goodbye!")
+    except Exception as e:
+        print(f"❌ Fatal error: {e}")
--- a/libs/python/agent/benchmarks/models/init.py
+++ b/libs/python/agent/benchmarks/models/init.py
@@ -0,0 +1,3 @@
+from .base import ModelProtocol
+
+__all__ = ["ModelProtocol"]
--- a/libs/python/agent/benchmarks/models/base.py
+++ b/libs/python/agent/benchmarks/models/base.py
@@ -0,0 +1,36 @@
+"""
+Base protocol for benchmark models.
+"""
+
+from typing import Protocol, Optional, Tuple
+from PIL import Image
+
+
+class ModelProtocol(Protocol):
+    """Protocol for benchmark models that can predict click coordinates."""
+    
+    @property
+    def model_name(self) -> str:
+        """Return the name of the model."""
+        ...
+    
+    async def load_model(self) -> None:
+        """Load the model into memory."""
+        ...
+    
+    async def unload_model(self) -> None:
+        """Unload the model from memory."""
+        ...
+    
+    async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates for the given image and instruction.
+        
+        Args:
+            image: PIL Image to analyze
+            instruction: Text instruction describing what to click
+            
+        Returns:
+            Tuple of (x, y) coordinates or None if prediction fails
+        """
+        ...
--- a/libs/python/agent/benchmarks/models/gta1.py
+++ b/libs/python/agent/benchmarks/models/gta1.py
@@ -0,0 +1,162 @@
+"""
+GTA1 model implementation for benchmarking.
+"""
+
+from typing import Optional, Tuple
+from PIL import Image
+import torch
+import re
+import gc
+from qwen_vl_utils import process_vision_info, smart_resize
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+
+from .base import ModelProtocol
+
+
+class GTA1Model:
+    """Ground truth GTA1 model implementation."""
+    
+    def __init__(self, model_path: str = "HelloKKMe/GTA1-7B"):
+        self.model_path = model_path
+        self.model = None
+        self.processor = None
+        self.max_new_tokens = 32
+        
+        self.system_prompt = '''
+You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
+
+Output the coordinate pair exactly:
+(x,y)
+'''.strip()
+    
+    @property
+    def model_name(self) -> str:
+        """Return the name of the model."""
+        return f"GTA1-{self.model_path.split('/')[-1]}"
+    
+    async def load_model(self) -> None:
+        """Load the model into memory."""
+        if self.model is None:
+            print(f"Loading GTA1 model: {self.model_path}")
+            self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+                self.model_path,
+                torch_dtype=torch.bfloat16,
+                device_map="auto"
+            )
+            self.processor = AutoProcessor.from_pretrained(
+                self.model_path,
+                min_pixels=3136,
+                max_pixels=4096 * 2160
+            )
+            print("GTA1 model loaded successfully")
+    
+    async def unload_model(self) -> None:
+        """Unload the model from memory."""
+        if self.model is not None:
+            print("Unloading GTA1 model from GPU...")
+            del self.model
+            del self.processor
+            self.model = None
+            self.processor = None
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            print("GTA1 model unloaded")
+    
+    def _extract_coordinates(self, raw_string: str) -> Tuple[int, int]:
+        """Extract coordinates from model output."""
+        try:
+            matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
+            return tuple(map(int, map(float, matches[0]))) # type: ignore
+        except:
+            return (0, 0)
+    
+    async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
+        """
+        Predict click coordinates for the given image and instruction.
+        
+        Args:
+            image: PIL Image to analyze
+            instruction: Text instruction describing what to click
+            
+        Returns:
+            Tuple of (x, y) coordinates or None if prediction fails
+        """
+        if self.model is None or self.processor is None:
+            await self.load_model()
+
+        assert self.processor is not None
+        assert self.model is not None
+        
+        try:
+            width, height = image.width, image.height
+            
+            # Resize image according to processor requirements
+            resized_height, resized_width = smart_resize(
+                image.height,
+                image.width,
+                factor=self.processor.image_processor.patch_size * self.processor.image_processor.merge_size,
+                min_pixels=self.processor.image_processor.min_pixels,
+                max_pixels=self.processor.image_processor.max_pixels,
+            )
+            resized_image = image.resize((resized_width, resized_height))
+            scale_x, scale_y = width / resized_width, height / resized_height
+            
+            # Prepare messages
+            system_message = {
+                "role": "system",
+                "content": self.system_prompt.format(height=resized_height, width=resized_width)
+            }
+            
+            user_message = {
+                "role": "user",
+                "content": [
+                    {"type": "image", "image": resized_image},
+                    {"type": "text", "text": instruction}
+                ]
+            }
+            
+            # Process inputs
+            image_inputs, video_inputs = process_vision_info([system_message, user_message]) # type: ignore
+            text = self.processor.apply_chat_template(
+                [system_message, user_message], 
+                tokenize=False, 
+                add_generation_prompt=True
+            )
+            inputs = self.processor(
+                text=[text], 
+                images=image_inputs, 
+                videos=video_inputs, 
+                padding=True, 
+                return_tensors="pt"
+            )
+            inputs = inputs.to(self.model.device)
+            
+            # Generate prediction
+            output_ids = self.model.generate(
+                **inputs, 
+                max_new_tokens=self.max_new_tokens, 
+                do_sample=False, 
+                temperature=1.0, 
+                use_cache=True
+            )
+            generated_ids = [
+                output_ids[len(input_ids):] 
+                for input_ids, output_ids in zip(inputs.input_ids, output_ids)
+            ]
+            output_text = self.processor.batch_decode(
+                generated_ids, 
+                skip_special_tokens=True, 
+                clean_up_tokenization_spaces=True
+            )[0]
+            
+            # Extract and rescale coordinates
+            pred_x, pred_y = self._extract_coordinates(output_text)
+            pred_x = int(pred_x * scale_x)
+            pred_y = int(pred_y * scale_y)
+            
+            return (pred_x, pred_y)
+            
+        except Exception as e:
+            print(f"Error in GTA1 prediction: {e}")
+            return None
--- a/libs/python/agent/benchmarks/ss-pro.py
+++ b/libs/python/agent/benchmarks/ss-pro.py
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+"""
+ScreenSpot-Pro Benchmark Script
+
+Evaluates models on the ScreenSpot-Pro dataset for click prediction accuracy.
+Supports both ComputerAgent model strings and custom model classes.
+"""
+
+import argparse
+import asyncio
+import random
+import statistics
+import time
+from typing import Optional
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+from utils import (
+    ModelWrapper, 
+    is_click_in_bbox, 
+    save_results_to_markdown, 
+    save_visualizations,
+    get_available_models,
+    get_gpu_memory
+)
+
+
+async def evaluate_model(model_wrapper: ModelWrapper, dataset, max_samples: Optional[int] = None) -> dict:
+    """
+    Evaluate a model on the ScreenSpot-Pro dataset.
+    
+    Args:
+        model_wrapper: ModelWrapper instance
+        dataset: ScreenSpot-Pro dataset (list of samples)
+        max_samples: Maximum number of samples to evaluate (None for all)
+        
+    Returns:
+        Dictionary with evaluation results
+    """
+    print(f"\nEvaluating model: {model_wrapper.model_name}")
+    
+    # Load model
+    await model_wrapper.load_model()
+    
+    total_samples = len(dataset)
+    if max_samples is not None:
+        total_samples = min(max_samples, total_samples)
+    
+    correct_predictions = 0
+    error_predictions = 0
+    results = []
+    
+    for i in tqdm(range(total_samples), desc=f"Evaluating {model_wrapper.model_name}"):
+        sample = dataset[i]
+        
+        # Extract sample data
+        image = sample['image']
+        instruction = sample['instruction']
+        bbox = sample['bbox']  # [x1, y1, x2, y2]
+        sample_id = sample['img_filename']
+        
+        # Predict click coordinates with timing
+        start_time = time.time()
+        click_coords = await model_wrapper.predict_click(image, instruction)
+        prediction_time = time.time() - start_time
+        
+        # Check if prediction is correct
+        is_correct = is_click_in_bbox(click_coords, bbox)
+        
+        if is_correct:
+            correct_predictions += 1
+        
+        results.append({
+            'id': sample_id,
+            'instruction': instruction,
+            'bbox': bbox,
+            'predicted_coords': click_coords,
+            'is_correct': is_correct,
+            'failed': False,
+            'prediction_time': prediction_time
+        })
+    
+    # Unload model
+    await model_wrapper.unload_model()
+    
+    # Calculate metrics
+    accuracy = correct_predictions / total_samples if total_samples > 0 else 0.0
+    error_rate = error_predictions / total_samples if total_samples > 0 else 0.0
+    
+    # Calculate timing statistics
+    successful_times = [r['prediction_time'] for r in results if not r['failed']]
+    avg_prediction_time = sum(successful_times) / len(successful_times) if successful_times else 0.0
+    median_prediction_time = statistics.median(successful_times) if successful_times else 0.0
+    min_prediction_time = min(successful_times) if successful_times else 0.0
+    max_prediction_time = max(successful_times) if successful_times else 0.0
+    
+    # Get VRAM statistics
+    vram_stats = model_wrapper.get_vram_stats()
+    
+    return {
+        'model_name': model_wrapper.model_name,
+        'total_samples': total_samples,
+        'correct_predictions': correct_predictions,
+        'failed_predictions': error_predictions,
+        'accuracy': accuracy,
+        'failure_rate': error_rate,
+        'avg_prediction_time': avg_prediction_time,
+        'median_prediction_time': median_prediction_time,
+        'min_prediction_time': min_prediction_time,
+        'max_prediction_time': max_prediction_time,
+        'vram_max_mb': vram_stats['max_mb'],
+        'vram_avg_mb': vram_stats['avg_mb'],
+        'results': results
+    }
+
+
+async def main():
+    """
+    Main function to run the benchmark.
+    """
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description='ScreenSpot-Pro Benchmark Script')
+    parser.add_argument('--samples', type=int, default=300, 
+                       help='Number of samples to evaluate (default: 300)')
+    parser.add_argument('--seed', type=int, default=42,
+                       help='Random seed for shuffling (default: 42)')
+    args = parser.parse_args()
+    
+    # Set random seed
+    random.seed(args.seed)
+    
+    # Load dataset
+    print("Loading ScreenSpot-Pro dataset...")
+    ds = load_dataset("lmms-lab/ScreenSpot-Pro")
+    dataset = ds['train'] # type: ignore
+    # Convert to list to support indexing
+    dataset_list = list(dataset)
+    print(f"Dataset loaded: {len(dataset_list)} samples")
+    
+    # Shuffle dataset with seed
+    random.shuffle(dataset_list)
+    print(f"Dataset shuffled with seed {args.seed}")
+    
+    # Get available models
+    models = get_available_models()
+    
+    # Evaluation settings
+    max_samples = args.samples  # Use command line argument
+    
+    # Run evaluations
+    all_results = []
+    
+    for model in models:
+        model_wrapper = ModelWrapper(model)
+        result = await evaluate_model(model_wrapper, dataset_list, max_samples)
+        all_results.append(result)
+        
+        # Print summary
+        print(f"\n{result['model_name']} Results:")
+        print(f"  Accuracy: {result['accuracy']*100:.2f}%")
+        print(f"  Correct: {result['correct_predictions']}/{result['total_samples']}")
+        print(f"  Errors: {result['failed_predictions']}")
+        print(f"  Error Rate: {result['failure_rate']*100:.2f}%")
+        print(f"  Avg Time: {result['avg_prediction_time']:.2f}s")
+        print(f"  Median Time: {result['median_prediction_time']:.2f}s")
+        print(f"  Time Range: {result['min_prediction_time']:.2f}s - {result['max_prediction_time']:.2f}s")
+        print(f"  VRAM Max: {result['vram_max_mb']:.1f}MB")
+        print(f"  VRAM Avg: {result['vram_avg_mb']:.1f}MB")
+        
+        # Print GPU memory info
+        gpu_memory = get_gpu_memory()
+        if gpu_memory and gpu_memory[0] > 0:
+            print(f"  GPU Free Memory: {gpu_memory[0]:.1f}MB")
+    
+    # Save results
+    if all_results:
+        save_results_to_markdown(all_results)
+        save_visualizations(all_results, dataset_list)
+        print("\nBenchmark completed successfully!")
+    else:
+        print("\nNo successful evaluations completed.")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/libs/python/agent/benchmarks/ss-v2.py
+++ b/libs/python/agent/benchmarks/ss-v2.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python3
+"""
+ScreenSpot-v2 Benchmark Script
+
+Evaluates models on the ScreenSpot-v2 dataset for click prediction accuracy.
+Supports both ComputerAgent model strings and custom model classes.
+"""
+
+import argparse
+import asyncio
+import random
+import statistics
+import time
+from typing import Optional
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+from utils import (
+    ModelWrapper, 
+    is_click_in_bbox, 
+    save_results_to_markdown, 
+    save_visualizations,
+    get_available_models,
+    get_gpu_memory
+)
+
+
+async def evaluate_model(model_wrapper: ModelWrapper, samples, max_samples: Optional[int] = None) -> dict:
+    """
+    Evaluate a model on any iterable of samples.
+    
+    Args:
+        model_wrapper: ModelWrapper instance
+        samples: Iterable of dicts with keys: image, bbox, instruction
+        max_samples: Maximum number of samples to evaluate (None for all)
+        
+    Returns:
+        Dictionary with evaluation results
+    """
+    print(f"\nEvaluating model: {model_wrapper.model_name}")
+    
+    # Load model
+    await model_wrapper.load_model()
+    
+    # Convert to list if needed and limit samples
+    if hasattr(samples, '__len__'):
+        total_samples = len(samples)
+        if max_samples is not None:
+            total_samples = min(max_samples, total_samples)
+        sample_list = list(samples)[:total_samples]
+    else:
+        # For iterators, take max_samples or all
+        sample_list = list(samples)
+        if max_samples is not None:
+            sample_list = sample_list[:max_samples]
+        total_samples = len(sample_list)
+    
+    correct_predictions = 0
+    error_predictions = 0
+    results = []
+    
+    for i, sample in enumerate(tqdm(sample_list, desc=f"Evaluating {model_wrapper.model_name}")):
+        # Extract required data (only these 3 keys matter)
+        image = sample['image']
+        instruction = sample['instruction']
+        bbox = sample['bbox']  # [x1, y1, x2, y2]
+        
+        # Predict click coordinates with timing
+        start_time = time.time()
+        click_coords = await model_wrapper.predict_click(image, instruction)
+        prediction_time = time.time() - start_time
+        
+        # Check if prediction is correct
+        is_correct = is_click_in_bbox(click_coords, bbox)
+        
+        if is_correct:
+            correct_predictions += 1
+        
+        results.append({
+            'sample_idx': i,
+            'instruction': instruction,
+            'bbox': bbox,
+            'predicted_coords': click_coords,
+            'is_correct': is_correct,
+            'failed': False,
+            'prediction_time': prediction_time
+        })
+    
+    # Unload model
+    await model_wrapper.unload_model()
+    
+    # Calculate metrics
+    accuracy = correct_predictions / total_samples if total_samples > 0 else 0.0
+    error_rate = error_predictions / total_samples if total_samples > 0 else 0.0
+    
+    # Calculate timing statistics
+    successful_times = [r['prediction_time'] for r in results if not r['failed']]
+    avg_prediction_time = sum(successful_times) / len(successful_times) if successful_times else 0.0
+    median_prediction_time = statistics.median(successful_times) if successful_times else 0.0
+    min_prediction_time = min(successful_times) if successful_times else 0.0
+    max_prediction_time = max(successful_times) if successful_times else 0.0
+    
+    # Get VRAM statistics
+    vram_stats = model_wrapper.get_vram_stats()
+    
+    return {
+        'model_name': model_wrapper.model_name,
+        'total_samples': total_samples,
+        'correct_predictions': correct_predictions,
+        'failed_predictions': error_predictions,
+        'accuracy': accuracy,
+        'failure_rate': error_rate,
+        'avg_prediction_time': avg_prediction_time,
+        'median_prediction_time': median_prediction_time,
+        'min_prediction_time': min_prediction_time,
+        'max_prediction_time': max_prediction_time,
+        'vram_max_mb': vram_stats['max_mb'],
+        'vram_avg_mb': vram_stats['avg_mb'],
+        'results': results
+    }
+
+
+async def main():
+    """
+    Main function to run the benchmark.
+    """
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description='ScreenSpot-v2 Benchmark Script')
+    parser.add_argument('--samples', type=int, default=500, 
+                       help='Number of samples to evaluate (default: 500)')
+    parser.add_argument('--seed', type=int, default=42,
+                       help='Random seed for shuffling (default: 42)')
+    args = parser.parse_args()
+    
+    # Set random seed
+    random.seed(args.seed)
+    
+    # Load dataset
+    print("Loading ScreenSpot-v2 dataset...")
+    ds = load_dataset("lmms-lab/ScreenSpot-v2")
+    dataset = ds['train'] # type: ignore
+    # Convert to simple list of dicts with only required keys
+    samples = []
+    for item in dataset:
+        # Convert dataset item to dict if needed
+        item_dict = dict(item) if hasattr(item, 'keys') else item
+        
+        # Convert ScreenSpot-v2 bbox format [x, y, w, h] to [x1, y1, x2, y2]
+        bbox_xywh = item_dict['bbox']  # type: ignore
+        x, y, w, h = bbox_xywh
+        bbox_xyxy = [x, y, x + w, y + h]
+        
+        samples.append({
+            'image': item_dict['image'],  # type: ignore
+            'instruction': item_dict['instruction'],  # type: ignore
+            'bbox': bbox_xyxy
+        })
+    print(f"Dataset loaded: {len(samples)} samples")
+    
+    # Shuffle samples with seed
+    random.shuffle(samples)
+    print(f"Samples shuffled with seed {args.seed}")
+    
+    # Get available models
+    models = get_available_models()
+    
+    # Evaluation settings
+    max_samples = args.samples  # Use command line argument
+    
+    # Run evaluations
+    all_results = []
+    
+    for model in models:
+        model_wrapper = ModelWrapper(model)
+        result = await evaluate_model(model_wrapper, samples, max_samples)
+        all_results.append(result)
+        
+        # Print summary
+        print(f"\n{result['model_name']} Results:")
+        print(f"  Accuracy: {result['accuracy']*100:.2f}%")
+        print(f"  Correct: {result['correct_predictions']}/{result['total_samples']}")
+        print(f"  Errors: {result['failed_predictions']}")
+        print(f"  Error Rate: {result['failure_rate']*100:.2f}%")
+        print(f"  Avg Time: {result['avg_prediction_time']:.2f}s")
+        print(f"  Median Time: {result['median_prediction_time']:.2f}s")
+        print(f"  Time Range: {result['min_prediction_time']:.2f}s - {result['max_prediction_time']:.2f}s")
+        print(f"  VRAM Max: {result['vram_max_mb']:.1f}MB")
+        print(f"  VRAM Avg: {result['vram_avg_mb']:.1f}MB")
+        
+        # Print GPU memory info
+        gpu_memory = get_gpu_memory()
+        if gpu_memory and gpu_memory[0] > 0:
+            print(f"  GPU Free Memory: {gpu_memory[0]:.1f}MB")
+    
+    # Save results
+    if all_results:
+        save_results_to_markdown(all_results, "screenspot_v2_results.md", title="ScreenSpot-v2 Benchmark Results")
+        save_visualizations(all_results, samples)
+        print("\nBenchmark completed successfully!")
+    else:
+        print("\nNo successful evaluations completed.")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/libs/python/agent/benchmarks/utils.py
+++ b/libs/python/agent/benchmarks/utils.py
@@ -0,0 +1,409 @@
+#!/usr/bin/env python3
+"""
+Shared utilities for ScreenSpot-Pro benchmarking and interactive testing.
+"""
+
+import dotenv
+dotenv.load_dotenv()
+
+import asyncio
+import base64
+import os
+import sys
+import subprocess as sp
+import statistics
+from datetime import datetime
+from io import BytesIO
+from typing import List, Union, Tuple, Optional
+
+from PIL import Image, ImageDraw
+from tqdm import tqdm
+import gc
+import torch
+
+# Add parent directory to path for imports
+sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
+from agent.agent import ComputerAgent
+from models.base import ModelProtocol
+
+def get_gpu_memory() -> List[int]:
+    """
+    Get GPU memory usage using nvidia-smi.
+    
+    Returns:
+        List of free memory values in MB for each GPU
+    """
+    try:
+        command = "nvidia-smi --query-gpu=memory.free --format=csv"
+        memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
+        memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
+        return memory_free_values
+    except (sp.CalledProcessError, FileNotFoundError, IndexError):
+        # Fallback to torch if nvidia-smi is not available
+        if torch.cuda.is_available():
+            device = torch.cuda.current_device()
+            total = torch.cuda.get_device_properties(device).total_memory / 1024 / 1024
+            reserved = torch.cuda.memory_reserved(device) / 1024 / 1024
+            return [int(total - reserved)]
+        return [0]
+
+
+def get_vram_usage() -> dict:
+    """
+    Get current VRAM usage statistics.
+    
+    Returns:
+        Dictionary with VRAM usage info (in MB)
+    """
+    if torch.cuda.is_available():
+        device = torch.cuda.current_device()
+        allocated = torch.cuda.memory_allocated(device) / 1024 / 1024  # Convert to MB
+        reserved = torch.cuda.memory_reserved(device) / 1024 / 1024   # Convert to MB
+        total = torch.cuda.get_device_properties(device).total_memory / 1024 / 1024
+        return {
+            'allocated_mb': allocated,
+            'reserved_mb': reserved,
+            'total_mb': total,
+            'free_mb': total - reserved
+        }
+    else:
+        return {
+            'allocated_mb': 0.0,
+            'reserved_mb': 0.0,
+            'total_mb': 0.0,
+            'free_mb': 0.0
+        }
+
+
+def get_available_models() -> List[Union[str, ModelProtocol]]:
+    """
+    Get list of available models for testing.
+    
+    Returns:
+        List of model strings and model classes
+    """
+    local_provider = "huggingface-local/"  # Options: huggingface-local/ or mlx/
+    
+    # from models.gta1 import GTA1Model
+
+    models = [
+        # === ComputerAgent model strings ===
+        "openai/computer-use-preview",
+        "anthropic/claude-opus-4-20250514",
+        # f"{local_provider}HelloKKMe/GTA1-7B",
+        # f"{local_provider}HelloKKMe/GTA1-32B",
+        "openai/computer-use-preview+openai/gpt-4o-mini",
+        "anthropic/claude-opus-4-20250514+openai/gpt-4o-mini",
+        
+        # === Reference model classes ===
+        # GTA1Model("HelloKKMe/GTA1-7B"),
+        # GTA1Model("HelloKKMe/GTA1-32B"), 
+    ]
+    
+    return models
+
+
+def is_click_in_bbox(click_coords: Optional[Tuple[int, int]], bbox: List[int]) -> bool:
+    """
+    Check if click coordinates are within the bounding box.
+    
+    Args:
+        click_coords: (x, y) coordinates or None
+        bbox: [x1, y1, x2, y2] bounding box
+        
+    Returns:
+        True if click is within bbox, False otherwise
+    """
+    if click_coords is None:
+        return False
+    
+    x, y = click_coords
+    x1, y1, x2, y2 = bbox
+    
+    return x1 <= x <= x2 and y1 <= y <= y2
+
+
+def image_to_base64(image: Image.Image) -> str:
+    """
+    Convert PIL Image to base64 string.
+    
+    Args:
+        image: PIL Image
+        
+    Returns:
+        Base64 encoded image string
+    """
+    buffered = BytesIO()
+    image.save(buffered, format="PNG")
+    return base64.b64encode(buffered.getvalue()).decode()
+
+
+class ModelWrapper:
+    """
+    Wrapper to provide unified interface for both ComputerAgent and custom models.
+    """
+    
+    def __init__(self, model: Union[str, ModelProtocol]):
+        self.model = model
+        self.is_computer_agent = isinstance(model, str)
+        self.agent: Optional[ComputerAgent] = None
+        self.vram_usage_history: List[float] = []  # Track VRAM usage over time
+        
+        if self.is_computer_agent:
+            self.model_name = str(model)
+        else:
+            self.model_name = f"{model.__class__.__name__}('{getattr(model, 'model_name', 'unknown')}')"
+    
+    async def load_model(self) -> None:
+        """Load the model."""
+        if self.is_computer_agent:
+            self.agent = ComputerAgent(model=str(self.model))
+        else:
+            await self.model.load_model() # type: ignore
+        
+        # Record initial VRAM usage after loading
+        vram_info = get_vram_usage()
+        self.vram_usage_history.append(vram_info['allocated_mb'])
+    
+    async def unload_model(self) -> None:
+        """Unload the model."""
+        if not self.is_computer_agent:
+            await self.model.unload_model() # type: ignore
+        else:
+            del self.agent
+            self.agent = None
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+        
+        # Record VRAM usage after unloading
+        vram_info = get_vram_usage()
+        self.vram_usage_history.append(vram_info['allocated_mb'])
+    
+    def get_vram_stats(self) -> dict:
+        """Get VRAM usage statistics for this model."""
+        if not self.vram_usage_history:
+            return {'max_mb': 0.0, 'avg_mb': 0.0}
+        
+        return {
+            'max_mb': max(self.vram_usage_history),
+            'avg_mb': sum(self.vram_usage_history) / len(self.vram_usage_history)
+        }
+
+    
+    async def predict_click(self, image: Image.Image, instruction: str) -> Optional[Tuple[int, int]]:
+        """Predict click coordinates."""
+        # Record VRAM usage before prediction
+        vram_info = get_vram_usage()
+        self.vram_usage_history.append(vram_info['allocated_mb'])
+        
+        if self.is_computer_agent:
+            if self.agent is None:
+                await self.load_model()
+            
+            if self.agent is not None:
+                image_b64 = image_to_base64(image)
+                result = await self.agent.predict_click(instruction=instruction, image_b64=image_b64)
+                
+                # Record VRAM usage after prediction
+                vram_info = get_vram_usage()
+                self.vram_usage_history.append(vram_info['allocated_mb'])
+                
+                return result
+            return None
+        else:
+            result = await self.model.predict_click(image, instruction) # type: ignore
+            
+            # Record VRAM usage after prediction
+            vram_info = get_vram_usage()
+            self.vram_usage_history.append(vram_info['allocated_mb'])
+            
+            return result
+
+
+def save_results_to_markdown(all_results: List[dict],output_file: str = "screenspot_pro_results.md", title: str = "ScreenSpot-Pro Benchmark Results") -> None:
+    """
+    Save evaluation results to a markdown table.
+    
+    Args:
+        all_results: List of evaluation results for each model
+        output_file: Output markdown file path
+    """
+    with open(output_file, 'w', encoding='utf-8') as f:
+        f.write(f"# {title}\n\n")
+        f.write(f"**Evaluation Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
+        
+        # Summary table
+        f.write("## Summary\n\n")
+        f.write("| Model | Total Samples | Correct | Errors | Accuracy | Error Rate | Avg Time (s) | Median Time (s) | Time Range (s) | VRAM Max (GB) | VRAM Avg (GB) |\n")
+        f.write("|-------|---------------|---------|--------|----------|------------|--------------|-----------------|----------------|---------------|---------------|\n")
+        
+        for result in all_results:
+            model_name = result['model_name']
+            total = result['total_samples']
+            correct = result['correct_predictions']
+            errors = result['failed_predictions']
+            accuracy = result['accuracy'] * 100
+            error_rate = result['failure_rate'] * 100
+            avg_time = result.get('avg_prediction_time', 0.0)
+            median_time = result.get('median_prediction_time', 0.0)
+            min_time = result.get('min_prediction_time', 0.0)
+            max_time = result.get('max_prediction_time', 0.0)
+            time_range = f"{min_time:.2f} - {max_time:.2f}"
+            vram_max = result.get('vram_max_mb', 0.0) / 1024
+            vram_avg = result.get('vram_avg_mb', 0.0) / 1024
+            
+            f.write(f"| {model_name} | {total} | {correct} | {errors} | {accuracy:.2f}% | {error_rate:.2f}% | {avg_time:.2f} | {median_time:.2f} | {time_range} | {vram_max:.1f} | {vram_avg:.1f} |\n")
+        
+        # Detailed results for each model
+        for result in all_results:
+            f.write(f"\n## {result['model_name']} - Detailed Results\n\n")
+            f.write("| Sample Index | Instruction | BBox | Predicted | Correct | Error | Time (s) |\n")
+            f.write("|-----------|-------------|------|-----------|---------|-------|----------|\n")
+            
+            for sample_result in result['results'][:10]:  # Show first 10 samples
+                sample_idx = sample_result['sample_idx']
+                instruction = sample_result['instruction'][:50] + "..." if len(sample_result['instruction']) > 50 else sample_result['instruction']
+                bbox = str(sample_result['bbox'])
+                predicted = str(sample_result['predicted_coords']) if sample_result['predicted_coords'] else "None"
+                correct = "PASS" if sample_result['is_correct'] else "FAIL"
+                error = "YES" if sample_result['failed'] else "NO"
+                pred_time = sample_result.get('prediction_time', 0.0)
+                
+                f.write(f"| {sample_idx} | {instruction} | {bbox} | {predicted} | {correct} | {error} | {pred_time:.2f} |\n")
+            
+            if len(result['results']) > 10:
+                f.write(f"\n*Showing first 10 of {len(result['results'])} samples*\n")
+    
+    print(f"\nResults saved to: {output_file}")
+
+
+def save_visualizations(all_results: List[dict], samples, output_dir: str = "output") -> None:
+    """
+    Save visualizations of predicted coordinates vs bboxes to an output folder.
+    
+    Args:
+        all_results: List of evaluation results for each model
+        samples: List of sample dicts with image, bbox, instruction keys
+        output_dir: Output directory path
+    """
+    os.makedirs(output_dir, exist_ok=True)
+    
+    for result in all_results:
+        model_name = result['model_name'].replace('/', '_').replace('\\', '_')
+        model_dir = os.path.join(output_dir, model_name)
+        os.makedirs(model_dir, exist_ok=True)
+        
+        print(f"Saving visualizations for {result['model_name']}...")
+        
+        # Save first 10 samples for visualization
+        for i, sample_result in enumerate(tqdm(result['results'][:10], desc=f"Saving {model_name} visualizations")):
+            # Get sample data using index
+            sample_idx = sample_result['sample_idx']
+            
+            if sample_idx < len(samples):
+                sample = samples[sample_idx]
+                image = sample['image'].copy()  # Make a copy to avoid modifying original
+            else:
+                print(f"Warning: Could not find sample at index {sample_idx}")
+                continue
+            
+            bbox = sample_result['bbox']
+            predicted_coords = sample_result['predicted_coords']
+            is_correct = sample_result['is_correct']
+            
+            # Draw on image
+            draw = ImageDraw.Draw(image)
+            
+            # Draw bounding box (ground truth) in green
+            x1, y1, x2, y2 = bbox
+            draw.rectangle([x1, y1, x2, y2], outline="green", width=3)
+            draw.text((x1, y1-20), "Ground Truth", fill="green")
+            
+            # Draw predicted click in red or blue
+            if predicted_coords is not None:
+                px, py = predicted_coords
+                color = "blue" if is_correct else "red"
+                # Draw crosshair
+                crosshair_size = 15
+                draw.line([(px-crosshair_size, py), (px+crosshair_size, py)], fill=color, width=3)
+                draw.line([(px, py-crosshair_size), (px, py+crosshair_size)], fill=color, width=3)
+                draw.text((px+10, py-20), f"Predicted ({px},{py})", fill=color)
+            
+            # Add status text
+            status = "CORRECT" if is_correct else "INCORRECT"
+            status_color = "blue" if is_correct else "red"
+            draw.text((10, 10), f"Status: {status}", fill=status_color)
+            draw.text((10, 30), f"Instruction: {sample_result['instruction'][:50]}...", fill="black")
+            
+            # Save image
+            filename = f"sample_{i+1:02d}_idx{sample_idx}_{status.lower()}.png"
+            filepath = os.path.join(model_dir, filename)
+            image.save(filepath)
+        
+        print(f"Visualizations saved to: {model_dir}")
+
+
+def save_prediction_visualization(image: Image.Image, instruction: str, predictions: List[dict], 
+                                output_file: str = "interactive_prediction.png") -> None:
+    """
+    Save visualization of multiple model predictions on a single image.
+    
+    Args:
+        image: PIL Image to visualize
+        instruction: Instruction text
+        predictions: List of prediction dicts with keys: model_name, coords, error
+        output_file: Output file path
+    """
+    # Create a copy of the image
+    vis_image = image.copy()
+    draw = ImageDraw.Draw(vis_image)
+    
+    # Colors for different models
+    colors = ["red", "blue", "orange", "purple", "brown", "pink", "gray", "olive"]
+    
+    # Draw predictions
+    for i, pred in enumerate(predictions):
+        color = colors[i % len(colors)]
+        model_name = pred['model_name']
+        coords = pred.get('coords')
+        error = pred.get('error')
+        
+        if coords is not None:
+            px, py = coords
+            # Draw crosshair
+            crosshair_size = 20
+            draw.line([(px-crosshair_size, py), (px+crosshair_size, py)], fill=color, width=4)
+            draw.line([(px, py-crosshair_size), (px, py+crosshair_size)], fill=color, width=4)
+            # Draw model name
+            draw.text((px+15, py+15), f"{model_name}: ({px},{py})", fill=color)
+        else:
+            # Draw error text
+            draw.text((10, 50 + i*20), f"{model_name}: ERROR - {error}", fill=color)
+    
+    # Add instruction at the top
+    draw.text((10, 10), f"Instruction: {instruction}", fill="black")
+    
+    # Save image
+    vis_image.save(output_file)
+    print(f"Prediction visualization saved to: {output_file}")
+
+
+def take_screenshot() -> Image.Image:
+    """
+    Take a screenshot of the current screen.
+    
+    Returns:
+        PIL Image of the screenshot
+    """
+    try:
+        import pyautogui
+        screenshot = pyautogui.screenshot()
+        return screenshot
+    except ImportError:
+        print("pyautogui not installed. Please install it with: pip install pyautogui")
+        raise
+    except Exception as e:
+        print(f"Error taking screenshot: {e}")
+        raise
+
--- a/libs/python/agent/example.py
+++ b/libs/python/agent/example.py
@@ -5,8 +5,7 @@ Example usage of the agent library with docstring-based tool definitions.
 import asyncio
 import logging

-from agent import agent_loop, ComputerAgent
-from agent.types import Messages
+from agent import ComputerAgent
 from computer import Computer
 from computer.helpers import sandboxed

--- a/libs/python/agent/pyproject.toml
+++ b/libs/python/agent/pyproject.toml
@@ -19,10 +19,10 @@ dependencies = [
    "pydantic>=2.6.4",
    "rich>=13.7.1",
    "python-dotenv>=1.0.1",
-    "cua-computer>=0.3.0,<0.5.0",
+    "cua-computer>=0.4.0,<0.5.0",
    "cua-core>=0.1.8,<0.2.0",
    "certifi>=2024.2.2",
-    "litellm>=1.74.8"
+    "litellm>=1.74.12"
 ]
 requires-python = ">=3.11"

@@ -38,8 +38,15 @@ uitars-mlx = [
    "mlx-vlm>=0.1.27; sys_platform == 'darwin'"
 ]
 uitars-hf = [
+    "accelerate",
+    "torch",
    "transformers>=4.54.0"
 ]
+glm45v-hf = [
+    "accelerate",
+    "torch",
+    "transformers-v4.55.0-GLM-4.5V-preview"
+]
 ui = [
    "gradio>=5.23.3",
    "python-dotenv>=1.0.1",
@@ -47,18 +54,25 @@ ui = [
 cli = [
    "yaspin>=3.1.0",
 ]
+hud = [
+    "hud-python==0.2.10",
+]
 all = [
    # omni requirements
    "ultralytics>=8.0.0",
    "cua-som>=0.1.0,<0.2.0",
    # uitars requirements
    "mlx-vlm>=0.1.27; sys_platform == 'darwin'",
+    "accelerate",
+    "torch",
    "transformers>=4.54.0",
    # ui requirements
    "gradio>=5.23.3",
    "python-dotenv>=1.0.1",
    # cli requirements
    "yaspin>=3.1.0",
+    # hud requirements
+    "hud-python==0.2.10",
 ]

 [tool.uv]
--- a/libs/python/computer-server/computer_server/handlers/linux.py
+++ b/libs/python/computer-server/computer_server/handlers/linux.py
@@ -23,6 +23,7 @@ logger = logging.getLogger(__name__)
 # This allows the server to run in headless environments
 try:
    import pyautogui
+    pyautogui.FAILSAFE = False

    logger.info("pyautogui successfully imported, GUI automation available")
 except Exception as e:
--- a/libs/python/computer-server/computer_server/handlers/macos.py
+++ b/libs/python/computer-server/computer_server/handlers/macos.py
@@ -1,4 +1,5 @@
 import pyautogui
+pyautogui.FAILSAFE = False
 from pynput.mouse import Button, Controller as MouseController
 from pynput.keyboard import Key, Controller as KeyboardController
 import time
--- a/libs/python/computer-server/computer_server/handlers/windows.py
+++ b/libs/python/computer-server/computer_server/handlers/windows.py
@@ -18,6 +18,7 @@ logger = logging.getLogger(__name__)
 # Try to import pyautogui
 try:
    import pyautogui
+    pyautogui.FAILSAFE = False
    logger.info("pyautogui successfully imported, GUI automation available")
 except Exception as e:
    logger.error(f"pyautogui import failed: {str(e)}. GUI operations will not work.")
--- a/libs/python/computer/pyproject.toml
+++ b/libs/python/computer/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "pdm.backend"

 [project]
 name = "cua-computer"
-version = "0.3.0"
+version = "0.4.0"
 description = "Computer-Use Interface (CUI) framework powering Cua"
 readme = "README.md"
 authors = [
--- a/libs/python/mcp-server/README.md
+++ b/libs/python/mcp-server/README.md
@@ -16,6 +16,21 @@
 </div>

 **cua-mcp-server** is a MCP server for the Computer-Use Agent (CUA), allowing you to run CUA through Claude Desktop or other MCP clients.
+
+## LiteLLM Integration
+
+This MCP server features comprehensive liteLLM integration, allowing you to use any supported LLM provider with a simple model string configuration.
+
+- **Unified Configuration**: Use a single `CUA_MODEL_NAME` environment variable with a model string
+- **Automatic Provider Detection**: The agent automatically detects the provider and capabilities from the model string
+- **Extensive Provider Support**: Works with Anthropic, OpenAI, local models, and any liteLLM-compatible provider
+
+### Model String Examples:
+- **Anthropic**: `"anthropic/claude-3-5-sonnet-20241022"`
+- **OpenAI**: `"openai/computer-use-preview"`
+- **UI-TARS**: `"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"`
+- **Omni + Any LiteLLM**: `"omniparser+litellm/gpt-4o"`, `"omniparser+litellm/claude-3-haiku"`, `"omniparser+ollama_chat/gemma3"`
+
 ### Get started with Agent

 ## Prerequisites
@@ -65,10 +80,7 @@ You can then use the script in your MCP configuration like this:
      "command": "/bin/bash",
      "args": ["~/.cua/start_mcp_server.sh"],
      "env": {
-        "CUA_AGENT_LOOP": "OMNI",
-        "CUA_MODEL_PROVIDER": "ANTHROPIC",
-        "CUA_MODEL_NAME": "claude-3-7-sonnet-20250219",
-        "CUA_PROVIDER_API_KEY": "your-api-key"
+        "CUA_MODEL_NAME": "anthropic/claude-3-5-sonnet-20241022"
      }
    }
  }
@@ -86,11 +98,7 @@ If you want to develop with the cua-mcp-server directly without installation, yo
      "command": "/bin/bash",
      "args": ["~/cua/libs/python/mcp-server/scripts/start_mcp_server.sh"],
      "env": {
-        "CUA_AGENT_LOOP": "UITARS",
-        "CUA_MODEL_PROVIDER": "OAICOMPAT",
-        "CUA_MODEL_NAME": "ByteDance-Seed/UI-TARS-1.5-7B",
-        "CUA_PROVIDER_BASE_URL": "https://****************.us-east-1.aws.endpoints.huggingface.cloud/v1",
-        "CUA_PROVIDER_API_KEY": "your-api-key"
+        "CUA_MODEL_NAME": "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"
      }
    }
  }
@@ -142,10 +150,7 @@ The server is configured using environment variables (can be set in the Claude D

 | Variable | Description | Default |
 |----------|-------------|---------|
-| `CUA_AGENT_LOOP` | Agent loop to use (OPENAI, ANTHROPIC, UITARS, OMNI) | OMNI |
-| `CUA_MODEL_PROVIDER` | Model provider (ANTHROPIC, OPENAI, OLLAMA, OAICOMPAT) | ANTHROPIC |
-| `CUA_MODEL_NAME` | Model name to use | None (provider default) |
-| `CUA_PROVIDER_BASE_URL` | Base URL for provider API | None |
+| `CUA_MODEL_NAME` | Model string (e.g., "anthropic/claude-3-5-sonnet-20241022", "openai/computer-use-preview", "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", "omniparser+litellm/gpt-4o", "omniparser+ollama_chat/gemma3") | anthropic/claude-3-5-sonnet-20241022 |
 | `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |

 ## Available Tools
--- a/libs/python/mcp-server/mcp_server/server.py
+++ b/libs/python/mcp-server/mcp_server/server.py
@@ -3,6 +3,7 @@ import base64
 import logging
 import os
 import sys
+from tabnanny import verbose
 import traceback
 from typing import Any, Dict, List, Optional, Union, Tuple

@@ -28,7 +29,7 @@ except ImportError as e:

 try:
    from computer import Computer
-    from agent import ComputerAgent, LLMProvider, LLM, AgentLoop
+    from agent import ComputerAgent

    logger.debug("Successfully imported Computer and Agent modules")
 except ImportError as e:
@@ -92,49 +93,27 @@ def serve() -> FastMCP:
                global_computer = Computer(verbosity=logging.INFO)
                await global_computer.run()

-            # Determine which loop to use
-            loop_str = os.getenv("CUA_AGENT_LOOP", "OMNI")
-            loop = getattr(AgentLoop, loop_str)
+            # Get model name - this now determines the loop and provider
+            model_name = os.getenv("CUA_MODEL_NAME", "anthropic/claude-3-5-sonnet-20241022")
+            
+            logger.info(f"Using model: {model_name}")

-            # Determine provider
-            provider_str = os.getenv("CUA_MODEL_PROVIDER", "ANTHROPIC")
-            provider = getattr(LLMProvider, provider_str)
-
-            # Get model name (if specified)
-            model_name = os.getenv("CUA_MODEL_NAME", None)
-
-            # Get base URL for provider (if needed)
-            provider_base_url = os.getenv("CUA_PROVIDER_BASE_URL", None)
-
-            # Get api key for provider (if needed)
-            api_key = os.getenv("CUA_PROVIDER_API_KEY", None)
-
-            # Create agent with the specified configuration
+            # Create agent with the new v0.4.x API
            agent = ComputerAgent(
-                computer=global_computer,
-                loop=loop,
-                model=LLM(
-                    provider=provider,
-                    name=model_name,
-                    provider_base_url=provider_base_url,
-                ),
-                api_key=api_key,
-                save_trajectory=False,
+                model=model_name,
                only_n_most_recent_images=int(os.getenv("CUA_MAX_IMAGES", "3")),
                verbosity=logging.INFO,
+                tools=[global_computer]
            )

+            # Create messages in the new v0.4.x format
+            messages = [{"role": "user", "content": task}]
+            
            # Collect all results
            full_result = ""
-            async for result in agent.run(task):
-                logger.info(f"Agent step complete: {result.get('id', 'unknown')}")
-                ctx.info(f"Agent step complete: {result.get('id', 'unknown')}")
-
-                # Add response ID to output
-                full_result += f"\n[Response ID: {result.get('id', 'unknown')}]\n"
-                
-                if "content" in result:
-                    full_result += f"Response: {result.get('content', '')}\n"
+            async for result in agent.run(messages):
+                logger.info(f"Agent processing step")
+                ctx.info(f"Agent processing step")

                # Process output if available
                outputs = result.get("output", [])
@@ -145,25 +124,23 @@ def serve() -> FastMCP:
                        content = output.get("content", [])
                        for content_part in content:
                            if content_part.get("text"):
-                                full_result += f"\nMessage: {content_part.get('text', '')}\n"
-                    elif output_type == "reasoning":
-                        logger.debug(f"Reasoning: {output}")
-                        
-                        summary_content = output.get("summary", [])
-                        if summary_content:
-                            for summary_part in summary_content:
-                                if summary_part.get("text"):
-                                    full_result += f"\nReasoning: {summary_part.get('text', '')}\n"
+                                full_result += f"Message: {content_part.get('text', '')}\n"
+                    elif output_type == "tool_use":
+                        logger.debug(f"Tool use: {output}")
+                        tool_name = output.get("name", "")
+                        full_result += f"Tool: {tool_name}\n"
+                    elif output_type == "tool_result":
+                        logger.debug(f"Tool result: {output}")
+                        result_content = output.get("content", "")
+                        if isinstance(result_content, list):
+                            for item in result_content:
+                                if item.get("type") == "text":
+                                    full_result += f"Result: {item.get('text', '')}\n"
                        else:
-                            full_result += f"\nReasoning: {output.get('text', output.get('content', ''))}\n"
-                    elif output_type == "computer_call":
-                        logger.debug(f"Computer call: {output}")
-                        action = output.get("action", "")
-                        result_value = output.get("result", "")
-                        full_result += f"\nComputer Action: {action}\nResult: {result_value}\n"
+                            full_result += f"Result: {result_content}\n"

                # Add separator between steps
-                full_result += "\n" + "-" * 40 + "\n"
+                full_result += "\n" + "-" * 20 + "\n"

            logger.info(f"CUA task completed successfully")
            ctx.info(f"CUA task completed successfully")
@@ -179,7 +156,21 @@ def serve() -> FastMCP:
            error_msg = f"Error running CUA task: {str(e)}\n{traceback.format_exc()}"
            logger.error(error_msg)
            ctx.error(error_msg)
-            return f"Error during task execution: {str(e)}"
+            # Return tuple with error message and a screenshot if possible
+            try:
+                if global_computer is not None:
+                    screenshot = await global_computer.interface.screenshot()
+                    return (
+                        f"Error during task execution: {str(e)}",
+                        Image(format="png", data=screenshot)
+                    )
+            except:
+                pass
+            # If we can't get a screenshot, return a placeholder
+            return (
+                f"Error during task execution: {str(e)}",
+                Image(format="png", data=b"")
+            )

    @server.tool()
    async def run_multi_cua_tasks(ctx: Context, tasks: List[str]) -> List:
--- a/libs/python/mcp-server/pyproject.toml
+++ b/libs/python/mcp-server/pyproject.toml
@@ -13,8 +13,8 @@ authors = [
 ]
 dependencies = [
    "mcp>=1.6.0,<2.0.0",
-    "cua-agent[all]>=0.3.0,<0.4.0",
-    "cua-computer>=0.3.0,<0.4.0",
+    "cua-agent[all]>=0.4.0,<0.5.0",
+    "cua-computer>=0.4.0,<0.5.0",
 ]

 [project.scripts]
--- a/notebooks/agent_nb.ipynb
+++ b/notebooks/agent_nb.ipynb
@@ -379,7 +379,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from agent.ui.gradio.app import create_gradio_ui\n",
+    "from agent.ui.gradio.ui_components import create_gradio_ui\n",
    "\n",
    "app = create_gradio_ui()\n",
    "app.launch(share=False)"
--- a/notebooks/eval_osworld.ipynb
+++ b/notebooks/eval_osworld.ipynb
--- a/scripts/playground.sh
+++ b/scripts/playground.sh
@@ -257,7 +257,7 @@ from pathlib import Path
 from dotenv import load_dotenv
 from computer import Computer
 from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
-from agent.ui.gradio.app import create_gradio_ui
+from agent.ui.gradio.ui_components import create_gradio_ui

 # Load environment variables from .env.local
 load_dotenv(Path(__file__).parent / ".env.local")
@@ -292,7 +292,7 @@ from pathlib import Path
 from dotenv import load_dotenv
 from computer import Computer
 from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
-from agent.ui.gradio.app import create_gradio_ui
+from agent.ui.gradio.ui_components import create_gradio_ui

 # Load environment variables from .env.local
 load_dotenv(Path(__file__).parent / ".env.local")