mirror of
https://github.com/trycua/computer.git
synced 2026-05-19 15:38:48 -05:00
Merge pull request #362 from trycua/models/opencua
[Agent] Add OpenCUA, InternVL, and Holo models
This commit is contained in:
@@ -29,20 +29,25 @@ With the Computer SDK, you can:
|
||||
- create & manage VMs [locally](https://docs.trycua.com/docs/computer-sdk/computers#cua-local-containers) or using [cua cloud](https://www.trycua.com/)
|
||||
|
||||
With the Agent SDK, you can:
|
||||
- run computer-use models with a [consistent output](https://docs.trycua.com/docs/agent-sdk/chat-history#message-array-structure)
|
||||
- run composed agents using UI grounding models and any LLM
|
||||
- use any liteLLM provider (`openai/`, `openrouter/`, etc.) or our included local providers (`huggingface-local/`, `mlx/`)
|
||||
- quickly evaluate new UI agent models and UI grounding models
|
||||
- `anthropic/claude-opus-4-1-20250805` (using [Computer-Use Models](https://docs.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents))
|
||||
- `openai/computer-use-preview`
|
||||
- `openrouter/z-ai/glm-4.5v`
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `omniparser+{any LLM}` (using [Composed Agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents))
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B+{any LLM}`
|
||||
- `huggingface/HelloKKMe/GTA1-32B+{any LLM}`
|
||||
- `vllm_hosted/HelloKKMe/GTA1-72B+{any LLM}`
|
||||
- `human/human` (using [Human-in-the-Loop](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop))
|
||||
- run computer-use models with a [consistent schema](https://docs.trycua.com/docs/agent-sdk/message-format)
|
||||
- benchmark on OSWorld-Verified, SheetBench-V2, and more [with a single line of code using HUD](https://docs.trycua.com/docs/agent-sdk/integrations/hud) ([Notebook](https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb))
|
||||
- combine UI grounding models with any LLM using [composed agents](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents)
|
||||
- use new UI agent models and UI grounding models from the Model Zoo below with just a model string (e.g., `ComputerAgent(model="openai/computer-use-preview")`)
|
||||
- use API or local inference by changing a prefix (e.g., `openai/`, `openrouter/`, `ollama/`, `huggingface-local/`, `mlx/`, [etc.](https://docs.litellm.ai/docs/providers))
|
||||
|
||||
### CUA Model Zoo 🐨
|
||||
|
||||
| [All-in-one CUAs](https://docs.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents) | [UI Grounding Models](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents) | [UI Planning Models](https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents) |
|
||||
|---|---|---|
|
||||
| `anthropic/claude-opus-4-1-20250805` | `huggingface-local/xlangai/OpenCUA-{7B,32B}` | any all-in-one CUA |
|
||||
| `openai/computer-use-preview` | `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` | any VLM (using liteLLM, requires `tools` parameter) |
|
||||
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | |
|
||||
| `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` | any all-in-one CUA | |
|
||||
| `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` | |
|
||||
| `omniparser+{ui planning}` | | |
|
||||
| `{ui grounding}+{ui planning}` | | |
|
||||
|
||||
- `human/human` → [Human-in-the-Loop](https://docs.trycua.com/docs/agent-sdk/supported-agents/human-in-the-loop)
|
||||
|
||||
Missing a model? [Raise a feature request](https://github.com/trycua/cua/issues/new?assignees=&labels=enhancement&projects=&title=%5BAgent%5D%3A+Add+model+support+for+) or [contribute](https://github.com/trycua/cua/blob/main/CONTRIBUTING.md)!
|
||||
|
||||
|
||||
@@ -5,32 +5,36 @@ description: Combine grounding models with any LLM for computer-use capabilities
|
||||
|
||||
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
|
||||
|
||||
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
|
||||
Use the format `"grounding_model+planning_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
|
||||
|
||||
## How Composed Agents Work
|
||||
|
||||
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
|
||||
1. **Planning Phase**: The planning model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
|
||||
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
|
||||
3. **Execution**: Actions are performed using the predicted coordinates
|
||||
|
||||
## Supported Grounding Models
|
||||
|
||||
Any model that supports `predict_click()` can be used as the grounding component:
|
||||
Any model that supports `predict_click()` can be used as the grounding component. See the full list on [Grounding Models](./grounding-models).
|
||||
|
||||
- `omniparser` (OSS set-of-marks model)
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
|
||||
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
|
||||
- `openai/computer-use-preview` (OpenAI CUA)
|
||||
- OpenCUA: `huggingface-local/xlangai/OpenCUA-{7B,32B}`
|
||||
- GTA1 family: `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
|
||||
- Holo 1.5 family: `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
|
||||
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
|
||||
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
|
||||
|
||||
## Supported Thinking Models
|
||||
## Supported Planning Models
|
||||
|
||||
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
|
||||
Any vision-enabled LiteLLM-compatible model can be used as the planning component:
|
||||
|
||||
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
|
||||
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
|
||||
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
|
||||
- **Local models**: Any Hugging Face vision-language model
|
||||
- Any All‑in‑one CUA (planning-capable). See [All‑in‑one CUAs](./computer-use-agents).
|
||||
- Any VLM via LiteLLM providers: `anthropic/*`, `openai/*`, `openrouter/*`, `gemini/*`, `vertex_ai/*`, `huggingface-local/*`, `mlx/*`, etc.
|
||||
- Examples:
|
||||
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-opus-4-1-20250805`
|
||||
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
|
||||
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
|
||||
- **Local models**: Any Hugging Face vision-language model
|
||||
|
||||
## Usage Examples
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
---
|
||||
title: Computer-Use Models
|
||||
title: All‑in‑one CUA Models
|
||||
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
|
||||
---
|
||||
|
||||
@@ -36,19 +36,6 @@ async for _ in agent.run("Take a screenshot and describe what you see"):
|
||||
pass
|
||||
```
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
Unified vision-language model for computer-use:
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
## GLM-4.5V
|
||||
|
||||
Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
|
||||
@@ -62,6 +49,32 @@ async for _ in agent.run("Click on the search bar and type 'hello world'"):
|
||||
pass
|
||||
```
|
||||
|
||||
## InternVL 3.5
|
||||
|
||||
InternVL 3.5 family:
|
||||
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/OpenGVLab/InternVL3_5-1B", tools=[computer])
|
||||
async for _ in agent.run("Open Firefox and navigate to github.com"):
|
||||
pass
|
||||
```
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
Unified vision-language model for computer-use:
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
CUAs also support direct click prediction. See [Grounding Models](./grounding-models) for details on `predict_click()`.
|
||||
|
||||
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
|
||||
|
||||
@@ -7,9 +7,7 @@ These models specialize in UI element grounding and click prediction. They can i
|
||||
|
||||
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
|
||||
|
||||
## All Computer-Use Agents
|
||||
|
||||
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
|
||||
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [All‑in‑one CUAs](./computer-use-agents).
|
||||
|
||||
### Anthropic CUAs
|
||||
|
||||
@@ -21,7 +19,7 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
|
||||
### OpenAI CUA Preview
|
||||
- Computer-use-preview: `computer-use-preview`
|
||||
|
||||
### UI-TARS 1.5
|
||||
### UI-TARS 1.5 (Unified VLM with grounding support)
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
@@ -29,18 +27,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
|
||||
|
||||
These models are optimized specifically for click prediction and UI element grounding:
|
||||
|
||||
### OmniParser
|
||||
### OpenCUA
|
||||
- `huggingface-local/xlangai/OpenCUA-{7B,32B}`
|
||||
|
||||
### GTA1 Family
|
||||
- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
|
||||
|
||||
### Holo 1.5 Family
|
||||
- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
|
||||
|
||||
### InternVL 3.5 Family
|
||||
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
|
||||
### OmniParser (OCR)
|
||||
|
||||
OCR-focused set-of-marks model that requires an LLM for click prediction:
|
||||
|
||||
- `omniparser` (requires combination with any LiteLLM vision model)
|
||||
|
||||
### GTA1-7B
|
||||
|
||||
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
|
||||
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B`
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
@@ -83,7 +87,6 @@ print(f"Click coordinates: {coords}") # (450, 320)
|
||||
# agent.run("Fill out the form and submit it")
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
|
||||
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
|
||||
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [All‑in‑one CUAs](./computer-use-agents).
|
||||
|
||||
@@ -15,54 +15,31 @@ try:
|
||||
except ImportError:
|
||||
HF_AVAILABLE = False
|
||||
|
||||
from .models import load_model as load_model_handler
|
||||
|
||||
class HuggingFaceLocalAdapter(CustomLLM):
|
||||
"""HuggingFace Local Adapter for running vision-language models locally."""
|
||||
|
||||
def __init__(self, device: str = "auto", **kwargs):
|
||||
def __init__(self, device: str = "auto", trust_remote_code: bool = False, **kwargs):
|
||||
"""Initialize the adapter.
|
||||
|
||||
Args:
|
||||
device: Device to load model on ("auto", "cuda", "cpu", etc.)
|
||||
trust_remote_code: Whether to trust remote code
|
||||
**kwargs: Additional arguments
|
||||
"""
|
||||
super().__init__()
|
||||
self.device = device
|
||||
self.models = {} # Cache for loaded models
|
||||
self.processors = {} # Cache for loaded processors
|
||||
self.trust_remote_code = trust_remote_code
|
||||
# Cache for model handlers keyed by model_name
|
||||
self._handlers: Dict[str, Any] = {}
|
||||
self._executor = ThreadPoolExecutor(max_workers=1) # Single thread pool
|
||||
|
||||
def _load_model_and_processor(self, model_name: str):
|
||||
"""Load model and processor if not already cached.
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to load
|
||||
|
||||
Returns:
|
||||
Tuple of (model, processor)
|
||||
"""
|
||||
if model_name not in self.models:
|
||||
# Load model
|
||||
model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype=torch.float16,
|
||||
device_map=self.device,
|
||||
attn_implementation="sdpa"
|
||||
)
|
||||
|
||||
# Load processor
|
||||
processor = AutoProcessor.from_pretrained(
|
||||
model_name,
|
||||
min_pixels=3136,
|
||||
max_pixels=4096 * 2160,
|
||||
device_map=self.device
|
||||
)
|
||||
|
||||
# Cache them
|
||||
self.models[model_name] = model
|
||||
self.processors[model_name] = processor
|
||||
|
||||
return self.models[model_name], self.processors[model_name]
|
||||
def _get_handler(self, model_name: str):
|
||||
"""Get or create a model handler for the given model name."""
|
||||
if model_name not in self._handlers:
|
||||
self._handlers[model_name] = load_model_handler(model_name=model_name, device=self.device, trust_remote_code=self.trust_remote_code)
|
||||
return self._handlers[model_name]
|
||||
|
||||
def _convert_messages(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
"""Convert OpenAI format messages to HuggingFace format.
|
||||
@@ -133,41 +110,13 @@ class HuggingFaceLocalAdapter(CustomLLM):
|
||||
if ignored_kwargs:
|
||||
warnings.warn(f"Ignoring unsupported kwargs: {ignored_kwargs}")
|
||||
|
||||
# Load model and processor
|
||||
model, processor = self._load_model_and_processor(model_name)
|
||||
|
||||
# Convert messages to HuggingFace format
|
||||
hf_messages = self._convert_messages(messages)
|
||||
|
||||
# Apply chat template and tokenize
|
||||
inputs = processor.apply_chat_template(
|
||||
hf_messages,
|
||||
add_generation_prompt=True,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt"
|
||||
)
|
||||
|
||||
# Move inputs to the same device as model
|
||||
inputs = inputs.to(model.device)
|
||||
|
||||
# Generate response
|
||||
with torch.no_grad():
|
||||
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
|
||||
|
||||
# Trim input tokens from output
|
||||
generated_ids_trimmed = [
|
||||
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
||||
]
|
||||
|
||||
# Decode output
|
||||
output_text = processor.batch_decode(
|
||||
generated_ids_trimmed,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False
|
||||
)
|
||||
|
||||
return output_text[0] if output_text else ""
|
||||
# Delegate to model handler
|
||||
handler = self._get_handler(model_name)
|
||||
generated_text = handler.generate(hf_messages, max_new_tokens=max_new_tokens)
|
||||
return generated_text
|
||||
|
||||
def completion(self, *args, **kwargs) -> ModelResponse:
|
||||
"""Synchronous completion method.
|
||||
|
||||
@@ -0,0 +1,33 @@
|
||||
from typing import Optional
|
||||
|
||||
try:
|
||||
from transformers import AutoConfig
|
||||
HF_AVAILABLE = True
|
||||
except ImportError:
|
||||
HF_AVAILABLE = False
|
||||
|
||||
from .generic import GenericHFModel
|
||||
from .opencua import OpenCUAModel
|
||||
from .qwen2_5_vl import Qwen2_5_VLModel
|
||||
from .internvl import InternVLModel
|
||||
|
||||
def load_model(model_name: str, device: str = "auto", trust_remote_code: bool = False):
|
||||
"""Factory function to load and return the right model handler instance.
|
||||
|
||||
- If the underlying transformers config class matches OpenCUA, return OpenCUAModel
|
||||
- Otherwise, return GenericHFModel
|
||||
"""
|
||||
if not HF_AVAILABLE:
|
||||
raise ImportError(
|
||||
"HuggingFace transformers dependencies not found. Install with: pip install \"cua-agent[uitars-hf]\""
|
||||
)
|
||||
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=trust_remote_code)
|
||||
cls = cfg.__class__.__name__
|
||||
print(f"cls: {cls}")
|
||||
if "OpenCUA" in cls:
|
||||
return OpenCUAModel(model_name=model_name, device=device, trust_remote_code=trust_remote_code)
|
||||
elif "Qwen2_5_VL" in cls:
|
||||
return Qwen2_5_VLModel(model_name=model_name, device=device, trust_remote_code=trust_remote_code)
|
||||
elif "InternVL" in cls:
|
||||
return InternVLModel(model_name=model_name, device=device, trust_remote_code=trust_remote_code)
|
||||
return GenericHFModel(model_name=model_name, device=device, trust_remote_code=trust_remote_code)
|
||||
@@ -0,0 +1,75 @@
|
||||
from typing import List, Dict, Any, Optional
|
||||
|
||||
# Hugging Face imports are local to avoid hard dependency at module import
|
||||
try:
|
||||
import torch # type: ignore
|
||||
from transformers import AutoModel, AutoProcessor # type: ignore
|
||||
HF_AVAILABLE = True
|
||||
except Exception:
|
||||
HF_AVAILABLE = False
|
||||
|
||||
|
||||
class GenericHFModel:
|
||||
"""Generic Hugging Face vision-language model handler.
|
||||
Loads an AutoModelForImageTextToText and AutoProcessor and generates text.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str, device: str = "auto", trust_remote_code: bool = False) -> None:
|
||||
if not HF_AVAILABLE:
|
||||
raise ImportError(
|
||||
"HuggingFace transformers dependencies not found. Install with: pip install \"cua-agent[uitars-hf]\""
|
||||
)
|
||||
self.model_name = model_name
|
||||
self.device = device
|
||||
self.model = None
|
||||
self.processor = None
|
||||
self.trust_remote_code = trust_remote_code
|
||||
self._load()
|
||||
|
||||
def _load(self) -> None:
|
||||
# Load model
|
||||
self.model = AutoModel.from_pretrained(
|
||||
self.model_name,
|
||||
torch_dtype=torch.float16,
|
||||
device_map=self.device,
|
||||
attn_implementation="sdpa",
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
)
|
||||
# Load processor
|
||||
self.processor = AutoProcessor.from_pretrained(
|
||||
self.model_name,
|
||||
min_pixels=3136,
|
||||
max_pixels=4096 * 2160,
|
||||
device_map=self.device,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
)
|
||||
|
||||
def generate(self, messages: List[Dict[str, Any]], max_new_tokens: int = 128) -> str:
|
||||
"""Generate text for the given HF-format messages.
|
||||
messages: [{ role, content: [{type:'text'|'image', text|image}] }]
|
||||
"""
|
||||
assert self.model is not None and self.processor is not None
|
||||
# Apply chat template and tokenize
|
||||
inputs = self.processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt=True,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
# Move inputs to the same device as model
|
||||
inputs = inputs.to(self.model.device)
|
||||
# Generate
|
||||
with torch.no_grad():
|
||||
generated_ids = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
|
||||
# Trim prompt tokens from output
|
||||
generated_ids_trimmed = [
|
||||
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
||||
]
|
||||
# Decode
|
||||
output_text = self.processor.batch_decode(
|
||||
generated_ids_trimmed,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False,
|
||||
)
|
||||
return output_text[0] if output_text else ""
|
||||
@@ -0,0 +1,253 @@
|
||||
from typing import List, Dict, Any, Optional
|
||||
|
||||
# Hugging Face imports are local to avoid hard dependency at module import
|
||||
try:
|
||||
import torch # type: ignore
|
||||
from transformers import AutoModel, AutoTokenizer # type: ignore
|
||||
# Attempt to import InternVL's model dependencies
|
||||
import einops as _ # type: ignore
|
||||
import timm as _ # type: ignore
|
||||
from PIL import Image # type: ignore
|
||||
import torchvision.transforms as T # type: ignore
|
||||
from torchvision.transforms.functional import InterpolationMode # type: ignore
|
||||
import base64 # type: ignore
|
||||
from io import BytesIO # type: ignore
|
||||
import requests # type: ignore
|
||||
HF_AVAILABLE = True
|
||||
except Exception:
|
||||
HF_AVAILABLE = False
|
||||
|
||||
|
||||
class InternVLModel:
|
||||
"""Generic Hugging Face vision-language model handler.
|
||||
Uses InternVL's native `model.chat()` interface with `AutoTokenizer`.
|
||||
Provides preprocessing to support multi-turn conversations with multiple images.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str, device: str = "auto", trust_remote_code: bool = False) -> None:
|
||||
if not HF_AVAILABLE:
|
||||
raise ImportError(
|
||||
"InternVL dependencies not found. Install with: pip install \"cua-agent[internvl-hf]\""
|
||||
)
|
||||
self.model_name = model_name
|
||||
self.device = device
|
||||
self.model = None
|
||||
self.tokenizer = None
|
||||
self.trust_remote_code = trust_remote_code
|
||||
self._load()
|
||||
|
||||
def _load(self) -> None:
|
||||
# Load model
|
||||
self.model = AutoModel.from_pretrained(
|
||||
self.model_name,
|
||||
torch_dtype=torch.bfloat16,
|
||||
low_cpu_mem_usage=True,
|
||||
use_flash_attn=True,
|
||||
device_map=self.device,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
).eval()
|
||||
# Load tokenizer (InternVL requires trust_remote_code=True and often use_fast=False)
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(
|
||||
self.model_name,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
use_fast=False,
|
||||
)
|
||||
|
||||
# ---- Image preprocessing utilities adapted from InternVL docs ----
|
||||
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
||||
IMAGENET_STD = (0.229, 0.224, 0.225)
|
||||
|
||||
def _build_transform(self, input_size: int) -> T.Compose:
|
||||
MEAN, STD = self.IMAGENET_MEAN, self.IMAGENET_STD
|
||||
transform = T.Compose([
|
||||
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
||||
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
||||
T.ToTensor(),
|
||||
T.Normalize(mean=MEAN, std=STD)
|
||||
])
|
||||
return transform
|
||||
|
||||
def _find_closest_aspect_ratio(self, aspect_ratio: float, target_ratios: List[tuple], width: int, height: int, image_size: int):
|
||||
best_ratio_diff = float('inf')
|
||||
best_ratio = (1, 1)
|
||||
area = width * height
|
||||
for ratio in target_ratios:
|
||||
target_aspect_ratio = ratio[0] / ratio[1]
|
||||
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
||||
if ratio_diff < best_ratio_diff:
|
||||
best_ratio_diff = ratio_diff
|
||||
best_ratio = ratio
|
||||
elif ratio_diff == best_ratio_diff:
|
||||
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
||||
best_ratio = ratio
|
||||
return best_ratio
|
||||
|
||||
def _dynamic_preprocess(self, image: Image.Image, min_num: int = 1, max_num: int = 12, image_size: int = 448, use_thumbnail: bool = True) -> List[Image.Image]:
|
||||
orig_width, orig_height = image.size
|
||||
aspect_ratio = orig_width / orig_height
|
||||
|
||||
target_ratios = set(
|
||||
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
||||
i * j <= max_num and i * j >= min_num)
|
||||
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
||||
|
||||
target_aspect_ratio = self._find_closest_aspect_ratio(
|
||||
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
||||
|
||||
target_width = image_size * target_aspect_ratio[0]
|
||||
target_height = image_size * target_aspect_ratio[1]
|
||||
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
||||
|
||||
resized_img = image.resize((target_width, target_height))
|
||||
processed_images: List[Image.Image] = []
|
||||
for i in range(blocks):
|
||||
box = (
|
||||
(i % (target_width // image_size)) * image_size,
|
||||
(i // (target_width // image_size)) * image_size,
|
||||
((i % (target_width // image_size)) + 1) * image_size,
|
||||
((i // (target_width // image_size)) + 1) * image_size
|
||||
)
|
||||
split_img = resized_img.crop(box)
|
||||
processed_images.append(split_img)
|
||||
assert len(processed_images) == blocks
|
||||
if use_thumbnail and len(processed_images) != 1:
|
||||
thumbnail_img = image.resize((image_size, image_size))
|
||||
processed_images.append(thumbnail_img)
|
||||
return processed_images
|
||||
|
||||
def _load_image_from_source(self, src: str) -> Image.Image:
|
||||
"""Load PIL image from various sources: data URL, http(s), or local path."""
|
||||
if src.startswith("data:image/"):
|
||||
# data URL base64
|
||||
header, b64data = src.split(",", 1)
|
||||
img_bytes = base64.b64decode(b64data)
|
||||
return Image.open(BytesIO(img_bytes)).convert('RGB')
|
||||
if src.startswith("http://") or src.startswith("https://"):
|
||||
resp = requests.get(src, timeout=10)
|
||||
resp.raise_for_status()
|
||||
return Image.open(BytesIO(resp.content)).convert('RGB')
|
||||
# Assume local file path
|
||||
return Image.open(src).convert('RGB')
|
||||
|
||||
def _images_to_pixel_values(self, images: List[Image.Image], input_size: int = 448, max_num: int = 12):
|
||||
transform = self._build_transform(input_size=input_size)
|
||||
pixel_values_list = []
|
||||
num_patches_list: List[int] = []
|
||||
for img in images:
|
||||
tiles = self._dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
||||
pv = [transform(tile) for tile in tiles]
|
||||
pv = torch.stack(pv)
|
||||
num_patches_list.append(pv.shape[0])
|
||||
pixel_values_list.append(pv)
|
||||
if not pixel_values_list:
|
||||
return None, []
|
||||
pixel_values = torch.cat(pixel_values_list)
|
||||
return pixel_values, num_patches_list
|
||||
|
||||
def generate(self, messages: List[Dict[str, Any]], max_new_tokens: int = 128) -> str:
|
||||
"""Generate text for the given HF-format messages.
|
||||
messages: [{ role, content: [{type:'text'|'image', text|image}] }]
|
||||
|
||||
This implementation constructs InternVL-compatible inputs and uses
|
||||
`model.chat(tokenizer, pixel_values, question, history=...)` to avoid
|
||||
relying on AutoProcessor (which fails for some tokenizers).
|
||||
"""
|
||||
assert self.model is not None and self.tokenizer is not None
|
||||
|
||||
# Build textual context and collect images and the final question
|
||||
context_lines: List[str] = []
|
||||
all_images: List[Image.Image] = []
|
||||
last_user_text_parts: List[str] = []
|
||||
|
||||
for msg in messages:
|
||||
role = msg.get("role", "user")
|
||||
content = msg.get("content", [])
|
||||
if isinstance(content, str):
|
||||
content_items = [{"type": "text", "text": content}]
|
||||
else:
|
||||
content_items = content
|
||||
|
||||
if role == "user":
|
||||
# Collect text and images
|
||||
parts_text: List[str] = []
|
||||
for item in content_items:
|
||||
if item.get("type") == "text":
|
||||
t = item.get("text", "")
|
||||
if t:
|
||||
parts_text.append(t)
|
||||
elif item.get("type") == "image":
|
||||
url = item.get("image", "")
|
||||
if url:
|
||||
try:
|
||||
all_images.append(self._load_image_from_source(url))
|
||||
except Exception:
|
||||
# Ignore failed image loads but keep going
|
||||
pass
|
||||
text = "\n".join(parts_text).strip()
|
||||
if text:
|
||||
context_lines.append(f"User: {text}")
|
||||
# Track last user text separately for question
|
||||
last_user_text_parts = parts_text or last_user_text_parts
|
||||
elif role == "assistant":
|
||||
# Only keep text content for history
|
||||
parts_text = [item.get("text", "") for item in content_items if item.get("type") == "text"]
|
||||
text = "\n".join(parts_text).strip()
|
||||
if text:
|
||||
context_lines.append(f"Assistant: {text}")
|
||||
|
||||
# Prepare pixel values for all collected images (across turns)
|
||||
pixel_values = None
|
||||
num_patches_list: List[int] = []
|
||||
if all_images:
|
||||
pixel_values, num_patches_list = self._images_to_pixel_values(all_images, input_size=448, max_num=12)
|
||||
if pixel_values is not None:
|
||||
# Convert dtype/device as in docs
|
||||
pixel_values = pixel_values.to(torch.bfloat16)
|
||||
# Chat API expects tensors on CUDA when model is on CUDA
|
||||
try:
|
||||
pixel_values = pixel_values.to(self.model.device)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Build question with any prior context and numbered image placeholders
|
||||
if all_images:
|
||||
# Separate images layout: Image-1: <image> ... then question text
|
||||
prefix_lines = [f"Image-{i+1}: <image>" for i in range(len(all_images))]
|
||||
prefix = "\n".join(prefix_lines) + "\n"
|
||||
else:
|
||||
prefix = ""
|
||||
|
||||
last_user_text = "\n".join(last_user_text_parts).strip()
|
||||
# Combine prior text-only turns as context to emulate multi-turn
|
||||
context_text = "\n".join(context_lines[:-1]) if len(context_lines) > 1 else ""
|
||||
base_question = last_user_text if last_user_text else "Describe the image(s) in detail."
|
||||
if context_text:
|
||||
question = (context_text + "\n" + prefix + base_question).strip()
|
||||
else:
|
||||
question = (prefix + base_question).strip()
|
||||
|
||||
# Generation config
|
||||
generation_config = dict(max_new_tokens=max_new_tokens, do_sample=False)
|
||||
|
||||
# Call InternVL chat
|
||||
try:
|
||||
if pixel_values is None:
|
||||
# Pure-text conversation (embed prior turns in question)
|
||||
response = self.model.chat(self.tokenizer, None, question, generation_config)
|
||||
else:
|
||||
# Multi-image: pass num_patches_list if >1 image
|
||||
if len(num_patches_list) > 1:
|
||||
response = self.model.chat(
|
||||
self.tokenizer,
|
||||
pixel_values,
|
||||
question,
|
||||
generation_config,
|
||||
num_patches_list=num_patches_list,
|
||||
)
|
||||
else:
|
||||
response = self.model.chat(self.tokenizer, pixel_values, question, generation_config)
|
||||
except Exception as e:
|
||||
# Fallback: return empty string to avoid crashing the adapter
|
||||
return ""
|
||||
|
||||
return response or ""
|
||||
@@ -0,0 +1,100 @@
|
||||
from typing import List, Dict, Any
|
||||
import re
|
||||
import base64
|
||||
from io import BytesIO
|
||||
|
||||
try:
|
||||
import torch # type: ignore
|
||||
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor # type: ignore
|
||||
from PIL import Image # type: ignore
|
||||
import blobfile as _ # assert blobfile is installed
|
||||
OPENCUA_AVAILABLE = True
|
||||
except Exception:
|
||||
OPENCUA_AVAILABLE = False
|
||||
|
||||
|
||||
class OpenCUAModel:
|
||||
"""OpenCUA model handler using AutoTokenizer, AutoModel and AutoImageProcessor."""
|
||||
|
||||
def __init__(self, model_name: str, device: str = "auto", trust_remote_code: bool = False) -> None:
|
||||
if not OPENCUA_AVAILABLE:
|
||||
raise ImportError(
|
||||
"OpenCUA requirements not found. Install with: pip install \"cua-agent[opencua-hf]\""
|
||||
)
|
||||
self.model_name = model_name
|
||||
self.device = device
|
||||
self.model = None
|
||||
self.tokenizer = None
|
||||
self.image_processor = None
|
||||
self.trust_remote_code = trust_remote_code
|
||||
self._load()
|
||||
|
||||
def _load(self) -> None:
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(
|
||||
self.model_name, trust_remote_code=self.trust_remote_code
|
||||
)
|
||||
self.model = AutoModel.from_pretrained(
|
||||
self.model_name,
|
||||
torch_dtype="auto",
|
||||
device_map=self.device,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
attn_implementation="sdpa",
|
||||
)
|
||||
self.image_processor = AutoImageProcessor.from_pretrained(
|
||||
self.model_name, trust_remote_code=self.trust_remote_code
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _extract_last_image_b64(messages: List[Dict[str, Any]]) -> str:
|
||||
# Expect HF-format messages with content items type: "image" with data URL
|
||||
for msg in reversed(messages):
|
||||
for item in reversed(msg.get("content", [])):
|
||||
if isinstance(item, dict) and item.get("type") == "image":
|
||||
url = item.get("image", "")
|
||||
if isinstance(url, str) and url.startswith("data:image/"):
|
||||
return url.split(",", 1)[1]
|
||||
return ""
|
||||
|
||||
def generate(self, messages: List[Dict[str, Any]], max_new_tokens: int = 512) -> str:
|
||||
assert self.model is not None and self.tokenizer is not None and self.image_processor is not None
|
||||
|
||||
# Tokenize text side using chat template
|
||||
input_ids = self.tokenizer.apply_chat_template(
|
||||
messages, tokenize=True, add_generation_prompt=True
|
||||
)
|
||||
input_ids = torch.tensor([input_ids]).to(self.model.device)
|
||||
|
||||
# Prepare image inputs from last data URL image
|
||||
image_b64 = self._extract_last_image_b64(messages)
|
||||
pixel_values = None
|
||||
grid_thws = None
|
||||
if image_b64:
|
||||
image = Image.open(BytesIO(base64.b64decode(image_b64))).convert("RGB")
|
||||
image_info = self.image_processor.preprocess(images=[image])
|
||||
pixel_values = torch.tensor(image_info["pixel_values"]).to(
|
||||
dtype=torch.bfloat16, device=self.model.device
|
||||
)
|
||||
grid_thws = torch.tensor(image_info["image_grid_thw"]) if "image_grid_thw" in image_info else None
|
||||
|
||||
gen_kwargs: Dict[str, Any] = {
|
||||
"max_new_tokens": max_new_tokens,
|
||||
"temperature": 0,
|
||||
}
|
||||
if pixel_values is not None:
|
||||
gen_kwargs["pixel_values"] = pixel_values
|
||||
if grid_thws is not None:
|
||||
gen_kwargs["grid_thws"] = grid_thws
|
||||
|
||||
with torch.no_grad():
|
||||
generated_ids = self.model.generate(
|
||||
input_ids,
|
||||
**gen_kwargs,
|
||||
)
|
||||
|
||||
# Remove prompt tokens
|
||||
prompt_len = input_ids.shape[1]
|
||||
generated_ids = generated_ids[:, prompt_len:]
|
||||
output_text = self.tokenizer.batch_decode(
|
||||
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||||
)[0]
|
||||
return output_text
|
||||
@@ -0,0 +1,75 @@
|
||||
from typing import List, Dict, Any, Optional
|
||||
|
||||
# Hugging Face imports are local to avoid hard dependency at module import
|
||||
try:
|
||||
import torch # type: ignore
|
||||
from transformers import AutoModelForImageTextToText, AutoProcessor # type: ignore
|
||||
HF_AVAILABLE = True
|
||||
except Exception:
|
||||
HF_AVAILABLE = False
|
||||
|
||||
|
||||
class Qwen2_5_VLModel:
|
||||
"""Qwen2.5-VL Hugging Face vision-language model handler.
|
||||
Loads an AutoModelForImageTextToText and AutoProcessor and generates text.
|
||||
"""
|
||||
|
||||
def __init__(self, model_name: str, device: str = "auto", trust_remote_code: bool = False) -> None:
|
||||
if not HF_AVAILABLE:
|
||||
raise ImportError(
|
||||
"HuggingFace transformers dependencies not found. Install with: pip install \"cua-agent[uitars-hf]\""
|
||||
)
|
||||
self.model_name = model_name
|
||||
self.device = device
|
||||
self.model = None
|
||||
self.processor = None
|
||||
self.trust_remote_code = trust_remote_code
|
||||
self._load()
|
||||
|
||||
def _load(self) -> None:
|
||||
# Load model
|
||||
self.model = AutoModelForImageTextToText.from_pretrained(
|
||||
self.model_name,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map=self.device,
|
||||
attn_implementation="sdpa",
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
)
|
||||
# Load processor
|
||||
self.processor = AutoProcessor.from_pretrained(
|
||||
self.model_name,
|
||||
min_pixels=3136,
|
||||
max_pixels=4096 * 2160,
|
||||
device_map=self.device,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
)
|
||||
|
||||
def generate(self, messages: List[Dict[str, Any]], max_new_tokens: int = 128) -> str:
|
||||
"""Generate text for the given HF-format messages.
|
||||
messages: [{ role, content: [{type:'text'|'image', text|image}] }]
|
||||
"""
|
||||
assert self.model is not None and self.processor is not None
|
||||
# Apply chat template and tokenize
|
||||
inputs = self.processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt=True,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
# Move inputs to the same device as model
|
||||
inputs = inputs.to(self.model.device)
|
||||
# Generate
|
||||
with torch.no_grad():
|
||||
generated_ids = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
|
||||
# Trim prompt tokens from output
|
||||
generated_ids_trimmed = [
|
||||
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
||||
]
|
||||
# Decode
|
||||
output_text = self.processor.batch_decode(
|
||||
generated_ids_trimmed,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False,
|
||||
)
|
||||
return output_text[0] if output_text else ""
|
||||
@@ -171,6 +171,7 @@ class ComputerAgent:
|
||||
use_prompt_caching: Optional[bool] = False,
|
||||
max_trajectory_budget: Optional[float | dict] = None,
|
||||
telemetry_enabled: Optional[bool] = True,
|
||||
trust_remote_code: Optional[bool] = False,
|
||||
**kwargs
|
||||
):
|
||||
"""
|
||||
@@ -190,6 +191,7 @@ class ComputerAgent:
|
||||
use_prompt_caching: If set, use prompt caching to avoid reprocessing the same prompt. Intended for use with anthropic providers.
|
||||
max_trajectory_budget: If set, adds BudgetManagerCallback to track usage costs and stop when budget is exceeded
|
||||
telemetry_enabled: If set, adds TelemetryCallback to track anonymized usage data. Enabled by default.
|
||||
trust_remote_code: If set, trust remote code when loading local models. Disabled by default.
|
||||
**kwargs: Additional arguments passed to the agent loop
|
||||
"""
|
||||
# If the loop is "human/human", we need to prefix a grounding model fallback
|
||||
@@ -209,6 +211,7 @@ class ComputerAgent:
|
||||
self.use_prompt_caching = use_prompt_caching
|
||||
self.telemetry_enabled = telemetry_enabled
|
||||
self.kwargs = kwargs
|
||||
self.trust_remote_code = trust_remote_code
|
||||
|
||||
# == Add built-in callbacks ==
|
||||
|
||||
@@ -252,7 +255,8 @@ class ComputerAgent:
|
||||
|
||||
# Register local model providers
|
||||
hf_adapter = HuggingFaceLocalAdapter(
|
||||
device="auto"
|
||||
device="auto",
|
||||
trust_remote_code=self.trust_remote_code or False
|
||||
)
|
||||
human_adapter = HumanAdapter()
|
||||
mlx_adapter = MLXVLMAdapter()
|
||||
|
||||
@@ -18,6 +18,15 @@ try:
|
||||
import json
|
||||
from typing import List, Dict, Any
|
||||
import dotenv
|
||||
import base64
|
||||
import time
|
||||
import platform
|
||||
from pathlib import Path
|
||||
try:
|
||||
from PIL import Image, ImageDraw
|
||||
PIL_AVAILABLE = True
|
||||
except Exception:
|
||||
PIL_AVAILABLE = False
|
||||
from yaspin import yaspin
|
||||
except ImportError:
|
||||
if __name__ == "__main__":
|
||||
@@ -248,6 +257,13 @@ Examples:
|
||||
help="Initial prompt to send to the agent. Leave blank for interactive mode."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--predict-click",
|
||||
dest="predict_click",
|
||||
type=str,
|
||||
help="Instruction for click prediction. If set, runs predict_click, draws crosshair on a fresh screenshot, saves and opens it."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-c", "--cache",
|
||||
action="store_true",
|
||||
@@ -331,6 +347,7 @@ Examples:
|
||||
agent_kwargs = {
|
||||
"model": args.model,
|
||||
"tools": [computer],
|
||||
"trust_remote_code": True, # needed for some local models (e.g., InternVL, OpenCUA)
|
||||
"verbosity": 20 if args.verbose else 30, # DEBUG vs WARNING
|
||||
"max_retries": args.max_retries
|
||||
}
|
||||
@@ -353,7 +370,79 @@ Examples:
|
||||
|
||||
agent = ComputerAgent(**agent_kwargs)
|
||||
|
||||
# Start chat loop
|
||||
# If predict-click mode is requested, run once and exit
|
||||
if args.predict_click:
|
||||
if not PIL_AVAILABLE:
|
||||
print_colored("❌ Pillow (PIL) is required for --predict-click visualization. Install with: pip install pillow", Colors.RED, bold=True)
|
||||
sys.exit(1)
|
||||
|
||||
instruction = args.predict_click
|
||||
print_colored(f"Predicting click for: '{instruction}'", Colors.CYAN)
|
||||
|
||||
# Take a fresh screenshot FIRST
|
||||
try:
|
||||
img_bytes = await computer.interface.screenshot()
|
||||
except Exception as e:
|
||||
print_colored(f"❌ Failed to take screenshot: {e}", Colors.RED, bold=True)
|
||||
sys.exit(1)
|
||||
|
||||
# Encode screenshot to base64 for predict_click
|
||||
try:
|
||||
image_b64 = base64.b64encode(img_bytes).decode("utf-8")
|
||||
except Exception as e:
|
||||
print_colored(f"❌ Failed to encode screenshot: {e}", Colors.RED, bold=True)
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
coords = await agent.predict_click(instruction, image_b64=image_b64)
|
||||
except Exception as e:
|
||||
print_colored(f"❌ predict_click failed: {e}", Colors.RED, bold=True)
|
||||
sys.exit(1)
|
||||
|
||||
if not coords:
|
||||
print_colored("⚠️ No coordinates returned.", Colors.YELLOW)
|
||||
sys.exit(2)
|
||||
|
||||
x, y = coords
|
||||
print_colored(f"✅ Predicted coordinates: ({x}, {y})", Colors.GREEN)
|
||||
|
||||
try:
|
||||
from io import BytesIO
|
||||
with Image.open(BytesIO(img_bytes)) as img:
|
||||
img = img.convert("RGB")
|
||||
draw = ImageDraw.Draw(img)
|
||||
# Draw crosshair
|
||||
size = 12
|
||||
color = (255, 0, 0)
|
||||
draw.line([(x - size, y), (x + size, y)], fill=color, width=3)
|
||||
draw.line([(x, y - size), (x, y + size)], fill=color, width=3)
|
||||
# Optional small circle
|
||||
r = 6
|
||||
draw.ellipse([(x - r, y - r), (x + r, y + r)], outline=color, width=2)
|
||||
|
||||
out_path = Path.cwd() / f"predict_click_{int(time.time())}.png"
|
||||
img.save(out_path)
|
||||
print_colored(f"🖼️ Saved to {out_path}")
|
||||
|
||||
# Open the image with default viewer
|
||||
try:
|
||||
system = platform.system().lower()
|
||||
if system == "windows":
|
||||
os.startfile(str(out_path)) # type: ignore[attr-defined]
|
||||
elif system == "darwin":
|
||||
os.system(f"open \"{out_path}\"")
|
||||
else:
|
||||
os.system(f"xdg-open \"{out_path}\"")
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as e:
|
||||
print_colored(f"❌ Failed to render/save screenshot: {e}", Colors.RED, bold=True)
|
||||
sys.exit(1)
|
||||
|
||||
# Done
|
||||
sys.exit(0)
|
||||
|
||||
# Start chat loop (default interactive mode)
|
||||
await chat_loop(agent, args.model, container_name, args.prompt, args.usage)
|
||||
|
||||
|
||||
|
||||
@@ -10,5 +10,19 @@ from . import omniparser
|
||||
from . import gta1
|
||||
from . import composed_grounded
|
||||
from . import glm45v
|
||||
from . import opencua
|
||||
from . import internvl
|
||||
from . import holo
|
||||
|
||||
__all__ = ["anthropic", "openai", "uitars", "omniparser", "gta1", "composed_grounded", "glm45v"]
|
||||
__all__ = [
|
||||
"anthropic",
|
||||
"openai",
|
||||
"uitars",
|
||||
"omniparser",
|
||||
"gta1",
|
||||
"composed_grounded",
|
||||
"glm45v",
|
||||
"opencua",
|
||||
"internvl",
|
||||
"holo",
|
||||
]
|
||||
@@ -126,7 +126,7 @@ def get_last_computer_call_image(messages: List[Dict[str, Any]]) -> Optional[str
|
||||
|
||||
|
||||
@register_agent(r".*\+.*", priority=1)
|
||||
class ComposedGroundedConfig:
|
||||
class ComposedGroundedConfig(AsyncAgentConfig):
|
||||
"""
|
||||
Composed-grounded agent configuration that uses both grounding and thinking models.
|
||||
|
||||
|
||||
@@ -844,7 +844,7 @@ Where x,y are coordinates normalized to 0-999 range."""
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": litellm_messages,
|
||||
"max_tokens": 100,
|
||||
"max_tokens": 2056,
|
||||
"temperature": 0.001,
|
||||
"extra_body": {
|
||||
"skip_special_tokens": False,
|
||||
@@ -856,6 +856,7 @@ Where x,y are coordinates normalized to 0-999 range."""
|
||||
|
||||
# Extract response content
|
||||
response_content = response.choices[0].message.content.strip()
|
||||
print(response)
|
||||
|
||||
# Parse response for click coordinates
|
||||
# Look for coordinates in the response, handling special tokens
|
||||
@@ -866,7 +867,7 @@ Where x,y are coordinates normalized to 0-999 range."""
|
||||
# Fallback: look for coordinates without special tokens
|
||||
coord_pattern = r"left_click\(start_box='?\[(\d+),(\d+)\]'?\)"
|
||||
match = re.search(coord_pattern, response_content)
|
||||
|
||||
|
||||
if match:
|
||||
x, y = int(match.group(1)), int(match.group(2))
|
||||
|
||||
|
||||
@@ -155,7 +155,7 @@ class GTA1Config(AsyncAgentConfig):
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": [system_message, user_message],
|
||||
"max_tokens": 32,
|
||||
"max_tokens": 2056,
|
||||
"temperature": 0.0,
|
||||
**kwargs
|
||||
}
|
||||
|
||||
@@ -0,0 +1,216 @@
|
||||
"""
|
||||
Holo 1.5 agent loop implementation for click prediction using litellm.acompletion.
|
||||
|
||||
Implements the Holo1.5 grounding behavior:
|
||||
- Prompt asks for absolute pixel coordinates in JSON: {"action":"click_absolute","x":int,"y":int}
|
||||
- Optionally resizes the image using Qwen2-VL smart_resize parameters (via transformers AutoProcessor)
|
||||
- If resized, maps predicted coordinates back to the original screenshot resolution
|
||||
|
||||
Note: We do NOT manually load the model; acompletions (via HuggingFaceLocalAdapter)
|
||||
will handle loading based on the provided model name.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import json
|
||||
from io import BytesIO
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
|
||||
import litellm
|
||||
from PIL import Image
|
||||
|
||||
from ..decorators import register_agent
|
||||
from .base import AsyncAgentConfig
|
||||
from ..types import AgentCapability
|
||||
|
||||
|
||||
def _strip_hf_prefix(model: str) -> str:
|
||||
"""Strip provider prefixes like 'huggingface-local/' from model names for HF processor load."""
|
||||
if "/" in model and model.lower().startswith("huggingface-local/"):
|
||||
return model.split("/", 1)[1]
|
||||
return model
|
||||
|
||||
|
||||
def _maybe_smart_resize(image: Image.Image, model: str) -> Tuple[Image.Image, Tuple[int, int]]:
|
||||
"""
|
||||
Try to compute Qwen2-VL smart_resize output size using transformers AutoProcessor.
|
||||
|
||||
Returns (processed_image, (orig_w, orig_h)). If transformers or processor unavailable,
|
||||
returns the original image and size without resizing.
|
||||
"""
|
||||
orig_w, orig_h = image.size
|
||||
try:
|
||||
# Import lazily to avoid hard dependency if not installed
|
||||
from transformers import AutoProcessor # type: ignore
|
||||
from transformers.models.qwen2_vl.image_processing_qwen2_vl import ( # type: ignore
|
||||
smart_resize,
|
||||
)
|
||||
|
||||
processor_name = _strip_hf_prefix(model)
|
||||
processor = AutoProcessor.from_pretrained(processor_name)
|
||||
image_processor = getattr(processor, "image_processor", None)
|
||||
if image_processor is None:
|
||||
return image, (orig_w, orig_h)
|
||||
|
||||
factor = getattr(image_processor, "patch_size", 14) * getattr(image_processor, "merge_size", 1)
|
||||
min_pixels = getattr(image_processor, "min_pixels", 256 * 256)
|
||||
max_pixels = getattr(image_processor, "max_pixels", 1536 * 1536)
|
||||
|
||||
resized_h, resized_w = smart_resize(
|
||||
orig_h,
|
||||
orig_w,
|
||||
factor=factor,
|
||||
min_pixels=min_pixels,
|
||||
max_pixels=max_pixels,
|
||||
)
|
||||
|
||||
if (resized_w, resized_h) == (orig_w, orig_h):
|
||||
return image, (orig_w, orig_h)
|
||||
|
||||
processed = image.resize((resized_w, resized_h), resample=Image.Resampling.LANCZOS)
|
||||
return processed, (orig_w, orig_h)
|
||||
except Exception:
|
||||
# If any failure (no transformers, processor load error), fall back to original
|
||||
return image, (orig_w, orig_h)
|
||||
|
||||
|
||||
def _build_holo_prompt(instruction: str) -> str:
|
||||
"""Construct the Holo1.5 grounding prompt."""
|
||||
# Keep it close to the cookbook while avoiding heavy schema generation
|
||||
schema_hint = '{"action": "click_absolute", "x": <int>, "y": <int>}'
|
||||
return (
|
||||
"Localize an element on the GUI image according to the provided target and output a click position. "
|
||||
f"You must output a valid JSON following the format: {schema_hint} "
|
||||
f"Your target is: {instruction}"
|
||||
)
|
||||
|
||||
|
||||
def _parse_click_json(output_text: str) -> Optional[Tuple[int, int]]:
|
||||
"""
|
||||
Parse JSON from model output and extract x, y ints.
|
||||
Tries to find the first JSON object substring if extra text is present.
|
||||
"""
|
||||
try:
|
||||
# Fast path: direct JSON
|
||||
data = json.loads(output_text)
|
||||
except Exception:
|
||||
# Try to locate a JSON object within the text
|
||||
start = output_text.find("{")
|
||||
end = output_text.rfind("}")
|
||||
if start == -1 or end == -1 or end <= start:
|
||||
return None
|
||||
try:
|
||||
data = json.loads(output_text[start : end + 1])
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
try:
|
||||
x = int(data.get("x"))
|
||||
y = int(data.get("y"))
|
||||
return x, y
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
@register_agent(models=r"(?i).*(Holo1\.5|Hcompany/Holo1\.5).*")
|
||||
class HoloConfig(AsyncAgentConfig):
|
||||
"""Holo is a family of UI grounding models from H Company"""
|
||||
|
||||
async def predict_step(
|
||||
self,
|
||||
messages: List[Dict[str, Any]],
|
||||
model: str,
|
||||
tools: Optional[List[Dict[str, Any]]] = None,
|
||||
max_retries: Optional[int] = None,
|
||||
stream: bool = False,
|
||||
computer_handler=None,
|
||||
_on_api_start=None,
|
||||
_on_api_end=None,
|
||||
_on_usage=None,
|
||||
_on_screenshot=None,
|
||||
**kwargs,
|
||||
) -> Dict[str, Any]:
|
||||
# Holo models are only trained on UI localization tasks, not all-in-one agent
|
||||
raise NotImplementedError()
|
||||
|
||||
async def predict_click(
|
||||
self,
|
||||
model: str,
|
||||
image_b64: str,
|
||||
instruction: str,
|
||||
**kwargs,
|
||||
) -> Optional[Tuple[int, int]]:
|
||||
"""
|
||||
Predict click coordinates using Holo1.5 via litellm.acompletion.
|
||||
|
||||
- Optionally smart-resizes the image using Qwen2-VL rules if transformers are available
|
||||
- Prompts for JSON with absolute pixel coordinates
|
||||
- Parses x,y and maps back to original screenshot size if resized
|
||||
"""
|
||||
try:
|
||||
img_bytes = base64.b64decode(image_b64)
|
||||
original_img = Image.open(BytesIO(img_bytes))
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
# Optional preprocessing
|
||||
processed_img, (orig_w, orig_h) = _maybe_smart_resize(original_img, model)
|
||||
|
||||
# If we resized, send the resized image; otherwise send original
|
||||
img_to_send = processed_img
|
||||
buf = BytesIO()
|
||||
img_to_send.save(buf, format="PNG")
|
||||
processed_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
|
||||
|
||||
prompt = _build_holo_prompt(instruction)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{processed_b64}"},
|
||||
},
|
||||
{"type": "text", "text": prompt},
|
||||
],
|
||||
}
|
||||
]
|
||||
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
# Deterministic, small output
|
||||
"max_tokens": kwargs.get("max_tokens", 256),
|
||||
"temperature": kwargs.get("temperature", 0.0),
|
||||
}
|
||||
|
||||
response = await litellm.acompletion(**api_kwargs)
|
||||
output_text = (response.choices[0].message.content or "").strip() # type: ignore
|
||||
|
||||
coords = _parse_click_json(output_text)
|
||||
if coords is None:
|
||||
return None
|
||||
|
||||
x, y = coords
|
||||
|
||||
# Map back to original size if we resized
|
||||
proc_w, proc_h = img_to_send.size
|
||||
if (proc_w, proc_h) != (orig_w, orig_h):
|
||||
try:
|
||||
sx = orig_w / float(proc_w)
|
||||
sy = orig_h / float(proc_h)
|
||||
x = int(round(x * sx))
|
||||
y = int(round(y * sy))
|
||||
except Exception:
|
||||
# Fallback: clamp within original bounds
|
||||
pass
|
||||
|
||||
# Clamp to original image bounds
|
||||
x = max(0, min(orig_w - 1, x))
|
||||
y = max(0, min(orig_h - 1, y))
|
||||
return x, y
|
||||
|
||||
def get_capabilities(self) -> List[AgentCapability]:
|
||||
return ["click"]
|
||||
@@ -0,0 +1,185 @@
|
||||
"""
|
||||
InternVL agent loop implementation for click prediction using litellm.acompletion.
|
||||
|
||||
Implements the ScreenSpot InternVL grounding baseline behavior:
|
||||
- Uses the exact grounding prompt format with <image> and <ref> tags
|
||||
- Expects coordinates in 0-1000 normalized range in formats [[x1,y1,x2,y2]] or [[x,y]]
|
||||
- Converts to pixel coordinates relative to the original screenshot size
|
||||
|
||||
Note: We do NOT manually load the InternVL model; acompletions (via HuggingFaceLocalAdapter)
|
||||
will handle loading based on the provided model name.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import math
|
||||
import re
|
||||
from io import BytesIO
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
|
||||
from PIL import Image
|
||||
import litellm
|
||||
|
||||
from ..decorators import register_agent
|
||||
from .composed_grounded import ComposedGroundedConfig
|
||||
from ..types import AgentCapability
|
||||
|
||||
|
||||
# Regex patterns for extracting coordinates
|
||||
# Accept optional whitespace and optional decimal fractions
|
||||
_NUM = r"(\d+(?:\.\d+)?)"
|
||||
_POINT_PATTERN = re.compile(r"\[\[\s*" + _NUM + r"\s*,\s*" + _NUM + r"\s*\]\]")
|
||||
_BBOX_PATTERN = re.compile(
|
||||
r"\[\[\s*" + _NUM + r"\s*,\s*" + _NUM + r"\s*,\s*" + _NUM + r"\s*,\s*" + _NUM + r"\s*\]\]"
|
||||
)
|
||||
|
||||
|
||||
def _extract_first_point(text: str) -> Optional[Tuple[float, float]]:
|
||||
"""Extract the first [[x,y]] as normalized (0-1000) floats."""
|
||||
m = _POINT_PATTERN.search(text)
|
||||
if not m:
|
||||
return None
|
||||
try:
|
||||
x = float(m.group(1))
|
||||
y = float(m.group(2))
|
||||
return x, y
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def _extract_last_bbox(text: str) -> Optional[Tuple[float, float, float, float]]:
|
||||
"""Extract the last [[x1,y1,x2,y2]] as normalized (0-1000) floats."""
|
||||
matches = list(_BBOX_PATTERN.finditer(text))
|
||||
if not matches:
|
||||
return None
|
||||
m = matches[-1]
|
||||
try:
|
||||
x1 = float(m.group(1))
|
||||
y1 = float(m.group(2))
|
||||
x2 = float(m.group(3))
|
||||
y2 = float(m.group(4))
|
||||
return x1, y1, x2, y2
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def _scale_norm_to_pixels(x_norm: float, y_norm: float, width: int, height: int) -> Tuple[int, int]:
|
||||
"""Scale 0-1000 normalized coordinates to pixel coordinates for given image size."""
|
||||
x_px = int(math.floor((x_norm / 1000.0) * width))
|
||||
y_px = int(math.floor((y_norm / 1000.0) * height))
|
||||
# Clamp to image bounds just in case
|
||||
x_px = max(0, min(width - 1, x_px))
|
||||
y_px = max(0, min(height - 1, y_px))
|
||||
return x_px, y_px
|
||||
|
||||
|
||||
@register_agent(models=r"(?i).*InternVL.*")
|
||||
class InternVLConfig(ComposedGroundedConfig):
|
||||
"""InternVL agent configuration reusing ComposedGroundedConfig for steps and
|
||||
overriding predict_click to implement ScreenSpot InternVL grounding baseline."""
|
||||
|
||||
async def predict_step(
|
||||
self,
|
||||
messages: List[Dict[str, Any]],
|
||||
model: str,
|
||||
tools: Optional[List[Dict[str, Any]]] = None,
|
||||
max_retries: Optional[int] = None,
|
||||
stream: bool = False,
|
||||
computer_handler=None,
|
||||
_on_api_start=None,
|
||||
_on_api_end=None,
|
||||
_on_usage=None,
|
||||
_on_screenshot=None,
|
||||
**kwargs
|
||||
) -> Dict[str, Any]:
|
||||
"""Fallback to a self-composed model"""
|
||||
return await super().predict_step(
|
||||
messages=messages,
|
||||
model=f"{model}+{model}",
|
||||
tools=tools,
|
||||
max_retries=max_retries,
|
||||
stream=stream,
|
||||
computer_handler=computer_handler,
|
||||
_on_api_start=_on_api_start,
|
||||
_on_api_end=_on_api_end,
|
||||
_on_usage=_on_usage,
|
||||
_on_screenshot=_on_screenshot,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
async def predict_click(
|
||||
self,
|
||||
model: str,
|
||||
image_b64: str,
|
||||
instruction: str,
|
||||
**kwargs
|
||||
) -> Optional[Tuple[int, int]]:
|
||||
"""
|
||||
Predict click coordinates using InternVL via litellm.acompletion.
|
||||
|
||||
Behavior mirrors the ScreenSpot InternVL baseline:
|
||||
- Prompt: "<image>\nPlease provide the bounding box coordinate of the UI element this user instruction describes: <ref>{instruction}</ref>. Answer in the format of [[x1, y1, x2, y2]]"
|
||||
- Parse either [[x,y]] point or [[x1,y1,x2,y2]] bbox, using bbox center if point missing
|
||||
- Coordinates are 0-1000 normalized; convert to pixel coordinates for the original screenshot
|
||||
"""
|
||||
try:
|
||||
# Decode image dimensions to scale the normalized outputs
|
||||
img_bytes = base64.b64decode(image_b64)
|
||||
image = Image.open(BytesIO(img_bytes))
|
||||
width, height = image.size
|
||||
except Exception:
|
||||
# If decoding fails, proceed with a safe default size to avoid crash
|
||||
width, height = 1920, 1080
|
||||
|
||||
# Build grounding prompt exactly like the baseline
|
||||
grounding_prompt = (
|
||||
f"Please provide the bounding box coordinate of the UI element this user instruction describes: <ref>{instruction}</ref>. "
|
||||
f"Answer in the format of [[x1, y1, x2, y2]]"
|
||||
)
|
||||
|
||||
# Prepare messages for LiteLLM
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
|
||||
},
|
||||
{"type": "text", "text": grounding_prompt},
|
||||
],
|
||||
}
|
||||
]
|
||||
|
||||
# Call acompletion; HuggingFaceLocalAdapter/model handler will handle InternVL loading
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
# Conservative generation params akin to baseline (deterministic)
|
||||
"max_tokens": kwargs.get("max_tokens", 256),
|
||||
"temperature": kwargs.get("temperature", 0.0),
|
||||
}
|
||||
|
||||
response = await litellm.acompletion(**api_kwargs)
|
||||
output_text = (response.choices[0].message.content or "").strip() # type: ignore
|
||||
|
||||
print(f"InternVL output: {output_text}")
|
||||
|
||||
# Try to parse a point first; if absent, parse bbox and take center
|
||||
point = _extract_first_point(output_text)
|
||||
if point is None:
|
||||
bbox = _extract_last_bbox(output_text)
|
||||
if bbox is None:
|
||||
return None
|
||||
x1, y1, x2, y2 = bbox
|
||||
cx = (x1 + x2) / 2.0
|
||||
cy = (y1 + y2) / 2.0
|
||||
point = (cx, cy)
|
||||
|
||||
x_norm, y_norm = point
|
||||
x_px, y_px = _scale_norm_to_pixels(x_norm, y_norm, width, height)
|
||||
return (x_px, y_px)
|
||||
|
||||
def get_capabilities(self) -> List[AgentCapability]:
|
||||
return ["click", "step"]
|
||||
@@ -0,0 +1,142 @@
|
||||
"""
|
||||
OpenCUA agent loop implementation for click prediction using litellm.acompletion
|
||||
Based on OpenCUA model for GUI grounding tasks.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import re
|
||||
import base64
|
||||
from typing import Dict, List, Any, AsyncGenerator, Union, Optional, Tuple
|
||||
from io import BytesIO
|
||||
import uuid
|
||||
from PIL import Image
|
||||
import litellm
|
||||
import math
|
||||
|
||||
from .composed_grounded import ComposedGroundedConfig
|
||||
from ..decorators import register_agent
|
||||
from ..types import Messages, AgentResponse, Tools, AgentCapability
|
||||
from ..loops.base import AsyncAgentConfig
|
||||
|
||||
def extract_coordinates_from_pyautogui(text: str) -> Optional[Tuple[int, int]]:
|
||||
"""Extract coordinates from pyautogui.click(x=..., y=...) format."""
|
||||
try:
|
||||
# Look for pyautogui.click(x=1443, y=343) pattern
|
||||
pattern = r"pyautogui\.click\(x=(\d+),\s*y=(\d+)\)"
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
x, y = int(match.group(1)), int(match.group(2))
|
||||
return (x, y)
|
||||
return None
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
@register_agent(models=r"(?i).*OpenCUA.*")
|
||||
class OpenCUAConfig(ComposedGroundedConfig):
|
||||
"""OpenCUA agent configuration implementing AsyncAgentConfig protocol for click prediction."""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.current_model = None
|
||||
self.last_screenshot_b64 = None
|
||||
|
||||
async def predict_step(
|
||||
self,
|
||||
messages: List[Dict[str, Any]],
|
||||
model: str,
|
||||
tools: Optional[List[Dict[str, Any]]] = None,
|
||||
max_retries: Optional[int] = None,
|
||||
stream: bool = False,
|
||||
computer_handler=None,
|
||||
_on_api_start=None,
|
||||
_on_api_end=None,
|
||||
_on_usage=None,
|
||||
_on_screenshot=None,
|
||||
**kwargs
|
||||
) -> Dict[str, Any]:
|
||||
"""Fallback to a self-composed model"""
|
||||
return await super().predict_step(
|
||||
messages=messages,
|
||||
model=f"{model}+{model}",
|
||||
tools=tools,
|
||||
max_retries=max_retries,
|
||||
stream=stream,
|
||||
computer_handler=computer_handler,
|
||||
_on_api_start=_on_api_start,
|
||||
_on_api_end=_on_api_end,
|
||||
_on_usage=_on_usage,
|
||||
_on_screenshot=_on_screenshot,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
async def predict_click(
|
||||
self,
|
||||
model: str,
|
||||
image_b64: str,
|
||||
instruction: str,
|
||||
**kwargs
|
||||
) -> Optional[Tuple[int, int]]:
|
||||
"""
|
||||
Predict click coordinates using OpenCUA model via litellm.acompletion.
|
||||
|
||||
Args:
|
||||
model: The OpenCUA model name
|
||||
image_b64: Base64 encoded image
|
||||
instruction: Instruction for where to click
|
||||
|
||||
Returns:
|
||||
Tuple of (x, y) coordinates or None if prediction fails
|
||||
"""
|
||||
# Prepare system message
|
||||
system_prompt = (
|
||||
"You are a GUI agent. You are given a task and a screenshot of the screen. "
|
||||
"You need to perform a series of pyautogui actions to complete the task."
|
||||
)
|
||||
|
||||
system_message = {
|
||||
"role": "system",
|
||||
"content": system_prompt
|
||||
}
|
||||
|
||||
# Prepare user message with image and instruction
|
||||
user_message = {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image/png;base64,{image_b64}"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": f"Click on {instruction}"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Prepare API call kwargs
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": [system_message, user_message],
|
||||
"max_new_tokens": 2056,
|
||||
"temperature": 0,
|
||||
**kwargs
|
||||
}
|
||||
|
||||
# Use liteLLM acompletion
|
||||
response = await litellm.acompletion(**api_kwargs)
|
||||
|
||||
# Extract response text
|
||||
output_text = response.choices[0].message.content
|
||||
# print(output_text)
|
||||
|
||||
# Extract coordinates from pyautogui format
|
||||
coordinates = extract_coordinates_from_pyautogui(output_text)
|
||||
|
||||
return coordinates
|
||||
|
||||
def get_capabilities(self) -> List[AgentCapability]:
|
||||
"""Return the capabilities supported by this agent."""
|
||||
return ["click"]
|
||||
@@ -780,7 +780,7 @@ class UITARSConfig:
|
||||
api_kwargs = {
|
||||
"model": model,
|
||||
"messages": litellm_messages,
|
||||
"max_tokens": 100,
|
||||
"max_tokens": 2056,
|
||||
"temperature": 0.0,
|
||||
"do_sample": False
|
||||
}
|
||||
|
||||
@@ -46,6 +46,20 @@ glm45v-hf = [
|
||||
"torch",
|
||||
"transformers-v4.55.0-GLM-4.5V-preview"
|
||||
]
|
||||
opencua-hf = [
|
||||
"accelerate",
|
||||
"torch",
|
||||
"transformers==4.53.0",
|
||||
"tiktoken>=0.11.0",
|
||||
"blobfile>=3.0.0"
|
||||
]
|
||||
internvl-hf = [
|
||||
"accelerate",
|
||||
"torch",
|
||||
"transformers>=4.55.0",
|
||||
"einops",
|
||||
"timm"
|
||||
]
|
||||
ui = [
|
||||
"gradio>=5.23.3",
|
||||
"python-dotenv>=1.0.1",
|
||||
@@ -61,7 +75,13 @@ all = [
|
||||
"mlx-vlm>=0.1.27; sys_platform == 'darwin'",
|
||||
"accelerate",
|
||||
"torch",
|
||||
"transformers>=4.54.0",
|
||||
"transformers>=4.55.0",
|
||||
# internvl requirements,
|
||||
"einops",
|
||||
"timm",
|
||||
# opencua requirements
|
||||
"tiktoken>=0.11.0",
|
||||
"blobfile>=3.0.0",
|
||||
# ui requirements
|
||||
"gradio>=5.23.3",
|
||||
"python-dotenv>=1.0.1",
|
||||
|
||||
@@ -0,0 +1,162 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Composite Agents with Docker Container Computer\n",
|
||||
"\n",
|
||||
"This notebook walks you through running a composed GUI agent using a Docker-based Computer and OpenRouter for the grounding model, paired with a planning model.\n",
|
||||
"\n",
|
||||
"We'll use the model string:\n",
|
||||
"\n",
|
||||
"- `\"openrouter/z-ai/glm-4.5v+openai/gpt-5-nano\"` (grounding + planning)\n",
|
||||
"\n",
|
||||
"Grounding (left) generates actionable UI coordinates; planning (right) reasons and drives steps."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prerequisites\n",
|
||||
"\n",
|
||||
"- Docker Desktop or Engine installed and running\n",
|
||||
"- An OpenRouter account and API key (https://openrouter.ai/)\n",
|
||||
"- (Optional) An OpenAI API key if using `openai/gpt-5-nano` for planning\n",
|
||||
"- Python 3.12 environment with `cua-agent` installed"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Install CUA Agent (and extras as needed)\n",
|
||||
"!pip install -q \"cua-agent[all]\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prepare a Docker Computer\n",
|
||||
"\n",
|
||||
"We'll follow the documented Docker provider flow (see `docs/content/docs/computer-sdk/computers.mdx`).\n",
|
||||
"\n",
|
||||
"If you don't have the image yet, either pull or build it locally. Run these in a terminal, not inside the notebook:\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# Option 1: Pull from Docker Hub\n",
|
||||
"docker pull trycua/cua-ubuntu:latest\n",
|
||||
"\n",
|
||||
"# Option 2: Build locally (from repo root)\n",
|
||||
"cd libs/kasm\n",
|
||||
"docker build -t cua-ubuntu:latest .\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Set environment keys\n",
|
||||
"\n",
|
||||
"- Get an OpenRouter API key at https://openrouter.ai/\n",
|
||||
"- If using OpenAI for planning, set your OpenAI key as well\n",
|
||||
"- You can input them here to set for this notebook session"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY') or input('Enter your OPENROUTER_API_KEY: ').strip()\n",
|
||||
"os.environ['OPENROUTER_API_KEY'] = OPENROUTER_API_KEY\n",
|
||||
"\n",
|
||||
"# Optional: if planning model uses OpenAI provider\n",
|
||||
"OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or input('(Optional) Enter your OPENAI_API_KEY (press Enter to skip): ').strip()\n",
|
||||
"if OPENAI_API_KEY:\n",
|
||||
" os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create a Docker Computer and a composed agent\n",
|
||||
"\n",
|
||||
"This uses the documented Docker provider parameters: `os_type=\"linux\"`, `provider_type=\"docker\"`, plus `image` and `name`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import asyncio\n",
|
||||
"from computer import Computer\n",
|
||||
"from agent import ComputerAgent\n",
|
||||
"\n",
|
||||
"async def main():\n",
|
||||
" # Launch & connect to a Docker container running the Computer Server\n",
|
||||
" async with Computer(\n",
|
||||
" os_type='linux',\n",
|
||||
" provider_type='docker',\n",
|
||||
" image='trycua/cua-ubuntu:latest',\n",
|
||||
" name='my-cua-container'\n",
|
||||
" ) as computer:\n",
|
||||
" agent = ComputerAgent(\n",
|
||||
" model='openrouter/z-ai/glm-4.5v+openai/gpt-5-nano',\n",
|
||||
" tools=[computer],\n",
|
||||
" trajectory_dir='trajectories' # Save agent trajectory (screenshots, api calls)\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Simple task to verify end-to-end\n",
|
||||
" async for _ in agent.run('Open a browser and go to example.com'):\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"asyncio.run(main())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Notes\n",
|
||||
"\n",
|
||||
"- Grounding (OpenRouter `z-ai/glm-4.5v`) + Planning (OpenAI `gpt-5-nano`) can be swapped for other providers/models.\n",
|
||||
"- If you prefer to avoid OpenAI, choose a planning model on OpenRouter and update the model string accordingly.\n",
|
||||
"- Be sure the planning model supports `vision` input and the `tools` parameter.\n",
|
||||
"- The agent emits normalized Agent Responses across providers."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user