mirror of
https://github.com/trycua/computer.git
synced 2026-01-06 05:20:02 -06:00
updated model docs
This commit is contained in:
@@ -5,32 +5,36 @@ description: Combine grounding models with any LLM for computer-use capabilities
|
||||
|
||||
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
|
||||
|
||||
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
|
||||
Use the format `"grounding_model+planning_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
|
||||
|
||||
## How Composed Agents Work
|
||||
|
||||
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
|
||||
1. **Planning Phase**: The planning model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
|
||||
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
|
||||
3. **Execution**: Actions are performed using the predicted coordinates
|
||||
|
||||
## Supported Grounding Models
|
||||
|
||||
Any model that supports `predict_click()` can be used as the grounding component:
|
||||
Any model that supports `predict_click()` can be used as the grounding component. See the full list on [Grounding Models](./grounding-models).
|
||||
|
||||
- `omniparser` (OSS set-of-marks model)
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
|
||||
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
|
||||
- `openai/computer-use-preview` (OpenAI CUA)
|
||||
- OpenCUA: `huggingface-local/xlangai/OpenCUA-{7B,32B}`
|
||||
- GTA1 family: `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
|
||||
- Holo 1.5 family: `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
|
||||
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
|
||||
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
|
||||
|
||||
## Supported Thinking Models
|
||||
## Supported Planning Models
|
||||
|
||||
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
|
||||
Any vision-enabled LiteLLM-compatible model can be used as the planning component:
|
||||
|
||||
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
|
||||
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
|
||||
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
|
||||
- **Local models**: Any Hugging Face vision-language model
|
||||
- Any All‑in‑one CUA (planning-capable). See [All‑in‑one CUAs](./computer-use-agents).
|
||||
- Any VLM via LiteLLM providers: `anthropic/*`, `openai/*`, `openrouter/*`, `gemini/*`, `vertex_ai/*`, `huggingface-local/*`, `mlx/*`, etc.
|
||||
- Examples:
|
||||
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-opus-4-1-20250805`
|
||||
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
|
||||
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
|
||||
- **Local models**: Any Hugging Face vision-language model
|
||||
|
||||
## Usage Examples
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
---
|
||||
title: Computer-Use Models
|
||||
title: All‑in‑one CUA Models
|
||||
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
|
||||
---
|
||||
|
||||
@@ -36,19 +36,6 @@ async for _ in agent.run("Take a screenshot and describe what you see"):
|
||||
pass
|
||||
```
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
Unified vision-language model for computer-use:
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
## GLM-4.5V
|
||||
|
||||
Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
|
||||
@@ -62,6 +49,32 @@ async for _ in agent.run("Click on the search bar and type 'hello world'"):
|
||||
pass
|
||||
```
|
||||
|
||||
## InternVL 3.5
|
||||
|
||||
InternVL 3.5 family:
|
||||
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/OpenGVLab/InternVL3_5-1B", tools=[computer])
|
||||
async for _ in agent.run("Open Firefox and navigate to github.com"):
|
||||
pass
|
||||
```
|
||||
|
||||
## UI-TARS 1.5
|
||||
|
||||
Unified vision-language model for computer-use:
|
||||
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
```python
|
||||
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
|
||||
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
CUAs also support direct click prediction. See [Grounding Models](./grounding-models) for details on `predict_click()`.
|
||||
|
||||
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
|
||||
|
||||
@@ -7,9 +7,7 @@ These models specialize in UI element grounding and click prediction. They can i
|
||||
|
||||
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
|
||||
|
||||
## All Computer-Use Agents
|
||||
|
||||
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
|
||||
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [All‑in‑one CUAs](./computer-use-agents).
|
||||
|
||||
### Anthropic CUAs
|
||||
|
||||
@@ -21,7 +19,7 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
|
||||
### OpenAI CUA Preview
|
||||
- Computer-use-preview: `computer-use-preview`
|
||||
|
||||
### UI-TARS 1.5
|
||||
### UI-TARS 1.5 (Unified VLM with grounding support)
|
||||
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
|
||||
|
||||
@@ -29,18 +27,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
|
||||
|
||||
These models are optimized specifically for click prediction and UI element grounding:
|
||||
|
||||
### OmniParser
|
||||
### OpenCUA
|
||||
- `huggingface-local/xlangai/OpenCUA-{7B,32B}`
|
||||
|
||||
### GTA1 Family
|
||||
- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
|
||||
|
||||
### Holo 1.5 Family
|
||||
- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
|
||||
|
||||
### InternVL 3.5 Family
|
||||
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
|
||||
### OmniParser (OCR)
|
||||
|
||||
OCR-focused set-of-marks model that requires an LLM for click prediction:
|
||||
|
||||
- `omniparser` (requires combination with any LiteLLM vision model)
|
||||
|
||||
### GTA1-7B
|
||||
|
||||
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
|
||||
|
||||
- `huggingface-local/HelloKKMe/GTA1-7B`
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
@@ -83,7 +87,6 @@ print(f"Click coordinates: {coords}") # (450, 320)
|
||||
# agent.run("Fill out the form and submit it")
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
|
||||
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
|
||||
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [All‑in‑one CUAs](./computer-use-agents).
|
||||
|
||||
Reference in New Issue
Block a user