updated model docs

This commit is contained in:
Dillon DuPont
2025-09-15 16:41:39 -04:00
parent c5bbd4611a
commit a46c276e70
3 changed files with 61 additions and 41 deletions

View File

@@ -5,32 +5,36 @@ description: Combine grounding models with any LLM for computer-use capabilities
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
Use the format `"grounding_model+planning_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
## How Composed Agents Work
1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
1. **Planning Phase**: The planning model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
3. **Execution**: Actions are performed using the predicted coordinates
## Supported Grounding Models
Any model that supports `predict_click()` can be used as the grounding component:
Any model that supports `predict_click()` can be used as the grounding component. See the full list on [Grounding Models](./grounding-models).
- `omniparser` (OSS set-of-marks model)
- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
- `claude-3-5-sonnet-20241022` (Anthropic CUA)
- `openai/computer-use-preview` (OpenAI CUA)
- OpenCUA: `huggingface-local/xlangai/OpenCUA-{7B,32B}`
- GTA1 family: `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
- Holo 1.5 family: `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
- UITARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
## Supported Thinking Models
## Supported Planning Models
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
Any vision-enabled LiteLLM-compatible model can be used as the planning component:
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
- **Local models**: Any Hugging Face vision-language model
- Any Allinone CUA (planning-capable). See [Allinone CUAs](./computer-use-agents).
- Any VLM via LiteLLM providers: `anthropic/*`, `openai/*`, `openrouter/*`, `gemini/*`, `vertex_ai/*`, `huggingface-local/*`, `mlx/*`, etc.
- Examples:
- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-opus-4-1-20250805`
- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
- **Local models**: Any Hugging Face vision-language model
## Usage Examples

View File

@@ -1,5 +1,5 @@
---
title: Computer-Use Models
title: Allinone CUA Models
description: Models that support full computer-use agent capabilities with ComputerAgent.run()
---
@@ -36,19 +36,6 @@ async for _ in agent.run("Take a screenshot and describe what you see"):
pass
```
## UI-TARS 1.5
Unified vision-language model for computer-use:
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
```python
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
pass
```
## GLM-4.5V
Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
@@ -62,6 +49,32 @@ async for _ in agent.run("Click on the search bar and type 'hello world'"):
pass
```
## InternVL 3.5
InternVL 3.5 family:
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
```python
agent = ComputerAgent("huggingface-local/OpenGVLab/InternVL3_5-1B", tools=[computer])
async for _ in agent.run("Open Firefox and navigate to github.com"):
pass
```
## UI-TARS 1.5
Unified vision-language model for computer-use:
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
```python
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
pass
```
---
CUAs also support direct click prediction. See [Grounding Models](./grounding-models) for details on `predict_click()`.
For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).

View File

@@ -7,9 +7,7 @@ These models specialize in UI element grounding and click prediction. They can i
Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
## All Computer-Use Agents
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [Allinone CUAs](./computer-use-agents).
### Anthropic CUAs
@@ -21,7 +19,7 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
### OpenAI CUA Preview
- Computer-use-preview: `computer-use-preview`
### UI-TARS 1.5
### UI-TARS 1.5 (Unified VLM with grounding support)
- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
@@ -29,18 +27,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
These models are optimized specifically for click prediction and UI element grounding:
### OmniParser
### OpenCUA
- `huggingface-local/xlangai/OpenCUA-{7B,32B}`
### GTA1 Family
- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
### Holo 1.5 Family
- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
### InternVL 3.5 Family
- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
### OmniParser (OCR)
OCR-focused set-of-marks model that requires an LLM for click prediction:
- `omniparser` (requires combination with any LiteLLM vision model)
### GTA1-7B
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
- `huggingface-local/HelloKKMe/GTA1-7B`
## Usage Examples
```python
@@ -83,7 +87,6 @@ print(f"Click coordinates: {coords}") # (450, 320)
# agent.run("Fill out the form and submit it")
```
---
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [Allinone CUAs](./computer-use-agents).