From a46c276e70063607e851030fac880e75e9cd21a5 Mon Sep 17 00:00:00 2001 From: Dillon DuPont Date: Mon, 15 Sep 2025 16:41:39 -0400 Subject: [PATCH] updated model docs --- .../supported-agents/composed-agents.mdx | 32 ++++++++------- .../supported-agents/computer-use-agents.mdx | 41 ++++++++++++------- .../supported-agents/grounding-models.mdx | 29 +++++++------ 3 files changed, 61 insertions(+), 41 deletions(-) diff --git a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx index 8040d2e5..485074e2 100644 --- a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx +++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx @@ -5,32 +5,36 @@ description: Combine grounding models with any LLM for computer-use capabilities Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning. -Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model. +Use the format `"grounding_model+planning_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model. ## How Composed Agents Work -1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`) +1. **Planning Phase**: The planning model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`) 2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates 3. **Execution**: Actions are performed using the predicted coordinates ## Supported Grounding Models -Any model that supports `predict_click()` can be used as the grounding component: +Any model that supports `predict_click()` can be used as the grounding component. See the full list on [Grounding Models](./grounding-models). -- `omniparser` (OSS set-of-marks model) -- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model) -- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model) -- `claude-3-5-sonnet-20241022` (Anthropic CUA) -- `openai/computer-use-preview` (OpenAI CUA) +- OpenCUA: `huggingface-local/xlangai/OpenCUA-{7B,32B}` +- GTA1 family: `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` +- Holo 1.5 family: `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` +- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` +- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU) +- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model) -## Supported Thinking Models +## Supported Planning Models -Any vision-enabled LiteLLM-compatible model can be used as the thinking component: +Any vision-enabled LiteLLM-compatible model can be used as the planning component: -- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229` -- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o` -- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision` -- **Local models**: Any Hugging Face vision-language model +- Any All‑in‑one CUA (planning-capable). See [All‑in‑one CUAs](./computer-use-agents). +- Any VLM via LiteLLM providers: `anthropic/*`, `openai/*`, `openrouter/*`, `gemini/*`, `vertex_ai/*`, `huggingface-local/*`, `mlx/*`, etc. +- Examples: + - **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-opus-4-1-20250805` + - **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o` + - **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision` + - **Local models**: Any Hugging Face vision-language model ## Usage Examples diff --git a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx index 44ab41d1..b2487a7c 100644 --- a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx +++ b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx @@ -1,5 +1,5 @@ --- -title: Computer-Use Models +title: All‑in‑one CUA Models description: Models that support full computer-use agent capabilities with ComputerAgent.run() --- @@ -36,19 +36,6 @@ async for _ in agent.run("Take a screenshot and describe what you see"): pass ``` -## UI-TARS 1.5 - -Unified vision-language model for computer-use: - -- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` -- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) - -```python -agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer]) -async for _ in agent.run("Open the settings menu and change the theme to dark mode"): - pass -``` - ## GLM-4.5V Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities: @@ -62,6 +49,32 @@ async for _ in agent.run("Click on the search bar and type 'hello world'"): pass ``` +## InternVL 3.5 + +InternVL 3.5 family: +- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` + +```python +agent = ComputerAgent("huggingface-local/OpenGVLab/InternVL3_5-1B", tools=[computer]) +async for _ in agent.run("Open Firefox and navigate to github.com"): + pass +``` + +## UI-TARS 1.5 + +Unified vision-language model for computer-use: + +- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` +- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) + +```python +agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer]) +async for _ in agent.run("Open the settings menu and change the theme to dark mode"): + pass +``` + --- +CUAs also support direct click prediction. See [Grounding Models](./grounding-models) for details on `predict_click()`. + For details on agent loop behavior and usage, see [Agent Loops](../agent-loops). diff --git a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx index 65d254fe..9270f183 100644 --- a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx +++ b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx @@ -7,9 +7,7 @@ These models specialize in UI element grounding and click prediction. They can i Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements. -## All Computer-Use Agents - -All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`: +All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [All‑in‑one CUAs](./computer-use-agents). ### Anthropic CUAs @@ -21,7 +19,7 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic ### OpenAI CUA Preview - Computer-use-preview: `computer-use-preview` -### UI-TARS 1.5 +### UI-TARS 1.5 (Unified VLM with grounding support) - `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` - `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint) @@ -29,18 +27,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic These models are optimized specifically for click prediction and UI element grounding: -### OmniParser +### OpenCUA +- `huggingface-local/xlangai/OpenCUA-{7B,32B}` + +### GTA1 Family +- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` + +### Holo 1.5 Family +- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` + +### InternVL 3.5 Family +- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` + +### OmniParser (OCR) OCR-focused set-of-marks model that requires an LLM for click prediction: - `omniparser` (requires combination with any LiteLLM vision model) -### GTA1-7B - -State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/): - -- `huggingface-local/HelloKKMe/GTA1-7B` - ## Usage Examples ```python @@ -83,7 +87,6 @@ print(f"Click coordinates: {coords}") # (450, 320) # agent.run("Fill out the form and submit it") ``` - --- -For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents). +For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [All‑in‑one CUAs](./computer-use-agents).