From a46c276e70063607e851030fac880e75e9cd21a5 Mon Sep 17 00:00:00 2001
From: Dillon DuPont <ddupont@mit.edu>
Date: Mon, 15 Sep 2025 16:41:39 -0400
Subject: [PATCH] updated model docs

---
 .../supported-agents/composed-agents.mdx      | 32 ++++++++-------
 .../supported-agents/computer-use-agents.mdx  | 41 ++++++++++++-------
 .../supported-agents/grounding-models.mdx     | 29 +++++++------
 3 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
index 8040d2e5..485074e2 100644
--- a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
@@ -5,32 +5,36 @@ description: Combine grounding models with any LLM for computer-use capabilities
 
 Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
 
-Use the format `"grounding_model+thinking_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
+Use the format `"grounding_model+planning_model"` to create a composed agent with any vision-enabled LiteLLM-compatible model.
 
 ## How Composed Agents Work
 
-1. **Planning Phase**: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
+1. **Planning Phase**: The planning model (LLM) analyzes the task and decides what actions to take (e.g., `click("find the login button")`, `type("username")`)
 2. **Grounding Phase**: The grounding model converts element descriptions to precise coordinates
 3. **Execution**: Actions are performed using the predicted coordinates
 
 ## Supported Grounding Models
 
-Any model that supports `predict_click()` can be used as the grounding component:
+Any model that supports `predict_click()` can be used as the grounding component. See the full list on [Grounding Models](./grounding-models).
 
-- `omniparser` (OSS set-of-marks model)
-- `huggingface-local/HelloKKMe/GTA1-7B` (OSS grounding model)
-- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (OSS unified model)
-- `claude-3-5-sonnet-20241022` (Anthropic CUA)
-- `openai/computer-use-preview` (OpenAI CUA)
+- OpenCUA: `huggingface-local/xlangai/OpenCUA-{7B,32B}`
+- GTA1 family: `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
+- Holo 1.5 family: `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
+- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
+- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
+- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
 
-## Supported Thinking Models
+## Supported Planning Models
 
-Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
+Any vision-enabled LiteLLM-compatible model can be used as the planning component:
 
-- **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229`
-- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
-- **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
-- **Local models**: Any Hugging Face vision-language model
+- Any All‑in‑one CUA (planning-capable). See [All‑in‑one CUAs](./computer-use-agents).
+- Any VLM via LiteLLM providers: `anthropic/*`, `openai/*`, `openrouter/*`, `gemini/*`, `vertex_ai/*`, `huggingface-local/*`, `mlx/*`, etc.
+- Examples:
+  - **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-opus-4-1-20250805`
+  - **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o`
+  - **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision`
+  - **Local models**: Any Hugging Face vision-language model
 
 ## Usage Examples
 
diff --git a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
index 44ab41d1..b2487a7c 100644
--- a/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/computer-use-agents.mdx
@@ -1,5 +1,5 @@
 ---
-title: Computer-Use Models
+title: All‑in‑one CUA Models
 description: Models that support full computer-use agent capabilities with ComputerAgent.run()
 ---
 
@@ -36,19 +36,6 @@ async for _ in agent.run("Take a screenshot and describe what you see"):
     pass
 ```
 
-## UI-TARS 1.5
-
-Unified vision-language model for computer-use:
-
-- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
-- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
-
-```python
-agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
-async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
-    pass
-```
-
 ## GLM-4.5V
 
 Zhipu AI's GLM-4.5V vision-language model with computer-use capabilities:
@@ -62,6 +49,32 @@ async for _ in agent.run("Click on the search bar and type 'hello world'"):
     pass
 ```
 
+## InternVL 3.5
+
+InternVL 3.5 family:
+- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
+
+```python
+agent = ComputerAgent("huggingface-local/OpenGVLab/InternVL3_5-1B", tools=[computer])
+async for _ in agent.run("Open Firefox and navigate to github.com"):
+    pass
+```
+
+## UI-TARS 1.5
+
+Unified vision-language model for computer-use:
+
+- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
+- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
+
+```python
+agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
+async for _ in agent.run("Open the settings menu and change the theme to dark mode"):
+    pass
+```
+
 ---
 
+CUAs also support direct click prediction. See [Grounding Models](./grounding-models) for details on `predict_click()`.
+
 For details on agent loop behavior and usage, see [Agent Loops](../agent-loops).
diff --git a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
index 65d254fe..9270f183 100644
--- a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
+++ b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
@@ -7,9 +7,7 @@ These models specialize in UI element grounding and click prediction. They can i
 
 Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.
 
-## All Computer-Use Agents
-
-All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`:
+All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [All‑in‑one CUAs](./computer-use-agents).
 
 ### Anthropic CUAs
 
@@ -21,7 +19,7 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
 ### OpenAI CUA Preview
 - Computer-use-preview: `computer-use-preview`
 
-### UI-TARS 1.5
+### UI-TARS 1.5 (Unified VLM with grounding support)
 - `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
 - `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)
 
@@ -29,18 +27,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
 
 These models are optimized specifically for click prediction and UI element grounding:
 
-### OmniParser
+### OpenCUA
+- `huggingface-local/xlangai/OpenCUA-{7B,32B}`
+
+### GTA1 Family
+- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`
+
+### Holo 1.5 Family
+- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`
+
+### InternVL 3.5 Family
+- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
+
+### OmniParser (OCR)
 
 OCR-focused set-of-marks model that requires an LLM for click prediction:
 
 - `omniparser` (requires combination with any LiteLLM vision model)
 
-### GTA1-7B
-
-State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
-
-- `huggingface-local/HelloKKMe/GTA1-7B`
-
 ## Usage Examples
 
 ```python
@@ -83,7 +87,6 @@ print(f"Click coordinates: {coords}")  # (450, 320)
 # agent.run("Fill out the form and submit it")
 ```
 
-
 ---
 
-For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).
+For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [All‑in‑one CUAs](./computer-use-agents).