added moondream3 to docs

This commit is contained in:
Dillon DuPont
2025-10-02 11:03:27 -04:00
parent 0b3c677205
commit b2ddfe2033
3 changed files with 26 additions and 1 deletions

View File

@@ -41,9 +41,10 @@ With the Agent SDK, you can:
|---|---|---|
| `anthropic/claude-sonnet-4-5-20250929` | `huggingface-local/xlangai/OpenCUA-{7B,32B}` | any all-in-one CUA |
| `openai/computer-use-preview` | `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` | any VLM (using liteLLM, requires `tools` parameter) |
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | |
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | any LLM (using liteLLM, requires `moondream3+` prefix ) |
| `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` | any all-in-one CUA | |
| `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` | |
| `moondream3+{ui planning}` (supports text-only models) | |
| `omniparser+{ui planning}` | | |
| `{ui grounding}+{ui planning}` | | |

View File

@@ -23,6 +23,7 @@ Any model that supports `predict_click()` can be used as the grounding component
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
- UITARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
- Moondream3: `moondream3` (requires combination with a LiteLLM vision/text model)
## Supported Planning Models
@@ -83,6 +84,23 @@ async for _ in agent.run("Help me fill out this form with my personal informatio
pass
```
### Moondream3 + GPT-4o
Use the built-in Moondream3 grounding with any planning model. Moondream3 will detect UI elements on the latest screenshot, label them, and provide a user message listing detected element names.
```python
from agent import ComputerAgent
from computer import computer
agent = ComputerAgent(
"moondream3+openai/gpt-4o",
tools=[computer]
)
async for _ in agent.run("Close the settings window, then open the Downloads folder"):
pass
```
## Benefits of Composed Agents
- **Specialized Grounding**: Use models optimized for click prediction accuracy

View File

@@ -45,6 +45,12 @@ OCR-focused set-of-marks model that requires an LLM for click prediction:
- `omniparser` (requires combination with any LiteLLM vision model)
### Moondream3 (Local Grounding)
Moondream3 is a powerful small model that can perform UI grounding and click prediction.
- `moondream3`
## Usage Examples
```python