mirror of
https://github.com/trycua/computer.git
synced 2026-01-06 13:30:06 -06:00
added moondream3 to docs
This commit is contained in:
@@ -41,9 +41,10 @@ With the Agent SDK, you can:
|
||||
|---|---|---|
|
||||
| `anthropic/claude-sonnet-4-5-20250929` | `huggingface-local/xlangai/OpenCUA-{7B,32B}` | any all-in-one CUA |
|
||||
| `openai/computer-use-preview` | `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` | any VLM (using liteLLM, requires `tools` parameter) |
|
||||
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | |
|
||||
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | any LLM (using liteLLM, requires `moondream3+` prefix ) |
|
||||
| `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` | any all-in-one CUA | |
|
||||
| `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` | |
|
||||
| `moondream3+{ui planning}` (supports text-only models) | |
|
||||
| `omniparser+{ui planning}` | | |
|
||||
| `{ui grounding}+{ui planning}` | | |
|
||||
|
||||
|
||||
@@ -23,6 +23,7 @@ Any model that supports `predict_click()` can be used as the grounding component
|
||||
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
|
||||
- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
|
||||
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
|
||||
- Moondream3: `moondream3` (requires combination with a LiteLLM vision/text model)
|
||||
|
||||
## Supported Planning Models
|
||||
|
||||
@@ -83,6 +84,23 @@ async for _ in agent.run("Help me fill out this form with my personal informatio
|
||||
pass
|
||||
```
|
||||
|
||||
### Moondream3 + GPT-4o
|
||||
|
||||
Use the built-in Moondream3 grounding with any planning model. Moondream3 will detect UI elements on the latest screenshot, label them, and provide a user message listing detected element names.
|
||||
|
||||
```python
|
||||
from agent import ComputerAgent
|
||||
from computer import computer
|
||||
|
||||
agent = ComputerAgent(
|
||||
"moondream3+openai/gpt-4o",
|
||||
tools=[computer]
|
||||
)
|
||||
|
||||
async for _ in agent.run("Close the settings window, then open the Downloads folder"):
|
||||
pass
|
||||
```
|
||||
|
||||
## Benefits of Composed Agents
|
||||
|
||||
- **Specialized Grounding**: Use models optimized for click prediction accuracy
|
||||
|
||||
@@ -45,6 +45,12 @@ OCR-focused set-of-marks model that requires an LLM for click prediction:
|
||||
|
||||
- `omniparser` (requires combination with any LiteLLM vision model)
|
||||
|
||||
### Moondream3 (Local Grounding)
|
||||
|
||||
Moondream3 is a powerful small model that can perform UI grounding and click prediction.
|
||||
|
||||
- `moondream3`
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
|
||||
Reference in New Issue
Block a user