diff --git a/docs/content/docs/agent-sdk/integrations/hud.mdx b/docs/content/docs/agent-sdk/integrations/hud.mdx index cebd36be..b517121e 100644 --- a/docs/content/docs/agent-sdk/integrations/hud.mdx +++ b/docs/content/docs/agent-sdk/integrations/hud.mdx @@ -28,6 +28,8 @@ taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370 # Run benchmark job job = await run_job( model="openai/computer-use-preview", + # model="anthropic/claude-3-5-sonnet-20241022", + # model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5", task_or_taskset=taskset, job_name="test-computeragent-job", max_concurrent_tasks=5, diff --git a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx index 50160fd8..8040d2e5 100644 --- a/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx +++ b/docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx @@ -28,12 +28,26 @@ Any model that supports `predict_click()` can be used as the grounding component Any vision-enabled LiteLLM-compatible model can be used as the thinking component: - **Anthropic**: `anthropic/claude-3-5-sonnet-20241022`, `anthropic/claude-3-opus-20240229` -- **OpenAI**: `openai/gpt-4o`, `openai/gpt-4-vision-preview` +- **OpenAI**: `openai/gpt-5`, `openai/gpt-o3`, `openai/gpt-4o` - **Google**: `gemini/gemini-1.5-pro`, `vertex_ai/gemini-pro-vision` - **Local models**: Any Hugging Face vision-language model ## Usage Examples +### GTA1 + GPT-5 + +Use Google's Gemini for planning with specialized grounding: + +```python +agent = ComputerAgent( + "huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5", + tools=[computer] +) + +async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"): + pass +``` + ### GTA1 + Claude 3.5 Sonnet Combine state-of-the-art grounding with powerful reasoning: @@ -51,20 +65,6 @@ async for _ in agent.run("Open Firefox, navigate to github.com, and search for ' # - GTA1-7B provides precise click coordinates for each UI element ``` -### GTA1 + Gemini Pro - -Use Google's Gemini for planning with specialized grounding: - -```python -agent = ComputerAgent( - "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", - tools=[computer] -) - -async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"): - pass -``` - ### UI-TARS + GPT-4o Combine two different vision models for enhanced capabilities: