mirror of
https://github.com/trycua/computer.git
synced 2026-01-01 19:10:30 -06:00
Merge branch 'main' into feature/computer/add-vm-provider
This commit is contained in:
290
README.md
290
README.md
@@ -5,188 +5,244 @@
|
||||
<img alt="Cua logo" height="150" src="img/logo_black.png">
|
||||
</picture>
|
||||
|
||||
<!-- <h1>Cua</h1> -->
|
||||
|
||||
[](#)
|
||||
[](#)
|
||||
[](#)
|
||||
[](https://discord.com/invite/mVnXXpdE85)
|
||||
</div>
|
||||
|
||||
**TL;DR**: **c/ua** (pronounced "koo-ah", short for Computer-Use Agent) is a framework that enables AI agents to control full operating systems within high-performance, lightweight virtual containers. It delivers up to 97% native speed on Apple Silicon and works with any vision language models.
|
||||
**c/ua** (pronounced "koo-ah") enables AI agents to control full operating systems in high-performance virtual containers with near-native speed on Apple Silicon.
|
||||
|
||||
## What is c/ua?
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/c619b4ea-bb8e-4382-860e-f3757e36af20" width="800" controls></video>
|
||||
</div>
|
||||
|
||||
**c/ua** offers two primary capabilities in a single integrated framework:
|
||||
# 🚀 Quick Start
|
||||
|
||||
1. **High-Performance Virtualization** - Create and run macOS/Linux virtual machines on Apple Silicon with near-native performance (up to 97% of native speed) using the **Lume CLI** with `Apple's Virtualization.Framework`.
|
||||
Get started with a Computer-Use Agent UI and a VM with a single command:
|
||||
|
||||
2. **Computer-Use Interface & Agent** - A framework that allows AI systems to observe and control these virtual environments - interacting with applications, browsing the web, writing code, and performing complex workflows.
|
||||
|
||||
## Why Use c/ua?
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/scripts/playground.sh)"
|
||||
```
|
||||
|
||||
- **Security & Isolation**: Run AI agents in fully isolated virtual environments instead of giving them access to your main system
|
||||
- **Performance**: [Near-native performance](https://browser.geekbench.com/v6/cpu/compare/11283746?baseline=11102709) on Apple Silicon
|
||||
- **Flexibility**: Run macOS or Linux environments with the same framework
|
||||
- **Reproducibility**: Create consistent, deterministic environments for AI agent workflows
|
||||
- **LLM Integration**: Built-in support for connecting to various LLM providers
|
||||
|
||||
## System Requirements
|
||||
This script will:
|
||||
- Install Lume CLI for VM management (if needed)
|
||||
- Pull the latest macOS CUA image (if needed)
|
||||
- Set up Python environment and install/update required packages
|
||||
- Launch the Computer-Use Agent UI
|
||||
|
||||
#### Supported [Agent Loops](https://github.com/trycua/cua/blob/main/libs/agent/README.md#agent-loops)
|
||||
- [UITARS-1.5](https://github.com/trycua/cua/blob/main/libs/agent/README.md#agent-loops) - Run locally on Apple Silicon with MLX, or use cloud providers
|
||||
- [OpenAI CUA](https://github.com/trycua/cua/blob/main/libs/agent/README.md#agent-loops) - Use OpenAI's Computer-Use Preview model
|
||||
- [Anthropic CUA](https://github.com/trycua/cua/blob/main/libs/agent/README.md#agent-loops) - Use Anthropic's Computer-Use capabilities
|
||||
- [OmniParser-v2.0](https://github.com/trycua/cua/blob/main/libs/agent/README.md#agent-loops) - Control UI with [Set-of-Marks prompting](https://som-gpt4v.github.io/) using any vision model
|
||||
|
||||
### System Requirements
|
||||
|
||||
- Mac with Apple Silicon (M1/M2/M3/M4 series)
|
||||
- macOS 15 (Sequoia) or newer
|
||||
- Python 3.10+ (required for the Computer, Agent, and MCP libraries). We recommend using Conda (or Anaconda) to create an ad hoc Python environment.
|
||||
- Disk space for VM images (30GB+ recommended)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Option 1: Lume CLI Only (VM Management)
|
||||
If you only need the virtualization capabilities:
|
||||
# 💻 For Developers
|
||||
|
||||
### Step 1: Install Lume CLI
|
||||
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
```
|
||||
|
||||
Optionally, if you don't want Lume to run as a background service:
|
||||
Lume CLI manages high-performance macOS/Linux VMs with near-native speed on Apple Silicon.
|
||||
|
||||
### Step 2: Pull the macOS CUA Image
|
||||
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh) --no-background-service"
|
||||
lume pull macos-sequoia-cua:latest
|
||||
```
|
||||
|
||||
**Note:** If you choose this option, you'll need to manually start the Lume API service whenever needed by running `lume serve` in your terminal. This applies to Option 2 after completing step 1.
|
||||
The macOS CUA image contains the default Mac apps and the Computer Server for easy automation.
|
||||
|
||||
For Lume usage instructions, refer to the [Lume documentation](./libs/lume/README.md).
|
||||
### Step 3: Install Python SDK
|
||||
|
||||
### Option 2: Full Computer-Use Agent Capabilities
|
||||
If you want to use AI agents with virtualized environments:
|
||||
```bash
|
||||
pip install cua-computer "cua-agent[all]"
|
||||
```
|
||||
|
||||
1. Install the Lume CLI:
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
```
|
||||
Alternatively, see the [Developer Guide](./docs/Developer-Guide.md) for building from source.
|
||||
|
||||
2. Pull the latest macOS CUA image:
|
||||
```bash
|
||||
lume pull macos-sequoia-cua:latest
|
||||
```
|
||||
### Step 4: Use in Your Code
|
||||
|
||||
3. Install the Python libraries:
|
||||
```bash
|
||||
pip install cua-computer cua-agent[all]
|
||||
```
|
||||
```python
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent, LLM
|
||||
|
||||
4. Use the libraries in your Python code:
|
||||
```python
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
|
||||
async def main():
|
||||
# Start a local macOS VM with a 1024x768 display
|
||||
async with Computer(os_type="macos", display="1024x768") as computer:
|
||||
|
||||
async with Computer(os_type="macos", display="1024x768") as macos_computer:
|
||||
agent = ComputerAgent(
|
||||
computer=macos_computer,
|
||||
loop=AgentLoop.OPENAI, # or AgentLoop.ANTHROPIC, or AgentLoop.UITARS, or AgentLoop.OMNI
|
||||
model=LLM(provider=LLMProvider.OPENAI) # or LLM(provider=LLMProvider.ANTHROPIC)
|
||||
)
|
||||
# Example: Direct control of a macOS VM with Computer
|
||||
await computer.interface.left_click(100, 200)
|
||||
await computer.interface.type_text("Hello, world!")
|
||||
screenshot_bytes = await computer.interface.screenshot()
|
||||
|
||||
# Example: Create and run an agent locally using mlx-community/UI-TARS-1.5-7B-6bit
|
||||
agent = ComputerAgent(
|
||||
computer=computer,
|
||||
loop="UITARS",
|
||||
model=LLM(provider="MLXVLM", name="mlx-community/UI-TARS-1.5-7B-6bit")
|
||||
)
|
||||
await agent.run("Find the trycua/cua repository on GitHub and follow the quick start guide")
|
||||
|
||||
tasks = [
|
||||
"Look for a repository named trycua/cua on GitHub.",
|
||||
]
|
||||
main()
|
||||
```
|
||||
|
||||
for task in tasks:
|
||||
async for result in agent.run(task):
|
||||
print(result)
|
||||
```
|
||||
|
||||
Explore the [Agent Notebook](./notebooks/) for a ready-to-run example.
|
||||
For ready-to-use examples, check out our [Notebooks](./notebooks/) collection.
|
||||
|
||||
5. Optionally, you can use the Agent with a Gradio UI:
|
||||
### Lume CLI Reference
|
||||
|
||||
```python
|
||||
from utils import load_dotenv_files
|
||||
load_dotenv_files()
|
||||
|
||||
from agent.ui.gradio.app import create_gradio_ui
|
||||
|
||||
app = create_gradio_ui()
|
||||
app.launch(share=False)
|
||||
```
|
||||
```bash
|
||||
# Install Lume CLI
|
||||
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
|
||||
|
||||
### Option 3: Build from Source (Nightly)
|
||||
If you want to contribute to the project or need the latest nightly features:
|
||||
# List all VMs
|
||||
lume ls
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/trycua/cua.git
|
||||
cd cua
|
||||
|
||||
# Open the project in VSCode
|
||||
code ./.vscode/py.code-workspace
|
||||
# Pull a VM image
|
||||
lume pull macos-sequoia-cua:latest
|
||||
|
||||
# Build the project
|
||||
./scripts/build.sh
|
||||
```
|
||||
|
||||
See our [Developer-Guide](./docs/Developer-Guide.md) for more information.
|
||||
# Create a new VM
|
||||
lume create my-vm --os macos --cpu 4 --memory 8GB --disk-size 50GB
|
||||
|
||||
## Monorepo Libraries
|
||||
# Run a VM (creates and starts if it doesn't exist)
|
||||
lume run macos-sequoia-cua:latest
|
||||
|
||||
| Library | Description | Installation | Version |
|
||||
|---------|-------------|--------------|---------|
|
||||
| [**Lume**](./libs/lume/README.md) | CLI for running macOS/Linux VMs with near-native performance using Apple's `Virtualization.Framework`. | [](https://github.com/trycua/cua/releases/latest/download/lume.pkg.tar.gz) | [](https://github.com/trycua/cua/releases) |
|
||||
| [**Computer**](./libs/computer/README.md) | Computer-Use Interface (CUI) framework for interacting with macOS/Linux sandboxes | `pip install cua-computer` | [](https://pypi.org/project/cua-computer/) |
|
||||
| [**Agent**](./libs/agent/README.md) | Computer-Use Agent (CUA) framework for running agentic workflows in macOS/Linux dedicated sandboxes | `pip install cua-agent` | [](https://pypi.org/project/cua-agent/) |
|
||||
# Stop a VM
|
||||
lume stop macos-sequoia-cua_latest
|
||||
|
||||
## Docs
|
||||
# Delete a VM
|
||||
lume delete macos-sequoia-cua_latest
|
||||
```
|
||||
|
||||
For the best onboarding experience with the packages in this monorepo, we recommend starting with the [Computer](./libs/computer/README.md) documentation to cover the core functionality of the Computer sandbox, then exploring the [Agent](./libs/agent/README.md) documentation to understand Cua's AI agent capabilities, and finally working through the Notebook examples.
|
||||
For advanced container-like virtualization, check out [Lumier](./libs/lumier/README.md) - a Docker interface for macOS and Linux VMs.
|
||||
|
||||
- [Lume](./libs/lume/README.md)
|
||||
- [Computer](./libs/computer/README.md)
|
||||
- [Agent](./libs/agent/README.md)
|
||||
- [Notebooks](./notebooks/)
|
||||
## Resources
|
||||
|
||||
- [How to use the MCP Server with Claude Desktop or other MCP clients](./libs/mcp-server/README.md) - One of the easiest ways to get started with C/ua
|
||||
- [How to use OpenAI Computer-Use, Anthropic, OmniParser, or UI-TARS for your Computer-Use Agent](./libs/agent/README.md)
|
||||
- [How to use Lume CLI for managing desktops](./libs/lume/README.md)
|
||||
- [Training Computer-Use Models: Collecting Human Trajectories with C/ua (Part 1)](https://www.trycua.com/blog/training-computer-use-models-trajectories-1)
|
||||
- [Build Your Own Operator on macOS (Part 1)](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)
|
||||
|
||||
## Modules
|
||||
|
||||
| Module | Description | Installation |
|
||||
|--------|-------------|---------------|
|
||||
| [**Lume**](./libs/lume/README.md) | VM management for macOS/Linux using Apple's Virtualization.Framework | `curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh \| bash` |
|
||||
| [**Computer**](./libs/computer/README.md) | Interface for controlling virtual machines | `pip install cua-computer` |
|
||||
| [**Agent**](./libs/agent/README.md) | AI agent framework for automating tasks | `pip install cua-agent` |
|
||||
| [**MCP Server**](./libs/mcp-server/README.md) | MCP server for using CUA with Claude Desktop | `pip install cua-mcp-server` |
|
||||
| [**SOM**](./libs/som/README.md) | Self-of-Mark library for Agent | `pip install cua-som` |
|
||||
| [**PyLume**](./libs/pylume/README.md) | Python bindings for Lume | `pip install pylume` |
|
||||
| [**Computer Server**](./libs/computer-server/README.md) | Server component for Computer | `pip install cua-computer-server` |
|
||||
| [**Core**](./libs/core/README.md) | Core utilities | `pip install cua-core` |
|
||||
|
||||
## Computer Interface Reference
|
||||
|
||||
For complete examples, see [computer_examples.py](./examples/computer_examples.py) or [computer_nb.ipynb](./notebooks/computer_nb.ipynb)
|
||||
|
||||
```python
|
||||
# Mouse Actions
|
||||
await computer.interface.left_click(x, y) # Left click at coordinates
|
||||
await computer.interface.right_click(x, y) # Right click at coordinates
|
||||
await computer.interface.double_click(x, y) # Double click at coordinates
|
||||
await computer.interface.move_cursor(x, y) # Move cursor to coordinates
|
||||
await computer.interface.drag_to(x, y, duration) # Drag to coordinates
|
||||
await computer.interface.get_cursor_position() # Get current cursor position
|
||||
|
||||
# Keyboard Actions
|
||||
await computer.interface.type_text("Hello") # Type text
|
||||
await computer.interface.press_key("enter") # Press a single key
|
||||
await computer.interface.hotkey("command", "c") # Press key combination
|
||||
|
||||
# Screen Actions
|
||||
await computer.interface.screenshot() # Take a screenshot
|
||||
await computer.interface.get_screen_size() # Get screen dimensions
|
||||
|
||||
# Clipboard Actions
|
||||
await computer.interface.set_clipboard(text) # Set clipboard content
|
||||
await computer.interface.copy_to_clipboard() # Get clipboard content
|
||||
|
||||
# File System Operations
|
||||
await computer.interface.file_exists(path) # Check if file exists
|
||||
await computer.interface.directory_exists(path) # Check if directory exists
|
||||
await computer.interface.run_command(cmd) # Run shell command
|
||||
|
||||
# Accessibility
|
||||
await computer.interface.get_accessibility_tree() # Get accessibility tree
|
||||
```
|
||||
|
||||
## ComputerAgent Reference
|
||||
|
||||
For complete examples, see [agent_examples.py](./examples/agent_examples.py) or [agent_nb.ipynb](./notebooks/agent_nb.ipynb)
|
||||
|
||||
```python
|
||||
# Import necessary components
|
||||
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
|
||||
|
||||
# UI-TARS-1.5 agent for local execution with MLX
|
||||
ComputerAgent(loop=AgentLoop.UITARS, model=LLM(provider=LLMProvider.MLXVLM, name="mlx-community/UI-TARS-1.5-7B-6bit"))
|
||||
# OpenAI Computer-Use agent using OPENAI_API_KEY
|
||||
ComputerAgent(loop=AgentLoop.OPENAI, model=LLM(provider=LLMProvider.OPENAI, name="computer-use-preview"))
|
||||
# Anthropic Claude agent using ANTHROPIC_API_KEY
|
||||
ComputerAgent(loop=AgentLoop.ANTHROPIC, model=LLM(provider=LLMProvider.ANTHROPIC))
|
||||
|
||||
# OmniParser loop for UI control using Set-of-Marks (SOM) prompting and any vision LLM
|
||||
ComputerAgent(loop=AgentLoop.OMNI, model=LLM(provider=LLMProvider.OLLAMA, name="gemma3:12b-it-q4_K_M"))
|
||||
# OpenRouter example using OAICOMPAT provider
|
||||
ComputerAgent(
|
||||
loop=AgentLoop.OMNI,
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="openai/gpt-4o-mini",
|
||||
provider_base_url="https://openrouter.ai/api/v1"
|
||||
),
|
||||
api_key="your-openrouter-api-key"
|
||||
)
|
||||
```
|
||||
|
||||
## Demos
|
||||
|
||||
Demos of the Computer-Use Agent in action. Share your most impressive demos in Cua's [Discord community](https://discord.com/invite/mVnXXpdE85)!
|
||||
Check out these demos of the Computer-Use Agent in action:
|
||||
|
||||
<details open>
|
||||
<summary><b>MCP Server: Work with Claude Desktop and Tableau </b></summary>
|
||||
<summary><b>MCP Server: Work with Claude Desktop and Tableau</b></summary>
|
||||
<br>
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/9f573547-5149-493e-9a72-396f3cff29df
|
||||
" width="800" controls></video>
|
||||
<video src="https://github.com/user-attachments/assets/9f573547-5149-493e-9a72-396f3cff29df" width="800" controls></video>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>AI-Gradio: multi-app workflow requiring browser, VS Code and terminal access</b></summary>
|
||||
<details>
|
||||
<summary><b>AI-Gradio: Multi-app workflow with browser, VS Code and terminal</b></summary>
|
||||
<br>
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/723a115d-1a07-4c8e-b517-88fbdf53ed0f" width="800" controls></video>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<details>
|
||||
<summary><b>Notebook: Fix GitHub issue in Cursor</b></summary>
|
||||
<br>
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/f67f0107-a1e1-46dc-aa9f-0146eb077077" width="800" controls></video>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
## Accessory Libraries
|
||||
## Community
|
||||
|
||||
| Library | Description | Installation | Version |
|
||||
|---------|-------------|--------------|---------|
|
||||
| [**Core**](./libs/core/README.md) | Core functionality and utilities used by other Cua packages | `pip install cua-core` | [](https://pypi.org/project/cua-core/) |
|
||||
| [**PyLume**](./libs/pylume/README.md) | Python bindings for Lume | `pip install pylume` | [](https://pypi.org/project/pylume/) |
|
||||
| [**Computer Server**](./libs/computer-server/README.md) | Server component for the Computer-Use Interface (CUI) framework | `pip install cua-computer-server` | [](https://pypi.org/project/cua-computer-server/) |
|
||||
| [**SOM**](./libs/som/README.md) | Self-of-Mark library for Agent | `pip install cua-som` | [](https://pypi.org/project/cua-som/) |
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome and greatly appreciate contributions to Cua! Whether you're improving documentation, adding new features, fixing bugs, or adding new VM images, your efforts help make lume better for everyone. For detailed instructions on how to contribute, please refer to our [Contributing Guidelines](CONTRIBUTING.md).
|
||||
|
||||
Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas or get assistance.
|
||||
Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas, get assistance, or share your demos!
|
||||
|
||||
## License
|
||||
|
||||
@@ -194,11 +250,17 @@ Cua is open-sourced under the MIT License - see the [LICENSE](LICENSE) file for
|
||||
|
||||
Microsoft's OmniParser, which is used in this project, is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the [OmniParser LICENSE](https://github.com/microsoft/OmniParser/blob/master/LICENSE) file for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions to CUA! Please refer to our [Contributing Guidelines](CONTRIBUTING.md) for details.
|
||||
|
||||
## Trademarks
|
||||
|
||||
Apple, macOS, and Apple Silicon are trademarks of Apple Inc. Ubuntu and Canonical are registered trademarks of Canonical Ltd. Microsoft is a registered trademark of Microsoft Corporation. This project is not affiliated with, endorsed by, or sponsored by Apple Inc., Canonical Ltd., or Microsoft Corporation.
|
||||
|
||||
## Stargazers over time
|
||||
## Stargazers
|
||||
|
||||
Thank you to all our supporters!
|
||||
|
||||
[](https://starchart.cc/trycua/cua)
|
||||
|
||||
|
||||
@@ -36,6 +36,7 @@ async def run_agent_example():
|
||||
# model=LLM(provider=LLMProvider.OPENAI, name="gpt-4o"),
|
||||
# model=LLM(provider=LLMProvider.ANTHROPIC, name="claude-3-7-sonnet-20250219"),
|
||||
# model=LLM(provider=LLMProvider.OLLAMA, name="gemma3:4b-it-q4_K_M"),
|
||||
# model=LLM(provider=LLMProvider.MLXVLM, name="mlx-community/UI-TARS-1.5-7B-4bit"),
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="gemma-3-12b-it",
|
||||
|
||||
@@ -34,6 +34,10 @@ pip install "cua-agent[anthropic]" # Anthropic Cua Loop
|
||||
pip install "cua-agent[uitars]" # UI-Tars support
|
||||
pip install "cua-agent[omni]" # Cua Loop based on OmniParser (includes Ollama for local models)
|
||||
pip install "cua-agent[ui]" # Gradio UI for the agent
|
||||
|
||||
# For local UI-TARS with MLX support, you need to manually install mlx-vlm:
|
||||
pip install "cua-agent[uitars-mlx]"
|
||||
pip install git+https://github.com/ddupont808/mlx-vlm.git@stable/fix/qwen2-position-id # PR: https://github.com/Blaizzy/mlx-vlm/pull/349
|
||||
```
|
||||
|
||||
## Run
|
||||
@@ -136,7 +140,32 @@ The Gradio UI provides:
|
||||
|
||||
### Using UI-TARS
|
||||
|
||||
You can use UI-TARS by first following the [deployment guide](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md). This will give you a provider URL like this: `https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1` which you can use in the gradio UI.
|
||||
The UI-TARS models are available in two forms:
|
||||
|
||||
1. **MLX UI-TARS models** (Default): These models run locally using MLXVLM provider
|
||||
- `mlx-community/UI-TARS-1.5-7B-4bit` (default) - 4-bit quantized version
|
||||
- `mlx-community/UI-TARS-1.5-7B-6bit` - 6-bit quantized version for higher quality
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
computer=macos_computer,
|
||||
loop=AgentLoop.UITARS,
|
||||
model=LLM(provider=LLMProvider.MLXVLM, name="mlx-community/UI-TARS-1.5-7B-4bit")
|
||||
)
|
||||
```
|
||||
|
||||
2. **OpenAI-compatible UI-TARS**: For using the original ByteDance model
|
||||
- If you want to use the original ByteDance UI-TARS model via an OpenAI-compatible API, follow the [deployment guide](https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md)
|
||||
- This will give you a provider URL like `https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1` which you can use in the code or Gradio UI:
|
||||
|
||||
```python
|
||||
agent = ComputerAgent(
|
||||
computer=macos_computer,
|
||||
loop=AgentLoop.UITARS,
|
||||
model=LLM(provider=LLMProvider.OAICOMPAT, name="tgi",
|
||||
provider_base_url="https://**************.us-east-1.aws.endpoints.huggingface.cloud/v1")
|
||||
)
|
||||
```
|
||||
|
||||
## Agent Loops
|
||||
|
||||
@@ -146,7 +175,7 @@ The `cua-agent` package provides three agent loops variations, based on differen
|
||||
|:-----------|:-----------------|:------------|:-------------|
|
||||
| `AgentLoop.OPENAI` | • `computer_use_preview` | Use OpenAI Operator CUA model | Not Required |
|
||||
| `AgentLoop.ANTHROPIC` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219` | Use Anthropic Computer-Use | Not Required |
|
||||
| `AgentLoop.UITARS` | • `ByteDance-Seed/UI-TARS-1.5-7B` | Uses ByteDance's UI-TARS 1.5 model | Not Required |
|
||||
| `AgentLoop.UITARS` | • `mlx-community/UI-TARS-1.5-7B-4bit` (default)<br>• `mlx-community/UI-TARS-1.5-7B-6bit`<br>• `ByteDance-Seed/UI-TARS-1.5-7B` (via openAI-compatible endpoint) | Uses UI-TARS models with MLXVLM (default) or OAICOMPAT providers | Not Required |
|
||||
| `AgentLoop.OMNI` | • `claude-3-5-sonnet-20240620`<br>• `claude-3-7-sonnet-20250219`<br>• `gpt-4.5-preview`<br>• `gpt-4o`<br>• `gpt-4`<br>• `phi4`<br>• `phi4-mini`<br>• `gemma3`<br>• `...`<br>• `Any Ollama or OpenAI-compatible model` | Use OmniParser for element pixel-detection (SoM) and any VLMs for UI Grounding and Reasoning | OmniParser |
|
||||
|
||||
## AgentResponse
|
||||
|
||||
@@ -131,6 +131,15 @@ class BaseLoop(ABC):
|
||||
An async generator that yields agent responses
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
async def cancel(self) -> None:
|
||||
"""Cancel the currently running agent loop task.
|
||||
|
||||
This method should stop any ongoing processing in the agent loop
|
||||
and clean up resources appropriately.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
###########################################
|
||||
# EXPERIMENT AND TRAJECTORY MANAGEMENT
|
||||
|
||||
@@ -116,6 +116,7 @@ class LoopFactory:
|
||||
base_dir=trajectory_dir,
|
||||
only_n_most_recent_images=only_n_most_recent_images,
|
||||
provider_base_url=provider_base_url,
|
||||
provider=provider,
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Unsupported loop type: {loop_type}")
|
||||
|
||||
@@ -8,6 +8,7 @@ DEFAULT_MODELS = {
|
||||
LLMProvider.ANTHROPIC: "claude-3-7-sonnet-20250219",
|
||||
LLMProvider.OLLAMA: "gemma3:4b-it-q4_K_M",
|
||||
LLMProvider.OAICOMPAT: "Qwen2.5-VL-7B-Instruct",
|
||||
LLMProvider.MLXVLM: "mlx-community/UI-TARS-1.5-7B-4bit",
|
||||
}
|
||||
|
||||
# Map providers to their environment variable names
|
||||
@@ -16,4 +17,5 @@ ENV_VARS = {
|
||||
LLMProvider.ANTHROPIC: "ANTHROPIC_API_KEY",
|
||||
LLMProvider.OLLAMA: "none",
|
||||
LLMProvider.OAICOMPAT: "none", # OpenAI-compatible API typically doesn't require an API key
|
||||
LLMProvider.MLXVLM: "none", # MLX VLM typically doesn't require an API key
|
||||
}
|
||||
|
||||
@@ -23,6 +23,7 @@ class LLMProvider(StrEnum):
|
||||
OPENAI = "openai"
|
||||
OLLAMA = "ollama"
|
||||
OAICOMPAT = "oaicompat"
|
||||
MLXVLM= "mlxvlm"
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
@@ -101,6 +101,7 @@ class AnthropicLoop(BaseLoop):
|
||||
self.tool_manager = None
|
||||
self.callback_manager = None
|
||||
self.queue = asyncio.Queue() # Initialize queue
|
||||
self.loop_task = None # Store the loop task for cancellation
|
||||
|
||||
# Initialize handlers
|
||||
self.api_handler = AnthropicAPIHandler(self)
|
||||
@@ -169,7 +170,7 @@ class AnthropicLoop(BaseLoop):
|
||||
logger.info("Client initialized successfully")
|
||||
|
||||
# Start loop in background task
|
||||
loop_task = asyncio.create_task(self._run_loop(queue, messages))
|
||||
self.loop_task = asyncio.create_task(self._run_loop(queue, messages))
|
||||
|
||||
# Process and yield messages as they arrive
|
||||
while True:
|
||||
@@ -184,7 +185,7 @@ class AnthropicLoop(BaseLoop):
|
||||
continue
|
||||
|
||||
# Wait for loop to complete
|
||||
await loop_task
|
||||
await self.loop_task
|
||||
|
||||
# Send completion message
|
||||
yield {
|
||||
@@ -200,6 +201,31 @@ class AnthropicLoop(BaseLoop):
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
|
||||
async def cancel(self) -> None:
|
||||
"""Cancel the currently running agent loop task.
|
||||
|
||||
This method stops the ongoing processing in the agent loop
|
||||
by cancelling the loop_task if it exists and is running.
|
||||
"""
|
||||
if self.loop_task and not self.loop_task.done():
|
||||
logger.info("Cancelling Anthropic loop task")
|
||||
self.loop_task.cancel()
|
||||
try:
|
||||
# Wait for the task to be cancelled with a timeout
|
||||
await asyncio.wait_for(self.loop_task, timeout=2.0)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Timeout while waiting for loop task to cancel")
|
||||
except asyncio.CancelledError:
|
||||
logger.info("Loop task cancelled successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Error while cancelling loop task: {str(e)}")
|
||||
finally:
|
||||
# Put None in the queue to signal any waiting consumers to stop
|
||||
await self.queue.put(None)
|
||||
logger.info("Anthropic loop task cancelled")
|
||||
else:
|
||||
logger.info("No active Anthropic loop task to cancel")
|
||||
|
||||
###########################################
|
||||
# AGENT LOOP IMPLEMENTATION
|
||||
|
||||
@@ -105,6 +105,7 @@ class OmniLoop(BaseLoop):
|
||||
# Set API client attributes
|
||||
self.client = None
|
||||
self.retry_count = 0
|
||||
self.loop_task = None # Store the loop task for cancellation
|
||||
|
||||
# Initialize handlers
|
||||
self.api_handler = OmniAPIHandler(loop=self)
|
||||
@@ -580,10 +581,55 @@ class OmniLoop(BaseLoop):
|
||||
Yields:
|
||||
Agent response format
|
||||
"""
|
||||
# Initialize the message manager with the provided messages
|
||||
self.message_manager.messages = messages.copy()
|
||||
logger.info(f"Starting OmniLoop run with {len(self.message_manager.messages)} messages")
|
||||
try:
|
||||
logger.info(f"Starting OmniLoop run with {len(messages)} messages")
|
||||
|
||||
# Initialize the message manager with the provided messages
|
||||
self.message_manager.messages = messages.copy()
|
||||
|
||||
# Create queue for response streaming
|
||||
queue = asyncio.Queue()
|
||||
|
||||
# Start loop in background task
|
||||
self.loop_task = asyncio.create_task(self._run_loop(queue, messages))
|
||||
|
||||
# Process and yield messages as they arrive
|
||||
while True:
|
||||
try:
|
||||
item = await queue.get()
|
||||
if item is None: # Stop signal
|
||||
break
|
||||
yield item
|
||||
queue.task_done()
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing queue item: {str(e)}")
|
||||
continue
|
||||
|
||||
# Wait for loop to complete
|
||||
await self.loop_task
|
||||
|
||||
# Send completion message
|
||||
yield {
|
||||
"role": "assistant",
|
||||
"content": "Task completed successfully.",
|
||||
"metadata": {"title": "✅ Complete"},
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in run method: {str(e)}")
|
||||
yield {
|
||||
"role": "assistant",
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
|
||||
async def _run_loop(self, queue: asyncio.Queue, messages: List[Dict[str, Any]]) -> None:
|
||||
"""Internal method to run the agent loop with provided messages.
|
||||
|
||||
Args:
|
||||
queue: Queue to put responses into
|
||||
messages: List of messages in standard OpenAI format
|
||||
"""
|
||||
# Continue running until explicitly told to stop
|
||||
running = True
|
||||
turn_created = False
|
||||
@@ -673,8 +719,8 @@ class OmniLoop(BaseLoop):
|
||||
# Log standardized response for ease of parsing
|
||||
self._log_api_call("agent_response", request=None, response=openai_compatible_response)
|
||||
|
||||
# Yield the response to the caller
|
||||
yield openai_compatible_response
|
||||
# Put the response in the queue
|
||||
await queue.put(openai_compatible_response)
|
||||
|
||||
# Check if we should continue this conversation
|
||||
running = should_continue
|
||||
@@ -688,20 +734,47 @@ class OmniLoop(BaseLoop):
|
||||
|
||||
except Exception as e:
|
||||
attempt += 1
|
||||
error_msg = f"Error in run method (attempt {attempt}/{max_attempts}): {str(e)}"
|
||||
error_msg = f"Error in _run_loop method (attempt {attempt}/{max_attempts}): {str(e)}"
|
||||
logger.error(error_msg)
|
||||
|
||||
# If this is our last attempt, provide more info about the error
|
||||
if attempt >= max_attempts:
|
||||
logger.error(f"Maximum retry attempts reached. Last error was: {str(e)}")
|
||||
|
||||
yield {
|
||||
"error": str(e),
|
||||
await queue.put({
|
||||
"role": "assistant",
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
})
|
||||
|
||||
# Create a brief delay before retrying
|
||||
await asyncio.sleep(1)
|
||||
finally:
|
||||
# Signal that we're done
|
||||
await queue.put(None)
|
||||
|
||||
async def cancel(self) -> None:
|
||||
"""Cancel the currently running agent loop task.
|
||||
|
||||
This method stops the ongoing processing in the agent loop
|
||||
by cancelling the loop_task if it exists and is running.
|
||||
"""
|
||||
if self.loop_task and not self.loop_task.done():
|
||||
logger.info("Cancelling Omni loop task")
|
||||
self.loop_task.cancel()
|
||||
try:
|
||||
# Wait for the task to be cancelled with a timeout
|
||||
await asyncio.wait_for(self.loop_task, timeout=2.0)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Timeout while waiting for loop task to cancel")
|
||||
except asyncio.CancelledError:
|
||||
logger.info("Loop task cancelled successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Error while cancelling loop task: {str(e)}")
|
||||
finally:
|
||||
logger.info("Omni loop task cancelled")
|
||||
else:
|
||||
logger.info("No active Omni loop task to cancel")
|
||||
|
||||
async def process_model_response(self, response_text: str) -> Optional[Dict[str, Any]]:
|
||||
"""Process model response to extract tool calls.
|
||||
|
||||
@@ -87,6 +87,7 @@ class OpenAILoop(BaseLoop):
|
||||
self.acknowledge_safety_check_callback = acknowledge_safety_check_callback
|
||||
self.queue = asyncio.Queue() # Initialize queue
|
||||
self.last_response_id = None # Store the last response ID across runs
|
||||
self.loop_task = None # Store the loop task for cancellation
|
||||
|
||||
# Initialize handlers
|
||||
self.api_handler = OpenAIAPIHandler(self)
|
||||
@@ -132,28 +133,28 @@ class OpenAILoop(BaseLoop):
|
||||
logger.info("Starting OpenAI loop run")
|
||||
|
||||
# Create queue for response streaming
|
||||
queue = asyncio.Queue()
|
||||
self.queue = asyncio.Queue()
|
||||
|
||||
# Ensure tool manager is initialized
|
||||
await self.tool_manager.initialize()
|
||||
|
||||
# Start loop in background task
|
||||
loop_task = asyncio.create_task(self._run_loop(queue, messages))
|
||||
self.loop_task = asyncio.create_task(self._run_loop(self.queue, messages))
|
||||
|
||||
# Process and yield messages as they arrive
|
||||
while True:
|
||||
try:
|
||||
item = await queue.get()
|
||||
item = await self.queue.get()
|
||||
if item is None: # Stop signal
|
||||
break
|
||||
yield item
|
||||
queue.task_done()
|
||||
self.queue.task_done()
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing queue item: {str(e)}")
|
||||
continue
|
||||
|
||||
# Wait for loop to complete
|
||||
await loop_task
|
||||
await self.loop_task
|
||||
|
||||
# Send completion message
|
||||
yield {
|
||||
@@ -169,6 +170,31 @@ class OpenAILoop(BaseLoop):
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
|
||||
async def cancel(self) -> None:
|
||||
"""Cancel the currently running agent loop task.
|
||||
|
||||
This method stops the ongoing processing in the agent loop
|
||||
by cancelling the loop_task if it exists and is running.
|
||||
"""
|
||||
if self.loop_task and not self.loop_task.done():
|
||||
logger.info("Cancelling OpenAI loop task")
|
||||
self.loop_task.cancel()
|
||||
try:
|
||||
# Wait for the task to be cancelled with a timeout
|
||||
await asyncio.wait_for(self.loop_task, timeout=2.0)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Timeout while waiting for loop task to cancel")
|
||||
except asyncio.CancelledError:
|
||||
logger.info("Loop task cancelled successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Error while cancelling loop task: {str(e)}")
|
||||
finally:
|
||||
# Put None in the queue to signal any waiting consumers to stop
|
||||
await self.queue.put(None)
|
||||
logger.info("OpenAI loop task cancelled")
|
||||
else:
|
||||
logger.info("No active OpenAI loop task to cancel")
|
||||
|
||||
###########################################
|
||||
# AGENT LOOP IMPLEMENTATION
|
||||
@@ -201,16 +227,7 @@ class OpenAILoop(BaseLoop):
|
||||
|
||||
# Emit screenshot callbacks
|
||||
await self.handle_screenshot(screenshot_base64, action_type="initial_state")
|
||||
|
||||
# Save screenshot if requested
|
||||
if self.save_trajectory:
|
||||
# Ensure screenshot_base64 is a string
|
||||
if not isinstance(screenshot_base64, str):
|
||||
logger.warning(
|
||||
"Converting non-string screenshot_base64 to string for _save_screenshot"
|
||||
)
|
||||
self._save_screenshot(screenshot_base64, action_type="state")
|
||||
logger.info("Screenshot saved to trajectory")
|
||||
self._save_screenshot(screenshot_base64, action_type="state")
|
||||
|
||||
# First add any existing user messages that were passed to run()
|
||||
user_query = None
|
||||
@@ -351,6 +368,7 @@ class OpenAILoop(BaseLoop):
|
||||
# Process screenshot through hooks
|
||||
action_type = f"after_{action.get('type', 'action')}"
|
||||
await self.handle_screenshot(screenshot_base64, action_type=action_type)
|
||||
self._save_screenshot(screenshot_base64, action_type=action_type)
|
||||
|
||||
# Create computer_call_output
|
||||
computer_call_output = {
|
||||
@@ -397,6 +415,7 @@ class OpenAILoop(BaseLoop):
|
||||
|
||||
# Process the response
|
||||
# await self.response_handler.process_response(response, queue)
|
||||
self._log_api_call("agent_response", request=None, response=response)
|
||||
await queue.put(response)
|
||||
except Exception as e:
|
||||
logger.error(f"Error executing computer action: {str(e)}")
|
||||
|
||||
263
libs/agent/agent/providers/uitars/clients/mlxvlm.py
Normal file
263
libs/agent/agent/providers/uitars/clients/mlxvlm.py
Normal file
@@ -0,0 +1,263 @@
|
||||
"""MLX LVM client implementation."""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import base64
|
||||
import tempfile
|
||||
import os
|
||||
import re
|
||||
import math
|
||||
from typing import Dict, List, Optional, Any, cast, Tuple
|
||||
from PIL import Image
|
||||
|
||||
from .base import BaseUITarsClient
|
||||
import mlx.core as mx
|
||||
from mlx_vlm import load, generate
|
||||
from mlx_vlm.prompt_utils import apply_chat_template
|
||||
from mlx_vlm.utils import load_config
|
||||
from transformers.tokenization_utils import PreTrainedTokenizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Constants for smart_resize
|
||||
IMAGE_FACTOR = 28
|
||||
MIN_PIXELS = 100 * 28 * 28
|
||||
MAX_PIXELS = 16384 * 28 * 28
|
||||
MAX_RATIO = 200
|
||||
|
||||
def round_by_factor(number: float, factor: int) -> int:
|
||||
"""Returns the closest integer to 'number' that is divisible by 'factor'."""
|
||||
return round(number / factor) * factor
|
||||
|
||||
def ceil_by_factor(number: float, factor: int) -> int:
|
||||
"""Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
|
||||
return math.ceil(number / factor) * factor
|
||||
|
||||
def floor_by_factor(number: float, factor: int) -> int:
|
||||
"""Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
|
||||
return math.floor(number / factor) * factor
|
||||
|
||||
def smart_resize(
|
||||
height: int, width: int, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
|
||||
) -> tuple[int, int]:
|
||||
"""
|
||||
Rescales the image so that the following conditions are met:
|
||||
|
||||
1. Both dimensions (height and width) are divisible by 'factor'.
|
||||
2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
|
||||
3. The aspect ratio of the image is maintained as closely as possible.
|
||||
"""
|
||||
if max(height, width) / min(height, width) > MAX_RATIO:
|
||||
raise ValueError(
|
||||
f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
|
||||
)
|
||||
h_bar = max(factor, round_by_factor(height, factor))
|
||||
w_bar = max(factor, round_by_factor(width, factor))
|
||||
if h_bar * w_bar > max_pixels:
|
||||
beta = math.sqrt((height * width) / max_pixels)
|
||||
h_bar = floor_by_factor(height / beta, factor)
|
||||
w_bar = floor_by_factor(width / beta, factor)
|
||||
elif h_bar * w_bar < min_pixels:
|
||||
beta = math.sqrt(min_pixels / (height * width))
|
||||
h_bar = ceil_by_factor(height * beta, factor)
|
||||
w_bar = ceil_by_factor(width * beta, factor)
|
||||
return h_bar, w_bar
|
||||
|
||||
class MLXVLMUITarsClient(BaseUITarsClient):
|
||||
"""MLX LVM client implementation class."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model: str = "mlx-community/UI-TARS-1.5-7B-4bit"
|
||||
):
|
||||
"""Initialize MLX LVM client.
|
||||
|
||||
Args:
|
||||
model: Model name or path (defaults to mlx-community/UI-TARS-1.5-7B-4bit)
|
||||
"""
|
||||
# Load model and processor
|
||||
model_obj, processor = load(
|
||||
model,
|
||||
processor_kwargs={"min_pixels": MIN_PIXELS, "max_pixels": MAX_PIXELS}
|
||||
)
|
||||
self.config = load_config(model)
|
||||
self.model = model_obj
|
||||
self.processor = processor
|
||||
self.model_name = model
|
||||
|
||||
def _process_coordinates(self, text: str, original_size: Tuple[int, int], model_size: Tuple[int, int]) -> str:
|
||||
"""Process coordinates in box tokens based on image resizing using smart_resize approach.
|
||||
|
||||
Args:
|
||||
text: Text containing box tokens
|
||||
original_size: Original image size (width, height)
|
||||
model_size: Model processed image size (width, height)
|
||||
|
||||
Returns:
|
||||
Text with processed coordinates
|
||||
"""
|
||||
# Find all box tokens
|
||||
box_pattern = r"<\|box_start\|>\((\d+),\s*(\d+)\)<\|box_end\|>"
|
||||
|
||||
def process_coords(match):
|
||||
model_x, model_y = int(match.group(1)), int(match.group(2))
|
||||
# Scale coordinates from model space to original image space
|
||||
# Both original_size and model_size are in (width, height) format
|
||||
new_x = int(model_x * original_size[0] / model_size[0]) # Width
|
||||
new_y = int(model_y * original_size[1] / model_size[1]) # Height
|
||||
return f"<|box_start|>({new_x},{new_y})<|box_end|>"
|
||||
|
||||
return re.sub(box_pattern, process_coords, text)
|
||||
|
||||
async def run_interleaved(
|
||||
self, messages: List[Dict[str, Any]], system: str, max_tokens: Optional[int] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Run interleaved chat completion.
|
||||
|
||||
Args:
|
||||
messages: List of message dicts
|
||||
system: System prompt
|
||||
max_tokens: Optional max tokens override
|
||||
|
||||
Returns:
|
||||
Response dict
|
||||
"""
|
||||
# Ensure the system message is included
|
||||
if not any(msg.get("role") == "system" for msg in messages):
|
||||
messages = [{"role": "system", "content": system}] + messages
|
||||
|
||||
# Create a deep copy of messages to avoid modifying the original
|
||||
processed_messages = messages.copy()
|
||||
|
||||
# Extract images and process messages
|
||||
images = []
|
||||
original_sizes = {} # Track original sizes of images for coordinate mapping
|
||||
model_sizes = {} # Track model processed sizes
|
||||
image_index = 0
|
||||
|
||||
for msg_idx, msg in enumerate(messages):
|
||||
content = msg.get("content", [])
|
||||
if not isinstance(content, list):
|
||||
continue
|
||||
|
||||
# Create a copy of the content list to modify
|
||||
processed_content = []
|
||||
|
||||
for item_idx, item in enumerate(content):
|
||||
if item.get("type") == "image_url":
|
||||
image_url = item.get("image_url", {}).get("url", "")
|
||||
pil_image = None
|
||||
|
||||
if image_url.startswith("data:image/"):
|
||||
# Extract base64 data
|
||||
base64_data = image_url.split(',')[1]
|
||||
# Convert base64 to PIL Image
|
||||
image_data = base64.b64decode(base64_data)
|
||||
pil_image = Image.open(io.BytesIO(image_data))
|
||||
else:
|
||||
# Handle file path or URL
|
||||
pil_image = Image.open(image_url)
|
||||
|
||||
# Store original image size for coordinate mapping
|
||||
original_size = pil_image.size
|
||||
original_sizes[image_index] = original_size
|
||||
|
||||
# Use smart_resize to determine model size
|
||||
# Note: smart_resize expects (height, width) but PIL gives (width, height)
|
||||
height, width = original_size[1], original_size[0]
|
||||
new_height, new_width = smart_resize(height, width)
|
||||
# Store model size in (width, height) format for consistent coordinate processing
|
||||
model_sizes[image_index] = (new_width, new_height)
|
||||
|
||||
# Resize the image using the calculated dimensions from smart_resize
|
||||
resized_image = pil_image.resize((new_width, new_height))
|
||||
images.append(resized_image)
|
||||
image_index += 1
|
||||
|
||||
# Copy items to processed content list
|
||||
processed_content.append(item.copy())
|
||||
|
||||
# Update the processed message content
|
||||
processed_messages[msg_idx] = msg.copy()
|
||||
processed_messages[msg_idx]["content"] = processed_content
|
||||
|
||||
logger.info(f"resized {len(images)} from {original_sizes[0]} to {model_sizes[0]}")
|
||||
|
||||
# Process user text input with box coordinates after image processing
|
||||
# Swap original_size and model_size arguments for inverse transformation
|
||||
for msg_idx, msg in enumerate(processed_messages):
|
||||
if msg.get("role") == "user" and isinstance(msg.get("content"), str):
|
||||
if "<|box_start|>" in msg.get("content") and original_sizes and model_sizes and 0 in original_sizes and 0 in model_sizes:
|
||||
orig_size = original_sizes[0]
|
||||
model_size = model_sizes[0]
|
||||
# Swap arguments to perform inverse transformation for user input
|
||||
processed_messages[msg_idx]["content"] = self._process_coordinates(msg["content"], model_size, orig_size)
|
||||
|
||||
try:
|
||||
# Format prompt according to model requirements using the processor directly
|
||||
prompt = self.processor.apply_chat_template(
|
||||
processed_messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
tokenizer = cast(PreTrainedTokenizer, self.processor)
|
||||
|
||||
print("generating response...")
|
||||
|
||||
# Generate response
|
||||
text_content, usage = generate(
|
||||
self.model,
|
||||
tokenizer,
|
||||
str(prompt),
|
||||
images,
|
||||
verbose=False,
|
||||
max_tokens=max_tokens
|
||||
)
|
||||
|
||||
from pprint import pprint
|
||||
print("DEBUG - AGENT GENERATION --------")
|
||||
pprint(text_content)
|
||||
print("DEBUG - AGENT GENERATION --------")
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating response: {str(e)}")
|
||||
return {
|
||||
"choices": [
|
||||
{
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": f"Error generating response: {str(e)}"
|
||||
},
|
||||
"finish_reason": "error"
|
||||
}
|
||||
],
|
||||
"model": self.model_name,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
# Process coordinates in the response back to original image space
|
||||
if original_sizes and model_sizes and 0 in original_sizes and 0 in model_sizes:
|
||||
# Get original image size and model size (using the first image)
|
||||
orig_size = original_sizes[0]
|
||||
model_size = model_sizes[0]
|
||||
|
||||
# Check if output contains box tokens that need processing
|
||||
if "<|box_start|>" in text_content:
|
||||
# Process coordinates from model space back to original image space
|
||||
text_content = self._process_coordinates(text_content, orig_size, model_size)
|
||||
|
||||
# Format response to match OpenAI format
|
||||
response = {
|
||||
"choices": [
|
||||
{
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": text_content
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"model": self.model_name,
|
||||
"usage": usage
|
||||
}
|
||||
|
||||
return response
|
||||
@@ -23,6 +23,7 @@ from .tools.computer import ToolResult
|
||||
from .prompts import COMPUTER_USE, SYSTEM_PROMPT, MAC_SPECIFIC_NOTES
|
||||
|
||||
from .clients.oaicompat import OAICompatClient
|
||||
from .clients.mlxvlm import MLXVLMUITarsClient
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -44,6 +45,7 @@ class UITARSLoop(BaseLoop):
|
||||
computer: Computer,
|
||||
api_key: str,
|
||||
model: str,
|
||||
provider: Optional[LLMProvider] = None,
|
||||
provider_base_url: Optional[str] = "http://localhost:8000/v1",
|
||||
only_n_most_recent_images: Optional[int] = 2,
|
||||
base_dir: Optional[str] = "trajectories",
|
||||
@@ -64,9 +66,10 @@ class UITARSLoop(BaseLoop):
|
||||
max_retries: Maximum number of retries for API calls
|
||||
retry_delay: Delay between retries in seconds
|
||||
save_trajectory: Whether to save trajectory data
|
||||
provider: The LLM provider to use (defaults to OAICOMPAT if not specified)
|
||||
"""
|
||||
# Set provider before initializing base class
|
||||
self.provider = LLMProvider.OAICOMPAT
|
||||
self.provider = provider or LLMProvider.OAICOMPAT
|
||||
self.provider_base_url = provider_base_url
|
||||
|
||||
# Initialize message manager with image retention config
|
||||
@@ -90,6 +93,7 @@ class UITARSLoop(BaseLoop):
|
||||
# Set API client attributes
|
||||
self.client = None
|
||||
self.retry_count = 0
|
||||
self.loop_task = None # Store the loop task for cancellation
|
||||
|
||||
# Initialize visualization helper
|
||||
self.viz_helper = VisualizationHelper(agent=self)
|
||||
@@ -113,7 +117,7 @@ class UITARSLoop(BaseLoop):
|
||||
logger.error(f"Error initializing tool manager: {str(e)}")
|
||||
logger.warning("Will attempt to initialize tools on first use.")
|
||||
|
||||
# Initialize client for the OAICompat provider
|
||||
# Initialize client for the selected provider
|
||||
try:
|
||||
await self.initialize_client()
|
||||
except Exception as e:
|
||||
@@ -128,18 +132,28 @@ class UITARSLoop(BaseLoop):
|
||||
"""Initialize the appropriate client.
|
||||
|
||||
Implements abstract method from BaseLoop to set up the specific
|
||||
provider client (OAICompat for UI-TARS).
|
||||
provider client based on the configured provider.
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Initializing OAICompat client for UI-TARS with model {self.model}...")
|
||||
|
||||
self.client = OAICompatClient(
|
||||
api_key=self.api_key or "EMPTY", # Local endpoints typically don't require an API key
|
||||
model=self.model,
|
||||
provider_base_url=self.provider_base_url,
|
||||
)
|
||||
|
||||
logger.info(f"Initialized OAICompat client with model {self.model}")
|
||||
if self.provider == LLMProvider.MLXVLM:
|
||||
logger.info(f"Initializing MLX VLM client for UI-TARS with model {self.model}...")
|
||||
|
||||
self.client = MLXVLMUITarsClient(
|
||||
model=self.model,
|
||||
)
|
||||
|
||||
logger.info(f"Initialized MLX VLM client with model {self.model}")
|
||||
else:
|
||||
# Default to OAICompat client for other providers
|
||||
logger.info(f"Initializing OAICompat client for UI-TARS with model {self.model}...")
|
||||
|
||||
self.client = OAICompatClient(
|
||||
api_key=self.api_key or "EMPTY", # Local endpoints typically don't require an API key
|
||||
model=self.model,
|
||||
provider_base_url=self.provider_base_url,
|
||||
)
|
||||
|
||||
logger.info(f"Initialized OAICompat client with model {self.model}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error initializing client: {str(e)}")
|
||||
self.client = None
|
||||
@@ -449,10 +463,55 @@ class UITARSLoop(BaseLoop):
|
||||
Yields:
|
||||
Agent response format
|
||||
"""
|
||||
# Initialize the message manager with the provided messages
|
||||
self.message_manager.messages = messages.copy()
|
||||
logger.info(f"Starting UITARSLoop run with {len(self.message_manager.messages)} messages")
|
||||
try:
|
||||
logger.info(f"Starting UITARSLoop run with {len(messages)} messages")
|
||||
|
||||
# Initialize the message manager with the provided messages
|
||||
self.message_manager.messages = messages.copy()
|
||||
|
||||
# Create queue for response streaming
|
||||
queue = asyncio.Queue()
|
||||
|
||||
# Start loop in background task
|
||||
self.loop_task = asyncio.create_task(self._run_loop(queue, messages))
|
||||
|
||||
# Process and yield messages as they arrive
|
||||
while True:
|
||||
try:
|
||||
item = await queue.get()
|
||||
if item is None: # Stop signal
|
||||
break
|
||||
yield item
|
||||
queue.task_done()
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing queue item: {str(e)}")
|
||||
continue
|
||||
|
||||
# Wait for loop to complete
|
||||
await self.loop_task
|
||||
|
||||
# Send completion message
|
||||
yield {
|
||||
"role": "assistant",
|
||||
"content": "Task completed successfully.",
|
||||
"metadata": {"title": "✅ Complete"},
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in run method: {str(e)}")
|
||||
yield {
|
||||
"role": "assistant",
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
|
||||
async def _run_loop(self, queue: asyncio.Queue, messages: List[Dict[str, Any]]) -> None:
|
||||
"""Internal method to run the agent loop with provided messages.
|
||||
|
||||
Args:
|
||||
queue: Queue to put responses into
|
||||
messages: List of messages in standard OpenAI format
|
||||
"""
|
||||
# Continue running until explicitly told to stop
|
||||
running = True
|
||||
turn_created = False
|
||||
@@ -462,88 +521,117 @@ class UITARSLoop(BaseLoop):
|
||||
attempt = 0
|
||||
max_attempts = 3
|
||||
|
||||
while running and attempt < max_attempts:
|
||||
try:
|
||||
# Create a new turn directory if it's not already created
|
||||
if not turn_created:
|
||||
self._create_turn_dir()
|
||||
turn_created = True
|
||||
try:
|
||||
while running and attempt < max_attempts:
|
||||
try:
|
||||
# Create a new turn directory if it's not already created
|
||||
if not turn_created:
|
||||
self._create_turn_dir()
|
||||
turn_created = True
|
||||
|
||||
# Ensure client is initialized
|
||||
if self.client is None:
|
||||
logger.info("Initializing client...")
|
||||
await self.initialize_client()
|
||||
# Ensure client is initialized
|
||||
if self.client is None:
|
||||
raise RuntimeError("Failed to initialize client")
|
||||
logger.info("Client initialized successfully")
|
||||
logger.info("Initializing client...")
|
||||
await self.initialize_client()
|
||||
if self.client is None:
|
||||
raise RuntimeError("Failed to initialize client")
|
||||
logger.info("Client initialized successfully")
|
||||
|
||||
# Get current screen
|
||||
base64_screenshot = await self._get_current_screen()
|
||||
|
||||
# Add screenshot to message history
|
||||
self.message_manager.add_user_message(
|
||||
[
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{base64_screenshot}"},
|
||||
}
|
||||
]
|
||||
)
|
||||
logger.info("Added screenshot to message history")
|
||||
# Get current screen
|
||||
base64_screenshot = await self._get_current_screen()
|
||||
|
||||
# Add screenshot to message history
|
||||
self.message_manager.add_user_message(
|
||||
[
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{base64_screenshot}"},
|
||||
}
|
||||
]
|
||||
)
|
||||
logger.info("Added screenshot to message history")
|
||||
|
||||
# Get system prompt
|
||||
system_prompt = self._get_system_prompt()
|
||||
# Get system prompt
|
||||
system_prompt = self._get_system_prompt()
|
||||
|
||||
# Make API call with retries
|
||||
response = await self._make_api_call(
|
||||
self.message_manager.messages, system_prompt
|
||||
)
|
||||
# Make API call with retries
|
||||
response = await self._make_api_call(
|
||||
self.message_manager.messages, system_prompt
|
||||
)
|
||||
|
||||
# Handle the response (may execute actions)
|
||||
# Returns: (should_continue, action_screenshot_saved)
|
||||
should_continue, new_screenshot_saved = await self._handle_response(
|
||||
response, self.message_manager.messages
|
||||
)
|
||||
# Handle the response (may execute actions)
|
||||
# Returns: (should_continue, action_screenshot_saved)
|
||||
should_continue, new_screenshot_saved = await self._handle_response(
|
||||
response, self.message_manager.messages
|
||||
)
|
||||
|
||||
# Update whether an action screenshot was saved this turn
|
||||
action_screenshot_saved = action_screenshot_saved or new_screenshot_saved
|
||||
|
||||
agent_response = await to_agent_response_format(
|
||||
response,
|
||||
messages,
|
||||
model=self.model,
|
||||
)
|
||||
# Log standardized response for ease of parsing
|
||||
self._log_api_call("agent_response", request=None, response=agent_response)
|
||||
yield agent_response
|
||||
|
||||
# Check if we should continue this conversation
|
||||
running = should_continue
|
||||
# Update whether an action screenshot was saved this turn
|
||||
action_screenshot_saved = action_screenshot_saved or new_screenshot_saved
|
||||
|
||||
agent_response = await to_agent_response_format(
|
||||
response,
|
||||
messages,
|
||||
model=self.model,
|
||||
)
|
||||
# Log standardized response for ease of parsing
|
||||
self._log_api_call("agent_response", request=None, response=agent_response)
|
||||
|
||||
# Put the response in the queue
|
||||
await queue.put(agent_response)
|
||||
|
||||
# Check if we should continue this conversation
|
||||
running = should_continue
|
||||
|
||||
# Create a new turn directory if we're continuing
|
||||
if running:
|
||||
turn_created = False
|
||||
# Create a new turn directory if we're continuing
|
||||
if running:
|
||||
turn_created = False
|
||||
|
||||
# Reset attempt counter on success
|
||||
attempt = 0
|
||||
# Reset attempt counter on success
|
||||
attempt = 0
|
||||
|
||||
except Exception as e:
|
||||
attempt += 1
|
||||
error_msg = f"Error in run method (attempt {attempt}/{max_attempts}): {str(e)}"
|
||||
logger.error(error_msg)
|
||||
|
||||
# If this is our last attempt, provide more info about the error
|
||||
if attempt >= max_attempts:
|
||||
logger.error(f"Maximum retry attempts reached. Last error was: {str(e)}")
|
||||
|
||||
await queue.put({
|
||||
"role": "assistant",
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
})
|
||||
|
||||
# Create a brief delay before retrying
|
||||
await asyncio.sleep(1)
|
||||
finally:
|
||||
# Signal that we're done
|
||||
await queue.put(None)
|
||||
|
||||
async def cancel(self) -> None:
|
||||
"""Cancel the currently running agent loop task.
|
||||
|
||||
This method stops the ongoing processing in the agent loop
|
||||
by cancelling the loop_task if it exists and is running.
|
||||
"""
|
||||
if self.loop_task and not self.loop_task.done():
|
||||
logger.info("Cancelling UITARS loop task")
|
||||
self.loop_task.cancel()
|
||||
try:
|
||||
# Wait for the task to be cancelled with a timeout
|
||||
await asyncio.wait_for(self.loop_task, timeout=2.0)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Timeout while waiting for loop task to cancel")
|
||||
except asyncio.CancelledError:
|
||||
logger.info("Loop task cancelled successfully")
|
||||
except Exception as e:
|
||||
attempt += 1
|
||||
error_msg = f"Error in run method (attempt {attempt}/{max_attempts}): {str(e)}"
|
||||
logger.error(error_msg)
|
||||
|
||||
# If this is our last attempt, provide more info about the error
|
||||
if attempt >= max_attempts:
|
||||
logger.error(f"Maximum retry attempts reached. Last error was: {str(e)}")
|
||||
|
||||
yield {
|
||||
"role": "assistant",
|
||||
"content": f"Error: {str(e)}",
|
||||
"metadata": {"title": "❌ Error"},
|
||||
}
|
||||
|
||||
# Create a brief delay before retrying
|
||||
await asyncio.sleep(1)
|
||||
logger.error(f"Error while cancelling loop task: {str(e)}")
|
||||
finally:
|
||||
logger.info("UITARS loop task cancelled")
|
||||
else:
|
||||
logger.info("No active UITARS loop task to cancel")
|
||||
|
||||
###########################################
|
||||
# UTILITY METHODS
|
||||
|
||||
@@ -105,7 +105,7 @@ async def to_agent_response_format(
|
||||
}
|
||||
],
|
||||
truncation="auto",
|
||||
usage=response["usage"],
|
||||
usage=response.get("usage", {}),
|
||||
user=None,
|
||||
metadata={},
|
||||
response=response
|
||||
|
||||
@@ -6,7 +6,7 @@ with an advanced UI for model selection and configuration.
|
||||
|
||||
Supported Agent Loops and Models:
|
||||
- AgentLoop.OPENAI: Uses OpenAI Operator CUA model
|
||||
• computer_use_preview
|
||||
• computer-use-preview
|
||||
|
||||
- AgentLoop.ANTHROPIC: Uses Anthropic Computer-Use models
|
||||
• claude-3-5-sonnet-20240620
|
||||
@@ -133,12 +133,12 @@ class GradioChatScreenshotHandler(DefaultCallbackHandler):
|
||||
MODEL_MAPPINGS = {
|
||||
"openai": {
|
||||
# Default to operator CUA model
|
||||
"default": "computer_use_preview",
|
||||
"default": "computer-use-preview",
|
||||
# Map standard OpenAI model names to CUA-specific model names
|
||||
"gpt-4-turbo": "computer_use_preview",
|
||||
"gpt-4o": "computer_use_preview",
|
||||
"gpt-4": "computer_use_preview",
|
||||
"gpt-4.5-preview": "computer_use_preview",
|
||||
"gpt-4-turbo": "computer-use-preview",
|
||||
"gpt-4o": "computer-use-preview",
|
||||
"gpt-4": "computer-use-preview",
|
||||
"gpt-4.5-preview": "computer-use-preview",
|
||||
"gpt-4o-mini": "gpt-4o-mini",
|
||||
},
|
||||
"anthropic": {
|
||||
@@ -164,8 +164,10 @@ MODEL_MAPPINGS = {
|
||||
"claude-3-7-sonnet-20250219": "claude-3-7-sonnet-20250219",
|
||||
},
|
||||
"uitars": {
|
||||
# UI-TARS models default to custom endpoint
|
||||
"default": "ByteDance-Seed/UI-TARS-1.5-7B",
|
||||
# UI-TARS models using MLXVLM provider
|
||||
"default": "mlx-community/UI-TARS-1.5-7B-4bit",
|
||||
"mlx-community/UI-TARS-1.5-7B-4bit": "mlx-community/UI-TARS-1.5-7B-4bit",
|
||||
"mlx-community/UI-TARS-1.5-7B-6bit": "mlx-community/UI-TARS-1.5-7B-6bit"
|
||||
},
|
||||
"ollama": {
|
||||
# For Ollama models, we keep the original name
|
||||
@@ -215,7 +217,7 @@ def get_provider_and_model(model_name: str, loop_provider: str) -> tuple:
|
||||
# Determine provider and clean model name based on the full string from UI
|
||||
cleaned_model_name = model_name # Default to using the name as-is (for custom)
|
||||
|
||||
if model_name == "Custom model...":
|
||||
if model_name == "Custom model (OpenAI compatible API)":
|
||||
# Actual model name comes from custom_model_value via model_to_use.
|
||||
# Assume OAICOMPAT for custom models unless overridden by URL/key later?
|
||||
# get_provider_and_model determines the *initial* provider/model.
|
||||
@@ -276,8 +278,8 @@ def get_provider_and_model(model_name: str, loop_provider: str) -> tuple:
|
||||
break
|
||||
# Note: No fallback needed here as we explicitly check against omni keys
|
||||
|
||||
else: # Handles unexpected formats or the raw custom name if "Custom model..." selected
|
||||
# Should only happen if user selected "Custom model..."
|
||||
else: # Handles unexpected formats or the raw custom name if "Custom model (OpenAI compatible API)" selected
|
||||
# Should only happen if user selected "Custom model (OpenAI compatible API)"
|
||||
# Or if a model name format isn't caught above
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
cleaned_model_name = (
|
||||
@@ -288,8 +290,16 @@ def get_provider_and_model(model_name: str, loop_provider: str) -> tuple:
|
||||
model_name_to_use = cleaned_model_name
|
||||
# agent_loop remains AgentLoop.OMNI
|
||||
elif agent_loop == AgentLoop.UITARS:
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
model_name_to_use = MODEL_MAPPINGS["uitars"]["default"] # Default
|
||||
# For UITARS, use MLXVLM provider for the MLX models, OAICOMPAT for custom
|
||||
if model_name == "Custom model (OpenAI compatible API)":
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
model_name_to_use = "tgi"
|
||||
else:
|
||||
provider = LLMProvider.MLXVLM
|
||||
# Get the model name from the mappings or use as-is if not found
|
||||
model_name_to_use = MODEL_MAPPINGS["uitars"].get(
|
||||
model_name, model_name if model_name else MODEL_MAPPINGS["uitars"]["default"]
|
||||
)
|
||||
else:
|
||||
# Default to OpenAI if unrecognized loop
|
||||
provider = LLMProvider.OPENAI
|
||||
@@ -412,25 +422,23 @@ def create_gradio_ui(
|
||||
openai_api_key = os.environ.get("OPENAI_API_KEY", "")
|
||||
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY", "")
|
||||
|
||||
# Prepare model choices based on available API keys
|
||||
openai_models = []
|
||||
anthropic_models = []
|
||||
omni_models = []
|
||||
|
||||
if openai_api_key:
|
||||
openai_models = ["OpenAI: Computer-Use Preview"]
|
||||
omni_models += [
|
||||
"OMNI: OpenAI GPT-4o",
|
||||
"OMNI: OpenAI GPT-4o mini",
|
||||
"OMNI: OpenAI GPT-4.5-preview",
|
||||
]
|
||||
|
||||
if anthropic_api_key:
|
||||
anthropic_models = [
|
||||
"Anthropic: Claude 3.7 Sonnet (20250219)",
|
||||
"Anthropic: Claude 3.5 Sonnet (20240620)",
|
||||
]
|
||||
omni_models += ["OMNI: Claude 3.7 Sonnet (20250219)", "OMNI: Claude 3.5 Sonnet (20240620)"]
|
||||
# Always show models regardless of API key availability
|
||||
openai_models = ["OpenAI: Computer-Use Preview"]
|
||||
anthropic_models = [
|
||||
"Anthropic: Claude 3.7 Sonnet (20250219)",
|
||||
"Anthropic: Claude 3.5 Sonnet (20240620)",
|
||||
]
|
||||
omni_models = [
|
||||
"OMNI: OpenAI GPT-4o",
|
||||
"OMNI: OpenAI GPT-4o mini",
|
||||
"OMNI: OpenAI GPT-4.5-preview",
|
||||
"OMNI: Claude 3.7 Sonnet (20250219)",
|
||||
"OMNI: Claude 3.5 Sonnet (20240620)"
|
||||
]
|
||||
|
||||
# Check if API keys are available
|
||||
has_openai_key = bool(openai_api_key)
|
||||
has_anthropic_key = bool(anthropic_api_key)
|
||||
|
||||
# Get Ollama models for OMNI
|
||||
ollama_models = get_ollama_models()
|
||||
@@ -441,8 +449,12 @@ def create_gradio_ui(
|
||||
provider_to_models = {
|
||||
"OPENAI": openai_models,
|
||||
"ANTHROPIC": anthropic_models,
|
||||
"OMNI": omni_models + ["Custom model..."], # Add custom model option
|
||||
"UITARS": ["Custom model..."], # UI-TARS options
|
||||
"OMNI": omni_models + ["Custom model (OpenAI compatible API)", "Custom model (ollama)"], # Add custom model options
|
||||
"UITARS": [
|
||||
"mlx-community/UI-TARS-1.5-7B-4bit",
|
||||
"mlx-community/UI-TARS-1.5-7B-6bit",
|
||||
"Custom model (OpenAI compatible API)"
|
||||
], # UI-TARS options with MLX models
|
||||
}
|
||||
|
||||
# --- Apply Saved Settings (override defaults if available) ---
|
||||
@@ -462,9 +474,9 @@ def create_gradio_ui(
|
||||
initial_model = anthropic_models[0] if anthropic_models else "No models available"
|
||||
else: # OMNI
|
||||
initial_model = omni_models[0] if omni_models else "No models available"
|
||||
if "Custom model..." in available_models_for_loop:
|
||||
if "Custom model (OpenAI compatible API)" in available_models_for_loop:
|
||||
initial_model = (
|
||||
"Custom model..." # Default to custom if available and no other default fits
|
||||
"Custom model (OpenAI compatible API)" # Default to custom if available and no other default fits
|
||||
)
|
||||
|
||||
initial_custom_model = saved_settings.get("custom_model", "Qwen2.5-VL-7B-Instruct")
|
||||
@@ -480,27 +492,129 @@ def create_gradio_ui(
|
||||
"Open Safari, search for 'macOS automation tools', and save the first three results as bookmarks",
|
||||
"Configure SSH keys and set up a connection to a remote server",
|
||||
]
|
||||
|
||||
# Function to generate Python code based on configuration and tasks
|
||||
def generate_python_code(agent_loop_choice, provider, model_name, tasks, provider_url, recent_images=3, save_trajectory=True):
|
||||
"""Generate Python code for the current configuration and tasks.
|
||||
|
||||
Args:
|
||||
agent_loop_choice: The agent loop type (e.g., UITARS, OPENAI, ANTHROPIC, OMNI)
|
||||
provider: The provider type (e.g., OPENAI, ANTHROPIC, OLLAMA, OAICOMPAT, MLXVLM)
|
||||
model_name: The model name
|
||||
tasks: List of tasks to execute
|
||||
provider_url: The provider base URL for OAICOMPAT providers
|
||||
recent_images: Number of recent images to keep in context
|
||||
save_trajectory: Whether to save the agent trajectory
|
||||
|
||||
Returns:
|
||||
Formatted Python code as a string
|
||||
"""
|
||||
# Format the tasks as a Python list
|
||||
tasks_str = ""
|
||||
for task in tasks:
|
||||
if task and task.strip():
|
||||
tasks_str += f' "{task}",\n'
|
||||
|
||||
# Create the Python code template
|
||||
code = f'''import asyncio
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
|
||||
|
||||
# Function to update model choices based on agent loop selection
|
||||
def update_model_choices(loop):
|
||||
models = provider_to_models.get(loop, [])
|
||||
if loop == "OMNI":
|
||||
# For OMNI, include the custom model option
|
||||
if not models:
|
||||
models = ["Custom model..."]
|
||||
elif "Custom model..." not in models:
|
||||
models.append("Custom model...")
|
||||
|
||||
return gr.update(
|
||||
choices=models, value=models[0] if models else "Custom model...", interactive=True
|
||||
)
|
||||
async def main():
|
||||
async with Computer() as macos_computer:
|
||||
agent = ComputerAgent(
|
||||
computer=macos_computer,
|
||||
loop=AgentLoop.{agent_loop_choice},
|
||||
only_n_most_recent_images={recent_images},
|
||||
save_trajectory={save_trajectory},'''
|
||||
|
||||
# Add the model configuration based on provider and agent loop
|
||||
if agent_loop_choice == "OPENAI":
|
||||
# For OPENAI loop, always use OPENAI provider with computer-use-preview
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.OPENAI,
|
||||
name="computer-use-preview"
|
||||
)'''
|
||||
elif agent_loop_choice == "ANTHROPIC":
|
||||
# For ANTHROPIC loop, always use ANTHROPIC provider
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.ANTHROPIC,
|
||||
name="{model_name}"
|
||||
)'''
|
||||
elif agent_loop_choice == "UITARS":
|
||||
# For UITARS, use MLXVLM for mlx-community models, OAICOMPAT for others
|
||||
if provider == LLMProvider.MLXVLM:
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.MLXVLM,
|
||||
name="{model_name}"
|
||||
)'''
|
||||
else: # OAICOMPAT
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="{model_name}",
|
||||
provider_base_url="{provider_url}"
|
||||
)'''
|
||||
elif agent_loop_choice == "OMNI":
|
||||
# For OMNI, provider can be OPENAI, ANTHROPIC, OLLAMA, or OAICOMPAT
|
||||
if provider == LLMProvider.OAICOMPAT:
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.OAICOMPAT,
|
||||
name="{model_name}",
|
||||
provider_base_url="{provider_url}"
|
||||
)'''
|
||||
else: # OPENAI, ANTHROPIC, OLLAMA
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.{provider.name},
|
||||
name="{model_name}"
|
||||
)'''
|
||||
else:
|
||||
# For other providers, use standard dropdown without custom option
|
||||
if not models:
|
||||
return gr.update(
|
||||
choices=["No models available"], value="No models available", interactive=True
|
||||
)
|
||||
return gr.update(choices=models, value=models[0] if models else None, interactive=True)
|
||||
# Default case - just use the provided provider and model
|
||||
code += f'''
|
||||
model=LLM(
|
||||
provider=LLMProvider.{provider.name},
|
||||
name="{model_name}"
|
||||
)'''
|
||||
|
||||
code += """
|
||||
)
|
||||
"""
|
||||
|
||||
# Add tasks section if there are tasks
|
||||
if tasks_str:
|
||||
code += f'''
|
||||
# Prompts for the computer-use agent
|
||||
tasks = [
|
||||
{tasks_str.rstrip()}
|
||||
]
|
||||
|
||||
for task in tasks:
|
||||
print(f"Executing task: {{task}}")
|
||||
async for result in agent.run(task):
|
||||
print(result)'''
|
||||
else:
|
||||
# If no tasks, just add a placeholder for a single task
|
||||
code += f'''
|
||||
# Execute a single task
|
||||
task = "Search for information about CUA on GitHub"
|
||||
print(f"Executing task: {{task}}")
|
||||
async for result in agent.run(task):
|
||||
print(result)'''
|
||||
|
||||
|
||||
|
||||
# Add the main block
|
||||
code += '''
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())'''
|
||||
|
||||
return code
|
||||
|
||||
# Create the Gradio interface with advanced UI
|
||||
with gr.Blocks(title="Computer-Use Agent") as demo:
|
||||
@@ -537,50 +651,20 @@ def create_gradio_ui(
|
||||
"""
|
||||
)
|
||||
|
||||
# Add installation prerequisites as a collapsible section
|
||||
with gr.Accordion("Prerequisites & Installation", open=False):
|
||||
gr.Markdown(
|
||||
"""
|
||||
## Prerequisites
|
||||
|
||||
Before using the Computer-Use Agent, you need to set up the Lume daemon and pull the macOS VM image.
|
||||
|
||||
### 1. Install Lume daemon
|
||||
|
||||
While a lume binary is included with Computer, we recommend installing the standalone version with brew, and starting the lume daemon service:
|
||||
|
||||
```bash
|
||||
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
```
|
||||
|
||||
### 2. Start the Lume daemon service
|
||||
|
||||
In a separate terminal:
|
||||
|
||||
```bash
|
||||
lume serve
|
||||
```
|
||||
|
||||
### 3. Pull the pre-built macOS image
|
||||
|
||||
```bash
|
||||
lume pull macos-sequoia-cua:latest
|
||||
```
|
||||
|
||||
Initial download requires 80GB storage, but reduces to ~30GB after first run due to macOS's sparse file system.
|
||||
|
||||
VMs are stored in `~/.lume`, and locally cached images are stored in `~/.lume/cache`.
|
||||
|
||||
### 4. Test the sandbox
|
||||
|
||||
```bash
|
||||
lume run macos-sequoia-cua:latest
|
||||
```
|
||||
|
||||
For more detailed instructions, visit the [CUA GitHub repository](https://github.com/trycua/cua).
|
||||
"""
|
||||
# Add accordion for Python code
|
||||
with gr.Accordion("Python Code", open=False):
|
||||
code_display = gr.Code(
|
||||
language="python",
|
||||
value=generate_python_code(
|
||||
initial_loop,
|
||||
LLMProvider.OPENAI,
|
||||
"gpt-4o",
|
||||
[],
|
||||
"https://openrouter.ai/api/v1"
|
||||
),
|
||||
interactive=False,
|
||||
)
|
||||
|
||||
|
||||
with gr.Accordion("Configuration", open=True):
|
||||
# Configuration options
|
||||
agent_loop = gr.Dropdown(
|
||||
@@ -590,42 +674,244 @@ def create_gradio_ui(
|
||||
info="Select the agent loop provider",
|
||||
)
|
||||
|
||||
# Create model selection dropdown with custom value support for OMNI
|
||||
model_choice = gr.Dropdown(
|
||||
choices=provider_to_models.get(initial_loop, ["No models available"]),
|
||||
label="LLM Provider and Model",
|
||||
value=initial_model,
|
||||
info="Select model or choose 'Custom model...' to enter a custom name",
|
||||
interactive=True,
|
||||
|
||||
# Create separate model selection dropdowns for each provider type
|
||||
# This avoids the Gradio bug with updating choices
|
||||
with gr.Group() as model_selection_group:
|
||||
# OpenAI models dropdown
|
||||
openai_model_choice = gr.Dropdown(
|
||||
choices=openai_models,
|
||||
label="OpenAI Model",
|
||||
value=openai_models[0] if openai_models else "No models available",
|
||||
info="Select OpenAI model",
|
||||
interactive=True,
|
||||
visible=(initial_loop == "OPENAI")
|
||||
)
|
||||
|
||||
# Anthropic models dropdown
|
||||
anthropic_model_choice = gr.Dropdown(
|
||||
choices=anthropic_models,
|
||||
label="Anthropic Model",
|
||||
value=anthropic_models[0] if anthropic_models else "No models available",
|
||||
info="Select Anthropic model",
|
||||
interactive=True,
|
||||
visible=(initial_loop == "ANTHROPIC")
|
||||
)
|
||||
|
||||
# OMNI models dropdown
|
||||
omni_model_choice = gr.Dropdown(
|
||||
choices=omni_models + ["Custom model (OpenAI compatible API)", "Custom model (ollama)"],
|
||||
label="OMNI Model",
|
||||
value=omni_models[0] if omni_models else "Custom model (OpenAI compatible API)",
|
||||
info="Select OMNI model or choose a custom model option",
|
||||
interactive=True,
|
||||
visible=(initial_loop == "OMNI")
|
||||
)
|
||||
|
||||
# UITARS models dropdown
|
||||
uitars_model_choice = gr.Dropdown(
|
||||
choices=provider_to_models.get("UITARS", ["No models available"]),
|
||||
label="UITARS Model",
|
||||
value=provider_to_models.get("UITARS", ["No models available"])[0] if provider_to_models.get("UITARS") else "No models available",
|
||||
info="Select UITARS model",
|
||||
interactive=True,
|
||||
visible=(initial_loop == "UITARS")
|
||||
)
|
||||
|
||||
# Hidden field to store the selected model (for compatibility with existing code)
|
||||
model_choice = gr.Textbox(visible=False)
|
||||
|
||||
# Add API key inputs for OpenAI and Anthropic
|
||||
with gr.Group(visible=not has_openai_key and (initial_loop == "OPENAI" or initial_loop == "OMNI")) as openai_key_group:
|
||||
openai_api_key_input = gr.Textbox(
|
||||
label="OpenAI API Key",
|
||||
placeholder="Enter your OpenAI API key",
|
||||
value="",
|
||||
interactive=True,
|
||||
type="password",
|
||||
info="Required for OpenAI models"
|
||||
)
|
||||
|
||||
with gr.Group(visible=not has_anthropic_key and (initial_loop == "ANTHROPIC" or initial_loop == "OMNI")) as anthropic_key_group:
|
||||
anthropic_api_key_input = gr.Textbox(
|
||||
label="Anthropic API Key",
|
||||
placeholder="Enter your Anthropic API key",
|
||||
value="",
|
||||
interactive=True,
|
||||
type="password",
|
||||
info="Required for Anthropic models"
|
||||
)
|
||||
|
||||
# Function to set OpenAI API key environment variable
|
||||
def set_openai_api_key(key):
|
||||
if key and key.strip():
|
||||
os.environ["OPENAI_API_KEY"] = key.strip()
|
||||
print(f"DEBUG - Set OpenAI API key environment variable")
|
||||
return key
|
||||
|
||||
# Function to set Anthropic API key environment variable
|
||||
def set_anthropic_api_key(key):
|
||||
if key and key.strip():
|
||||
os.environ["ANTHROPIC_API_KEY"] = key.strip()
|
||||
print(f"DEBUG - Set Anthropic API key environment variable")
|
||||
return key
|
||||
|
||||
# Add change event handlers for API key inputs
|
||||
openai_api_key_input.change(
|
||||
fn=set_openai_api_key,
|
||||
inputs=[openai_api_key_input],
|
||||
outputs=[openai_api_key_input],
|
||||
queue=False
|
||||
)
|
||||
|
||||
anthropic_api_key_input.change(
|
||||
fn=set_anthropic_api_key,
|
||||
inputs=[anthropic_api_key_input],
|
||||
outputs=[anthropic_api_key_input],
|
||||
queue=False
|
||||
)
|
||||
|
||||
# Add custom model textbox (only visible when "Custom model..." is selected)
|
||||
# Combined function to update UI based on selections
|
||||
def update_ui(loop=None, openai_model=None, anthropic_model=None, omni_model=None, uitars_model=None):
|
||||
# Default values if not provided
|
||||
loop = loop or agent_loop.value
|
||||
|
||||
# Determine which model value to use for custom model checks
|
||||
model_value = None
|
||||
if loop == "OPENAI" and openai_model:
|
||||
model_value = openai_model
|
||||
elif loop == "ANTHROPIC" and anthropic_model:
|
||||
model_value = anthropic_model
|
||||
elif loop == "OMNI" and omni_model:
|
||||
model_value = omni_model
|
||||
elif loop == "UITARS" and uitars_model:
|
||||
model_value = uitars_model
|
||||
|
||||
# Show/hide appropriate model dropdown based on loop selection
|
||||
openai_visible = (loop == "OPENAI")
|
||||
anthropic_visible = (loop == "ANTHROPIC")
|
||||
omni_visible = (loop == "OMNI")
|
||||
uitars_visible = (loop == "UITARS")
|
||||
|
||||
# Show/hide API key inputs based on loop selection
|
||||
show_openai_key = not has_openai_key and (loop == "OPENAI" or (loop == "OMNI" and model_value and "OpenAI" in model_value and "Custom" not in model_value))
|
||||
show_anthropic_key = not has_anthropic_key and (loop == "ANTHROPIC" or (loop == "OMNI" and model_value and "Claude" in model_value and "Custom" not in model_value))
|
||||
|
||||
# Determine custom model visibility
|
||||
is_custom_openai_api = model_value == "Custom model (OpenAI compatible API)"
|
||||
is_custom_ollama = model_value == "Custom model (ollama)"
|
||||
is_any_custom = is_custom_openai_api or is_custom_ollama
|
||||
|
||||
# Update the hidden model_choice field based on the visible dropdown
|
||||
model_choice_value = model_value if model_value else ""
|
||||
|
||||
# Return all UI updates
|
||||
return [
|
||||
# Model dropdowns visibility
|
||||
gr.update(visible=openai_visible),
|
||||
gr.update(visible=anthropic_visible),
|
||||
gr.update(visible=omni_visible),
|
||||
gr.update(visible=uitars_visible),
|
||||
# API key inputs visibility
|
||||
gr.update(visible=show_openai_key),
|
||||
gr.update(visible=show_anthropic_key),
|
||||
# Custom model fields visibility
|
||||
gr.update(visible=is_any_custom), # Custom model name always visible for any custom option
|
||||
gr.update(visible=is_custom_openai_api), # Provider base URL only for OpenAI compatible API
|
||||
gr.update(visible=is_custom_openai_api), # Provider API key only for OpenAI compatible API
|
||||
# Update the hidden model_choice field
|
||||
gr.update(value=model_choice_value)
|
||||
]
|
||||
|
||||
# Add custom model textbox (visible for both custom model options)
|
||||
custom_model = gr.Textbox(
|
||||
label="Custom Model Name",
|
||||
placeholder="Enter custom model name (e.g., Qwen2.5-VL-7B-Instruct)",
|
||||
placeholder="Enter custom model name (e.g., Qwen2.5-VL-7B-Instruct or llama3)",
|
||||
value=initial_custom_model,
|
||||
visible=(initial_model == "Custom model..."),
|
||||
visible=(initial_model == "Custom model (OpenAI compatible API)" or initial_model == "Custom model (ollama)"),
|
||||
interactive=True,
|
||||
)
|
||||
|
||||
# Add custom provider base URL textbox (only visible when "Custom model..." is selected)
|
||||
# Add custom provider base URL textbox (only visible for OpenAI compatible API)
|
||||
provider_base_url = gr.Textbox(
|
||||
label="Provider Base URL",
|
||||
placeholder="Enter provider base URL (e.g., http://localhost:1234/v1)",
|
||||
value=initial_provider_base_url,
|
||||
visible=(initial_model == "Custom model..."),
|
||||
visible=(initial_model == "Custom model (OpenAI compatible API)"),
|
||||
interactive=True,
|
||||
)
|
||||
|
||||
# Add custom API key textbox (only visible when "Custom model..." is selected)
|
||||
# Add custom API key textbox (only visible for OpenAI compatible API)
|
||||
provider_api_key = gr.Textbox(
|
||||
label="Provider API Key",
|
||||
placeholder="Enter provider API key (if required)",
|
||||
value="",
|
||||
visible=(initial_model == "Custom model..."),
|
||||
visible=(initial_model == "Custom model (OpenAI compatible API)"),
|
||||
interactive=True,
|
||||
type="password",
|
||||
)
|
||||
|
||||
# Connect agent_loop changes to update all UI elements
|
||||
agent_loop.change(
|
||||
fn=update_ui,
|
||||
inputs=[agent_loop, openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice],
|
||||
outputs=[
|
||||
openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice,
|
||||
openai_key_group, anthropic_key_group,
|
||||
custom_model, provider_base_url, provider_api_key,
|
||||
model_choice # Add model_choice to outputs
|
||||
],
|
||||
queue=False # Process immediately without queueing
|
||||
)
|
||||
|
||||
# Connect each model dropdown to update UI
|
||||
omni_model_choice.change(
|
||||
fn=update_ui,
|
||||
inputs=[agent_loop, openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice],
|
||||
outputs=[
|
||||
openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice,
|
||||
openai_key_group, anthropic_key_group,
|
||||
custom_model, provider_base_url, provider_api_key,
|
||||
model_choice # Add model_choice to outputs
|
||||
],
|
||||
queue=False
|
||||
)
|
||||
|
||||
uitars_model_choice.change(
|
||||
fn=update_ui,
|
||||
inputs=[agent_loop, openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice],
|
||||
outputs=[
|
||||
openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice,
|
||||
openai_key_group, anthropic_key_group,
|
||||
custom_model, provider_base_url, provider_api_key,
|
||||
model_choice # Add model_choice to outputs
|
||||
],
|
||||
queue=False
|
||||
)
|
||||
|
||||
openai_model_choice.change(
|
||||
fn=update_ui,
|
||||
inputs=[agent_loop, openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice],
|
||||
outputs=[
|
||||
openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice,
|
||||
openai_key_group, anthropic_key_group,
|
||||
custom_model, provider_base_url, provider_api_key,
|
||||
model_choice # Add model_choice to outputs
|
||||
],
|
||||
queue=False
|
||||
)
|
||||
|
||||
anthropic_model_choice.change(
|
||||
fn=update_ui,
|
||||
inputs=[agent_loop, openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice],
|
||||
outputs=[
|
||||
openai_model_choice, anthropic_model_choice, omni_model_choice, uitars_model_choice,
|
||||
openai_key_group, anthropic_key_group,
|
||||
custom_model, provider_base_url, provider_api_key,
|
||||
model_choice # Add model_choice to outputs
|
||||
],
|
||||
queue=False
|
||||
)
|
||||
|
||||
save_trajectory = gr.Checkbox(
|
||||
label="Save Trajectory",
|
||||
@@ -643,6 +929,7 @@ def create_gradio_ui(
|
||||
info="Number of recent images to keep in context",
|
||||
interactive=True,
|
||||
)
|
||||
|
||||
|
||||
# Right column for chat interface
|
||||
with gr.Column(scale=2):
|
||||
@@ -656,6 +943,9 @@ def create_gradio_ui(
|
||||
placeholder="Ask me to perform tasks in a virtual macOS environment"
|
||||
)
|
||||
clear = gr.Button("Clear")
|
||||
|
||||
# Add cancel button
|
||||
cancel_button = gr.Button("Cancel", variant="stop")
|
||||
|
||||
# Add examples
|
||||
example_group = gr.Examples(examples=example_messages, inputs=msg)
|
||||
@@ -666,16 +956,36 @@ def create_gradio_ui(
|
||||
history.append(gr.ChatMessage(role="user", content=message))
|
||||
return "", history
|
||||
|
||||
# Function to cancel the running agent
|
||||
async def cancel_agent_task(history):
|
||||
global global_agent
|
||||
if global_agent and hasattr(global_agent, '_loop'):
|
||||
print("DEBUG - Cancelling agent task")
|
||||
# Cancel the agent loop
|
||||
if hasattr(global_agent._loop, 'cancel') and callable(global_agent._loop.cancel):
|
||||
await global_agent._loop.cancel()
|
||||
history.append(gr.ChatMessage(role="assistant", content="Task cancelled by user", metadata={"title": "❌ Cancelled"}))
|
||||
else:
|
||||
history.append(gr.ChatMessage(role="assistant", content="Could not cancel task: cancel method not found", metadata={"title": "⚠️ Warning"}))
|
||||
else:
|
||||
history.append(gr.ChatMessage(role="assistant", content="No active agent task to cancel", metadata={"title": "ℹ️ Info"}))
|
||||
return history
|
||||
|
||||
# Function to process agent response after user input
|
||||
async def process_response(
|
||||
history,
|
||||
model_choice_value,
|
||||
openai_model_value,
|
||||
anthropic_model_value,
|
||||
omni_model_value,
|
||||
uitars_model_value,
|
||||
custom_model_value,
|
||||
agent_loop_choice,
|
||||
save_traj,
|
||||
recent_imgs,
|
||||
custom_url_value=None,
|
||||
custom_api_key=None,
|
||||
openai_key_input=None,
|
||||
anthropic_key_input=None,
|
||||
):
|
||||
if not history:
|
||||
yield history
|
||||
@@ -684,21 +994,53 @@ def create_gradio_ui(
|
||||
# Get the last user message
|
||||
last_user_message = history[-1]["content"]
|
||||
|
||||
# Get the appropriate model value based on the agent loop
|
||||
if agent_loop_choice == "OPENAI":
|
||||
model_choice_value = openai_model_value
|
||||
elif agent_loop_choice == "ANTHROPIC":
|
||||
model_choice_value = anthropic_model_value
|
||||
elif agent_loop_choice == "OMNI":
|
||||
model_choice_value = omni_model_value
|
||||
elif agent_loop_choice == "UITARS":
|
||||
model_choice_value = uitars_model_value
|
||||
else:
|
||||
model_choice_value = "No models available"
|
||||
|
||||
# Determine if this is a custom model selection and which type
|
||||
is_custom_openai_api = model_choice_value == "Custom model (OpenAI compatible API)"
|
||||
is_custom_ollama = model_choice_value == "Custom model (ollama)"
|
||||
is_custom_model_selected = is_custom_openai_api or is_custom_ollama
|
||||
|
||||
# Determine the model name string to analyze: custom or from dropdown
|
||||
model_string_to_analyze = (
|
||||
custom_model_value
|
||||
if model_choice_value == "Custom model..."
|
||||
else model_choice_value # Use the full UI string initially
|
||||
)
|
||||
|
||||
# Determine if this is a custom model selection
|
||||
is_custom_model_selected = model_choice_value == "Custom model..."
|
||||
if is_custom_model_selected:
|
||||
model_string_to_analyze = custom_model_value
|
||||
else:
|
||||
model_string_to_analyze = model_choice_value # Use the full UI string initially
|
||||
|
||||
try:
|
||||
# Get the provider, *cleaned* model name, and agent loop type
|
||||
provider, cleaned_model_name_from_func, agent_loop_type = (
|
||||
get_provider_and_model(model_string_to_analyze, agent_loop_choice)
|
||||
)
|
||||
# Special case for UITARS - use MLXVLM provider or OAICOMPAT for custom
|
||||
if agent_loop_choice == "UITARS":
|
||||
if is_custom_openai_api:
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
cleaned_model_name_from_func = custom_model_value
|
||||
agent_loop_type = AgentLoop.UITARS
|
||||
print(f"Using OAICOMPAT provider for custom UITARS model: {custom_model_value}")
|
||||
else:
|
||||
provider = LLMProvider.MLXVLM
|
||||
cleaned_model_name_from_func = model_string_to_analyze
|
||||
agent_loop_type = AgentLoop.UITARS
|
||||
print(f"Using MLXVLM provider for UITARS model: {model_string_to_analyze}")
|
||||
# Special case for Ollama custom model
|
||||
elif is_custom_ollama and agent_loop_choice == "OMNI":
|
||||
provider = LLMProvider.OLLAMA
|
||||
cleaned_model_name_from_func = custom_model_value
|
||||
agent_loop_type = AgentLoop.OMNI
|
||||
print(f"Using Ollama provider for custom model: {custom_model_value}")
|
||||
else:
|
||||
# Get the provider, *cleaned* model name, and agent loop type
|
||||
provider, cleaned_model_name_from_func, agent_loop_type = (
|
||||
get_provider_and_model(model_string_to_analyze, agent_loop_choice)
|
||||
)
|
||||
|
||||
print(f"provider={provider} cleaned_model_name_from_func={cleaned_model_name_from_func} agent_loop_type={agent_loop_type} agent_loop_choice={agent_loop_choice}")
|
||||
|
||||
@@ -710,20 +1052,34 @@ def create_gradio_ui(
|
||||
else cleaned_model_name_from_func
|
||||
)
|
||||
|
||||
# Determine if OAICOMPAT should be used (only if custom model explicitly selected)
|
||||
is_oaicompat = is_custom_model_selected
|
||||
# Determine if OAICOMPAT should be used (for OpenAI compatible API custom model)
|
||||
is_oaicompat = is_custom_openai_api
|
||||
|
||||
# Get API key based on provider determined by get_provider_and_model
|
||||
if is_oaicompat and custom_api_key:
|
||||
# Use custom API key if provided for custom model
|
||||
# Use custom API key if provided for OpenAI compatible API custom model
|
||||
api_key = custom_api_key
|
||||
print(
|
||||
f"DEBUG - Using custom API key for model: {final_model_name_to_send}"
|
||||
f"DEBUG - Using custom API key for OpenAI compatible API model: {final_model_name_to_send}"
|
||||
)
|
||||
elif provider == LLMProvider.OLLAMA:
|
||||
# No API key needed for Ollama
|
||||
api_key = ""
|
||||
print(f"DEBUG - No API key needed for Ollama model: {final_model_name_to_send}")
|
||||
elif provider == LLMProvider.OPENAI:
|
||||
api_key = openai_api_key or os.environ.get("OPENAI_API_KEY", "")
|
||||
# Use OpenAI key from input if provided, otherwise use environment variable
|
||||
api_key = openai_key_input if openai_key_input else (openai_api_key or os.environ.get("OPENAI_API_KEY", ""))
|
||||
if openai_key_input:
|
||||
# Set the environment variable for the OpenAI API key
|
||||
os.environ["OPENAI_API_KEY"] = openai_key_input
|
||||
print(f"DEBUG - Using provided OpenAI API key from UI and set as environment variable")
|
||||
elif provider == LLMProvider.ANTHROPIC:
|
||||
api_key = anthropic_api_key or os.environ.get("ANTHROPIC_API_KEY", "")
|
||||
# Use Anthropic key from input if provided, otherwise use environment variable
|
||||
api_key = anthropic_key_input if anthropic_key_input else (anthropic_api_key or os.environ.get("ANTHROPIC_API_KEY", ""))
|
||||
if anthropic_key_input:
|
||||
# Set the environment variable for the Anthropic API key
|
||||
os.environ["ANTHROPIC_API_KEY"] = anthropic_key_input
|
||||
print(f"DEBUG - Using provided Anthropic API key from UI and set as environment variable")
|
||||
else:
|
||||
# For Ollama or default OAICOMPAT (without custom key), no key needed/expected
|
||||
api_key = ""
|
||||
@@ -742,8 +1098,8 @@ def create_gradio_ui(
|
||||
|
||||
# Create or update the agent
|
||||
create_agent(
|
||||
# Provider determined by get_provider_and_model unless custom model selected
|
||||
provider=LLMProvider.OAICOMPAT if is_oaicompat else provider,
|
||||
# Provider determined by special cases and get_provider_and_model
|
||||
provider=provider,
|
||||
agent_loop=agent_loop_type,
|
||||
# Pass the FINAL determined model name (cleaned or custom)
|
||||
model_name=final_model_name_to_send,
|
||||
@@ -856,48 +1212,163 @@ def create_gradio_ui(
|
||||
# Update with error message
|
||||
history.append(gr.ChatMessage(role="assistant", content=f"Error: {str(e)}"))
|
||||
yield history
|
||||
|
||||
# Connect the components
|
||||
msg.submit(chat_submit, [msg, chatbot_history], [msg, chatbot_history]).then(
|
||||
process_response,
|
||||
[
|
||||
|
||||
# Connect the submit button to the process_response function
|
||||
submit_event = msg.submit(
|
||||
fn=chat_submit,
|
||||
inputs=[msg, chatbot_history],
|
||||
outputs=[msg, chatbot_history],
|
||||
queue=False,
|
||||
).then(
|
||||
fn=process_response,
|
||||
inputs=[
|
||||
chatbot_history,
|
||||
model_choice,
|
||||
openai_model_choice,
|
||||
anthropic_model_choice,
|
||||
omni_model_choice,
|
||||
uitars_model_choice,
|
||||
custom_model,
|
||||
agent_loop,
|
||||
save_trajectory,
|
||||
recent_images,
|
||||
provider_base_url,
|
||||
provider_api_key,
|
||||
openai_api_key_input,
|
||||
anthropic_api_key_input,
|
||||
],
|
||||
[chatbot_history],
|
||||
outputs=[chatbot_history],
|
||||
queue=True,
|
||||
)
|
||||
|
||||
# Clear button functionality
|
||||
clear.click(lambda: None, None, chatbot_history, queue=False)
|
||||
|
||||
# Connect agent_loop changes to model selection
|
||||
agent_loop.change(
|
||||
fn=update_model_choices,
|
||||
inputs=[agent_loop],
|
||||
outputs=[model_choice],
|
||||
queue=False, # Process immediately without queueing
|
||||
|
||||
# Connect cancel button to cancel function
|
||||
cancel_button.click(
|
||||
cancel_agent_task,
|
||||
[chatbot_history],
|
||||
[chatbot_history],
|
||||
queue=False # Process immediately without queueing
|
||||
)
|
||||
|
||||
# Show/hide custom model, provider base URL, and API key textboxes based on dropdown selection
|
||||
def update_custom_model_visibility(model_value):
|
||||
is_custom = model_value == "Custom model..."
|
||||
return (
|
||||
gr.update(visible=is_custom),
|
||||
gr.update(visible=is_custom),
|
||||
gr.update(visible=is_custom),
|
||||
)
|
||||
|
||||
# Function to update the code display based on configuration and chat history
|
||||
def update_code_display(agent_loop, model_choice_val, custom_model_val, chat_history, provider_base_url, recent_images_val, save_trajectory_val):
|
||||
# Extract messages from chat history
|
||||
messages = []
|
||||
if chat_history:
|
||||
for msg in chat_history:
|
||||
if msg.get("role") == "user":
|
||||
messages.append(msg.get("content", ""))
|
||||
|
||||
# Determine if this is a custom model selection and which type
|
||||
is_custom_openai_api = model_choice_val == "Custom model (OpenAI compatible API)"
|
||||
is_custom_ollama = model_choice_val == "Custom model (ollama)"
|
||||
is_custom_model_selected = is_custom_openai_api or is_custom_ollama
|
||||
|
||||
# Determine provider and model name based on agent loop
|
||||
if agent_loop == "OPENAI":
|
||||
# For OPENAI loop, always use OPENAI provider with computer-use-preview
|
||||
provider = LLMProvider.OPENAI
|
||||
model_name = "computer-use-preview"
|
||||
elif agent_loop == "ANTHROPIC":
|
||||
# For ANTHROPIC loop, always use ANTHROPIC provider
|
||||
provider = LLMProvider.ANTHROPIC
|
||||
# Extract model name from the UI string
|
||||
if model_choice_val.startswith("Anthropic: Claude "):
|
||||
# Extract the model name based on the UI string
|
||||
model_parts = model_choice_val.replace("Anthropic: Claude ", "").split(" (")
|
||||
version = model_parts[0] # e.g., "3.7 Sonnet"
|
||||
date = model_parts[1].replace(")", "") if len(model_parts) > 1 else "" # e.g., "20250219"
|
||||
|
||||
# Format as claude-3-7-sonnet-20250219 or claude-3-5-sonnet-20240620
|
||||
version = version.replace(".", "-").replace(" ", "-").lower()
|
||||
model_name = f"claude-{version}-{date}"
|
||||
else:
|
||||
# Use the model_choice_val directly if it doesn't match the expected format
|
||||
model_name = model_choice_val
|
||||
elif agent_loop == "UITARS":
|
||||
# For UITARS, use MLXVLM for mlx-community models, OAICOMPAT for custom
|
||||
if model_choice_val == "Custom model (OpenAI compatible API)":
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
model_name = custom_model_val
|
||||
else:
|
||||
provider = LLMProvider.MLXVLM
|
||||
model_name = model_choice_val
|
||||
elif agent_loop == "OMNI":
|
||||
# For OMNI, provider can be OPENAI, ANTHROPIC, OLLAMA, or OAICOMPAT
|
||||
if is_custom_openai_api:
|
||||
provider = LLMProvider.OAICOMPAT
|
||||
model_name = custom_model_val
|
||||
elif is_custom_ollama:
|
||||
provider = LLMProvider.OLLAMA
|
||||
model_name = custom_model_val
|
||||
elif model_choice_val.startswith("OMNI: OpenAI "):
|
||||
provider = LLMProvider.OPENAI
|
||||
# Extract model name from UI string (e.g., "OMNI: OpenAI GPT-4o" -> "gpt-4o")
|
||||
model_name = model_choice_val.replace("OMNI: OpenAI ", "").lower().replace(" ", "-")
|
||||
elif model_choice_val.startswith("OMNI: Claude "):
|
||||
provider = LLMProvider.ANTHROPIC
|
||||
# Extract model name from UI string (similar to ANTHROPIC loop case)
|
||||
model_parts = model_choice_val.replace("OMNI: Claude ", "").split(" (")
|
||||
version = model_parts[0] # e.g., "3.7 Sonnet"
|
||||
date = model_parts[1].replace(")", "") if len(model_parts) > 1 else "" # e.g., "20250219"
|
||||
|
||||
# Format as claude-3-7-sonnet-20250219 or claude-3-5-sonnet-20240620
|
||||
version = version.replace(".", "-").replace(" ", "-").lower()
|
||||
model_name = f"claude-{version}-{date}"
|
||||
elif model_choice_val.startswith("OMNI: Ollama "):
|
||||
provider = LLMProvider.OLLAMA
|
||||
# Extract model name from UI string (e.g., "OMNI: Ollama llama3" -> "llama3")
|
||||
model_name = model_choice_val.replace("OMNI: Ollama ", "")
|
||||
else:
|
||||
# Fallback to get_provider_and_model for any other cases
|
||||
provider, model_name, _ = get_provider_and_model(model_choice_val, agent_loop)
|
||||
else:
|
||||
# Fallback for any other agent loop
|
||||
provider, model_name, _ = get_provider_and_model(model_choice_val, agent_loop)
|
||||
|
||||
# Generate and return the code
|
||||
return generate_python_code(
|
||||
agent_loop,
|
||||
provider,
|
||||
model_name,
|
||||
messages,
|
||||
provider_base_url,
|
||||
recent_images_val,
|
||||
save_trajectory_val
|
||||
)
|
||||
|
||||
# Update code display when configuration changes
|
||||
agent_loop.change(
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
model_choice.change(
|
||||
fn=update_custom_model_visibility,
|
||||
inputs=[model_choice],
|
||||
outputs=[custom_model, provider_base_url, provider_api_key],
|
||||
queue=False, # Process immediately without queueing
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
custom_model.change(
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
chatbot_history.change(
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
recent_images.change(
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
save_trajectory.change(
|
||||
update_code_display,
|
||||
inputs=[agent_loop, model_choice, custom_model, chatbot_history, provider_base_url, recent_images, save_trajectory],
|
||||
outputs=[code_display]
|
||||
)
|
||||
|
||||
return demo
|
||||
|
||||
@@ -37,6 +37,10 @@ openai = [
|
||||
uitars = [
|
||||
"httpx>=0.27.0,<0.29.0",
|
||||
]
|
||||
uitars-mlx = [
|
||||
# The mlx-vlm package needs to be installed manually with:
|
||||
# pip install git+https://github.com/ddupont808/mlx-vlm.git@stable/fix/qwen2-position-id
|
||||
]
|
||||
ui = [
|
||||
"gradio>=5.23.3,<6.0.0",
|
||||
"python-dotenv>=1.0.1,<2.0.0",
|
||||
@@ -85,6 +89,8 @@ all = [
|
||||
"ollama>=0.4.7,<0.5.0",
|
||||
"gradio>=5.23.3,<6.0.0",
|
||||
"python-dotenv>=1.0.1,<2.0.0"
|
||||
# mlx-vlm needs to be installed manually with:
|
||||
# pip install git+https://github.com/ddupont808/mlx-vlm.git@stable/fix/qwen2-position-id
|
||||
]
|
||||
|
||||
[tool.pdm]
|
||||
|
||||
@@ -343,7 +343,15 @@ class MacOSComputerInterface(BaseComputerInterface):
|
||||
|
||||
# Keyboard Actions
|
||||
async def type_text(self, text: str) -> None:
|
||||
await self._send_command("type_text", {"text": text})
|
||||
# Temporary fix for https://github.com/trycua/cua/issues/165
|
||||
# Check if text contains Unicode characters
|
||||
if any(ord(char) > 127 for char in text):
|
||||
# For Unicode text, use clipboard and paste
|
||||
await self.set_clipboard(text)
|
||||
await self.hotkey(Key.COMMAND, 'v')
|
||||
else:
|
||||
# For ASCII text, use the regular typing method
|
||||
await self._send_command("type_text", {"text": text})
|
||||
|
||||
async def press(self, key: "KeyType") -> None:
|
||||
"""Press a single key.
|
||||
@@ -531,7 +539,7 @@ class MacOSComputerInterface(BaseComputerInterface):
|
||||
result = await self._send_command("get_accessibility_tree")
|
||||
if not result.get("success", False):
|
||||
raise RuntimeError(result.get("error", "Failed to get accessibility tree"))
|
||||
return result.get("tree", {})
|
||||
return result
|
||||
|
||||
async def get_active_window_bounds(self) -> Dict[str, int]:
|
||||
"""Get the bounds of the currently active window."""
|
||||
|
||||
@@ -28,6 +28,7 @@ lumier = [
|
||||
ui = [
|
||||
"gradio>=5.23.3,<6.0.0",
|
||||
"python-dotenv>=1.0.1,<2.0.0",
|
||||
"datasets>=3.6.0,<4.0.0",
|
||||
]
|
||||
all = [
|
||||
"gradio>=5.23.3,<6.0.0",
|
||||
|
||||
@@ -246,6 +246,27 @@ final class DarwinVirtualizationService: BaseVirtualizationService {
|
||||
]
|
||||
vzConfig.memoryBalloonDevices = [VZVirtioTraditionalMemoryBalloonDeviceConfiguration()]
|
||||
vzConfig.entropyDevices = [VZVirtioEntropyDeviceConfiguration()]
|
||||
|
||||
// Audio configuration
|
||||
let soundDeviceConfiguration = VZVirtioSoundDeviceConfiguration()
|
||||
let inputAudioStreamConfiguration = VZVirtioSoundDeviceInputStreamConfiguration()
|
||||
let outputAudioStreamConfiguration = VZVirtioSoundDeviceOutputStreamConfiguration()
|
||||
|
||||
inputAudioStreamConfiguration.source = VZHostAudioInputStreamSource()
|
||||
outputAudioStreamConfiguration.sink = VZHostAudioOutputStreamSink()
|
||||
|
||||
soundDeviceConfiguration.streams = [inputAudioStreamConfiguration, outputAudioStreamConfiguration]
|
||||
vzConfig.audioDevices = [soundDeviceConfiguration]
|
||||
|
||||
// Clipboard sharing via Spice agent
|
||||
let spiceAgentConsoleDevice = VZVirtioConsoleDeviceConfiguration()
|
||||
let spiceAgentPort = VZVirtioConsolePortConfiguration()
|
||||
spiceAgentPort.name = VZSpiceAgentPortAttachment.spiceAgentPortName
|
||||
let spiceAgentPortAttachment = VZSpiceAgentPortAttachment()
|
||||
spiceAgentPortAttachment.sharesClipboard = true
|
||||
spiceAgentPort.attachment = spiceAgentPortAttachment
|
||||
spiceAgentConsoleDevice.ports[0] = spiceAgentPort
|
||||
vzConfig.consoleDevices.append(spiceAgentConsoleDevice)
|
||||
|
||||
// Directory sharing
|
||||
let directorySharingDevices = createDirectorySharingDevices(
|
||||
@@ -376,6 +397,27 @@ final class LinuxVirtualizationService: BaseVirtualizationService {
|
||||
]
|
||||
vzConfig.memoryBalloonDevices = [VZVirtioTraditionalMemoryBalloonDeviceConfiguration()]
|
||||
vzConfig.entropyDevices = [VZVirtioEntropyDeviceConfiguration()]
|
||||
|
||||
// Audio configuration
|
||||
let soundDeviceConfiguration = VZVirtioSoundDeviceConfiguration()
|
||||
let inputAudioStreamConfiguration = VZVirtioSoundDeviceInputStreamConfiguration()
|
||||
let outputAudioStreamConfiguration = VZVirtioSoundDeviceOutputStreamConfiguration()
|
||||
|
||||
inputAudioStreamConfiguration.source = VZHostAudioInputStreamSource()
|
||||
outputAudioStreamConfiguration.sink = VZHostAudioOutputStreamSink()
|
||||
|
||||
soundDeviceConfiguration.streams = [inputAudioStreamConfiguration, outputAudioStreamConfiguration]
|
||||
vzConfig.audioDevices = [soundDeviceConfiguration]
|
||||
|
||||
// Clipboard sharing via Spice agent
|
||||
let spiceAgentConsoleDevice = VZVirtioConsoleDeviceConfiguration()
|
||||
let spiceAgentPort = VZVirtioConsolePortConfiguration()
|
||||
spiceAgentPort.name = VZSpiceAgentPortAttachment.spiceAgentPortName
|
||||
let spiceAgentPortAttachment = VZSpiceAgentPortAttachment()
|
||||
spiceAgentPortAttachment.sharesClipboard = true
|
||||
spiceAgentPort.attachment = spiceAgentPortAttachment
|
||||
spiceAgentConsoleDevice.ports[0] = spiceAgentPort
|
||||
vzConfig.consoleDevices.append(spiceAgentConsoleDevice)
|
||||
|
||||
// Directory sharing
|
||||
var directorySharingDevices = createDirectorySharingDevices(
|
||||
|
||||
@@ -134,13 +134,35 @@ EOF
|
||||
extract_json_field() {
|
||||
local field_name=$1
|
||||
local input=$2
|
||||
local result
|
||||
result=$(echo "$input" | grep -oP '"'"$field_name"'"\s*:\s*"\K[^"]+')
|
||||
if [[ $? -ne 0 ]]; then
|
||||
echo ""
|
||||
else
|
||||
echo "$result"
|
||||
local result=""
|
||||
|
||||
# First attempt with jq if available (most reliable JSON parsing)
|
||||
if command -v jq &> /dev/null; then
|
||||
# Use jq for reliable JSON parsing
|
||||
result=$(echo "$input" | jq -r ".$field_name // empty" 2>/dev/null)
|
||||
if [[ -n "$result" ]]; then
|
||||
echo "$result"
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
# Fallback to grep-based approach with improvements
|
||||
# First try for quoted string values
|
||||
result=$(echo "$input" | tr -d '\n' | grep -o "\"$field_name\"\s*:\s*\"[^\"]*\"" | sed -E 's/.*":\s*"(.*)"$/\1/')
|
||||
if [[ -n "$result" ]]; then
|
||||
echo "$result"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# Try for non-quoted values (numbers, true, false, null)
|
||||
result=$(echo "$input" | tr -d '\n' | grep -o "\"$field_name\"\s*:\s*[^,}]*" | sed -E 's/.*":\s*(.*)$/\1/')
|
||||
if [[ -n "$result" ]]; then
|
||||
echo "$result"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# Return empty string if field not found
|
||||
echo ""
|
||||
}
|
||||
|
||||
extract_json_field_from_file() {
|
||||
|
||||
@@ -233,7 +233,7 @@ stop_vm() {
|
||||
# still attempt a stop just in case.
|
||||
echo "VM status is unknown ('$vm_status') or VM not found during cleanup. Attempting stop anyway."
|
||||
lume_stop "$VM_NAME" "$STORAGE_PATH"
|
||||
sleep 5000
|
||||
sleep 5
|
||||
echo "VM '$VM_NAME' stop command issued as a precaution."
|
||||
else
|
||||
echo "VM status is unknown ('$vm_status') or VM not found. Not attempting stop."
|
||||
|
||||
152
scripts/playground.sh
Executable file
152
scripts/playground.sh
Executable file
@@ -0,0 +1,152 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
echo "🚀 Setting up CUA playground environment..."
|
||||
|
||||
# Check for Apple Silicon Mac
|
||||
if [[ $(uname -s) != "Darwin" || $(uname -m) != "arm64" ]]; then
|
||||
echo "❌ This script requires an Apple Silicon Mac (M1/M2/M3/M4)."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check for macOS 15 (Sequoia) or newer
|
||||
OSVERSION=$(sw_vers -productVersion)
|
||||
if [[ $(echo "$OSVERSION 15.0" | tr " " "\n" | sort -V | head -n 1) != "15.0" ]]; then
|
||||
echo "❌ This script requires macOS 15 (Sequoia) or newer. You have $OSVERSION."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create a temporary directory for our work
|
||||
TMP_DIR=$(mktemp -d)
|
||||
cd "$TMP_DIR"
|
||||
|
||||
# Function to clean up on exit
|
||||
cleanup() {
|
||||
cd ~
|
||||
rm -rf "$TMP_DIR"
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
# Install Lume if not already installed
|
||||
if ! command -v lume &> /dev/null; then
|
||||
echo "📦 Installing Lume CLI..."
|
||||
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh | bash
|
||||
|
||||
# Add lume to PATH for this session if it's not already there
|
||||
if ! command -v lume &> /dev/null; then
|
||||
export PATH="$PATH:$HOME/.local/bin"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Pull the macOS CUA image if not already present
|
||||
if ! lume ls | grep -q "macos-sequoia-cua"; then
|
||||
# Check available disk space
|
||||
IMAGE_SIZE_GB=30
|
||||
AVAILABLE_SPACE_KB=$(df -k $HOME | tail -1 | awk '{print $4}')
|
||||
AVAILABLE_SPACE_GB=$(($AVAILABLE_SPACE_KB / 1024 / 1024))
|
||||
|
||||
echo "📊 The macOS CUA image will use approximately ${IMAGE_SIZE_GB}GB of disk space."
|
||||
echo " You currently have ${AVAILABLE_SPACE_GB}GB available on your system."
|
||||
|
||||
# Prompt for confirmation
|
||||
read -p " Continue? [y]/n: " CONTINUE
|
||||
CONTINUE=${CONTINUE:-y}
|
||||
|
||||
if [[ $CONTINUE =~ ^[Yy]$ ]]; then
|
||||
echo "📥 Pulling macOS CUA image (this may take a while)..."
|
||||
lume pull macos-sequoia-cua:latest
|
||||
else
|
||||
echo "❌ Installation cancelled."
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Create a Python virtual environment
|
||||
echo "🐍 Setting up Python environment..."
|
||||
PYTHON_CMD="python3"
|
||||
|
||||
# Check if Python 3.11+ is available
|
||||
PYTHON_VERSION=$($PYTHON_CMD --version 2>&1 | cut -d" " -f2)
|
||||
PYTHON_MAJOR=$(echo $PYTHON_VERSION | cut -d. -f1)
|
||||
PYTHON_MINOR=$(echo $PYTHON_VERSION | cut -d. -f2)
|
||||
|
||||
if [ "$PYTHON_MAJOR" -lt 3 ] || ([ "$PYTHON_MAJOR" -eq 3 ] && [ "$PYTHON_MINOR" -lt 11 ]); then
|
||||
echo "❌ Python 3.11+ is required. You have $PYTHON_VERSION."
|
||||
echo "Please install Python 3.11+ and try again."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create a virtual environment
|
||||
VENV_DIR="$HOME/.cua-venv"
|
||||
if [ ! -d "$VENV_DIR" ]; then
|
||||
$PYTHON_CMD -m venv "$VENV_DIR"
|
||||
fi
|
||||
|
||||
# Activate the virtual environment
|
||||
source "$VENV_DIR/bin/activate"
|
||||
|
||||
# Install required packages
|
||||
echo "📦 Updating CUA packages..."
|
||||
pip install -U pip
|
||||
pip install -U cua-computer "cua-agent[all]"
|
||||
|
||||
# Temporary fix for mlx-vlm, see https://github.com/Blaizzy/mlx-vlm/pull/349
|
||||
pip install git+https://github.com/ddupont808/mlx-vlm.git@stable/fix/qwen2-position-id
|
||||
|
||||
# Create a simple demo script
|
||||
DEMO_DIR="$HOME/.cua-demo"
|
||||
mkdir -p "$DEMO_DIR"
|
||||
|
||||
cat > "$DEMO_DIR/run_demo.py" << 'EOF'
|
||||
import asyncio
|
||||
import os
|
||||
from computer import Computer
|
||||
from agent import ComputerAgent, LLM, AgentLoop, LLMProvider
|
||||
from agent.ui.gradio.app import create_gradio_ui
|
||||
|
||||
# Try to load API keys from environment
|
||||
api_key = os.environ.get("OPENAI_API_KEY", "")
|
||||
if not api_key:
|
||||
print("\n⚠️ No OpenAI API key found. You'll need to provide one in the UI.")
|
||||
|
||||
# Launch the Gradio UI and open it in the browser
|
||||
app = create_gradio_ui()
|
||||
app.launch(share=False, inbrowser=True)
|
||||
EOF
|
||||
|
||||
# Create a convenience script to run the demo
|
||||
cat > "$DEMO_DIR/start_demo.sh" << EOF
|
||||
#!/bin/bash
|
||||
source "$VENV_DIR/bin/activate"
|
||||
cd "$DEMO_DIR"
|
||||
python run_demo.py
|
||||
EOF
|
||||
chmod +x "$DEMO_DIR/start_demo.sh"
|
||||
|
||||
echo "✅ Setup complete!"
|
||||
echo "🖥️ You can start the CUA playground by running: $DEMO_DIR/start_demo.sh"
|
||||
|
||||
# Check if the VM is running
|
||||
echo "🔍 Checking if the macOS CUA VM is running..."
|
||||
VM_RUNNING=$(lume ls | grep "macos-sequoia-cua" | grep "running" || echo "")
|
||||
|
||||
if [ -z "$VM_RUNNING" ]; then
|
||||
echo "🚀 Starting the macOS CUA VM in the background..."
|
||||
lume run macos-sequoia-cua:latest &
|
||||
# Wait a moment for the VM to initialize
|
||||
sleep 5
|
||||
echo "✅ VM started successfully."
|
||||
else
|
||||
echo "✅ macOS CUA VM is already running."
|
||||
fi
|
||||
|
||||
# Ask if the user wants to start the demo now
|
||||
echo
|
||||
read -p "Would you like to start the CUA playground now? (y/n) " -n 1 -r
|
||||
echo
|
||||
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||
echo "🚀 Starting the CUA playground..."
|
||||
echo ""
|
||||
"$DEMO_DIR/start_demo.sh"
|
||||
fi
|
||||
Reference in New Issue
Block a user