mirror of
https://github.com/trycua/computer.git
synced 2026-05-09 08:49:33 -05:00
Merge branch 'main' into models/opencua
This commit is contained in:
+2
-2
@@ -30,7 +30,7 @@ We're always looking for suggestions to make lume better. If you have an idea:
|
||||
|
||||
We follow strict code formatting guidelines to ensure consistency across the codebase. Before submitting any code:
|
||||
|
||||
1. **Review Our Format Guide**: Please review our [Code Formatting Standards](docs/Developer-Guide.md#code-formatting-standards) section in the Getting Started guide.
|
||||
1. **Review Our Format Guide**: Please review our [Code Formatting Standards](Development.md#code-formatting-standards) section in the Getting Started guide.
|
||||
2. **Configure Your IDE**: We recommend using the workspace settings provided in `.vscode/` for automatic formatting.
|
||||
3. **Run Formatting Tools**: Always run the formatting tools before submitting a PR:
|
||||
```bash
|
||||
@@ -51,6 +51,6 @@ Documentation improvements are always welcome. You can:
|
||||
- Improve API documentation
|
||||
- Add tutorials or guides
|
||||
|
||||
For detailed instructions on setting up your development environment and submitting code contributions, please see our [Developer-Guide](./docs/Developer-Guide.md).
|
||||
For detailed instructions on setting up your development environment and submitting code contributions, please see our [Developer-Guide](Development.md).
|
||||
|
||||
Feel free to join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas or get help with your contributions.
|
||||
+285
@@ -0,0 +1,285 @@
|
||||
# Getting Started
|
||||
|
||||
## Project Structure
|
||||
|
||||
The project is organized as a monorepo with these main packages:
|
||||
|
||||
- `libs/core/` - Base package with telemetry support
|
||||
- `libs/computer/` - Computer-use interface (CUI) library
|
||||
- `libs/agent/` - AI agent library with multi-provider support
|
||||
- `libs/som/` - Set-of-Mark parser
|
||||
- `libs/computer-server/` - Server component for VM
|
||||
- `libs/lume/` - Lume CLI
|
||||
- `libs/pylume/` - Python bindings for Lume
|
||||
|
||||
Each package has its own virtual environment and dependencies, managed through PDM.
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
1. Install Lume CLI:
|
||||
|
||||
```bash
|
||||
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
|
||||
```
|
||||
|
||||
2. Clone the repository:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/trycua/cua.git
|
||||
cd cua
|
||||
```
|
||||
|
||||
3. Create a `.env.local` file in the root directory with your API keys:
|
||||
|
||||
```bash
|
||||
# Required for Anthropic provider
|
||||
ANTHROPIC_API_KEY=your_anthropic_key_here
|
||||
|
||||
# Required for OpenAI provider
|
||||
OPENAI_API_KEY=your_openai_key_here
|
||||
```
|
||||
|
||||
4. Open the workspace in VSCode or Cursor:
|
||||
|
||||
```bash
|
||||
# For Cua Python development
|
||||
code .vscode/py.code-workspace
|
||||
|
||||
# For Lume (Swift) development
|
||||
code .vscode/lume.code-workspace
|
||||
```
|
||||
|
||||
Using the workspace file is strongly recommended as it:
|
||||
|
||||
- Sets up correct Python environments for each package
|
||||
- Configures proper import paths
|
||||
- Enables debugging configurations
|
||||
- Maintains consistent settings across packages
|
||||
|
||||
## Lume Development
|
||||
|
||||
Refer to the [Lume README](./libs/lume/Development.md) for instructions on how to develop the Lume CLI.
|
||||
|
||||
## Python Development
|
||||
|
||||
There are two ways to install Lume:
|
||||
|
||||
### Run the build script
|
||||
|
||||
Run the build script to set up all packages:
|
||||
|
||||
```bash
|
||||
./scripts/build.sh
|
||||
```
|
||||
|
||||
The build script creates a shared virtual environment for all packages. The workspace configuration automatically handles import paths with the correct Python path settings.
|
||||
|
||||
This will:
|
||||
|
||||
- Create a virtual environment for the project
|
||||
- Install all packages in development mode
|
||||
- Set up the correct Python path
|
||||
- Install development tools
|
||||
|
||||
### Install with PDM
|
||||
|
||||
If PDM is not already installed, you can follow the installation instructions [here](https://pdm-project.org/en/latest/#installation).
|
||||
|
||||
To install with PDM, simply run:
|
||||
|
||||
```console
|
||||
pdm install -G:all
|
||||
```
|
||||
|
||||
This installs all the dependencies for development, testing, and building the docs. If you'd only like development dependencies, you can run:
|
||||
|
||||
```console
|
||||
pdm install -d
|
||||
```
|
||||
|
||||
## Running Examples
|
||||
|
||||
The Python workspace includes launch configurations for all packages:
|
||||
|
||||
- "Run Computer Examples" - Runs computer examples
|
||||
- "Run Agent Examples" - Runs agent examples
|
||||
- "SOM" configurations - Various settings for running SOM
|
||||
|
||||
To run examples from VSCode / Cursor:
|
||||
|
||||
1. Press F5 or use the Run/Debug view
|
||||
2. Select the desired configuration
|
||||
|
||||
The workspace also includes compound launch configurations:
|
||||
|
||||
- "Run Computer Examples + Server" - Runs both the Computer Examples and Server simultaneously
|
||||
|
||||
## Docker Development Environment
|
||||
|
||||
As an alternative to installing directly on your host machine, you can use Docker for development. This approach has several advantages:
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker installed on your machine
|
||||
- Lume server running on your host (port 7777): `lume serve`
|
||||
|
||||
### Setup and Usage
|
||||
|
||||
1. Build the development Docker image:
|
||||
|
||||
```bash
|
||||
./scripts/run-docker-dev.sh build
|
||||
```
|
||||
|
||||
2. Run an example in the container:
|
||||
|
||||
```bash
|
||||
./scripts/run-docker-dev.sh run computer_examples.py
|
||||
```
|
||||
|
||||
3. Get an interactive shell in the container:
|
||||
|
||||
```bash
|
||||
./scripts/run-docker-dev.sh run --interactive
|
||||
```
|
||||
|
||||
4. Stop any running containers:
|
||||
|
||||
```bash
|
||||
./scripts/run-docker-dev.sh stop
|
||||
```
|
||||
|
||||
### How it Works
|
||||
|
||||
The Docker development environment:
|
||||
|
||||
- Installs all required Python dependencies in the container
|
||||
- Mounts your source code from the host at runtime
|
||||
- Automatically configures the connection to use host.docker.internal:7777 for accessing the Lume server on your host machine
|
||||
- Preserves your code changes without requiring rebuilds (source code is mounted as a volume)
|
||||
|
||||
> **Note**: The Docker container doesn't include the macOS-specific Lume executable. Instead, it connects to the Lume server running on your host machine via host.docker.internal:7777. Make sure to start the Lume server on your host before running examples in the container.
|
||||
|
||||
## Cleanup and Reset
|
||||
|
||||
If you need to clean up the environment (non-docker) and start fresh:
|
||||
|
||||
```bash
|
||||
./scripts/cleanup.sh
|
||||
```
|
||||
|
||||
This will:
|
||||
|
||||
- Remove all virtual environments
|
||||
- Clean Python cache files and directories
|
||||
- Remove build artifacts
|
||||
- Clean PDM-related files
|
||||
- Reset environment configurations
|
||||
|
||||
## Code Formatting Standards
|
||||
|
||||
The cua project follows strict code formatting standards to ensure consistency across all packages.
|
||||
|
||||
### Python Code Formatting
|
||||
|
||||
#### Tools
|
||||
|
||||
The project uses the following tools for code formatting and linting:
|
||||
|
||||
- **[Black](https://black.readthedocs.io/)**: Code formatter
|
||||
- **[Ruff](https://beta.ruff.rs/docs/)**: Fast linter and formatter
|
||||
- **[MyPy](https://mypy.readthedocs.io/)**: Static type checker
|
||||
|
||||
These tools are automatically installed when you set up the development environment using the `./scripts/build.sh` script.
|
||||
|
||||
#### Configuration
|
||||
|
||||
The formatting configuration is defined in the root `pyproject.toml` file:
|
||||
|
||||
```toml
|
||||
[tool.black]
|
||||
line-length = 100
|
||||
target-version = ["py311"]
|
||||
|
||||
[tool.ruff]
|
||||
line-length = 100
|
||||
target-version = "py311"
|
||||
select = ["E", "F", "B", "I"]
|
||||
fix = true
|
||||
|
||||
[tool.ruff.format]
|
||||
docstring-code-format = true
|
||||
|
||||
[tool.mypy]
|
||||
strict = true
|
||||
python_version = "3.11"
|
||||
ignore_missing_imports = true
|
||||
disallow_untyped_defs = true
|
||||
check_untyped_defs = true
|
||||
warn_return_any = true
|
||||
show_error_codes = true
|
||||
warn_unused_ignores = false
|
||||
```
|
||||
|
||||
#### Key Formatting Rules
|
||||
|
||||
- **Line Length**: Maximum of 100 characters
|
||||
- **Python Version**: Code should be compatible with Python 3.11+
|
||||
- **Imports**: Automatically sorted (using Ruff's "I" rule)
|
||||
- **Type Hints**: Required for all function definitions (strict mypy mode)
|
||||
|
||||
#### IDE Integration
|
||||
|
||||
The repository includes VSCode workspace configurations that enable automatic formatting. When you open the workspace files (as recommended in the setup instructions), the correct formatting settings are automatically applied.
|
||||
|
||||
Python-specific settings in the workspace files:
|
||||
|
||||
```json
|
||||
"[python]": {
|
||||
"editor.formatOnSave": true,
|
||||
"editor.defaultFormatter": "ms-python.black-formatter",
|
||||
"editor.codeActionsOnSave": {
|
||||
"source.organizeImports": "explicit"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Recommended VS Code extensions:
|
||||
|
||||
- Black Formatter (ms-python.black-formatter)
|
||||
- Ruff (charliermarsh.ruff)
|
||||
- Pylance (ms-python.vscode-pylance)
|
||||
|
||||
#### Manual Formatting
|
||||
|
||||
To manually format code:
|
||||
|
||||
```bash
|
||||
# Format all Python files using Black
|
||||
pdm run black .
|
||||
|
||||
# Run Ruff linter with auto-fix
|
||||
pdm run ruff check --fix .
|
||||
|
||||
# Run type checking with MyPy
|
||||
pdm run mypy .
|
||||
```
|
||||
|
||||
#### Pre-commit Validation
|
||||
|
||||
Before submitting a pull request, ensure your code passes all formatting checks:
|
||||
|
||||
```bash
|
||||
# Run all checks
|
||||
pdm run black --check .
|
||||
pdm run ruff check .
|
||||
pdm run mypy .
|
||||
```
|
||||
|
||||
### Swift Code (Lume)
|
||||
|
||||
For Swift code in the `libs/lume` directory:
|
||||
|
||||
- Follow the [Swift API Design Guidelines](https://www.swift.org/documentation/api-design-guidelines/)
|
||||
- Use SwiftFormat for consistent formatting
|
||||
- Code will be automatically formatted on save when using the lume workspace
|
||||
@@ -188,9 +188,9 @@ Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss i
|
||||
|
||||
Cua is open-sourced under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
The base image `kasmweb/core-ubuntu-jammy` is maintained by [Kasm Technologies](https://github.com/kasmtech/workspaces-core-images) and distributed under the Apache License 2.0. Usage of that image is subject to its own license terms.
|
||||
Portions of this project, specifically components adapted from Kasm Technologies Inc., are also licensed under the MIT License. See [libs/kasm/LICENSE](libs/kasm/LICENSE) for details.
|
||||
|
||||
Microsoft's OmniParser, which is used in this project, is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the [OmniParser LICENSE](https://github.com/microsoft/OmniParser/blob/master/LICENSE) file for details.
|
||||
Microsoft's OmniParser, which is used in this project, is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0). See the [OmniParser LICENSE](https://github.com/microsoft/OmniParser/blob/master/LICENSE) for details.
|
||||
|
||||
### Third-Party Licenses and Optional Components
|
||||
|
||||
|
||||
@@ -3,6 +3,8 @@ title: Agent Loops
|
||||
description: Supported computer-using agent loops and models
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/agent_nb.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
|
||||
|
||||
An agent can be thought of as a loop - it generates actions, executes them, and repeats until done:
|
||||
|
||||
1. **Generate**: Your `model` generates `output_text`, `computer_call`, `function_call`
|
||||
|
||||
@@ -75,13 +75,7 @@ messages = [
|
||||
|
||||
## Message Types
|
||||
|
||||
- **user**: User input messages
|
||||
- **computer_call**: Computer actions (click, type, keypress, etc.)
|
||||
- **computer_call_output**: Results from computer actions (usually screenshots)
|
||||
- **function_call**: Function calls (e.g., `computer.call`)
|
||||
- **function_call_output**: Results from function calls
|
||||
- **reasoning**: Agent's internal reasoning and planning
|
||||
- **message**: Agent text responses
|
||||
See the complete schema in [Message Format](./message-format).
|
||||
|
||||
### Memory Management
|
||||
|
||||
|
||||
@@ -0,0 +1,121 @@
|
||||
---
|
||||
title: Customizing Your ComputerAgent
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/customizing_computeragent.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
|
||||
|
||||
The `ComputerAgent` interface provides an easy proxy to any computer-using model configuration, and it is a powerful framework for extending and building your own agentic systems.
|
||||
|
||||
This guide shows four proven ways to increase capabilities and success rate:
|
||||
|
||||
- 1 — Simple: Prompt engineering
|
||||
- 2 — Easy: Tools
|
||||
- 3 — Intermediate: Callbacks
|
||||
- 4 — Expert: Custom `@register_agent`
|
||||
|
||||
## 1) Simple: Prompt engineering
|
||||
|
||||
Provide guiding instructions to shape behavior. `ComputerAgent` accepts an optional `instructions: str | None` which acts like a system-style preface. Internally, this uses a callback that pre-pends a user message before each LLM call.
|
||||
|
||||
```python
|
||||
from agent.agent import ComputerAgent
|
||||
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
tools=[computer],
|
||||
instructions=(
|
||||
"You are a meticulous software operator. Prefer safe, deterministic actions. "
|
||||
"Always confirm via on-screen text before proceeding."
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
## 2) Easy: Tools
|
||||
|
||||
Expose deterministic capabilities as tools (Python functions or custom computer handlers). The agent will call them when appropriate.
|
||||
|
||||
```python
|
||||
def calculate_percentage(numerator: float, denominator: float) -> str:
|
||||
"""Calculate percentage as a string.
|
||||
|
||||
Args:
|
||||
numerator: Numerator value
|
||||
denominator: Denominator value
|
||||
Returns:
|
||||
A formatted percentage string (e.g., '75.00%').
|
||||
"""
|
||||
if denominator == 0:
|
||||
return "0.00%"
|
||||
return f"{(numerator/denominator)*100:.2f}%"
|
||||
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
tools=[computer, calculate_percentage],
|
||||
)
|
||||
```
|
||||
|
||||
- See `docs/agent-sdk/custom-tools` for authoring function tools.
|
||||
- See `docs/agent-sdk/custom-computer-handlers` for building full computer interfaces.
|
||||
|
||||
## 3) Intermediate: Callbacks
|
||||
|
||||
Callbacks provide lifecycle hooks to preprocess messages, postprocess outputs, record trajectories, manage costs, and more.
|
||||
|
||||
```python
|
||||
from agent.callbacks import ImageRetentionCallback, TrajectorySaverCallback, BudgetManagerCallback
|
||||
|
||||
agent = ComputerAgent(
|
||||
model="anthropic/claude-3-5-sonnet-20241022",
|
||||
tools=[computer],
|
||||
callbacks=[
|
||||
ImageRetentionCallback(only_n_most_recent_images=3),
|
||||
TrajectorySaverCallback("./trajectories"),
|
||||
BudgetManagerCallback(max_budget=10.0, raise_error=True),
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
- Browse implementations in `libs/python/agent/agent/loops/`.
|
||||
|
||||
## 4) Expert: Custom `@register_agent`
|
||||
|
||||
Build your own agent configuration class to control prompting, message shaping, and tool handling. This is the most flexible option for specialized domains.
|
||||
|
||||
- Register your own `model=...` loop using `@register_agent`
|
||||
- Browse implementations in `libs/python/agent/agent/loops/`.
|
||||
- Implement `predict_step()` (and optionally `predict_click()`) and return the standardized output schema.
|
||||
|
||||
```python
|
||||
from agent.decorators import register_agent
|
||||
|
||||
@register_agent(models=r".*my-special-model.*", priority=10)
|
||||
class MyCustomAgentConfig:
|
||||
async def predict_step(self, messages, model, tools, **kwargs):
|
||||
# 1) Format messages for your provider
|
||||
# 2) Call provider
|
||||
# 3) Convert responses to the agent output schema
|
||||
return {"output": [], "usage": {}}
|
||||
|
||||
async def predict_click(self, model, image_b64, instruction):
|
||||
# Optional: click-only capability
|
||||
return None
|
||||
|
||||
def get_capabilities(self):
|
||||
return ["step"]
|
||||
```
|
||||
|
||||
## HUD integration (optional)
|
||||
|
||||
When using the HUD evaluation integration (`agent/integrations/hud/`), you can pass `instructions`, `tools`, and `callbacks` directly
|
||||
|
||||
```python
|
||||
from agent.integrations.hud import run_single_task
|
||||
|
||||
await run_single_task(
|
||||
dataset="username/dataset-name",
|
||||
model="openai/computer-use-preview",
|
||||
instructions="Operate carefully. Always verify on-screen text before actions.",
|
||||
# tools=[your_custom_function],
|
||||
# callbacks=[YourCustomCallback()],
|
||||
)
|
||||
```
|
||||
@@ -3,6 +3,8 @@ title: HUD Evals
|
||||
description: Use ComputerAgent with HUD for benchmarking and evaluation
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
|
||||
|
||||
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
|
||||
|
||||
## Installation
|
||||
@@ -76,7 +78,7 @@ results = await run_full_dataset(
|
||||
- `max_steps` (`int`): Default: `50`
|
||||
Safety cap on steps per task to prevent infinite loops.
|
||||
- `split` (`str`): Default: `"train"`
|
||||
Dataset split or subset (e.g., `"train[:10]"`).
|
||||
Dataset split or subset to run. Uses the [Hugging Face split format](https://huggingface.co/docs/datasets/v1.11.0/splits.html), e.g., `"train[:10]"` for the first 10 tasks.
|
||||
|
||||
## Additional Parameters
|
||||
|
||||
|
||||
@@ -0,0 +1,201 @@
|
||||
---
|
||||
title: Message Format
|
||||
---
|
||||
|
||||
This page documents the Python message and response schema used by the Agent SDK.
|
||||
It mirrors the structure shown in Chat History and provides precise type definitions you can target in your own code.
|
||||
|
||||
All examples below use Python type hints with `TypedDict` and `Literal` from the standard `typing` module.
|
||||
|
||||
## Response
|
||||
|
||||
The agent yields response chunks as an async generator of objects with `output` and `usage`.
|
||||
|
||||
```python
|
||||
from typing import List, TypedDict
|
||||
|
||||
class Usage(TypedDict, total=False):
|
||||
prompt_tokens: int
|
||||
completion_tokens: int
|
||||
total_tokens: int
|
||||
response_cost: float # USD cost if available
|
||||
|
||||
class AgentResponse(TypedDict):
|
||||
output: List["AgentMessage"]
|
||||
usage: Usage
|
||||
```
|
||||
|
||||
## Messages
|
||||
|
||||
Agent messages represent the state of the conversation and the agent's actions.
|
||||
|
||||
```python
|
||||
from typing import List, Literal, Optional, TypedDict, Union
|
||||
|
||||
# Union of all message variants
|
||||
AgentMessage = Union[
|
||||
"UserMessage",
|
||||
"AssistantMessage",
|
||||
"ReasoningMessage",
|
||||
"ComputerCallMessage",
|
||||
"ComputerCallOutputMessage",
|
||||
"FunctionCallMessage",
|
||||
"FunctionCallOutputMessage",
|
||||
]
|
||||
|
||||
# Input message (role: user/system/developer)
|
||||
class UserMessage(TypedDict, total=False):
|
||||
type: Literal["message"] # optional for user input
|
||||
role: Literal["user", "system", "developer"]
|
||||
content: Union[str, List["InputContent"]]
|
||||
|
||||
# Output message (assistant text)
|
||||
class AssistantMessage(TypedDict):
|
||||
type: Literal["message"]
|
||||
role: Literal["assistant"]
|
||||
content: List["OutputContent"]
|
||||
|
||||
# Output reasoning/thinking message
|
||||
class ReasoningMessage(TypedDict):
|
||||
type: Literal["reasoning"]
|
||||
summary: List["SummaryContent"]
|
||||
|
||||
# Output computer action call (agent intends to act)
|
||||
class ComputerCallMessage(TypedDict):
|
||||
type: Literal["computer_call"]
|
||||
call_id: str
|
||||
status: Literal["completed", "failed", "pending"]
|
||||
action: "ComputerAction"
|
||||
|
||||
# Output computer action result (always a screenshot)
|
||||
class ComputerCallOutputMessage(TypedDict):
|
||||
type: Literal["computer_call_output"]
|
||||
call_id: str
|
||||
output: "ComputerResultContent"
|
||||
|
||||
# Output function call (agent calls a Python tool)
|
||||
class FunctionCallMessage(TypedDict):
|
||||
type: Literal["function_call"]
|
||||
call_id: str
|
||||
status: Literal["completed", "failed", "pending"]
|
||||
name: str
|
||||
arguments: str # JSON-serialized kwargs
|
||||
|
||||
# Output function call result (text)
|
||||
class FunctionCallOutputMessage(TypedDict):
|
||||
type: Literal["function_call_output"]
|
||||
call_id: str
|
||||
output: str
|
||||
```
|
||||
|
||||
## Message Content
|
||||
|
||||
These content items appear inside `content` arrays for the message types above.
|
||||
|
||||
```python
|
||||
# Input content kinds
|
||||
class InputContent(TypedDict):
|
||||
type: Literal["input_image", "input_text"]
|
||||
text: Optional[str]
|
||||
image_url: Optional[str] # e.g., data URL
|
||||
|
||||
# Assistant output content
|
||||
class OutputContent(TypedDict):
|
||||
type: Literal["output_text"]
|
||||
text: str
|
||||
|
||||
# Reasoning/summary output content
|
||||
class SummaryContent(TypedDict):
|
||||
type: Literal["summary_text"]
|
||||
text: str
|
||||
|
||||
# Computer call outputs (screenshots)
|
||||
class ComputerResultContent(TypedDict):
|
||||
type: Literal["computer_screenshot", "input_image"]
|
||||
image_url: str # data URL (e.g., "data:image/png;base64,....")
|
||||
```
|
||||
|
||||
## Actions
|
||||
|
||||
Computer actions represent concrete operations the agent will perform on the computer.
|
||||
|
||||
Two broad families exist depending on the provider: OpenAI-style and Anthropic-style.
|
||||
|
||||
```python
|
||||
# Union of all supported computer actions
|
||||
ComputerAction = Union[
|
||||
"ClickAction",
|
||||
"DoubleClickAction",
|
||||
"DragAction",
|
||||
"KeyPressAction",
|
||||
"MoveAction",
|
||||
"ScreenshotAction",
|
||||
"ScrollAction",
|
||||
"TypeAction",
|
||||
"WaitAction",
|
||||
# Anthropic variants
|
||||
"LeftMouseDownAction",
|
||||
"LeftMouseUpAction",
|
||||
]
|
||||
|
||||
# OpenAI Computer Actions
|
||||
class ClickAction(TypedDict):
|
||||
type: Literal["click"]
|
||||
button: Literal["left", "right", "wheel", "back", "forward"]
|
||||
x: int
|
||||
y: int
|
||||
|
||||
class DoubleClickAction(TypedDict, total=False):
|
||||
type: Literal["double_click"]
|
||||
button: Literal["left", "right", "wheel", "back", "forward"]
|
||||
x: int
|
||||
y: int
|
||||
|
||||
class DragAction(TypedDict, total=False):
|
||||
type: Literal["drag"]
|
||||
button: Literal["left", "right", "wheel", "back", "forward"]
|
||||
path: List[tuple[int, int]] # [(x1, y1), (x2, y2), ...]
|
||||
|
||||
class KeyPressAction(TypedDict):
|
||||
type: Literal["keypress"]
|
||||
keys: List[str] # e.g., ["ctrl", "a"]
|
||||
|
||||
class MoveAction(TypedDict):
|
||||
type: Literal["move"]
|
||||
x: int
|
||||
y: int
|
||||
|
||||
class ScreenshotAction(TypedDict):
|
||||
type: Literal["screenshot"]
|
||||
|
||||
class ScrollAction(TypedDict):
|
||||
type: Literal["scroll"]
|
||||
scroll_x: int
|
||||
scroll_y: int
|
||||
x: int
|
||||
y: int
|
||||
|
||||
class TypeAction(TypedDict):
|
||||
type: Literal["type"]
|
||||
text: str
|
||||
|
||||
class WaitAction(TypedDict):
|
||||
type: Literal["wait"]
|
||||
|
||||
# Anthropic Computer Actions
|
||||
class LeftMouseDownAction(TypedDict):
|
||||
type: Literal["left_mouse_down"]
|
||||
x: int
|
||||
y: int
|
||||
|
||||
class LeftMouseUpAction(TypedDict):
|
||||
type: Literal["left_mouse_up"]
|
||||
x: int
|
||||
y: int
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The agent runtime may add provider-specific fields when available (e.g., usage cost). Unknown fields should be ignored for forward compatibility.
|
||||
- Computer action outputs are screenshots as data URLs. For security and storage, some serializers may redact or omit large fields in persisted metadata.
|
||||
- The message flow typically alternates between reasoning, actions, screenshots, and concluding assistant text. See [Chat History](./chat-history) for a step-by-step example.
|
||||
@@ -6,6 +6,8 @@
|
||||
"supported-agents",
|
||||
"supported-model-providers",
|
||||
"chat-history",
|
||||
"message-format",
|
||||
"customizing-computeragent",
|
||||
"callbacks",
|
||||
"custom-tools",
|
||||
"custom-computer-handlers",
|
||||
|
||||
@@ -3,6 +3,8 @@ title: Cua Computers
|
||||
description: Understanding cua computer types and connection methods
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/computer_nb.ipynb" target="_blank">Jupyter Notebook</a> and <a href="https://github.com/trycua/cua/tree/main/examples/computer-example-ts" target="_blank">NodeJS project</a> are available for this documentation.</Callout>
|
||||
|
||||
Before we can automate apps using AI, we need to first connect to a Computer Server to give the AI a safe environment to execute workflows in.
|
||||
|
||||
Cua Computers are preconfigured virtual machines running the Computer Server. They can be either macOS, Linux, or Windows. They're found in either a cloud-native container, or on your host desktop.
|
||||
|
||||
@@ -3,6 +3,8 @@ title: Sandboxed Python
|
||||
slug: sandboxed-python
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/examples/sandboxed_functions_examples.py" target="_blank">Python example</a> is available for this documentation.</Callout>
|
||||
|
||||
You can run Python functions securely inside a sandboxed virtual environment on a remote Cua Computer. This is useful for executing untrusted user code, isolating dependencies, or providing a safe environment for automation tasks.
|
||||
|
||||
## How It Works
|
||||
|
||||
@@ -6,6 +6,8 @@ github:
|
||||
- https://github.com/trycua/cua/tree/main/libs/python/computer-server
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/computer_server_nb.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
|
||||
|
||||
The Computer Server API reference documentation is currently under development.
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -6,6 +6,8 @@ github:
|
||||
- https://github.com/trycua/cua/tree/main/libs/python/som
|
||||
---
|
||||
|
||||
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/examples/som_examples.py" target="_blank">Python example</a> is available for this documentation.</Callout>
|
||||
|
||||
## Overview
|
||||
|
||||
The SOM library provides visual element detection and interaction capabilities. It is based on the [Set-of-Mark](https://arxiv.org/abs/2310.11441) research paper and the [OmniParser](https://github.com/microsoft/OmniParser) model.
|
||||
|
||||
@@ -18,6 +18,12 @@ gnome-screenshot wmctrl ffmpeg socat xclip
|
||||
|
||||
RUN pip install cua-computer-server
|
||||
|
||||
# Install Firefox
|
||||
ENV DEBIAN_FRONTEND=noninteractive \
|
||||
INST_DIR=$STARTUPDIR/install
|
||||
COPY ./src/ $INST_DIR
|
||||
RUN bash ${INST_DIR}/ubuntu/install/firefox/install_firefox.sh
|
||||
|
||||
# Disable SSL requirement
|
||||
RUN sed -i 's/require_ssl: true/require_ssl: false/g' /usr/share/kasmvnc/kasmvnc_defaults.yaml
|
||||
RUN sed -i 's/-sslOnly//g' /dockerstartup/vnc_startup.sh
|
||||
|
||||
@@ -0,0 +1,24 @@
|
||||
# LICENSE
|
||||
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 Cua AI, Inc.
|
||||
Portions Copyright (c) 2022 Kasm Technologies Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
set -ex
|
||||
START_COMMAND="firefox"
|
||||
PGREP="firefox"
|
||||
export MAXIMIZE="true"
|
||||
export MAXIMIZE_NAME="Mozilla Firefox"
|
||||
MAXIMIZE_SCRIPT=$STARTUPDIR/maximize_window.sh
|
||||
DEFAULT_ARGS=""
|
||||
ARGS=${APP_ARGS:-$DEFAULT_ARGS}
|
||||
|
||||
options=$(getopt -o gau: -l go,assign,url: -n "$0" -- "$@") || exit
|
||||
eval set -- "$options"
|
||||
|
||||
while [[ $1 != -- ]]; do
|
||||
case $1 in
|
||||
-g|--go) GO='true'; shift 1;;
|
||||
-a|--assign) ASSIGN='true'; shift 1;;
|
||||
-u|--url) OPT_URL=$2; shift 2;;
|
||||
*) echo "bad option: $1" >&2; exit 1;;
|
||||
esac
|
||||
done
|
||||
shift
|
||||
|
||||
# Process non-option arguments.
|
||||
for arg; do
|
||||
echo "arg! $arg"
|
||||
done
|
||||
|
||||
FORCE=$2
|
||||
|
||||
# run with vgl if GPU is available
|
||||
if [ -f /opt/VirtualGL/bin/vglrun ] && [ ! -z "${KASM_EGL_CARD}" ] && [ ! -z "${KASM_RENDERD}" ] && [ -O "${KASM_RENDERD}" ] && [ -O "${KASM_EGL_CARD}" ] ; then
|
||||
START_COMMAND="/opt/VirtualGL/bin/vglrun -d ${KASM_EGL_CARD} $START_COMMAND"
|
||||
fi
|
||||
|
||||
kasm_exec() {
|
||||
if [ -n "$OPT_URL" ] ; then
|
||||
URL=$OPT_URL
|
||||
elif [ -n "$1" ] ; then
|
||||
URL=$1
|
||||
fi
|
||||
|
||||
# Since we are execing into a container that already has the browser running from startup,
|
||||
# when we don't have a URL to open we want to do nothing. Otherwise a second browser instance would open.
|
||||
if [ -n "$URL" ] ; then
|
||||
/usr/bin/filter_ready
|
||||
/usr/bin/desktop_ready
|
||||
bash ${MAXIMIZE_SCRIPT} &
|
||||
$START_COMMAND $ARGS $OPT_URL
|
||||
else
|
||||
echo "No URL specified for exec command. Doing nothing."
|
||||
fi
|
||||
}
|
||||
|
||||
kasm_startup() {
|
||||
if [ -n "$KASM_URL" ] ; then
|
||||
URL=$KASM_URL
|
||||
elif [ -z "$URL" ] ; then
|
||||
URL=$LAUNCH_URL
|
||||
fi
|
||||
|
||||
if [ -z "$DISABLE_CUSTOM_STARTUP" ] || [ -n "$FORCE" ] ; then
|
||||
|
||||
echo "Entering process startup loop"
|
||||
set +x
|
||||
while true
|
||||
do
|
||||
if ! pgrep -x $PGREP > /dev/null
|
||||
then
|
||||
/usr/bin/filter_ready
|
||||
/usr/bin/desktop_ready
|
||||
set +e
|
||||
bash ${MAXIMIZE_SCRIPT} &
|
||||
$START_COMMAND $ARGS $URL
|
||||
set -e
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
set -x
|
||||
|
||||
fi
|
||||
|
||||
}
|
||||
|
||||
if [ -n "$GO" ] || [ -n "$ASSIGN" ] ; then
|
||||
kasm_exec
|
||||
else
|
||||
kasm_startup
|
||||
fi
|
||||
@@ -0,0 +1,221 @@
|
||||
[Desktop Entry]
|
||||
Version=1.0
|
||||
Name=Firefox Web Browser
|
||||
Name[ar]=متصفح الويب فَيَرفُكْس
|
||||
Name[ast]=Restolador web Firefox
|
||||
Name[bn]=ফায়ারফক্স ওয়েব ব্রাউজার
|
||||
Name[ca]=Navegador web Firefox
|
||||
Name[cs]=Firefox Webový prohlížeč
|
||||
Name[da]=Firefox - internetbrowser
|
||||
Name[el]=Περιηγητής Firefox
|
||||
Name[es]=Navegador web Firefox
|
||||
Name[et]=Firefoxi veebibrauser
|
||||
Name[fa]=مرورگر اینترنتی Firefox
|
||||
Name[fi]=Firefox-selain
|
||||
Name[fr]=Navigateur Web Firefox
|
||||
Name[gl]=Navegador web Firefox
|
||||
Name[he]=דפדפן האינטרנט Firefox
|
||||
Name[hr]=Firefox web preglednik
|
||||
Name[hu]=Firefox webböngésző
|
||||
Name[it]=Firefox Browser Web
|
||||
Name[ja]=Firefox ウェブ・ブラウザ
|
||||
Name[ko]=Firefox 웹 브라우저
|
||||
Name[ku]=Geroka torê Firefox
|
||||
Name[lt]=Firefox interneto naršyklė
|
||||
Name[nb]=Firefox Nettleser
|
||||
Name[nl]=Firefox webbrowser
|
||||
Name[nn]=Firefox Nettlesar
|
||||
Name[no]=Firefox Nettleser
|
||||
Name[pl]=Przeglądarka WWW Firefox
|
||||
Name[pt]=Firefox Navegador Web
|
||||
Name[pt_BR]=Navegador Web Firefox
|
||||
Name[ro]=Firefox – Navigator Internet
|
||||
Name[ru]=Веб-браузер Firefox
|
||||
Name[sk]=Firefox - internetový prehliadač
|
||||
Name[sl]=Firefox spletni brskalnik
|
||||
Name[sv]=Firefox webbläsare
|
||||
Name[tr]=Firefox Web Tarayıcısı
|
||||
Name[ug]=Firefox توركۆرگۈ
|
||||
Name[uk]=Веб-браузер Firefox
|
||||
Name[vi]=Trình duyệt web Firefox
|
||||
Name[zh_CN]=Firefox 网络浏览器
|
||||
Name[zh_TW]=Firefox 網路瀏覽器
|
||||
Comment=Browse the World Wide Web
|
||||
Comment[ar]=تصفح الشبكة العنكبوتية العالمية
|
||||
Comment[ast]=Restola pela Rede
|
||||
Comment[bn]=ইন্টারনেট ব্রাউজ করুন
|
||||
Comment[ca]=Navegueu per la web
|
||||
Comment[cs]=Prohlížení stránek World Wide Webu
|
||||
Comment[da]=Surf på internettet
|
||||
Comment[de]=Im Internet surfen
|
||||
Comment[el]=Μπορείτε να περιηγηθείτε στο διαδίκτυο (Web)
|
||||
Comment[es]=Navegue por la web
|
||||
Comment[et]=Lehitse veebi
|
||||
Comment[fa]=صفحات شبکه جهانی اینترنت را مرور نمایید
|
||||
Comment[fi]=Selaa Internetin WWW-sivuja
|
||||
Comment[fr]=Naviguer sur le Web
|
||||
Comment[gl]=Navegar pola rede
|
||||
Comment[he]=גלישה ברחבי האינטרנט
|
||||
Comment[hr]=Pretražite web
|
||||
Comment[hu]=A világháló böngészése
|
||||
Comment[it]=Esplora il web
|
||||
Comment[ja]=ウェブを閲覧します
|
||||
Comment[ko]=웹을 돌아 다닙니다
|
||||
Comment[ku]=Li torê bigere
|
||||
Comment[lt]=Naršykite internete
|
||||
Comment[nb]=Surf på nettet
|
||||
Comment[nl]=Verken het internet
|
||||
Comment[nn]=Surf på nettet
|
||||
Comment[no]=Surf på nettet
|
||||
Comment[pl]=Przeglądanie stron WWW
|
||||
Comment[pt]=Navegue na Internet
|
||||
Comment[pt_BR]=Navegue na Internet
|
||||
Comment[ro]=Navigați pe Internet
|
||||
Comment[ru]=Доступ в Интернет
|
||||
Comment[sk]=Prehliadanie internetu
|
||||
Comment[sl]=Brskajte po spletu
|
||||
Comment[sv]=Surfa på webben
|
||||
Comment[tr]=İnternet'te Gezinin
|
||||
Comment[ug]=دۇنيادىكى توربەتلەرنى كۆرگىلى بولىدۇ
|
||||
Comment[uk]=Перегляд сторінок Інтернету
|
||||
Comment[vi]=Để duyệt các trang web
|
||||
Comment[zh_CN]=浏览互联网
|
||||
Comment[zh_TW]=瀏覽網際網路
|
||||
GenericName=Web Browser
|
||||
GenericName[ar]=متصفح ويب
|
||||
GenericName[ast]=Restolador Web
|
||||
GenericName[bn]=ওয়েব ব্রাউজার
|
||||
GenericName[ca]=Navegador web
|
||||
GenericName[cs]=Webový prohlížeč
|
||||
GenericName[da]=Webbrowser
|
||||
GenericName[el]=Περιηγητής διαδικτύου
|
||||
GenericName[es]=Navegador web
|
||||
GenericName[et]=Veebibrauser
|
||||
GenericName[fa]=مرورگر اینترنتی
|
||||
GenericName[fi]=WWW-selain
|
||||
GenericName[fr]=Navigateur Web
|
||||
GenericName[gl]=Navegador Web
|
||||
GenericName[he]=דפדפן אינטרנט
|
||||
GenericName[hr]=Web preglednik
|
||||
GenericName[hu]=Webböngésző
|
||||
GenericName[it]=Browser web
|
||||
GenericName[ja]=ウェブ・ブラウザ
|
||||
GenericName[ko]=웹 브라우저
|
||||
GenericName[ku]=Geroka torê
|
||||
GenericName[lt]=Interneto naršyklė
|
||||
GenericName[nb]=Nettleser
|
||||
GenericName[nl]=Webbrowser
|
||||
GenericName[nn]=Nettlesar
|
||||
GenericName[no]=Nettleser
|
||||
GenericName[pl]=Przeglądarka WWW
|
||||
GenericName[pt]=Navegador Web
|
||||
GenericName[pt_BR]=Navegador Web
|
||||
GenericName[ro]=Navigator Internet
|
||||
GenericName[ru]=Веб-браузер
|
||||
GenericName[sk]=Internetový prehliadač
|
||||
GenericName[sl]=Spletni brskalnik
|
||||
GenericName[sv]=Webbläsare
|
||||
GenericName[tr]=Web Tarayıcı
|
||||
GenericName[ug]=توركۆرگۈ
|
||||
GenericName[uk]=Веб-браузер
|
||||
GenericName[vi]=Trình duyệt Web
|
||||
GenericName[zh_CN]=网络浏览器
|
||||
GenericName[zh_TW]=網路瀏覽器
|
||||
Keywords=Internet;WWW;Browser;Web;Explorer
|
||||
Keywords[ar]=انترنت;إنترنت;متصفح;ويب;وب
|
||||
Keywords[ast]=Internet;WWW;Restolador;Web;Esplorador
|
||||
Keywords[ca]=Internet;WWW;Navegador;Web;Explorador;Explorer
|
||||
Keywords[cs]=Internet;WWW;Prohlížeč;Web;Explorer
|
||||
Keywords[da]=Internet;Internettet;WWW;Browser;Browse;Web;Surf;Nettet
|
||||
Keywords[de]=Internet;WWW;Browser;Web;Explorer;Webseite;Site;surfen;online;browsen
|
||||
Keywords[el]=Internet;WWW;Browser;Web;Explorer;Διαδίκτυο;Περιηγητής;Firefox;Φιρεφοχ;Ιντερνετ
|
||||
Keywords[es]=Explorador;Internet;WWW
|
||||
Keywords[fi]=Internet;WWW;Browser;Web;Explorer;selain;Internet-selain;internetselain;verkkoselain;netti;surffaa
|
||||
Keywords[fr]=Internet;WWW;Browser;Web;Explorer;Fureteur;Surfer;Navigateur
|
||||
Keywords[he]=דפדפן;אינטרנט;רשת;אתרים;אתר;פיירפוקס;מוזילה;
|
||||
Keywords[hr]=Internet;WWW;preglednik;Web
|
||||
Keywords[hu]=Internet;WWW;Böngésző;Web;Háló;Net;Explorer
|
||||
Keywords[it]=Internet;WWW;Browser;Web;Navigatore
|
||||
Keywords[is]=Internet;WWW;Vafri;Vefur;Netvafri;Flakk
|
||||
Keywords[ja]=Internet;WWW;Web;インターネット;ブラウザ;ウェブ;エクスプローラ
|
||||
Keywords[nb]=Internett;WWW;Nettleser;Explorer;Web;Browser;Nettside
|
||||
Keywords[nl]=Internet;WWW;Browser;Web;Explorer;Verkenner;Website;Surfen;Online
|
||||
Keywords[pt]=Internet;WWW;Browser;Web;Explorador;Navegador
|
||||
Keywords[pt_BR]=Internet;WWW;Browser;Web;Explorador;Navegador
|
||||
Keywords[ru]=Internet;WWW;Browser;Web;Explorer;интернет;браузер;веб;файрфокс;огнелис
|
||||
Keywords[sk]=Internet;WWW;Prehliadač;Web;Explorer
|
||||
Keywords[sl]=Internet;WWW;Browser;Web;Explorer;Brskalnik;Splet
|
||||
Keywords[tr]=İnternet;WWW;Tarayıcı;Web;Gezgin;Web sitesi;Site;sörf;çevrimiçi;tara
|
||||
Keywords[uk]=Internet;WWW;Browser;Web;Explorer;Інтернет;мережа;переглядач;оглядач;браузер;веб;файрфокс;вогнелис;перегляд
|
||||
Keywords[vi]=Internet;WWW;Browser;Web;Explorer;Trình duyệt;Trang web
|
||||
Keywords[zh_CN]=Internet;WWW;Browser;Web;Explorer;网页;浏览;上网;火狐;Firefox;ff;互联网;网站;
|
||||
Keywords[zh_TW]=Internet;WWW;Browser;Web;Explorer;網際網路;網路;瀏覽器;上網;網頁;火狐
|
||||
Exec=firefox %u
|
||||
Terminal=false
|
||||
X-MultipleArgs=false
|
||||
Type=Application
|
||||
Icon=/usr/lib/firefox/browser/chrome/icons/default/default128.png
|
||||
Categories=GNOME;GTK;Network;WebBrowser;
|
||||
MimeType=text/html;text/xml;application/xhtml+xml;application/xml;application/rss+xml;application/rdf+xml;image/gif;image/jpeg;image/png;x-scheme-handler/http;x-scheme-handler/https;x-scheme-handler/ftp;x-scheme-handler/chrome;video/webm;application/x-xpinstall;
|
||||
StartupNotify=true
|
||||
Actions=NewWindow;NewPrivateWindow;
|
||||
|
||||
[Desktop Action NewWindow]
|
||||
Name=Open a New Window
|
||||
Name[ar]=افتح نافذة جديدة
|
||||
Name[ast]=Abrir una ventana nueva
|
||||
Name[bn]=Abrir una ventana nueva
|
||||
Name[ca]=Obre una finestra nova
|
||||
Name[cs]=Otevřít nové okno
|
||||
Name[da]=Åbn et nyt vindue
|
||||
Name[de]=Ein neues Fenster öffnen
|
||||
Name[el]=Άνοιγμα νέου παραθύρου
|
||||
Name[es]=Abrir una ventana nueva
|
||||
Name[fi]=Avaa uusi ikkuna
|
||||
Name[fr]=Ouvrir une nouvelle fenêtre
|
||||
Name[gl]=Abrir unha nova xanela
|
||||
Name[he]=פתיחת חלון חדש
|
||||
Name[hr]=Otvori novi prozor
|
||||
Name[hu]=Új ablak nyitása
|
||||
Name[it]=Apri una nuova finestra
|
||||
Name[ja]=新しいウィンドウを開く
|
||||
Name[ko]=새 창 열기
|
||||
Name[ku]=Paceyeke nû veke
|
||||
Name[lt]=Atverti naują langą
|
||||
Name[nb]=Åpne et nytt vindu
|
||||
Name[nl]=Nieuw venster openen
|
||||
Name[pt]=Abrir nova janela
|
||||
Name[pt_BR]=Abrir nova janela
|
||||
Name[ro]=Deschide o fereastră nouă
|
||||
Name[ru]=Новое окно
|
||||
Name[sk]=Otvoriť nové okno
|
||||
Name[sl]=Odpri novo okno
|
||||
Name[sv]=Öppna ett nytt fönster
|
||||
Name[tr]=Yeni pencere aç
|
||||
Name[ug]=يېڭى كۆزنەك ئېچىش
|
||||
Name[uk]=Відкрити нове вікно
|
||||
Name[vi]=Mở cửa sổ mới
|
||||
Name[zh_CN]=新建窗口
|
||||
Name[zh_TW]=開啟新視窗
|
||||
Exec=firefox -new-window
|
||||
OnlyShowIn=Unity;
|
||||
|
||||
[Desktop Action NewPrivateWindow]
|
||||
Name=Open a New Private Window
|
||||
Name[ar]=افتح نافذة جديدة للتصفح الخاص
|
||||
Name[ca]=Obre una finestra nova en mode d'incògnit
|
||||
Name[de]=Ein neues privates Fenster öffnen
|
||||
Name[es]=Abrir una ventana privada nueva
|
||||
Name[fi]=Avaa uusi yksityinen ikkuna
|
||||
Name[fr]=Ouvrir une nouvelle fenêtre de navigation privée
|
||||
Name[he]=פתיחת חלון גלישה פרטית חדש
|
||||
Name[hu]=Új privát ablak nyitása
|
||||
Name[it]=Apri una nuova finestra anonima
|
||||
Name[nb]=Åpne et nytt privat vindu
|
||||
Name[ru]=Новое приватное окно
|
||||
Name[sl]=Odpri novo okno zasebnega brskanja
|
||||
Name[tr]=Yeni bir pencere aç
|
||||
Name[uk]=Відкрити нове вікно у потайливому режимі
|
||||
Name[zh_TW]=開啟新隱私瀏覽視窗
|
||||
Exec=firefox -private-window
|
||||
OnlyShowIn=Unity;
|
||||
@@ -0,0 +1,236 @@
|
||||
#!/usr/bin/env bash
|
||||
set -xe
|
||||
|
||||
# Add icon
|
||||
if [ -f /dockerstartup/install/ubuntu/install/firefox/firefox.desktop ]; then
|
||||
mv /dockerstartup/install/ubuntu/install/firefox/firefox.desktop $HOME/Desktop/
|
||||
fi
|
||||
|
||||
ARCH=$(arch | sed 's/aarch64/arm64/g' | sed 's/x86_64/amd64/g')
|
||||
|
||||
set_desktop_icon() {
|
||||
sed -i -e 's!Icon=.\+!Icon=/usr/share/icons/hicolor/48x48/apps/firefox.png!' "$HOME/Desktop/firefox.desktop"
|
||||
}
|
||||
|
||||
echo "Install Firefox"
|
||||
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
|
||||
dnf install -y firefox p11-kit
|
||||
elif [ "${DISTRO}" == "opensuse" ]; then
|
||||
zypper install -yn p11-kit-tools MozillaFirefox
|
||||
elif grep -q Jammy /etc/os-release || grep -q Noble /etc/os-release; then
|
||||
if [ ! -f '/etc/apt/preferences.d/mozilla-firefox' ]; then
|
||||
add-apt-repository -y ppa:mozillateam/ppa
|
||||
echo '
|
||||
Package: *
|
||||
Pin: release o=LP-PPA-mozillateam
|
||||
Pin-Priority: 1001
|
||||
' > /etc/apt/preferences.d/mozilla-firefox
|
||||
fi
|
||||
apt-get install -y firefox p11-kit-modules
|
||||
elif grep -q "ID=kali" /etc/os-release; then
|
||||
apt-get update
|
||||
apt-get install -y firefox-esr p11-kit-modules
|
||||
rm -f $HOME/Desktop/firefox.desktop
|
||||
cp \
|
||||
/usr/share/applications/firefox-esr.desktop \
|
||||
$HOME/Desktop/
|
||||
chmod +x $HOME/Desktop/firefox-esr.desktop
|
||||
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
|
||||
if [ "${ARCH}" == "amd64" ]; then
|
||||
install -d -m 0755 /etc/apt/keyrings
|
||||
wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg -O- > /etc/apt/keyrings/packages.mozilla.org.asc
|
||||
echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" > /etc/apt/sources.list.d/mozilla.list
|
||||
echo '
|
||||
Package: *
|
||||
Pin: origin packages.mozilla.org
|
||||
Pin-Priority: 1000
|
||||
' > /etc/apt/preferences.d/mozilla
|
||||
apt-get update
|
||||
apt-get install -y firefox p11-kit-modules
|
||||
else
|
||||
apt-get update
|
||||
apt-get install -y firefox-esr p11-kit-modules
|
||||
rm -f $HOME/Desktop/firefox.desktop
|
||||
cp \
|
||||
/usr/share/applications/firefox-esr.desktop \
|
||||
$HOME/Desktop/
|
||||
chmod +x $HOME/Desktop/firefox-esr.desktop
|
||||
fi
|
||||
else
|
||||
apt-mark unhold firefox || :
|
||||
apt-get remove firefox
|
||||
apt-get update
|
||||
apt-get install -y firefox p11-kit-modules
|
||||
fi
|
||||
|
||||
# Add Langpacks
|
||||
FIREFOX_VERSION=$(curl -sI https://download.mozilla.org/?product=firefox-latest | awk -F '(releases/|/win32)' '/Location/ {print $2}')
|
||||
RELEASE_URL="https://releases.mozilla.org/pub/firefox/releases/${FIREFOX_VERSION}/win64/xpi/"
|
||||
LANGS=$(curl -Ls ${RELEASE_URL} | awk -F '(xpi">|</a>)' '/href.*xpi/ {print $2}' | tr '\n' ' ')
|
||||
EXTENSION_DIR=/usr/lib/firefox-addons/distribution/extensions/
|
||||
mkdir -p ${EXTENSION_DIR}
|
||||
for LANG in ${LANGS}; do
|
||||
LANGCODE=$(echo ${LANG} | sed 's/\.xpi//g')
|
||||
echo "Downloading ${LANG} Language pack"
|
||||
curl -o \
|
||||
${EXTENSION_DIR}langpack-${LANGCODE}@firefox.mozilla.org.xpi -Ls \
|
||||
${RELEASE_URL}${LANG}
|
||||
done
|
||||
|
||||
# Cleanup and install flash if supported
|
||||
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
|
||||
if [ -z ${SKIP_CLEAN+x} ]; then
|
||||
dnf clean all
|
||||
fi
|
||||
elif [ "${DISTRO}" == "opensuse" ]; then
|
||||
if [ -z ${SKIP_CLEAN+x} ]; then
|
||||
zypper clean --all
|
||||
fi
|
||||
else
|
||||
if [ "$ARCH" == "arm64" ] && [ "$(lsb_release -cs)" == "focal" ] ; then
|
||||
echo "Firefox flash player not supported on arm64 Ubuntu Focal Skipping"
|
||||
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=kali" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
|
||||
echo "Firefox flash player not supported on Debian"
|
||||
elif grep -q Focal /etc/os-release; then
|
||||
# Plugin to support running flash videos for sites like vimeo
|
||||
apt-get update
|
||||
apt-get install -y browser-plugin-freshplayer-pepperflash
|
||||
apt-mark hold firefox
|
||||
if [ -z ${SKIP_CLEAN+x} ]; then
|
||||
apt-get autoclean
|
||||
rm -rf \
|
||||
/var/lib/apt/lists/* \
|
||||
/var/tmp/*
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
# Update firefox to utilize the system certificate store instead of the one that ships with firefox
|
||||
if grep -q "ID=debian" /etc/os-release || grep -q "ID=kali" /etc/os-release || grep -q "ID=parrot" /etc/os-release && [ "${ARCH}" == "arm64" ]; then
|
||||
rm -f /usr/lib/firefox-esr/libnssckbi.so
|
||||
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox-esr/libnssckbi.so
|
||||
elif grep -q "ID=kali" /etc/os-release && [ "${ARCH}" == "amd64" ]; then
|
||||
rm -f /usr/lib/firefox-esr/libnssckbi.so
|
||||
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox-esr/libnssckbi.so
|
||||
else
|
||||
rm -f /usr/lib/firefox/libnssckbi.so
|
||||
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox/libnssckbi.so
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
|
||||
if [[ "${DISTRO}" == @(fedora39|fedora40) ]]; then
|
||||
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/firefox-redhat-default-prefs.js
|
||||
else
|
||||
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/all-redhat.js
|
||||
fi
|
||||
sed -i -e '/homepage/d' "$preferences_file"
|
||||
elif [ "${DISTRO}" == "opensuse" ]; then
|
||||
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/firefox.js
|
||||
elif grep -q "ID=kali" /etc/os-release; then
|
||||
preferences_file=/usr/lib/firefox-esr/defaults/pref/firefox.js
|
||||
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
|
||||
if [ "${ARCH}" == "amd64" ]; then
|
||||
preferences_file=/usr/lib/firefox/defaults/pref/firefox.js
|
||||
else
|
||||
preferences_file=/usr/lib/firefox-esr/defaults/pref/firefox.js
|
||||
fi
|
||||
else
|
||||
preferences_file=/usr/lib/firefox/browser/defaults/preferences/firefox.js
|
||||
fi
|
||||
|
||||
# Disabling default first run URL for Debian based images
|
||||
if [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
cat >"$preferences_file" <<EOF
|
||||
pref("datareporting.policy.firstRunURL", "");
|
||||
pref("datareporting.policy.dataSubmissionEnabled", false);
|
||||
pref("datareporting.healthreport.service.enabled", false);
|
||||
pref("datareporting.healthreport.uploadEnabled", false);
|
||||
pref("trailhead.firstrun.branches", "nofirstrun-empty");
|
||||
pref("browser.aboutwelcome.enabled", false);
|
||||
EOF
|
||||
fi
|
||||
|
||||
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
# Creating a default profile
|
||||
chown -R root:root $HOME
|
||||
firefox -headless -CreateProfile "kasm $HOME/.mozilla/firefox/kasm"
|
||||
# Generate a certdb to be detected on squid start
|
||||
HOME=/root firefox --headless &
|
||||
mkdir -p /root/.mozilla
|
||||
CERTDB=$(find /root/.mozilla* -name "cert9.db")
|
||||
while [ -z "${CERTDB}" ] ; do
|
||||
sleep 1
|
||||
echo "waiting for certdb"
|
||||
CERTDB=$(find /root/.mozilla* -name "cert9.db")
|
||||
done
|
||||
sleep 2
|
||||
kill $(pgrep firefox)
|
||||
CERTDIR=$(dirname ${CERTDB})
|
||||
mv ${CERTDB} $HOME/.mozilla/firefox/kasm/
|
||||
rm -Rf /root/.mozilla
|
||||
else
|
||||
# Creating Default Profile
|
||||
chown -R 0:0 $HOME
|
||||
firefox -headless -CreateProfile "kasm $HOME/.mozilla/firefox/kasm"
|
||||
fi
|
||||
|
||||
# Silence Firefox security nag "Some of Firefox's features may offer less protection on your current operating system".
|
||||
echo 'user_pref("security.sandbox.warn_unprivileged_namespaces", false);' > $HOME/.mozilla/firefox/kasm/user.js
|
||||
chown 1000:1000 $HOME/.mozilla/firefox/kasm/user.js
|
||||
|
||||
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
set_desktop_icon
|
||||
fi
|
||||
|
||||
# Starting with version 67, Firefox creates a unique profile mapping per installation which is hash generated
|
||||
# based off the installation path. Because that path will be static for our deployments we can assume the hash
|
||||
# and thus assign our profile to the default for the installation
|
||||
if grep -q "ID=kali" /etc/os-release; then
|
||||
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
|
||||
[Install3B6073811A6ABF12]
|
||||
Default=kasm
|
||||
Locked=1
|
||||
EOL
|
||||
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
|
||||
if [ "${ARCH}" != "amd64" ]; then
|
||||
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
|
||||
[Install3B6073811A6ABF12]
|
||||
Default=kasm
|
||||
Locked=1
|
||||
EOL
|
||||
else
|
||||
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
|
||||
[Install4F96D1932A9F858E]
|
||||
Default=kasm
|
||||
Locked=1
|
||||
EOL
|
||||
fi
|
||||
elif [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
|
||||
[Install4F96D1932A9F858E]
|
||||
Default=kasm
|
||||
Locked=1
|
||||
EOL
|
||||
elif [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
|
||||
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
|
||||
[Install11457493C5A56847]
|
||||
Default=kasm
|
||||
Locked=1
|
||||
EOL
|
||||
fi
|
||||
|
||||
# Desktop Icon FIxes
|
||||
if [[ "${DISTRO}" == @(rockylinux9|oracle9|rhel9|almalinux9|fedora39|fedora40) ]]; then
|
||||
sed -i 's#Icon=/usr/lib/firefox#Icon=/usr/lib64/firefox#g' $HOME/Desktop/firefox.desktop
|
||||
fi
|
||||
|
||||
# Cleanup for app layer
|
||||
chown -R 1000:0 $HOME
|
||||
find /usr/share/ -name "icon-theme.cache" -exec rm -f {} \;
|
||||
if [ -f $HOME/Desktop/firefox.desktop ]; then
|
||||
chmod +x $HOME/Desktop/firefox.desktop
|
||||
fi
|
||||
chown -R 1000:1000 $HOME/.mozilla
|
||||
|
||||
@@ -31,7 +31,8 @@ from .callbacks import (
|
||||
TrajectorySaverCallback,
|
||||
BudgetManagerCallback,
|
||||
TelemetryCallback,
|
||||
OperatorNormalizerCallback
|
||||
OperatorNormalizerCallback,
|
||||
PromptInstructionsCallback,
|
||||
)
|
||||
from .computers import (
|
||||
AsyncComputerHandler,
|
||||
@@ -162,6 +163,7 @@ class ComputerAgent:
|
||||
custom_loop: Optional[Callable] = None,
|
||||
only_n_most_recent_images: Optional[int] = None,
|
||||
callbacks: Optional[List[Any]] = None,
|
||||
instructions: Optional[str] = None,
|
||||
verbosity: Optional[int] = None,
|
||||
trajectory_dir: Optional[str | Path | dict] = None,
|
||||
max_retries: Optional[int] = 3,
|
||||
@@ -181,6 +183,7 @@ class ComputerAgent:
|
||||
custom_loop: Custom agent loop function to use instead of auto-selection
|
||||
only_n_most_recent_images: If set, only keep the N most recent images in message history. Adds ImageRetentionCallback automatically.
|
||||
callbacks: List of AsyncCallbackHandler instances for preprocessing/postprocessing
|
||||
instructions: Optional system instructions to be passed to the model
|
||||
verbosity: Logging level (logging.DEBUG, logging.INFO, etc.). If set, adds LoggingCallback automatically
|
||||
trajectory_dir: If set, saves trajectory data (screenshots, responses) to this directory. Adds TrajectorySaverCallback automatically.
|
||||
max_retries: Maximum number of retries for failed API calls
|
||||
@@ -200,6 +203,7 @@ class ComputerAgent:
|
||||
self.custom_loop = custom_loop
|
||||
self.only_n_most_recent_images = only_n_most_recent_images
|
||||
self.callbacks = callbacks or []
|
||||
self.instructions = instructions
|
||||
self.verbosity = verbosity
|
||||
self.trajectory_dir = trajectory_dir
|
||||
self.max_retries = max_retries
|
||||
@@ -214,6 +218,10 @@ class ComputerAgent:
|
||||
# Prepend operator normalizer callback
|
||||
self.callbacks.insert(0, OperatorNormalizerCallback())
|
||||
|
||||
# Add prompt instructions callback if provided
|
||||
if self.instructions:
|
||||
self.callbacks.append(PromptInstructionsCallback(self.instructions))
|
||||
|
||||
# Add telemetry callback if telemetry_enabled is set
|
||||
if self.telemetry_enabled:
|
||||
if isinstance(self.telemetry_enabled, bool):
|
||||
|
||||
@@ -9,6 +9,7 @@ from .trajectory_saver import TrajectorySaverCallback
|
||||
from .budget_manager import BudgetManagerCallback
|
||||
from .telemetry import TelemetryCallback
|
||||
from .operator_validator import OperatorNormalizerCallback
|
||||
from .prompt_instructions import PromptInstructionsCallback
|
||||
|
||||
__all__ = [
|
||||
"AsyncCallbackHandler",
|
||||
@@ -18,4 +19,5 @@ __all__ = [
|
||||
"BudgetManagerCallback",
|
||||
"TelemetryCallback",
|
||||
"OperatorNormalizerCallback",
|
||||
"PromptInstructionsCallback",
|
||||
]
|
||||
|
||||
@@ -0,0 +1,47 @@
|
||||
"""
|
||||
Prompt instructions callback.
|
||||
|
||||
This callback allows simple prompt engineering by pre-pending a user
|
||||
instructions message to the start of the conversation before each LLM call.
|
||||
|
||||
Usage:
|
||||
|
||||
from agent.callbacks import PromptInstructionsCallback
|
||||
agent = ComputerAgent(
|
||||
model="openai/computer-use-preview",
|
||||
callbacks=[PromptInstructionsCallback("Follow these rules...")]
|
||||
)
|
||||
|
||||
"""
|
||||
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from .base import AsyncCallbackHandler
|
||||
|
||||
|
||||
class PromptInstructionsCallback(AsyncCallbackHandler):
|
||||
"""
|
||||
Prepend a user instructions message to the message list.
|
||||
|
||||
This is a minimal, non-invasive way to guide the agent's behavior without
|
||||
modifying agent loops or tools. It works with any provider/loop since it
|
||||
only alters the messages array before sending to the model.
|
||||
"""
|
||||
|
||||
def __init__(self, instructions: Optional[str]) -> None:
|
||||
self.instructions = instructions
|
||||
|
||||
async def on_llm_start(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
# Pre-pend instructions message
|
||||
if not self.instructions:
|
||||
return messages
|
||||
|
||||
# Ensure we don't duplicate if already present at the front
|
||||
if messages and isinstance(messages[0], dict):
|
||||
first = messages[0]
|
||||
if first.get("role") == "user" and first.get("content") == self.instructions:
|
||||
return messages
|
||||
|
||||
return [
|
||||
{"role": "user", "content": self.instructions},
|
||||
] + messages
|
||||
@@ -1,102 +1,28 @@
|
||||
"""HUD integration: Generic HuggingFace dataset evaluation runner (CUA proxy).
|
||||
"""HUD integration: dataset runners and MCP-based computer agent export.
|
||||
|
||||
This module exposes two helpers to evaluate HUD-compatible datasets using
|
||||
HUD's OperatorAgent, while proxying model calls through our ComputerAgent via
|
||||
`FakeAsyncOpenAI` (see `agent/integrations/hud/agent.py`).
|
||||
This module exposes helpers to evaluate HUD-compatible datasets and exports
|
||||
the MCP-compatible computer agent implementation.
|
||||
|
||||
Exports:
|
||||
- run_single_task(dataset_name, *, agent_type="cua-proxy", model=None, allowed_tools=None)
|
||||
- run_full_dataset(dataset_name, *, agent_type="cua-proxy", model=None, allowed_tools=None, max_concurrent=30, max_steps=50)
|
||||
- run_single_task(dataset, ...)
|
||||
- run_full_dataset(dataset, ...)
|
||||
- MCPComputerAgent
|
||||
"""
|
||||
import time
|
||||
from typing import Any, Optional
|
||||
|
||||
from PIL import Image
|
||||
from agent.computers import is_agent_computer
|
||||
from datasets import load_dataset, Dataset
|
||||
from hud.agents import OperatorAgent
|
||||
from hud.datasets import Task, run_dataset
|
||||
from hud.tools.computer.settings import computer_settings
|
||||
from hud import trace
|
||||
|
||||
from agent.agent import ComputerAgent as BaseComputerAgent
|
||||
from .proxy import FakeAsyncOpenAI
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Proxy OperatorAgent
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class ProxyOperatorAgent(OperatorAgent):
|
||||
"""OperatorAgent that proxies model calls through our ComputerAgent.
|
||||
|
||||
Accepts the same config keys we pass via hud.run_dataset `agent_config`:
|
||||
- model: str | None
|
||||
- allowed_tools: list[str] | None
|
||||
Additional kwargs are forwarded to OperatorAgent (if any are supported).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
model: str | None = None,
|
||||
allowed_tools: list[str] | None = None,
|
||||
trajectory_dir: str | dict | None = None,
|
||||
# === ComputerAgent kwargs ===
|
||||
tools: list[Any] | None = None,
|
||||
custom_loop: Any | None = None,
|
||||
only_n_most_recent_images: int | None = None,
|
||||
callbacks: list[Any] | None = None,
|
||||
verbosity: int | None = None,
|
||||
max_retries: int | None = 3,
|
||||
screenshot_delay: float | int = 0.5,
|
||||
use_prompt_caching: bool | None = False,
|
||||
max_trajectory_budget: float | dict | None = None,
|
||||
telemetry_enabled: bool | None = True,
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
model = model or "computer-use-preview"
|
||||
allowed_tools = allowed_tools or ["openai_computer"]
|
||||
|
||||
computer_shim = {
|
||||
'screenshot': lambda: Image.new('RGB', (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)),
|
||||
'environment': 'linux',
|
||||
'dimensions': (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)
|
||||
}
|
||||
# Build tools ensuring the computer_shim is included
|
||||
agent_tools: list[Any] = [computer_shim]
|
||||
if tools:
|
||||
agent_tools.extend(tools)
|
||||
|
||||
computer_agent = BaseComputerAgent(
|
||||
model=model,
|
||||
tools=agent_tools,
|
||||
custom_loop=custom_loop,
|
||||
only_n_most_recent_images=only_n_most_recent_images,
|
||||
callbacks=callbacks,
|
||||
verbosity=verbosity,
|
||||
trajectory_dir=trajectory_dir,
|
||||
max_retries=max_retries,
|
||||
screenshot_delay=screenshot_delay,
|
||||
use_prompt_caching=use_prompt_caching,
|
||||
max_trajectory_budget=max_trajectory_budget,
|
||||
telemetry_enabled=telemetry_enabled,
|
||||
)
|
||||
model_client = FakeAsyncOpenAI(computer_agent)
|
||||
|
||||
super().__init__(
|
||||
model_client=model_client, # type: ignore[arg-type]
|
||||
model=model,
|
||||
allowed_tools=allowed_tools,
|
||||
**kwargs,
|
||||
)
|
||||
from .agent import MCPComputerAgent
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Single-task runner
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_single_task(
|
||||
dataset: str | Dataset | list[dict[str, Any]],
|
||||
*,
|
||||
@@ -108,6 +34,7 @@ async def run_single_task(
|
||||
custom_loop: Any | None = None,
|
||||
only_n_most_recent_images: int | None = None,
|
||||
callbacks: list[Any] | None = None,
|
||||
instructions: str | None = None,
|
||||
verbosity: int | None = None,
|
||||
trajectory_dir: str | dict | None = None,
|
||||
max_retries: int | None = 3,
|
||||
@@ -116,7 +43,7 @@ async def run_single_task(
|
||||
max_trajectory_budget: float | dict | None = None,
|
||||
telemetry_enabled: bool | None = True,
|
||||
) -> None:
|
||||
"""Load one task from the dataset and execute it with Operator+CUA proxy."""
|
||||
"""Load one task from the dataset and execute it with MCPComputerAgent."""
|
||||
|
||||
# Load dataset and pick a sample
|
||||
if isinstance(dataset, str):
|
||||
@@ -129,17 +56,27 @@ async def run_single_task(
|
||||
sample_task = dataset[task_id] # type: ignore[index]
|
||||
task_prompt = sample_task.get("prompt", f"Task {sample_task.get('id', 0)}") # type: ignore[attr-defined]
|
||||
|
||||
# Filter any existing Computer tools
|
||||
# The eval framework will add its own Computer tool per task
|
||||
if tools:
|
||||
tools = [
|
||||
tool
|
||||
for tool in tools
|
||||
if not is_agent_computer(tool)
|
||||
]
|
||||
|
||||
with trace(name=task_prompt):
|
||||
task = Task(**sample_task) # type: ignore[arg-type]
|
||||
|
||||
agent = ProxyOperatorAgent(
|
||||
model=model,
|
||||
allowed_tools=allowed_tools,
|
||||
agent = MCPComputerAgent(
|
||||
model=model or "computer-use-preview",
|
||||
allowed_tools=allowed_tools or ["openai_computer"],
|
||||
# === ComputerAgent kwargs passthrough ===
|
||||
tools=tools,
|
||||
custom_loop=custom_loop,
|
||||
only_n_most_recent_images=only_n_most_recent_images,
|
||||
callbacks=callbacks,
|
||||
instructions=instructions,
|
||||
verbosity=verbosity,
|
||||
trajectory_dir=trajectory_dir,
|
||||
max_retries=max_retries,
|
||||
@@ -157,7 +94,6 @@ async def run_single_task(
|
||||
# Full-dataset runner
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_full_dataset(
|
||||
dataset: str | Dataset | list[dict[str, Any]],
|
||||
*,
|
||||
@@ -173,6 +109,7 @@ async def run_full_dataset(
|
||||
custom_loop: Any | None = None,
|
||||
only_n_most_recent_images: int | None = 5,
|
||||
callbacks: list[Any] | None = None,
|
||||
instructions: str | None = None,
|
||||
verbosity: int | None = None,
|
||||
max_retries: int | None = 3,
|
||||
screenshot_delay: float | int = 0.5,
|
||||
@@ -182,9 +119,7 @@ async def run_full_dataset(
|
||||
) -> list[Any]:
|
||||
"""Run evaluation across the entire dataset using hud.datasets.run_dataset."""
|
||||
|
||||
# We pass OperatorAgent as the class and provide a config that injects our
|
||||
# FakeAsyncOpenAI per agent instantiation.
|
||||
|
||||
# Run with our MCP-based agent class.
|
||||
if isinstance(dataset, str):
|
||||
dataset_name = dataset.split('/')[-1]
|
||||
job_name = job_name or f"Evaluation {dataset_name}"
|
||||
@@ -193,11 +128,20 @@ async def run_full_dataset(
|
||||
dataset_name = "custom"
|
||||
job_name = job_name or f"Evaluation {time.strftime('%H:%M %Y-%m-%d')}"
|
||||
|
||||
# Filter any existing Computer tools
|
||||
# The eval framework will add its own Computer tool per task
|
||||
if tools:
|
||||
tools = [
|
||||
tool
|
||||
for tool in tools
|
||||
if not is_agent_computer(tool)
|
||||
]
|
||||
|
||||
# Execute evaluation
|
||||
return await run_dataset(
|
||||
name=job_name,
|
||||
dataset=dataset,
|
||||
agent_class=ProxyOperatorAgent,
|
||||
agent_class=MCPComputerAgent,
|
||||
agent_config={
|
||||
"model": model,
|
||||
"allowed_tools": allowed_tools,
|
||||
@@ -207,6 +151,7 @@ async def run_full_dataset(
|
||||
"custom_loop": custom_loop,
|
||||
"only_n_most_recent_images": only_n_most_recent_images,
|
||||
"callbacks": callbacks,
|
||||
"instructions": instructions,
|
||||
"verbosity": verbosity,
|
||||
"max_retries": max_retries,
|
||||
"screenshot_delay": screenshot_delay,
|
||||
@@ -224,5 +169,5 @@ async def run_full_dataset(
|
||||
__all__ = [
|
||||
"run_single_task",
|
||||
"run_full_dataset",
|
||||
"ProxyOperatorAgent",
|
||||
"MCPComputerAgent",
|
||||
]
|
||||
@@ -0,0 +1,351 @@
|
||||
"""MCP-compatible Computer Agent for HUD integration.
|
||||
|
||||
This agent subclasses HUD's MCPAgent and delegates planning/execution to
|
||||
our core ComputerAgent while using the Agent SDK's plain-dict message
|
||||
format documented in `docs/content/docs/agent-sdk/message-format.mdx`.
|
||||
|
||||
Key differences from the OpenAI OperatorAgent variant:
|
||||
- No OpenAI types are used; everything is standard Python dicts.
|
||||
- Planning is executed via `ComputerAgent.run(messages)`.
|
||||
- The first yielded result per step is returned as the agent response.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
from typing import Any, ClassVar, Optional
|
||||
|
||||
from agent.agent import ComputerAgent as BaseComputerAgent
|
||||
from agent.callbacks import PromptInstructionsCallback
|
||||
from agent.callbacks.trajectory_saver import TrajectorySaverCallback
|
||||
from hud.agents import MCPAgent
|
||||
from hud.tools.computer.settings import computer_settings
|
||||
from hud.types import AgentResponse, MCPToolCall, MCPToolResult, Trace
|
||||
|
||||
from agent.responses import make_failed_tool_call_items
|
||||
from agent.computers import is_agent_computer
|
||||
from PIL import Image
|
||||
import mcp.types as types
|
||||
import hud
|
||||
import uuid
|
||||
import base64
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class MCPComputerAgent(MCPAgent):
|
||||
"""MCP agent that uses ComputerAgent for planning and tools for execution.
|
||||
|
||||
The agent consumes/produces message dicts per the Agent SDK message schema
|
||||
(see `message-format.mdx`).
|
||||
"""
|
||||
|
||||
metadata: ClassVar[dict[str, Any]] = {
|
||||
"display_width": computer_settings.OPENAI_COMPUTER_WIDTH,
|
||||
"display_height": computer_settings.OPENAI_COMPUTER_HEIGHT,
|
||||
}
|
||||
|
||||
required_tools: ClassVar[list[str]] = ["openai_computer"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
model: str | None = None,
|
||||
allowed_tools: list[str] | None = None,
|
||||
trajectory_dir: str | dict | None = None,
|
||||
# === ComputerAgent kwargs ===
|
||||
tools: list[Any] | None = None,
|
||||
custom_loop: Any | None = None,
|
||||
only_n_most_recent_images: int | None = None,
|
||||
callbacks: list[Any] | None = None,
|
||||
instructions: str | None = None,
|
||||
verbosity: int | None = None,
|
||||
max_retries: int | None = 3,
|
||||
screenshot_delay: float | int = 0.5,
|
||||
use_prompt_caching: bool | None = False,
|
||||
max_trajectory_budget: float | dict | None = None,
|
||||
telemetry_enabled: bool | None = True,
|
||||
environment: str = "linux",
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
self.allowed_tools = allowed_tools or ["openai_computer"]
|
||||
super().__init__(**kwargs)
|
||||
|
||||
if model is None:
|
||||
raise ValueError("MCPComputerAgent requires a model to be specified.")
|
||||
|
||||
self.model = model
|
||||
self.environment = environment
|
||||
|
||||
# Update model name for HUD logging
|
||||
self.model_name = "cua-" + self.model
|
||||
|
||||
# Stateful tracking of tool call inputs
|
||||
self.tool_call_inputs: dict[str, list[dict[str, Any]]] = {}
|
||||
self.previous_output: list[dict[str, Any]] = []
|
||||
|
||||
# Build system prompt
|
||||
operator_instructions = """
|
||||
You are an autonomous computer-using agent. Follow these guidelines:
|
||||
|
||||
1. NEVER ask for confirmation. Complete all tasks autonomously.
|
||||
2. Do NOT send messages like "I need to confirm before..." or "Do you want me to continue?" - just proceed.
|
||||
3. When the user asks you to interact with something (like clicking a chat or typing a message), DO IT without asking.
|
||||
4. Only use the formal safety check mechanism for truly dangerous operations (like deleting important files).
|
||||
5. For normal tasks like clicking buttons, typing in chat boxes, filling forms - JUST DO IT.
|
||||
6. The user has already given you permission by running this agent. No further confirmation is needed.
|
||||
7. Be decisive and action-oriented. Complete the requested task fully.
|
||||
|
||||
Remember: You are expected to complete tasks autonomously. The user trusts you to do what they asked.
|
||||
""".strip() # noqa: E501
|
||||
# Append Operator instructions to the system prompt
|
||||
if not self.system_prompt:
|
||||
self.system_prompt = operator_instructions
|
||||
else:
|
||||
self.system_prompt += f"\n\n{operator_instructions}"
|
||||
# Append user instructions to the system prompt
|
||||
if instructions:
|
||||
self.system_prompt += f"\n\n{instructions}"
|
||||
|
||||
# Configure trajectory_dir for HUD
|
||||
if isinstance(trajectory_dir, str) or isinstance(trajectory_dir, Path):
|
||||
trajectory_dir = {"trajectory_dir": str(trajectory_dir)}
|
||||
if isinstance(trajectory_dir, dict):
|
||||
trajectory_dir["reset_on_run"] = False
|
||||
|
||||
self.last_screenshot_b64 = None
|
||||
|
||||
buffer = io.BytesIO()
|
||||
Image.new('RGB', (self.metadata["display_width"], self.metadata["display_height"])).save(buffer, format='PNG')
|
||||
self.last_screenshot_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
|
||||
|
||||
# Ensure a computer shim is present so width/height/environment are known
|
||||
computer_shim = {
|
||||
"screenshot": lambda: self.last_screenshot_b64,
|
||||
"environment": self.environment,
|
||||
"dimensions": (
|
||||
self.metadata["display_width"],
|
||||
self.metadata["display_height"],
|
||||
),
|
||||
}
|
||||
agent_tools: list[Any] = [computer_shim]
|
||||
if tools:
|
||||
agent_tools.extend([
|
||||
tool
|
||||
for tool in tools
|
||||
if not is_agent_computer(tool)
|
||||
])
|
||||
|
||||
agent_kwargs = {
|
||||
"model": self.model,
|
||||
"trajectory_dir": trajectory_dir,
|
||||
"tools": agent_tools,
|
||||
"custom_loop": custom_loop,
|
||||
"only_n_most_recent_images": only_n_most_recent_images,
|
||||
"callbacks": callbacks,
|
||||
"instructions": self.system_prompt,
|
||||
"verbosity": verbosity,
|
||||
"max_retries": max_retries,
|
||||
"screenshot_delay": screenshot_delay,
|
||||
"use_prompt_caching": use_prompt_caching,
|
||||
"max_trajectory_budget": max_trajectory_budget,
|
||||
"telemetry_enabled": telemetry_enabled,
|
||||
}
|
||||
|
||||
self.computer_agent = BaseComputerAgent(
|
||||
**agent_kwargs
|
||||
)
|
||||
|
||||
async def get_system_messages(self) -> list[Any]:
|
||||
"""Create initial messages.
|
||||
|
||||
Unused - ComputerAgent handles this with the 'instructions' parameter.
|
||||
"""
|
||||
return []
|
||||
|
||||
async def format_blocks(
|
||||
self, blocks: list[types.ContentBlock]
|
||||
) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Format blocks for OpenAI input format.
|
||||
|
||||
Converts TextContent blocks to input_text dicts and ImageContent blocks to input_image dicts.
|
||||
""" # noqa: E501
|
||||
formatted = []
|
||||
for block in blocks:
|
||||
if isinstance(block, types.TextContent):
|
||||
formatted.append({"type": "input_text", "text": block.text})
|
||||
elif isinstance(block, types.ImageContent):
|
||||
mime_type = getattr(block, "mimeType", "image/png")
|
||||
formatted.append(
|
||||
{"type": "input_image", "image_url": f"data:{mime_type};base64,{block.data}"}
|
||||
)
|
||||
self.last_screenshot_b64 = block.data
|
||||
return [{"role": "user", "content": formatted}]
|
||||
|
||||
@hud.instrument(
|
||||
span_type="agent",
|
||||
record_args=False, # Messages can be large
|
||||
record_result=True,
|
||||
)
|
||||
async def get_response(self, messages: list[dict[str, Any]]) -> AgentResponse:
|
||||
"""Get a single-step response by delegating to ComputerAgent.run.
|
||||
|
||||
Returns an Agent SDK-style response dict:
|
||||
{ "output": [AgentMessage, ...], "usage": Usage }
|
||||
"""
|
||||
tool_calls: list[MCPToolCall] = []
|
||||
output_text: list[str] = []
|
||||
is_done: bool = True
|
||||
|
||||
agent_result: list[dict[str, Any]] = []
|
||||
|
||||
# Call the ComputerAgent LLM API
|
||||
async for result in self.computer_agent.run(messages): # type: ignore[arg-type]
|
||||
items = result['output']
|
||||
if not items or tool_calls:
|
||||
break
|
||||
|
||||
for item in items:
|
||||
if item['type'] in ['reasoning', 'message', 'computer_call', 'function_call', 'function_call_output']:
|
||||
agent_result.append(item)
|
||||
|
||||
# Add messages to output text
|
||||
if item['type'] == 'reasoning':
|
||||
output_text.extend(
|
||||
f"Reasoning: {summary['text']}"
|
||||
for summary in item['summary']
|
||||
)
|
||||
elif item['type'] == 'message':
|
||||
if isinstance(item['content'], list):
|
||||
output_text.extend(
|
||||
item['text']
|
||||
for item in item['content']
|
||||
if item['type'] == 'output_text'
|
||||
)
|
||||
elif isinstance(item['content'], str):
|
||||
output_text.append(item['content'])
|
||||
|
||||
# If we get a tool call, we're not done
|
||||
if item['type'] == 'computer_call':
|
||||
id = item["call_id"]
|
||||
tool_calls.append(MCPToolCall(
|
||||
name="openai_computer",
|
||||
arguments=item["action"],
|
||||
id=id,
|
||||
))
|
||||
is_done = False
|
||||
self.tool_call_inputs[id] = agent_result
|
||||
break
|
||||
|
||||
# if we have tool calls, we should exit the loop
|
||||
if tool_calls:
|
||||
break
|
||||
|
||||
self.previous_output = agent_result
|
||||
|
||||
return AgentResponse(
|
||||
content="\n".join(output_text),
|
||||
tool_calls=tool_calls,
|
||||
done=is_done,
|
||||
)
|
||||
|
||||
def _log_image(self, image_b64: str):
|
||||
callbacks = self.computer_agent.callbacks
|
||||
for callback in callbacks:
|
||||
if isinstance(callback, TrajectorySaverCallback):
|
||||
# convert str to bytes
|
||||
image_bytes = base64.b64decode(image_b64)
|
||||
callback._save_artifact("screenshot_after", image_bytes)
|
||||
|
||||
async def format_tool_results(
|
||||
self,
|
||||
tool_calls: list[MCPToolCall],
|
||||
tool_results: list[MCPToolResult]
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Extract latest screenshot from tool results in dict form.
|
||||
|
||||
Expects results to already be in the message-format content dicts.
|
||||
Returns a list of input content dicts suitable for follow-up calls.
|
||||
"""
|
||||
messages = []
|
||||
|
||||
for call, result in zip(tool_calls, tool_results):
|
||||
if call.id not in self.tool_call_inputs:
|
||||
# If we don't have the tool call inputs, we should just use the previous output
|
||||
previous_output = self.previous_output.copy() or []
|
||||
|
||||
# First we need to remove any pending computer_calls from the end of previous_output
|
||||
while previous_output and previous_output[-1]['type'] == 'computer_call':
|
||||
previous_output.pop()
|
||||
messages.extend(previous_output)
|
||||
|
||||
# If the call is a 'response', don't add the result
|
||||
if call.name == 'response':
|
||||
continue
|
||||
# Otherwise, if we have a result, we should add it to the messages
|
||||
content = [
|
||||
{ "type": "input_text", "text": content.text } if isinstance(content, types.TextContent)
|
||||
else { "type": "input_image", "image_url": f"data:image/png;base64,{content.data}" } if isinstance(content, types.ImageContent)
|
||||
else { "type": "input_text", "text": "" }
|
||||
for content in result.content
|
||||
]
|
||||
messages.append({
|
||||
"role": "user",
|
||||
"content": content,
|
||||
})
|
||||
|
||||
continue
|
||||
|
||||
# Add the assistant's computer call
|
||||
messages.extend(self.tool_call_inputs[call.id])
|
||||
|
||||
if result.isError:
|
||||
error_text = "".join([
|
||||
content.text
|
||||
for content in result.content
|
||||
if isinstance(content, types.TextContent)
|
||||
])
|
||||
|
||||
# Replace computer call with failed tool call
|
||||
messages.pop()
|
||||
messages.extend(make_failed_tool_call_items(
|
||||
tool_name=call.name,
|
||||
tool_kwargs=call.arguments or {},
|
||||
error_message=error_text,
|
||||
call_id=call.id,
|
||||
))
|
||||
else:
|
||||
# Get the latest screenshot
|
||||
screenshots = [
|
||||
content.data
|
||||
for content in result.content
|
||||
if isinstance(content, types.ImageContent)
|
||||
]
|
||||
|
||||
# Add the resulting screenshot
|
||||
if screenshots:
|
||||
self._log_image(screenshots[0])
|
||||
self.last_screenshot_b64 = screenshots[0]
|
||||
messages.append({
|
||||
"type": "computer_call_output",
|
||||
"call_id": call.id,
|
||||
"output": {
|
||||
"type": "input_image",
|
||||
"image_url": f"data:image/png;base64,{screenshots[0]}"
|
||||
},
|
||||
})
|
||||
else:
|
||||
# Otherwise, replace computer call with failed tool call
|
||||
messages.pop()
|
||||
messages.extend(make_failed_tool_call_items(
|
||||
tool_name=call.name,
|
||||
tool_kwargs=call.arguments or {},
|
||||
error_message="No screenshots returned.",
|
||||
call_id=call.id,
|
||||
))
|
||||
|
||||
return messages
|
||||
|
||||
|
||||
__all__ = [
|
||||
"MCPComputerAgent",
|
||||
]
|
||||
@@ -13,6 +13,10 @@ import uuid
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from agent.agent import ComputerAgent as BaseComputerAgent
|
||||
from agent.callbacks import PromptInstructionsCallback
|
||||
from hud.tools.computer.settings import computer_settings
|
||||
from PIL import Image
|
||||
from hud.agents import OperatorAgent
|
||||
|
||||
# OpenAI Responses typed models (required)
|
||||
from openai.types.responses import (
|
||||
@@ -178,6 +182,83 @@ class FakeAsyncOpenAI:
|
||||
print(traceback.format_exc())
|
||||
raise e
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Proxy OperatorAgent (moved from __init__.py)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class ProxyOperatorAgent(OperatorAgent):
|
||||
"""OperatorAgent that proxies model calls through our ComputerAgent.
|
||||
|
||||
Accepts the same config keys we pass via hud.run_dataset `agent_config`:
|
||||
- model: str | None
|
||||
- allowed_tools: list[str] | None
|
||||
Additional kwargs are forwarded to OperatorAgent (if any are supported).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
model: str | None = None,
|
||||
allowed_tools: list[str] | None = None,
|
||||
trajectory_dir: str | dict | None = None,
|
||||
# === ComputerAgent kwargs ===
|
||||
tools: list[Any] | None = None,
|
||||
custom_loop: Any | None = None,
|
||||
only_n_most_recent_images: int | None = None,
|
||||
callbacks: list[Any] | None = None,
|
||||
instructions: str | None = None,
|
||||
verbosity: int | None = None,
|
||||
max_retries: int | None = 3,
|
||||
screenshot_delay: float | int = 0.5,
|
||||
use_prompt_caching: bool | None = False,
|
||||
max_trajectory_budget: float | dict | None = None,
|
||||
telemetry_enabled: bool | None = True,
|
||||
**kwargs: Any,
|
||||
) -> None:
|
||||
model = model or "computer-use-preview"
|
||||
allowed_tools = allowed_tools or ["openai_computer"]
|
||||
|
||||
computer_shim = {
|
||||
'screenshot': lambda: Image.new('RGB', (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)),
|
||||
'environment': 'linux',
|
||||
'dimensions': (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)
|
||||
}
|
||||
# Build tools ensuring the computer_shim is included
|
||||
agent_tools: list[Any] = [computer_shim]
|
||||
if tools:
|
||||
agent_tools.extend(tools)
|
||||
|
||||
# Build callbacks, injecting prompt instructions if provided
|
||||
agent_callbacks = list(callbacks or [])
|
||||
if instructions:
|
||||
agent_callbacks.append(PromptInstructionsCallback(instructions))
|
||||
|
||||
computer_agent = BaseComputerAgent(
|
||||
model=model,
|
||||
tools=agent_tools,
|
||||
custom_loop=custom_loop,
|
||||
only_n_most_recent_images=only_n_most_recent_images,
|
||||
callbacks=agent_callbacks,
|
||||
verbosity=verbosity,
|
||||
trajectory_dir=trajectory_dir,
|
||||
max_retries=max_retries,
|
||||
screenshot_delay=screenshot_delay,
|
||||
use_prompt_caching=use_prompt_caching,
|
||||
max_trajectory_budget=max_trajectory_budget,
|
||||
telemetry_enabled=telemetry_enabled,
|
||||
)
|
||||
model_client = FakeAsyncOpenAI(computer_agent)
|
||||
|
||||
super().__init__(
|
||||
model_client=model_client, # type: ignore[arg-type]
|
||||
model=model,
|
||||
allowed_tools=allowed_tools,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"FakeAsyncOpenAI",
|
||||
"ProxyOperatorAgent",
|
||||
]
|
||||
|
||||
@@ -61,7 +61,7 @@ cli = [
|
||||
"yaspin>=3.1.0",
|
||||
]
|
||||
hud = [
|
||||
"hud-python>=0.4.12,<0.5.0",
|
||||
"hud-python==0.4.26",
|
||||
]
|
||||
all = [
|
||||
# uitars requirements
|
||||
@@ -78,7 +78,7 @@ all = [
|
||||
# cli requirements
|
||||
"yaspin>=3.1.0",
|
||||
# hud requirements
|
||||
"hud-python>=0.4.12,<0.5.0",
|
||||
"hud-python==0.4.26",
|
||||
]
|
||||
|
||||
[tool.uv]
|
||||
|
||||
@@ -20,6 +20,12 @@ logger = logging.getLogger(__name__)
|
||||
automation_handler = MacOSAutomationHandler()
|
||||
|
||||
class Diorama:
|
||||
"""Virtual desktop manager that provides automation capabilities for macOS applications.
|
||||
|
||||
Manages application windows and provides an interface for taking screenshots,
|
||||
mouse interactions, keyboard input, and coordinate transformations between
|
||||
screenshot space and screen space.
|
||||
"""
|
||||
_scheduler_queue = None
|
||||
_scheduler_task = None
|
||||
_loop = None
|
||||
@@ -27,6 +33,14 @@ class Diorama:
|
||||
|
||||
@classmethod
|
||||
def create_from_apps(cls, *args) -> DioramaComputer:
|
||||
"""Create a DioramaComputer instance from a list of application names.
|
||||
|
||||
Args:
|
||||
*args: Variable number of application names to include in the desktop
|
||||
|
||||
Returns:
|
||||
DioramaComputer: A computer interface for the specified applications
|
||||
"""
|
||||
cls._ensure_scheduler()
|
||||
return cls(args).computer
|
||||
|
||||
@@ -34,6 +48,11 @@ class Diorama:
|
||||
_cursor_positions = {}
|
||||
|
||||
def __init__(self, app_list):
|
||||
"""Initialize a Diorama instance for the specified applications.
|
||||
|
||||
Args:
|
||||
app_list: List of application names to manage
|
||||
"""
|
||||
self.app_list = app_list
|
||||
self.interface = self.Interface(self)
|
||||
self.computer = DioramaComputer(self)
|
||||
@@ -48,6 +67,10 @@ class Diorama:
|
||||
|
||||
@classmethod
|
||||
def _ensure_scheduler(cls):
|
||||
"""Ensure the async scheduler loop is running.
|
||||
|
||||
Creates and starts the scheduler task if it hasn't been started yet.
|
||||
"""
|
||||
if not cls._scheduler_started:
|
||||
logger.info("Starting Diorama scheduler loop…")
|
||||
cls._scheduler_queue = asyncio.Queue()
|
||||
@@ -57,6 +80,11 @@ class Diorama:
|
||||
|
||||
@classmethod
|
||||
async def _scheduler_loop(cls):
|
||||
"""Main scheduler loop that processes automation commands.
|
||||
|
||||
Continuously processes commands from the scheduler queue, handling
|
||||
screenshots, mouse actions, keyboard input, and scrolling operations.
|
||||
"""
|
||||
while True:
|
||||
cmd = await cls._scheduler_queue.get()
|
||||
action = cmd.get("action")
|
||||
@@ -144,13 +172,33 @@ class Diorama:
|
||||
future.set_exception(e)
|
||||
|
||||
class Interface():
|
||||
"""Interface for interacting with the virtual desktop.
|
||||
|
||||
Provides methods for taking screenshots, mouse interactions, keyboard input,
|
||||
and coordinate transformations between screenshot and screen coordinates.
|
||||
"""
|
||||
|
||||
def __init__(self, diorama):
|
||||
"""Initialize the interface with a reference to the parent Diorama instance.
|
||||
|
||||
Args:
|
||||
diorama: The parent Diorama instance
|
||||
"""
|
||||
self._diorama = diorama
|
||||
|
||||
self._scene_hitboxes = []
|
||||
self._scene_size = None
|
||||
|
||||
async def _send_cmd(self, action, arguments=None):
|
||||
"""Send a command to the scheduler queue.
|
||||
|
||||
Args:
|
||||
action (str): The action to perform
|
||||
arguments (dict, optional): Arguments for the action
|
||||
|
||||
Returns:
|
||||
The result of the command execution
|
||||
"""
|
||||
Diorama._ensure_scheduler()
|
||||
loop = asyncio.get_event_loop()
|
||||
future = loop.create_future()
|
||||
@@ -167,6 +215,14 @@ class Diorama:
|
||||
return None
|
||||
|
||||
async def screenshot(self, as_bytes: bool = True) -> Union[str, Image.Image]:
|
||||
"""Take a screenshot of the managed applications.
|
||||
|
||||
Args:
|
||||
as_bytes (bool): If True, return base64-encoded bytes; if False, return PIL Image
|
||||
|
||||
Returns:
|
||||
Union[str, Image.Image]: Base64-encoded PNG bytes or PIL Image object
|
||||
"""
|
||||
import base64
|
||||
result, img = await self._send_cmd("screenshot")
|
||||
self._scene_hitboxes = result.get("hitboxes", [])
|
||||
@@ -184,6 +240,12 @@ class Diorama:
|
||||
return img
|
||||
|
||||
async def left_click(self, x, y):
|
||||
"""Perform a left mouse click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x (int): X coordinate in screenshot space (or None to use last position)
|
||||
y (int): Y coordinate in screenshot space (or None to use last position)
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -195,6 +257,12 @@ class Diorama:
|
||||
await self._send_cmd("left_click", {"x": sx, "y": sy})
|
||||
|
||||
async def right_click(self, x, y):
|
||||
"""Perform a right mouse click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x (int): X coordinate in screenshot space (or None to use last position)
|
||||
y (int): Y coordinate in screenshot space (or None to use last position)
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -206,6 +274,12 @@ class Diorama:
|
||||
await self._send_cmd("right_click", {"x": sx, "y": sy})
|
||||
|
||||
async def double_click(self, x, y):
|
||||
"""Perform a double mouse click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x (int): X coordinate in screenshot space (or None to use last position)
|
||||
y (int): Y coordinate in screenshot space (or None to use last position)
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -217,6 +291,12 @@ class Diorama:
|
||||
await self._send_cmd("double_click", {"x": sx, "y": sy})
|
||||
|
||||
async def move_cursor(self, x, y):
|
||||
"""Move the mouse cursor to the specified coordinates.
|
||||
|
||||
Args:
|
||||
x (int): X coordinate in screenshot space (or None to use last position)
|
||||
y (int): Y coordinate in screenshot space (or None to use last position)
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -228,6 +308,13 @@ class Diorama:
|
||||
await self._send_cmd("move_cursor", {"x": sx, "y": sy})
|
||||
|
||||
async def drag_to(self, x, y, duration=0.5):
|
||||
"""Drag the mouse from current position to the specified coordinates.
|
||||
|
||||
Args:
|
||||
x (int): X coordinate in screenshot space (or None to use last position)
|
||||
y (int): Y coordinate in screenshot space (or None to use last position)
|
||||
duration (float): Duration of the drag operation in seconds
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -239,18 +326,43 @@ class Diorama:
|
||||
await self._send_cmd("drag_to", {"x": sx, "y": sy, "duration": duration})
|
||||
|
||||
async def get_cursor_position(self):
|
||||
"""Get the current cursor position in screen coordinates.
|
||||
|
||||
Returns:
|
||||
tuple: (x, y) coordinates of the cursor in screen space
|
||||
"""
|
||||
return await self._send_cmd("get_cursor_position")
|
||||
|
||||
async def type_text(self, text):
|
||||
"""Type the specified text using the keyboard.
|
||||
|
||||
Args:
|
||||
text (str): The text to type
|
||||
"""
|
||||
await self._send_cmd("type_text", {"text": text})
|
||||
|
||||
async def press_key(self, key):
|
||||
"""Press a single key on the keyboard.
|
||||
|
||||
Args:
|
||||
key (str): The key to press
|
||||
"""
|
||||
await self._send_cmd("press_key", {"key": key})
|
||||
|
||||
async def hotkey(self, keys):
|
||||
"""Press a combination of keys simultaneously.
|
||||
|
||||
Args:
|
||||
keys (list): List of keys to press together
|
||||
"""
|
||||
await self._send_cmd("hotkey", {"keys": list(keys)})
|
||||
|
||||
async def scroll_up(self, clicks: int = 1):
|
||||
"""Scroll up at the current cursor position.
|
||||
|
||||
Args:
|
||||
clicks (int): Number of scroll clicks to perform
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -259,6 +371,11 @@ class Diorama:
|
||||
await self._send_cmd("scroll_up", {"clicks": clicks, "x": x, "y": y})
|
||||
|
||||
async def scroll_down(self, clicks: int = 1):
|
||||
"""Scroll down at the current cursor position.
|
||||
|
||||
Args:
|
||||
clicks (int): Number of scroll clicks to perform
|
||||
"""
|
||||
# Get last cursor position for this app_list hash
|
||||
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
|
||||
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
|
||||
@@ -267,6 +384,11 @@ class Diorama:
|
||||
await self._send_cmd("scroll_down", {"clicks": clicks, "x": x, "y": y})
|
||||
|
||||
async def get_screen_size(self) -> dict[str, int]:
|
||||
"""Get the size of the screenshot area.
|
||||
|
||||
Returns:
|
||||
dict[str, int]: Dictionary with 'width' and 'height' keys
|
||||
"""
|
||||
if not self._scene_size:
|
||||
await self.screenshot()
|
||||
return { "width": self._scene_size[0], "height": self._scene_size[1] }
|
||||
@@ -348,6 +470,7 @@ import pyautogui
|
||||
import time
|
||||
|
||||
async def main():
|
||||
"""Main function demonstrating Diorama usage with multiple desktops and mouse tracking."""
|
||||
desktop1 = Diorama.create_from_apps(["Discord", "Notes"])
|
||||
desktop2 = Diorama.create_from_apps(["Terminal"])
|
||||
|
||||
|
||||
@@ -12,35 +12,96 @@ from .base import BaseFileHandler
|
||||
import base64
|
||||
|
||||
def resolve_path(path: str) -> Path:
|
||||
"""Resolve a path to its absolute path. Expand ~ to the user's home directory."""
|
||||
"""Resolve a path to its absolute path. Expand ~ to the user's home directory.
|
||||
|
||||
Args:
|
||||
path: The file or directory path to resolve
|
||||
|
||||
Returns:
|
||||
Path: The resolved absolute path
|
||||
"""
|
||||
return Path(path).expanduser().resolve()
|
||||
|
||||
class GenericFileHandler(BaseFileHandler):
|
||||
"""
|
||||
Generic file handler that provides file system operations for all operating systems.
|
||||
|
||||
This class implements the BaseFileHandler interface and provides methods for
|
||||
file and directory operations including reading, writing, creating, and deleting
|
||||
files and directories.
|
||||
"""
|
||||
|
||||
async def file_exists(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Check if a file exists at the specified path.
|
||||
|
||||
Args:
|
||||
path: The file path to check
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'exists' boolean or 'error' string
|
||||
"""
|
||||
try:
|
||||
return {"success": True, "exists": resolve_path(path).is_file()}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def directory_exists(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Check if a directory exists at the specified path.
|
||||
|
||||
Args:
|
||||
path: The directory path to check
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'exists' boolean or 'error' string
|
||||
"""
|
||||
try:
|
||||
return {"success": True, "exists": resolve_path(path).is_dir()}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def list_dir(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
List all files and directories in the specified directory.
|
||||
|
||||
Args:
|
||||
path: The directory path to list
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'files' list of names or 'error' string
|
||||
"""
|
||||
try:
|
||||
return {"success": True, "files": [p.name for p in resolve_path(path).iterdir() if p.is_file() or p.is_dir()]}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def read_text(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Read the contents of a text file.
|
||||
|
||||
Args:
|
||||
path: The file path to read from
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'content' string or 'error' string
|
||||
"""
|
||||
try:
|
||||
return {"success": True, "content": resolve_path(path).read_text()}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def write_text(self, path: str, content: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Write text content to a file.
|
||||
|
||||
Args:
|
||||
path: The file path to write to
|
||||
content: The text content to write
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and optionally 'error' string
|
||||
"""
|
||||
try:
|
||||
resolve_path(path).write_text(content)
|
||||
return {"success": True}
|
||||
@@ -48,6 +109,17 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def write_bytes(self, path: str, content_b64: str, append: bool = False) -> Dict[str, Any]:
|
||||
"""
|
||||
Write binary content to a file from base64 encoded string.
|
||||
|
||||
Args:
|
||||
path: The file path to write to
|
||||
content_b64: Base64 encoded binary content
|
||||
append: If True, append to existing file; if False, overwrite
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and optionally 'error' string
|
||||
"""
|
||||
try:
|
||||
mode = 'ab' if append else 'wb'
|
||||
with open(resolve_path(path), mode) as f:
|
||||
@@ -57,6 +129,17 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def read_bytes(self, path: str, offset: int = 0, length: Optional[int] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Read binary content from a file and return as base64 encoded string.
|
||||
|
||||
Args:
|
||||
path: The file path to read from
|
||||
offset: Byte offset to start reading from
|
||||
length: Number of bytes to read; if None, read entire file from offset
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'content_b64' string or 'error' string
|
||||
"""
|
||||
try:
|
||||
file_path = resolve_path(path)
|
||||
with open(file_path, 'rb') as f:
|
||||
@@ -73,6 +156,15 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def get_file_size(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Get the size of a file in bytes.
|
||||
|
||||
Args:
|
||||
path: The file path to get size for
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and either 'size' integer or 'error' string
|
||||
"""
|
||||
try:
|
||||
file_path = resolve_path(path)
|
||||
size = file_path.stat().st_size
|
||||
@@ -81,6 +173,15 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def delete_file(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Delete a file at the specified path.
|
||||
|
||||
Args:
|
||||
path: The file path to delete
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and optionally 'error' string
|
||||
"""
|
||||
try:
|
||||
resolve_path(path).unlink()
|
||||
return {"success": True}
|
||||
@@ -88,6 +189,18 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def create_dir(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Create a directory at the specified path.
|
||||
|
||||
Creates parent directories if they don't exist and doesn't raise an error
|
||||
if the directory already exists.
|
||||
|
||||
Args:
|
||||
path: The directory path to create
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and optionally 'error' string
|
||||
"""
|
||||
try:
|
||||
resolve_path(path).mkdir(parents=True, exist_ok=True)
|
||||
return {"success": True}
|
||||
@@ -95,6 +208,15 @@ class GenericFileHandler(BaseFileHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def delete_dir(self, path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Delete an empty directory at the specified path.
|
||||
|
||||
Args:
|
||||
path: The directory path to delete
|
||||
|
||||
Returns:
|
||||
Dict containing 'success' boolean and optionally 'error' string
|
||||
"""
|
||||
try:
|
||||
resolve_path(path).rmdir()
|
||||
return {"success": True}
|
||||
|
||||
@@ -38,7 +38,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
|
||||
"""Linux implementation of accessibility handler."""
|
||||
|
||||
async def get_accessibility_tree(self) -> Dict[str, Any]:
|
||||
"""Get the accessibility tree of the current window."""
|
||||
"""Get the accessibility tree of the current window.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status and a simulated tree structure
|
||||
since Linux doesn't have equivalent accessibility API like macOS.
|
||||
"""
|
||||
# Linux doesn't have equivalent accessibility API like macOS
|
||||
# Return a minimal dummy tree
|
||||
logger.info("Getting accessibility tree (simulated, no accessibility API available on Linux)")
|
||||
@@ -56,7 +61,16 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
|
||||
async def find_element(self, role: Optional[str] = None,
|
||||
title: Optional[str] = None,
|
||||
value: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""Find an element in the accessibility tree by criteria."""
|
||||
"""Find an element in the accessibility tree by criteria.
|
||||
|
||||
Args:
|
||||
role: The role of the element to find.
|
||||
title: The title of the element to find.
|
||||
value: The value of the element to find.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary indicating that element search is not supported on Linux.
|
||||
"""
|
||||
logger.info(f"Finding element with role={role}, title={title}, value={value} (not supported on Linux)")
|
||||
return {
|
||||
"success": False,
|
||||
@@ -64,7 +78,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
|
||||
}
|
||||
|
||||
def get_cursor_position(self) -> Tuple[int, int]:
|
||||
"""Get the current cursor position."""
|
||||
"""Get the current cursor position.
|
||||
|
||||
Returns:
|
||||
Tuple[int, int]: The x and y coordinates of the cursor position.
|
||||
Returns (0, 0) if pyautogui is not available.
|
||||
"""
|
||||
try:
|
||||
pos = pyautogui.position()
|
||||
return pos.x, pos.y
|
||||
@@ -75,7 +94,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
|
||||
return 0, 0
|
||||
|
||||
def get_screen_size(self) -> Tuple[int, int]:
|
||||
"""Get the screen size."""
|
||||
"""Get the screen size.
|
||||
|
||||
Returns:
|
||||
Tuple[int, int]: The width and height of the screen in pixels.
|
||||
Returns (1920, 1080) if pyautogui is not available.
|
||||
"""
|
||||
try:
|
||||
size = pyautogui.size()
|
||||
return size.width, size.height
|
||||
@@ -92,6 +116,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Mouse Actions
|
||||
async def mouse_down(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
|
||||
"""Press and hold a mouse button at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to move to before pressing. If None, uses current position.
|
||||
y: The y coordinate to move to before pressing. If None, uses current position.
|
||||
button: The mouse button to press ("left", "right", or "middle").
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -101,6 +135,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def mouse_up(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
|
||||
"""Release a mouse button at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to move to before releasing. If None, uses current position.
|
||||
y: The y coordinate to move to before releasing. If None, uses current position.
|
||||
button: The mouse button to release ("left", "right", or "middle").
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -110,6 +154,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def move_cursor(self, x: int, y: int) -> Dict[str, Any]:
|
||||
"""Move the cursor to the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to move to.
|
||||
y: The y coordinate to move to.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.moveTo(x, y)
|
||||
return {"success": True}
|
||||
@@ -117,6 +170,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def left_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
|
||||
"""Perform a left mouse click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to click at. If None, clicks at current position.
|
||||
y: The y coordinate to click at. If None, clicks at current position.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -126,6 +188,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def right_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
|
||||
"""Perform a right mouse click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to click at. If None, clicks at current position.
|
||||
y: The y coordinate to click at. If None, clicks at current position.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -135,6 +206,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def double_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
|
||||
"""Perform a double click at the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to double click at. If None, clicks at current position.
|
||||
y: The y coordinate to double click at. If None, clicks at current position.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -144,6 +224,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def click(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
|
||||
"""Perform a mouse click with the specified button at the given coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to click at. If None, clicks at current position.
|
||||
y: The y coordinate to click at. If None, clicks at current position.
|
||||
button: The mouse button to click ("left", "right", or "middle").
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if x is not None and y is not None:
|
||||
pyautogui.moveTo(x, y)
|
||||
@@ -153,6 +243,17 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def drag_to(self, x: int, y: int, button: str = "left", duration: float = 0.5) -> Dict[str, Any]:
|
||||
"""Drag from the current position to the specified coordinates.
|
||||
|
||||
Args:
|
||||
x: The x coordinate to drag to.
|
||||
y: The y coordinate to drag to.
|
||||
button: The mouse button to use for dragging.
|
||||
duration: The time in seconds to take for the drag operation.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.dragTo(x, y, duration=duration, button=button)
|
||||
return {"success": True}
|
||||
@@ -160,6 +261,18 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def drag(self, start_x: int, start_y: int, end_x: int, end_y: int, button: str = "left") -> Dict[str, Any]:
|
||||
"""Drag from start coordinates to end coordinates.
|
||||
|
||||
Args:
|
||||
start_x: The starting x coordinate.
|
||||
start_y: The starting y coordinate.
|
||||
end_x: The ending x coordinate.
|
||||
end_y: The ending y coordinate.
|
||||
button: The mouse button to use for dragging.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.moveTo(start_x, start_y)
|
||||
pyautogui.dragTo(end_x, end_y, duration=0.5, button=button)
|
||||
@@ -168,6 +281,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def drag_path(self, path: List[Tuple[int, int]], button: str = "left", duration: float = 0.5) -> Dict[str, Any]:
|
||||
"""Drag along a path defined by a list of coordinates.
|
||||
|
||||
Args:
|
||||
path: A list of (x, y) coordinate tuples defining the drag path.
|
||||
button: The mouse button to use for dragging.
|
||||
duration: The time in seconds to take for each segment of the drag.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
if not path:
|
||||
return {"success": False, "error": "Path is empty"}
|
||||
@@ -180,6 +303,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Keyboard Actions
|
||||
async def key_down(self, key: str) -> Dict[str, Any]:
|
||||
"""Press and hold a key.
|
||||
|
||||
Args:
|
||||
key: The key to press down.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.keyDown(key)
|
||||
return {"success": True}
|
||||
@@ -187,6 +318,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def key_up(self, key: str) -> Dict[str, Any]:
|
||||
"""Release a key.
|
||||
|
||||
Args:
|
||||
key: The key to release.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.keyUp(key)
|
||||
return {"success": True}
|
||||
@@ -194,6 +333,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def type_text(self, text: str) -> Dict[str, Any]:
|
||||
"""Type the specified text using the keyboard.
|
||||
|
||||
Args:
|
||||
text: The text to type.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
# use pynput for Unicode support
|
||||
self.keyboard.type(text)
|
||||
@@ -202,6 +349,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def press_key(self, key: str) -> Dict[str, Any]:
|
||||
"""Press and release a key.
|
||||
|
||||
Args:
|
||||
key: The key to press.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.press(key)
|
||||
return {"success": True}
|
||||
@@ -209,6 +364,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def hotkey(self, keys: List[str]) -> Dict[str, Any]:
|
||||
"""Press a combination of keys simultaneously.
|
||||
|
||||
Args:
|
||||
keys: A list of keys to press together as a hotkey combination.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.hotkey(*keys)
|
||||
return {"success": True}
|
||||
@@ -217,6 +380,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Scrolling Actions
|
||||
async def scroll(self, x: int, y: int) -> Dict[str, Any]:
|
||||
"""Scroll the mouse wheel.
|
||||
|
||||
Args:
|
||||
x: The horizontal scroll amount.
|
||||
y: The vertical scroll amount.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
self.mouse.scroll(x, y)
|
||||
return {"success": True}
|
||||
@@ -224,6 +396,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def scroll_down(self, clicks: int = 1) -> Dict[str, Any]:
|
||||
"""Scroll down by the specified number of clicks.
|
||||
|
||||
Args:
|
||||
clicks: The number of scroll clicks to perform downward.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.scroll(-clicks)
|
||||
return {"success": True}
|
||||
@@ -231,6 +411,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def scroll_up(self, clicks: int = 1) -> Dict[str, Any]:
|
||||
"""Scroll up by the specified number of clicks.
|
||||
|
||||
Args:
|
||||
clicks: The number of scroll clicks to perform upward.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
pyautogui.scroll(clicks)
|
||||
return {"success": True}
|
||||
@@ -239,6 +427,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Screen Actions
|
||||
async def screenshot(self) -> Dict[str, Any]:
|
||||
"""Take a screenshot of the current screen.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status and base64-encoded image data,
|
||||
or error message if failed.
|
||||
"""
|
||||
try:
|
||||
from PIL import Image
|
||||
screenshot = pyautogui.screenshot()
|
||||
@@ -253,6 +447,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": f"Screenshot error: {str(e)}"}
|
||||
|
||||
async def get_screen_size(self) -> Dict[str, Any]:
|
||||
"""Get the size of the screen.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status and screen dimensions,
|
||||
or error message if failed.
|
||||
"""
|
||||
try:
|
||||
size = pyautogui.size()
|
||||
return {"success": True, "size": {"width": size.width, "height": size.height}}
|
||||
@@ -260,6 +460,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def get_cursor_position(self) -> Dict[str, Any]:
|
||||
"""Get the current position of the cursor.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status and cursor coordinates,
|
||||
or error message if failed.
|
||||
"""
|
||||
try:
|
||||
pos = pyautogui.position()
|
||||
return {"success": True, "position": {"x": pos.x, "y": pos.y}}
|
||||
@@ -268,6 +474,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Clipboard Actions
|
||||
async def copy_to_clipboard(self) -> Dict[str, Any]:
|
||||
"""Get the current content of the clipboard.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status and clipboard content,
|
||||
or error message if failed.
|
||||
"""
|
||||
try:
|
||||
import pyperclip
|
||||
content = pyperclip.paste()
|
||||
@@ -276,6 +488,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def set_clipboard(self, text: str) -> Dict[str, Any]:
|
||||
"""Set the clipboard content to the specified text.
|
||||
|
||||
Args:
|
||||
text: The text to copy to the clipboard.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary with success status and error message if failed.
|
||||
"""
|
||||
try:
|
||||
import pyperclip
|
||||
pyperclip.copy(text)
|
||||
@@ -285,6 +505,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
|
||||
|
||||
# Command Execution
|
||||
async def run_command(self, command: str) -> Dict[str, Any]:
|
||||
"""Execute a shell command asynchronously.
|
||||
|
||||
Args:
|
||||
command: The shell command to execute.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing success status, stdout, stderr,
|
||||
and return code, or error message if failed.
|
||||
"""
|
||||
try:
|
||||
# Create subprocess
|
||||
process = await asyncio.create_subprocess_shell(
|
||||
|
||||
@@ -3,6 +3,12 @@ import re
|
||||
from pydantic import BaseModel, Field, computed_field, validator, ConfigDict, RootModel
|
||||
|
||||
class DiskInfo(BaseModel):
|
||||
"""Information about disk storage allocation.
|
||||
|
||||
Attributes:
|
||||
total: Total disk space in bytes
|
||||
allocated: Currently allocated disk space in bytes
|
||||
"""
|
||||
total: int
|
||||
allocated: int
|
||||
|
||||
@@ -10,6 +16,15 @@ class VMConfig(BaseModel):
|
||||
"""Configuration for creating a new VM.
|
||||
|
||||
Note: Memory and disk sizes should be specified with units (e.g., "4GB", "64GB")
|
||||
|
||||
Attributes:
|
||||
name: Name of the virtual machine
|
||||
os: Operating system type, either "macOS" or "linux"
|
||||
cpu: Number of CPU cores to allocate
|
||||
memory: Amount of memory to allocate with units
|
||||
disk_size: Size of the disk to create with units
|
||||
display: Display resolution in format "widthxheight"
|
||||
ipsw: IPSW path or 'latest' for macOS VMs, None for other OS types
|
||||
"""
|
||||
name: str
|
||||
os: Literal["macOS", "linux"] = "macOS"
|
||||
@@ -23,7 +38,12 @@ class VMConfig(BaseModel):
|
||||
populate_by_alias = True
|
||||
|
||||
class SharedDirectory(BaseModel):
|
||||
"""Configuration for a shared directory."""
|
||||
"""Configuration for a shared directory.
|
||||
|
||||
Attributes:
|
||||
host_path: Path to the directory on the host system
|
||||
read_only: Whether the directory should be mounted as read-only
|
||||
"""
|
||||
host_path: str = Field(..., alias="hostPath") # Allow host_path but serialize as hostPath
|
||||
read_only: bool = False
|
||||
|
||||
@@ -50,6 +70,16 @@ class VMRunOpts(BaseModel):
|
||||
)
|
||||
|
||||
def model_dump(self, **kwargs):
|
||||
"""Export model data with proper field name conversion.
|
||||
|
||||
Converts shared directory fields to match API expectations when using aliases.
|
||||
|
||||
Args:
|
||||
**kwargs: Keyword arguments passed to parent model_dump method
|
||||
|
||||
Returns:
|
||||
dict: Model data with properly formatted field names
|
||||
"""
|
||||
data = super().model_dump(**kwargs)
|
||||
# Convert shared directory fields to match API expectations
|
||||
if self.shared_directories and "by_alias" in kwargs and kwargs["by_alias"]:
|
||||
@@ -65,6 +95,18 @@ class VMRunOpts(BaseModel):
|
||||
return data
|
||||
|
||||
class VMStatus(BaseModel):
|
||||
"""Status information for a virtual machine.
|
||||
|
||||
Attributes:
|
||||
name: Name of the virtual machine
|
||||
status: Current status of the VM
|
||||
os: Operating system type
|
||||
cpu_count: Number of CPU cores allocated
|
||||
memory_size: Amount of memory allocated in bytes
|
||||
disk_size: Disk storage information
|
||||
vnc_url: URL for VNC connection if available
|
||||
ip_address: IP address of the VM if available
|
||||
"""
|
||||
name: str
|
||||
status: str
|
||||
os: Literal["macOS", "linux"]
|
||||
@@ -80,38 +122,79 @@ class VMStatus(BaseModel):
|
||||
@computed_field
|
||||
@property
|
||||
def state(self) -> str:
|
||||
"""Get the current state of the VM.
|
||||
|
||||
Returns:
|
||||
str: Current VM status
|
||||
"""
|
||||
return self.status
|
||||
|
||||
@computed_field
|
||||
@property
|
||||
def cpu(self) -> int:
|
||||
"""Get the number of CPU cores.
|
||||
|
||||
Returns:
|
||||
int: Number of CPU cores allocated to the VM
|
||||
"""
|
||||
return self.cpu_count
|
||||
|
||||
@computed_field
|
||||
@property
|
||||
def memory(self) -> str:
|
||||
"""Get memory allocation in human-readable format.
|
||||
|
||||
Returns:
|
||||
str: Memory size formatted as "{size}GB"
|
||||
"""
|
||||
# Convert bytes to GB
|
||||
gb = self.memory_size / (1024 * 1024 * 1024)
|
||||
return f"{int(gb)}GB"
|
||||
|
||||
class VMUpdateOpts(BaseModel):
|
||||
"""Options for updating VM configuration.
|
||||
|
||||
Attributes:
|
||||
cpu: Number of CPU cores to update to
|
||||
memory: Amount of memory to update to with units
|
||||
disk_size: Size of disk to update to with units
|
||||
"""
|
||||
cpu: Optional[int] = None
|
||||
memory: Optional[str] = None
|
||||
disk_size: Optional[str] = None
|
||||
|
||||
class ImageRef(BaseModel):
|
||||
"""Reference to a VM image."""
|
||||
"""Reference to a VM image.
|
||||
|
||||
Attributes:
|
||||
image: Name of the image
|
||||
tag: Tag version of the image
|
||||
registry: Registry hostname where image is stored
|
||||
organization: Organization or namespace in the registry
|
||||
"""
|
||||
image: str
|
||||
tag: str = "latest"
|
||||
registry: Optional[str] = "ghcr.io"
|
||||
organization: Optional[str] = "trycua"
|
||||
|
||||
def model_dump(self, **kwargs):
|
||||
"""Override model_dump to return just the image:tag format."""
|
||||
"""Override model_dump to return just the image:tag format.
|
||||
|
||||
Args:
|
||||
**kwargs: Keyword arguments (ignored)
|
||||
|
||||
Returns:
|
||||
str: Image reference in "image:tag" format
|
||||
"""
|
||||
return f"{self.image}:{self.tag}"
|
||||
|
||||
class CloneSpec(BaseModel):
|
||||
"""Specification for cloning a VM."""
|
||||
"""Specification for cloning a VM.
|
||||
|
||||
Attributes:
|
||||
name: Name of the source VM to clone
|
||||
new_name: Name for the new cloned VM
|
||||
"""
|
||||
name: str
|
||||
new_name: str = Field(alias="newName")
|
||||
|
||||
@@ -119,18 +202,44 @@ class CloneSpec(BaseModel):
|
||||
populate_by_alias = True
|
||||
|
||||
class ImageInfo(BaseModel):
|
||||
"""Model for individual image information."""
|
||||
"""Model for individual image information.
|
||||
|
||||
Attributes:
|
||||
imageId: Unique identifier for the image
|
||||
"""
|
||||
imageId: str
|
||||
|
||||
class ImageList(RootModel):
|
||||
"""Response model for the images endpoint."""
|
||||
"""Response model for the images endpoint.
|
||||
|
||||
A list-like container for ImageInfo objects that provides
|
||||
iteration and indexing capabilities.
|
||||
"""
|
||||
root: List[ImageInfo]
|
||||
|
||||
def __iter__(self):
|
||||
"""Iterate over the image list.
|
||||
|
||||
Returns:
|
||||
Iterator over ImageInfo objects
|
||||
"""
|
||||
return iter(self.root)
|
||||
|
||||
def __getitem__(self, item):
|
||||
"""Get an item from the image list by index.
|
||||
|
||||
Args:
|
||||
item: Index or slice to retrieve
|
||||
|
||||
Returns:
|
||||
ImageInfo or list of ImageInfo objects
|
||||
"""
|
||||
return self.root[item]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.root)
|
||||
"""Get the number of images in the list.
|
||||
|
||||
Returns:
|
||||
int: Number of images in the list
|
||||
"""
|
||||
return len(self.root)
|
||||
@@ -8,6 +8,13 @@ import type { AccessibilityNode, CursorPosition, MouseButton } from './base';
|
||||
|
||||
export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
// Mouse Actions
|
||||
/**
|
||||
* Press and hold a mouse button at the specified coordinates.
|
||||
* @param {number} [x] - X coordinate for the mouse action
|
||||
* @param {number} [y] - Y coordinate for the mouse action
|
||||
* @param {MouseButton} [button='left'] - Mouse button to press down
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async mouseDown(
|
||||
x?: number,
|
||||
y?: number,
|
||||
@@ -16,6 +23,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
await this.sendCommand('mouse_down', { x, y, button });
|
||||
}
|
||||
|
||||
/**
|
||||
* Release a mouse button at the specified coordinates.
|
||||
* @param {number} [x] - X coordinate for the mouse action
|
||||
* @param {number} [y] - Y coordinate for the mouse action
|
||||
* @param {MouseButton} [button='left'] - Mouse button to release
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async mouseUp(
|
||||
x?: number,
|
||||
y?: number,
|
||||
@@ -24,22 +38,54 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
await this.sendCommand('mouse_up', { x, y, button });
|
||||
}
|
||||
|
||||
/**
|
||||
* Perform a left mouse click at the specified coordinates.
|
||||
* @param {number} [x] - X coordinate for the click
|
||||
* @param {number} [y] - Y coordinate for the click
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async leftClick(x?: number, y?: number): Promise<void> {
|
||||
await this.sendCommand('left_click', { x, y });
|
||||
}
|
||||
|
||||
/**
|
||||
* Perform a right mouse click at the specified coordinates.
|
||||
* @param {number} [x] - X coordinate for the click
|
||||
* @param {number} [y] - Y coordinate for the click
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async rightClick(x?: number, y?: number): Promise<void> {
|
||||
await this.sendCommand('right_click', { x, y });
|
||||
}
|
||||
|
||||
/**
|
||||
* Perform a double click at the specified coordinates.
|
||||
* @param {number} [x] - X coordinate for the double click
|
||||
* @param {number} [y] - Y coordinate for the double click
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async doubleClick(x?: number, y?: number): Promise<void> {
|
||||
await this.sendCommand('double_click', { x, y });
|
||||
}
|
||||
|
||||
/**
|
||||
* Move the cursor to the specified coordinates.
|
||||
* @param {number} x - X coordinate to move to
|
||||
* @param {number} y - Y coordinate to move to
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async moveCursor(x: number, y: number): Promise<void> {
|
||||
await this.sendCommand('move_cursor', { x, y });
|
||||
}
|
||||
|
||||
/**
|
||||
* Drag from current position to the specified coordinates.
|
||||
* @param {number} x - X coordinate to drag to
|
||||
* @param {number} y - Y coordinate to drag to
|
||||
* @param {MouseButton} [button='left'] - Mouse button to use for dragging
|
||||
* @param {number} [duration=0.5] - Duration of the drag operation in seconds
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async dragTo(
|
||||
x: number,
|
||||
y: number,
|
||||
@@ -49,6 +95,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
await this.sendCommand('drag_to', { x, y, button, duration });
|
||||
}
|
||||
|
||||
/**
|
||||
* Drag along a path of coordinates.
|
||||
* @param {Array<[number, number]>} path - Array of [x, y] coordinate pairs to drag through
|
||||
* @param {MouseButton} [button='left'] - Mouse button to use for dragging
|
||||
* @param {number} [duration=0.5] - Duration of the drag operation in seconds
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async drag(
|
||||
path: Array<[number, number]>,
|
||||
button: MouseButton = 'left',
|
||||
@@ -58,40 +111,86 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
|
||||
// Keyboard Actions
|
||||
/**
|
||||
* Press and hold a key.
|
||||
* @param {string} key - Key to press down
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async keyDown(key: string): Promise<void> {
|
||||
await this.sendCommand('key_down', { key });
|
||||
}
|
||||
|
||||
/**
|
||||
* Release a key.
|
||||
* @param {string} key - Key to release
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async keyUp(key: string): Promise<void> {
|
||||
await this.sendCommand('key_up', { key });
|
||||
}
|
||||
|
||||
/**
|
||||
* Type text as if entered from keyboard.
|
||||
* @param {string} text - Text to type
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async typeText(text: string): Promise<void> {
|
||||
await this.sendCommand('type_text', { text });
|
||||
}
|
||||
|
||||
/**
|
||||
* Press and release a key.
|
||||
* @param {string} key - Key to press
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async pressKey(key: string): Promise<void> {
|
||||
await this.sendCommand('press_key', { key });
|
||||
}
|
||||
|
||||
/**
|
||||
* Press multiple keys simultaneously as a hotkey combination.
|
||||
* @param {...string} keys - Keys to press together
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async hotkey(...keys: string[]): Promise<void> {
|
||||
await this.sendCommand('hotkey', { keys });
|
||||
}
|
||||
|
||||
// Scrolling Actions
|
||||
/**
|
||||
* Scroll by the specified amount in x and y directions.
|
||||
* @param {number} x - Horizontal scroll amount
|
||||
* @param {number} y - Vertical scroll amount
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async scroll(x: number, y: number): Promise<void> {
|
||||
await this.sendCommand('scroll', { x, y });
|
||||
}
|
||||
|
||||
/**
|
||||
* Scroll down by the specified number of clicks.
|
||||
* @param {number} [clicks=1] - Number of scroll clicks
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async scrollDown(clicks = 1): Promise<void> {
|
||||
await this.sendCommand('scroll_down', { clicks });
|
||||
}
|
||||
|
||||
/**
|
||||
* Scroll up by the specified number of clicks.
|
||||
* @param {number} [clicks=1] - Number of scroll clicks
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async scrollUp(clicks = 1): Promise<void> {
|
||||
await this.sendCommand('scroll_up', { clicks });
|
||||
}
|
||||
|
||||
// Screen Actions
|
||||
/**
|
||||
* Take a screenshot of the screen.
|
||||
* @returns {Promise<Buffer>} Screenshot image data as a Buffer
|
||||
* @throws {Error} If screenshot fails
|
||||
*/
|
||||
async screenshot(): Promise<Buffer> {
|
||||
const response = await this.sendCommand('screenshot');
|
||||
if (!response.image_data) {
|
||||
@@ -100,6 +199,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return Buffer.from(response.image_data as string, 'base64');
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the current screen size.
|
||||
* @returns {Promise<ScreenSize>} Screen dimensions
|
||||
* @throws {Error} If unable to get screen size
|
||||
*/
|
||||
async getScreenSize(): Promise<ScreenSize> {
|
||||
const response = await this.sendCommand('get_screen_size');
|
||||
if (!response.success || !response.size) {
|
||||
@@ -108,6 +212,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return response.size as ScreenSize;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the current cursor position.
|
||||
* @returns {Promise<CursorPosition>} Current cursor coordinates
|
||||
* @throws {Error} If unable to get cursor position
|
||||
*/
|
||||
async getCursorPosition(): Promise<CursorPosition> {
|
||||
const response = await this.sendCommand('get_cursor_position');
|
||||
if (!response.success || !response.position) {
|
||||
@@ -117,6 +226,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
|
||||
// Clipboard Actions
|
||||
/**
|
||||
* Copy current selection to clipboard and return the content.
|
||||
* @returns {Promise<string>} Clipboard content
|
||||
* @throws {Error} If unable to get clipboard content
|
||||
*/
|
||||
async copyToClipboard(): Promise<string> {
|
||||
const response = await this.sendCommand('copy_to_clipboard');
|
||||
if (!response.success || !response.content) {
|
||||
@@ -125,21 +239,42 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return response.content as string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Set the clipboard content to the specified text.
|
||||
* @param {string} text - Text to set in clipboard
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async setClipboard(text: string): Promise<void> {
|
||||
await this.sendCommand('set_clipboard', { text });
|
||||
}
|
||||
|
||||
// File System Actions
|
||||
/**
|
||||
* Check if a file exists at the specified path.
|
||||
* @param {string} path - Path to the file
|
||||
* @returns {Promise<boolean>} True if file exists, false otherwise
|
||||
*/
|
||||
async fileExists(path: string): Promise<boolean> {
|
||||
const response = await this.sendCommand('file_exists', { path });
|
||||
return (response.exists as boolean) || false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if a directory exists at the specified path.
|
||||
* @param {string} path - Path to the directory
|
||||
* @returns {Promise<boolean>} True if directory exists, false otherwise
|
||||
*/
|
||||
async directoryExists(path: string): Promise<boolean> {
|
||||
const response = await this.sendCommand('directory_exists', { path });
|
||||
return (response.exists as boolean) || false;
|
||||
}
|
||||
|
||||
/**
|
||||
* List the contents of a directory.
|
||||
* @param {string} path - Path to the directory
|
||||
* @returns {Promise<string[]>} Array of file and directory names
|
||||
* @throws {Error} If unable to list directory
|
||||
*/
|
||||
async listDir(path: string): Promise<string[]> {
|
||||
const response = await this.sendCommand('list_dir', { path });
|
||||
if (!response.success) {
|
||||
@@ -148,6 +283,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return (response.files as string[]) || [];
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the size of a file in bytes.
|
||||
* @param {string} path - Path to the file
|
||||
* @returns {Promise<number>} File size in bytes
|
||||
* @throws {Error} If unable to get file size
|
||||
*/
|
||||
async getFileSize(path: string): Promise<number> {
|
||||
const response = await this.sendCommand('get_file_size', { path });
|
||||
if (!response.success) {
|
||||
@@ -156,6 +297,16 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return (response.size as number) || 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Read file content in chunks for large files.
|
||||
* @private
|
||||
* @param {string} path - Path to the file
|
||||
* @param {number} offset - Starting byte offset
|
||||
* @param {number} totalLength - Total number of bytes to read
|
||||
* @param {number} [chunkSize=1048576] - Size of each chunk in bytes
|
||||
* @returns {Promise<Buffer>} File content as Buffer
|
||||
* @throws {Error} If unable to read file chunk
|
||||
*/
|
||||
private async readBytesChunked(
|
||||
path: string,
|
||||
offset: number,
|
||||
@@ -190,6 +341,16 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return Buffer.concat(chunks);
|
||||
}
|
||||
|
||||
/**
|
||||
* Write file content in chunks for large files.
|
||||
* @private
|
||||
* @param {string} path - Path to the file
|
||||
* @param {Buffer} content - Content to write
|
||||
* @param {boolean} [append=false] - Whether to append to existing file
|
||||
* @param {number} [chunkSize=1048576] - Size of each chunk in bytes
|
||||
* @returns {Promise<void>}
|
||||
* @throws {Error} If unable to write file chunk
|
||||
*/
|
||||
private async writeBytesChunked(
|
||||
path: string,
|
||||
content: Buffer,
|
||||
@@ -222,36 +383,43 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Read text from a file with specified encoding.
|
||||
* @param {string} path - Path to the file to read
|
||||
* @param {BufferEncoding} [encoding='utf8'] - Text encoding to use
|
||||
* @returns {Promise<string>} The decoded text content of the file
|
||||
*/
|
||||
async readText(path: string, encoding: BufferEncoding = 'utf8'): Promise<string> {
|
||||
/**
|
||||
* Read text from a file with specified encoding.
|
||||
*
|
||||
* @param path - Path to the file to read
|
||||
* @param encoding - Text encoding to use (default: 'utf8')
|
||||
* @returns The decoded text content of the file
|
||||
*/
|
||||
const contentBytes = await this.readBytes(path);
|
||||
return contentBytes.toString(encoding);
|
||||
}
|
||||
|
||||
/**
|
||||
* Write text to a file with specified encoding.
|
||||
* @param {string} path - Path to the file to write
|
||||
* @param {string} content - Text content to write
|
||||
* @param {BufferEncoding} [encoding='utf8'] - Text encoding to use
|
||||
* @param {boolean} [append=false] - Whether to append to the file instead of overwriting
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
async writeText(
|
||||
path: string,
|
||||
content: string,
|
||||
encoding: BufferEncoding = 'utf8',
|
||||
append: boolean = false
|
||||
): Promise<void> {
|
||||
/**
|
||||
* Write text to a file with specified encoding.
|
||||
*
|
||||
* @param path - Path to the file to write
|
||||
* @param content - Text content to write
|
||||
* @param encoding - Text encoding to use (default: 'utf8')
|
||||
* @param append - Whether to append to the file instead of overwriting
|
||||
*/
|
||||
const contentBytes = Buffer.from(content, encoding);
|
||||
await this.writeBytes(path, contentBytes, append);
|
||||
}
|
||||
|
||||
/**
|
||||
* Read bytes from a file, with optional offset and length.
|
||||
* @param {string} path - Path to the file
|
||||
* @param {number} [offset=0] - Starting byte offset
|
||||
* @param {number} [length] - Number of bytes to read (reads entire file if not specified)
|
||||
* @returns {Promise<Buffer>} File content as Buffer
|
||||
* @throws {Error} If unable to read file
|
||||
*/
|
||||
async readBytes(path: string, offset: number = 0, length?: number): Promise<Buffer> {
|
||||
// For large files, use chunked reading
|
||||
if (length === undefined) {
|
||||
@@ -275,6 +443,14 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return Buffer.from(response.content_b64 as string, 'base64');
|
||||
}
|
||||
|
||||
/**
|
||||
* Write bytes to a file.
|
||||
* @param {string} path - Path to the file
|
||||
* @param {Buffer} content - Content to write as Buffer
|
||||
* @param {boolean} [append=false] - Whether to append to existing file
|
||||
* @returns {Promise<void>}
|
||||
* @throws {Error} If unable to write file
|
||||
*/
|
||||
async writeBytes(path: string, content: Buffer, append: boolean = false): Promise<void> {
|
||||
// For large files, use chunked writing
|
||||
if (content.length > 5 * 1024 * 1024) {
|
||||
@@ -293,6 +469,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Delete a file at the specified path.
|
||||
* @param {string} path - Path to the file to delete
|
||||
* @returns {Promise<void>}
|
||||
* @throws {Error} If unable to delete file
|
||||
*/
|
||||
async deleteFile(path: string): Promise<void> {
|
||||
const response = await this.sendCommand('delete_file', { path });
|
||||
if (!response.success) {
|
||||
@@ -300,6 +482,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Create a directory at the specified path.
|
||||
* @param {string} path - Path where to create the directory
|
||||
* @returns {Promise<void>}
|
||||
* @throws {Error} If unable to create directory
|
||||
*/
|
||||
async createDir(path: string): Promise<void> {
|
||||
const response = await this.sendCommand('create_dir', { path });
|
||||
if (!response.success) {
|
||||
@@ -309,6 +497,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Delete a directory at the specified path.
|
||||
* @param {string} path - Path to the directory to delete
|
||||
* @returns {Promise<void>}
|
||||
* @throws {Error} If unable to delete directory
|
||||
*/
|
||||
async deleteDir(path: string): Promise<void> {
|
||||
const response = await this.sendCommand('delete_dir', { path });
|
||||
if (!response.success) {
|
||||
@@ -318,6 +512,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Execute a shell command and return stdout and stderr.
|
||||
* @param {string} command - Command to execute
|
||||
* @returns {Promise<[string, string]>} Tuple of [stdout, stderr]
|
||||
* @throws {Error} If command execution fails
|
||||
*/
|
||||
async runCommand(command: string): Promise<[string, string]> {
|
||||
const response = await this.sendCommand('run_command', { command });
|
||||
if (!response.success) {
|
||||
@@ -330,6 +530,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
}
|
||||
|
||||
// Accessibility Actions
|
||||
/**
|
||||
* Get the accessibility tree of the current screen.
|
||||
* @returns {Promise<AccessibilityNode>} Root accessibility node
|
||||
* @throws {Error} If unable to get accessibility tree
|
||||
*/
|
||||
async getAccessibilityTree(): Promise<AccessibilityNode> {
|
||||
const response = await this.sendCommand('get_accessibility_tree');
|
||||
if (!response.success) {
|
||||
@@ -340,6 +545,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return response as unknown as AccessibilityNode;
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert coordinates to screen coordinates.
|
||||
* @param {number} x - X coordinate to convert
|
||||
* @param {number} y - Y coordinate to convert
|
||||
* @returns {Promise<[number, number]>} Converted screen coordinates as [x, y]
|
||||
* @throws {Error} If coordinate conversion fails
|
||||
*/
|
||||
async toScreenCoordinates(x: number, y: number): Promise<[number, number]> {
|
||||
const response = await this.sendCommand('to_screen_coordinates', { x, y });
|
||||
if (!response.success || !response.coordinates) {
|
||||
@@ -348,6 +560,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
|
||||
return response.coordinates as [number, number];
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert coordinates to screenshot coordinates.
|
||||
* @param {number} x - X coordinate to convert
|
||||
* @param {number} y - Y coordinate to convert
|
||||
* @returns {Promise<[number, number]>} Converted screenshot coordinates as [x, y]
|
||||
* @throws {Error} If coordinate conversion fails
|
||||
*/
|
||||
async toScreenshotCoordinates(
|
||||
x: number,
|
||||
y: number
|
||||
|
||||
@@ -0,0 +1,201 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Customizing Your ComputerAgent\n",
|
||||
"\n",
|
||||
"This notebook demonstrates four practical ways to increase the capabilities and success rate of your `ComputerAgent` in the Agent SDK:\n",
|
||||
"\n",
|
||||
"1. Simple: Prompt engineering (via optional `instructions`)\n",
|
||||
"2. Easy: Tools (function tools and custom computer tools)\n",
|
||||
"3. Intermediate: Callbacks\n",
|
||||
"4. Expert: Custom `@register_agent` loops\n",
|
||||
"\n",
|
||||
"> Tip: The same patterns work in scripts and services — the notebook just makes it easy to iterate."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"We'll import `ComputerAgent`, a simple Docker-based computer, and some utilities."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import logging\n",
|
||||
"from agent.agent import ComputerAgent\n",
|
||||
"from agent.callbacks import LoggingCallback\n",
|
||||
"from computer import Computer\n",
|
||||
"\n",
|
||||
"computer = Computer(\n",
|
||||
" os_type=\"linux\",\n",
|
||||
" provider_type=\"docker\",\n",
|
||||
" image=\"trycua/cua-ubuntu:latest\",\n",
|
||||
" name=\"my-cua-container\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"await computer.run() # Launch & connect to Docker container"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1) Simple: Prompt engineering\n",
|
||||
"\n",
|
||||
"You can guide your agent with system-like `instructions`.\n",
|
||||
"\n",
|
||||
"Under the hood, `ComputerAgent(instructions=...)` adds a `PromptInstructionsCallback` that prepends a user message before each LLM call.\n",
|
||||
"\n",
|
||||
"This mirrors the recommended snippet in code:\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"effective_input = full_input\n",
|
||||
"if instructions:\n",
|
||||
" effective_input = [{\"role\": \"user\", \"content\": instructions}] + full_input\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"instructions = (\n",
|
||||
" \"You are a meticulous software operator. Prefer safe, deterministic actions. \"\n",
|
||||
" \"Always confirm via on-screen text before proceeding.\"\n",
|
||||
")\n",
|
||||
"agent = ComputerAgent(\n",
|
||||
" model=\"openai/computer-use-preview\",\n",
|
||||
" tools=[computer],\n",
|
||||
" instructions=instructions,\n",
|
||||
" callbacks=[LoggingCallback(level=logging.INFO)],\n",
|
||||
")\n",
|
||||
"messages = [\n",
|
||||
" {\"role\": \"user\", \"content\": \"Open the settings and turn on dark mode.\"}\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"# In notebooks, you may want to consume the async generator\n",
|
||||
"import asyncio\n",
|
||||
"async def run_once():\n",
|
||||
" async for chunk in agent.run(messages):\n",
|
||||
" # Print any assistant text outputs\n",
|
||||
" for item in chunk.get(\"output\", []):\n",
|
||||
" if item.get(\"type\") == \"message\":\n",
|
||||
" for c in item.get(\"content\", []):\n",
|
||||
" if c.get(\"text\"):\n",
|
||||
" print(c.get(\"text\"))\n",
|
||||
"\n",
|
||||
"await run_once()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2) Easy: Tools\n",
|
||||
"\n",
|
||||
"Add function tools to expose deterministic capabilities. Tools are auto-extracted to schemas and callable by the agent."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def calculate_percentage(numerator: float, denominator: float) -> str:\n",
|
||||
" \"\"\"Calculate a percentage string.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" numerator: Numerator value\n",
|
||||
" denominator: Denominator value\n",
|
||||
" Returns:\n",
|
||||
" A formatted percentage string (e.g., '75.00%').\n",
|
||||
" \"\"\"\n",
|
||||
" if denominator == 0:\n",
|
||||
" return \"0.00%\"\n",
|
||||
" return f\"{(numerator/denominator)*100:.2f}%\"\n",
|
||||
"\n",
|
||||
"agent_with_tool = ComputerAgent(\n",
|
||||
" model=\"openai/computer-use-preview\",\n",
|
||||
" tools=[computer, calculate_percentage],\n",
|
||||
" instructions=\"When doing math, prefer the `calculate_percentage` tool when relevant.\",\n",
|
||||
")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3) Intermediate: Callbacks\n",
|
||||
"\n",
|
||||
"Callbacks offer lifecycle hooks. For example, limit recent images or record trajectories."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from agent.callbacks import ImageRetentionCallback, TrajectorySaverCallback\n",
|
||||
"\n",
|
||||
"agent_with_callbacks = ComputerAgent(\n",
|
||||
" model=\"anthropic/claude-3-5-sonnet-20241022\",\n",
|
||||
" tools=[computer],\n",
|
||||
" callbacks=[\n",
|
||||
" ImageRetentionCallback(only_n_most_recent_images=3),\n",
|
||||
" TrajectorySaverCallback(\"./trajectories\"),\n",
|
||||
" ],\n",
|
||||
")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4) Expert: Custom `@register_agent`\n",
|
||||
"\n",
|
||||
"Register custom agent configs that implement `predict_step` (and optionally `predict_click`). This gives you full control over prompting, message shaping, and tool wiring.\n",
|
||||
"\n",
|
||||
"See: `libs/python/agent/agent/loops/` for concrete examples."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Next steps\n",
|
||||
"\n",
|
||||
"- Start with `instructions` for fast wins.\n",
|
||||
"- Add function tools for determinism and reliability.\n",
|
||||
"- Use callbacks to manage cost, logs, and safety.\n",
|
||||
"- Build custom loops for specialized domains."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -0,0 +1,280 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a5d6b2ed",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Computer-Use Agents SOTA Challenge\n",
|
||||
"\n",
|
||||
"Congrats on joining the Cua + HUD hackathon at Hack The North 2025!\n",
|
||||
"\n",
|
||||
"This notebook will show you how to create a computer use agent with Cua and evaluate it using HUD."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cebe8572",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 💻 Prequisites\n",
|
||||
"\n",
|
||||
"Clone the Cua repository and install project dependencies."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3d7c38f9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The easiest way to get started is by getting set up with the Cua development repository.\n",
|
||||
"\n",
|
||||
"Install [Docker](https://www.docker.com/products/docker-desktop/) and [pdm](https://pdm-project.org/en/latest/#recommended-installation-method).\n",
|
||||
"\n",
|
||||
"Clone the Cua repository:\n",
|
||||
"\n",
|
||||
"`git clone https://github.com/trycua/cua`\n",
|
||||
"\n",
|
||||
"Install the project dependencies:\n",
|
||||
"\n",
|
||||
"`cd cua && pdm install`\n",
|
||||
"\n",
|
||||
"Now, you should be able to run the `notebooks/hud_hackathon.ipynb` notebook in VS Code with the `.venv` virtual environment selected."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "19f92431",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ☁️ Connect to cloud services\n",
|
||||
"\n",
|
||||
"Create a free HUD accounts and load your API keys. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "47171dc3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Create a HUD account at https://www.hud.so/\n",
|
||||
"4. Create a .env file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1757f145",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create a .env file if it doesn't exist\n",
|
||||
"\n",
|
||||
"ENV_TEMPLATE = \"\"\"# Required environment variables:\n",
|
||||
"HUD_API_KEY=\n",
|
||||
"\n",
|
||||
"# Any LLM provider will work:\n",
|
||||
"ANTHROPIC_API_KEY=\n",
|
||||
"OPENAI_API_KEY=\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"if not os.path.exists(\".env\"):\n",
|
||||
" open(\".env\", \"w\").write(ENV_TEMPLATE)\n",
|
||||
" print(\"A .env file was created! Fill in the empty values.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0949908d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"5. Fill in all missing values in the .env file"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2f23828d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Read the .env file\n",
|
||||
"# HUD requires the .env file to be in the same directory\n",
|
||||
"\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"load_dotenv(dotenv_path='.env', override=True)\n",
|
||||
"\n",
|
||||
"assert os.getenv(\"HUD_API_KEY\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5c8bef64",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🤖 Create a computer use agent\n",
|
||||
"\n",
|
||||
"Create and a computer use agent using the Cua SDK."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cd4393b0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import logging\n",
|
||||
"from pathlib import Path\n",
|
||||
"from agent import ComputerAgent\n",
|
||||
"\n",
|
||||
"# Here you can set the model and tools for your agent.\n",
|
||||
"# Computer use models: https://www.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents\n",
|
||||
"# Composed agent models: https://www.trycua.com/docs/agent-sdk/supported-agents/composed-agents\n",
|
||||
"# Custom tools: https://www.trycua.com/docs/agent-sdk/custom-tools\n",
|
||||
"agent_config = {\n",
|
||||
" \"model\": \"openai/computer-use-preview\",\n",
|
||||
" \"trajectory_dir\": str(Path(\"trajectories\")),\n",
|
||||
" \"only_n_most_recent_images\": 3,\n",
|
||||
" \"verbosity\": logging.INFO\n",
|
||||
"}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a07b09ee",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🖱️ Test your agent\n",
|
||||
"\n",
|
||||
"Run your agent on a test scenario in a Docker container."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12b9c22c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Make sure Docker is running to launch the computer.\n",
|
||||
"\n",
|
||||
"You can view the live VNC stream from the Docker container at `http://localhost:8006/`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a210e959",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from computer import Computer, VMProviderType\n",
|
||||
"import webbrowser\n",
|
||||
"\n",
|
||||
"# Connect to your existing cloud container\n",
|
||||
"computer = Computer(\n",
|
||||
" os_type=\"linux\",\n",
|
||||
" provider_type=VMProviderType.DOCKER,\n",
|
||||
" verbosity=logging.INFO\n",
|
||||
")\n",
|
||||
"await computer.run()\n",
|
||||
"\n",
|
||||
"agent_config[\"tools\"] = [ computer ]\n",
|
||||
"\n",
|
||||
"webbrowser.open(\"http://localhost:8006/\", new=0, autoraise=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "87a307e3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Try running the computer use agent on a simple task.\n",
|
||||
"\n",
|
||||
"Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f3a32ea8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create agent\n",
|
||||
"agent = ComputerAgent(**agent_config)\n",
|
||||
"\n",
|
||||
"tasks = [\n",
|
||||
" \"Open the web browser and search for a repository named trycua/cua on GitHub.\"\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"for i, task in enumerate(tasks):\n",
|
||||
" print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
|
||||
" async for result in agent.run(task):\n",
|
||||
" print(result)\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
" print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb4edbb5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🧐 Benchmark your agent\n",
|
||||
"\n",
|
||||
"Test your agent's performance on a selection of tasks from the OSWorld benchmark."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6bf0887e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import uuid\n",
|
||||
"from pprint import pprint\n",
|
||||
"from agent.integrations.hud import run_full_dataset\n",
|
||||
"\n",
|
||||
"job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
|
||||
"\n",
|
||||
"# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
|
||||
"# See the documentation here: https://docs.trycua.com/docs/agent-sdk/integrations/hud#running-a-full-dataset\n",
|
||||
"results = await run_full_dataset(\n",
|
||||
" dataset=\"ddupont/OSWorld-Tiny-Public\",\n",
|
||||
" job_name=job_name,\n",
|
||||
" **agent_config,\n",
|
||||
" max_concurrent=20,\n",
|
||||
" max_steps=50,\n",
|
||||
" #split=\"train[:5]\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
|
||||
"print(f\"Job: {job_name}\")\n",
|
||||
"print(f\"Total results: {len(results)}\")\n",
|
||||
"pprint(results[:3])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5b89a103",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🦾 Improve your agent\n",
|
||||
"\n",
|
||||
"To improve your agent for OSWorld-Verified, experiment with different models and add custom tools that fit your use case. You can also dive into the ComputerAgent source code to design an improved version or subclass tailored to your needs.\n",
|
||||
"\n",
|
||||
"Learn more about [Customizing Your ComputerAgent](https://docs.trycua.com/docs/agent-sdk/customizing-computeragent) in the docs."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -0,0 +1,286 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a5d6b2ed",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Computer-Use Agents SOTA Challenge\n",
|
||||
"\n",
|
||||
"Congrats on joining the Cua + HUD hackathon at Hack The North 2025!\n",
|
||||
"\n",
|
||||
"This notebook will show you how to create a computer use agent with Cua and evaluate it using HUD."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cebe8572",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 💻 Prequisites\n",
|
||||
"\n",
|
||||
"Clone the Cua repository and install project dependencies."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3d7c38f9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The easiest way to get started is by getting set up with the Cua development repository.\n",
|
||||
"\n",
|
||||
"First, clone the Cua repository:\n",
|
||||
"\n",
|
||||
"`git clone https://github.com/trycua/cua`\n",
|
||||
"\n",
|
||||
"Install [pdm](https://pdm-project.org/en/latest/#recommended-installation-method).\n",
|
||||
"\n",
|
||||
"Install the project dependencies:\n",
|
||||
"\n",
|
||||
"`cd cua && pdm install`\n",
|
||||
"\n",
|
||||
"Now, you should be able to run the `notebooks/hud_hackathon.ipynb` notebook in VS Code with the `.venv` virtual environment selected."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "19f92431",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ☁️ Connect to cloud services\n",
|
||||
"\n",
|
||||
"Create Cua and HUD accounts and load your API keys. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "47171dc3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Create a Cua account at https://www.trycua.com/\n",
|
||||
"2. Start a small Cua container at https://www.trycua.com/dashboard/containers (If you need credits, ask us!)\n",
|
||||
"3. Create a HUD account at https://www.hud.so/\n",
|
||||
"4. Create a .env file:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1757f145",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create a .env file if it doesn't exist\n",
|
||||
"\n",
|
||||
"ENV_TEMPLATE = \"\"\"# Required environment variables:\n",
|
||||
"CUA_API_KEY=\n",
|
||||
"CUA_CONTAINER_NAME=\n",
|
||||
"HUD_API_KEY=\n",
|
||||
"\n",
|
||||
"# Any LLM provider will work:\n",
|
||||
"ANTHROPIC_API_KEY=\n",
|
||||
"OPENAI_API_KEY=\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"if not os.path.exists(\".env\"):\n",
|
||||
" open(\".env\", \"w\").write(ENV_TEMPLATE)\n",
|
||||
" print(\"A .env file was created! Fill in the empty values.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0949908d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"5. Fill in all missing values in the .env file"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2f23828d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Read the .env file\n",
|
||||
"# HUD requires the .env file to be in the same directory\n",
|
||||
"\n",
|
||||
"from dotenv import load_dotenv\n",
|
||||
"load_dotenv(dotenv_path='.env', override=True)\n",
|
||||
"\n",
|
||||
"assert os.getenv(\"CUA_API_KEY\")\n",
|
||||
"assert os.getenv(\"CUA_CONTAINER_NAME\")\n",
|
||||
"assert os.getenv(\"HUD_API_KEY\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5c8bef64",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🤖 Create a computer use agent\n",
|
||||
"\n",
|
||||
"Create and a computer use agent using the Cua SDK."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cd4393b0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import logging\n",
|
||||
"from pathlib import Path\n",
|
||||
"from agent import ComputerAgent\n",
|
||||
"\n",
|
||||
"# Here you can set the model and tools for your agent.\n",
|
||||
"# Computer use models: https://www.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents\n",
|
||||
"# Composed agent models: https://www.trycua.com/docs/agent-sdk/supported-agents/composed-agents\n",
|
||||
"# Custom tools: https://www.trycua.com/docs/agent-sdk/custom-tools\n",
|
||||
"agent_config = {\n",
|
||||
" \"model\": \"openai/computer-use-preview\",\n",
|
||||
" \"trajectory_dir\": str(Path(\"trajectories\")),\n",
|
||||
" \"only_n_most_recent_images\": 3,\n",
|
||||
" \"verbosity\": logging.INFO\n",
|
||||
"}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a07b09ee",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🖱️ Test your agent\n",
|
||||
"\n",
|
||||
"Run your agent on a test scenario in a Cua cloud container."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12b9c22c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Connect to an existing cloud container through the Cua SDK.\n",
|
||||
"\n",
|
||||
"You can access the computer through VNC on the [Cua Dashboard](https://www.trycua.com/dashboard)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a210e959",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from computer import Computer, VMProviderType\n",
|
||||
"\n",
|
||||
"# Connect to your existing cloud container\n",
|
||||
"computer = Computer(\n",
|
||||
" os_type=\"linux\",\n",
|
||||
" provider_type=VMProviderType.CLOUD,\n",
|
||||
" name=os.getenv(\"CUA_CONTAINER_NAME\") or \"\",\n",
|
||||
" api_key=os.getenv(\"CUA_API_KEY\"),\n",
|
||||
" verbosity=logging.INFO\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"agent_config[\"tools\"] = [ computer ]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "87a307e3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Try running the computer use agent on a simple task.\n",
|
||||
"\n",
|
||||
"To view a replay of the agent's actions, upload the trajectory to the [trajectory viewer](https://www.trycua.com/trajectory-viewer).\n",
|
||||
"\n",
|
||||
"Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f3a32ea8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create agent\n",
|
||||
"agent = ComputerAgent(**agent_config)\n",
|
||||
"\n",
|
||||
"tasks = [\n",
|
||||
" \"Open the web browser and search for a repository named trycua/cua on GitHub.\"\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"for i, task in enumerate(tasks):\n",
|
||||
" print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
|
||||
" async for result in agent.run(task):\n",
|
||||
" print(result)\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
" print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb4edbb5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🧐 Benchmark your agent\n",
|
||||
"\n",
|
||||
"Test your agent's performance on a selection of tasks from the OSWorld benchmark."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6bf0887e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import uuid\n",
|
||||
"from pprint import pprint\n",
|
||||
"from agent.integrations.hud import run_full_dataset\n",
|
||||
"\n",
|
||||
"job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
|
||||
"\n",
|
||||
"# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
|
||||
"# See the documentation here: https://docs.trycua.com/docs/agent-sdk/integrations/hud#running-a-full-dataset\n",
|
||||
"results = await run_full_dataset(\n",
|
||||
" dataset=\"ddupont/OSWorld-Tiny-Public\",\n",
|
||||
" job_name=job_name,\n",
|
||||
" **agent_config,\n",
|
||||
" max_concurrent=20,\n",
|
||||
" max_steps=50,\n",
|
||||
" #split=\"train[:5]\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
|
||||
"print(f\"Job: {job_name}\")\n",
|
||||
"print(f\"Total results: {len(results)}\")\n",
|
||||
"pprint(results[:3])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5b89a103",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🦾 Improve your agent\n",
|
||||
"\n",
|
||||
"To improve your agent for OSWorld-Verified, experiment with different models and add custom tools that fit your use case. You can also dive into the ComputerAgent source code to design an improved version or subclass tailored to your needs.\n",
|
||||
"\n",
|
||||
"Learn more about [Customizing Your ComputerAgent](https://docs.trycua.com/docs/agent-sdk/customizing-computeragent) in the docs."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
+2
-3
@@ -6,6 +6,7 @@ requires = ["pdm-backend"]
|
||||
authors = [{ name = "TryCua", email = "gh@trycua.com" }]
|
||||
dependencies = [
|
||||
"openai<1.100.0",
|
||||
"anthropic>=0.67.0",
|
||||
]
|
||||
description = "CUA (Computer Use Agent) mono-repo"
|
||||
license = { text = "MIT" }
|
||||
@@ -40,6 +41,7 @@ dev = [
|
||||
"mypy>=1.10.0",
|
||||
"ruff>=0.9.2",
|
||||
"types-requests>=2.31.0",
|
||||
"hud-python[agent]==0.4.26"
|
||||
]
|
||||
docs = ["mkdocs-material>=9.2.0", "mkdocs>=1.5.0"]
|
||||
test = [
|
||||
@@ -54,9 +56,6 @@ test = [
|
||||
[tool.pdm.resolution]
|
||||
respect-source-order = true
|
||||
|
||||
[tool.pdm.resolution.overrides]
|
||||
hud-python = "0.4.12"
|
||||
|
||||
[tool.black]
|
||||
line-length = 100
|
||||
target-version = ["py311"]
|
||||
|
||||
Reference in New Issue
Block a user