Merge branch 'main' into models/opencua

This commit is contained in:
Dillon DuPont
2025-09-15 15:11:15 -04:00
35 changed files with 9754 additions and 137 deletions
+2 -2
View File
@@ -30,7 +30,7 @@ We're always looking for suggestions to make lume better. If you have an idea:
We follow strict code formatting guidelines to ensure consistency across the codebase. Before submitting any code:
1. **Review Our Format Guide**: Please review our [Code Formatting Standards](docs/Developer-Guide.md#code-formatting-standards) section in the Getting Started guide.
1. **Review Our Format Guide**: Please review our [Code Formatting Standards](Development.md#code-formatting-standards) section in the Getting Started guide.
2. **Configure Your IDE**: We recommend using the workspace settings provided in `.vscode/` for automatic formatting.
3. **Run Formatting Tools**: Always run the formatting tools before submitting a PR:
```bash
@@ -51,6 +51,6 @@ Documentation improvements are always welcome. You can:
- Improve API documentation
- Add tutorials or guides
For detailed instructions on setting up your development environment and submitting code contributions, please see our [Developer-Guide](./docs/Developer-Guide.md).
For detailed instructions on setting up your development environment and submitting code contributions, please see our [Developer-Guide](Development.md).
Feel free to join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss ideas or get help with your contributions.
+285
View File
@@ -0,0 +1,285 @@
# Getting Started
## Project Structure
The project is organized as a monorepo with these main packages:
- `libs/core/` - Base package with telemetry support
- `libs/computer/` - Computer-use interface (CUI) library
- `libs/agent/` - AI agent library with multi-provider support
- `libs/som/` - Set-of-Mark parser
- `libs/computer-server/` - Server component for VM
- `libs/lume/` - Lume CLI
- `libs/pylume/` - Python bindings for Lume
Each package has its own virtual environment and dependencies, managed through PDM.
## Local Development Setup
1. Install Lume CLI:
```bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
```
2. Clone the repository:
```bash
git clone https://github.com/trycua/cua.git
cd cua
```
3. Create a `.env.local` file in the root directory with your API keys:
```bash
# Required for Anthropic provider
ANTHROPIC_API_KEY=your_anthropic_key_here
# Required for OpenAI provider
OPENAI_API_KEY=your_openai_key_here
```
4. Open the workspace in VSCode or Cursor:
```bash
# For Cua Python development
code .vscode/py.code-workspace
# For Lume (Swift) development
code .vscode/lume.code-workspace
```
Using the workspace file is strongly recommended as it:
- Sets up correct Python environments for each package
- Configures proper import paths
- Enables debugging configurations
- Maintains consistent settings across packages
## Lume Development
Refer to the [Lume README](./libs/lume/Development.md) for instructions on how to develop the Lume CLI.
## Python Development
There are two ways to install Lume:
### Run the build script
Run the build script to set up all packages:
```bash
./scripts/build.sh
```
The build script creates a shared virtual environment for all packages. The workspace configuration automatically handles import paths with the correct Python path settings.
This will:
- Create a virtual environment for the project
- Install all packages in development mode
- Set up the correct Python path
- Install development tools
### Install with PDM
If PDM is not already installed, you can follow the installation instructions [here](https://pdm-project.org/en/latest/#installation).
To install with PDM, simply run:
```console
pdm install -G:all
```
This installs all the dependencies for development, testing, and building the docs. If you'd only like development dependencies, you can run:
```console
pdm install -d
```
## Running Examples
The Python workspace includes launch configurations for all packages:
- "Run Computer Examples" - Runs computer examples
- "Run Agent Examples" - Runs agent examples
- "SOM" configurations - Various settings for running SOM
To run examples from VSCode / Cursor:
1. Press F5 or use the Run/Debug view
2. Select the desired configuration
The workspace also includes compound launch configurations:
- "Run Computer Examples + Server" - Runs both the Computer Examples and Server simultaneously
## Docker Development Environment
As an alternative to installing directly on your host machine, you can use Docker for development. This approach has several advantages:
### Prerequisites
- Docker installed on your machine
- Lume server running on your host (port 7777): `lume serve`
### Setup and Usage
1. Build the development Docker image:
```bash
./scripts/run-docker-dev.sh build
```
2. Run an example in the container:
```bash
./scripts/run-docker-dev.sh run computer_examples.py
```
3. Get an interactive shell in the container:
```bash
./scripts/run-docker-dev.sh run --interactive
```
4. Stop any running containers:
```bash
./scripts/run-docker-dev.sh stop
```
### How it Works
The Docker development environment:
- Installs all required Python dependencies in the container
- Mounts your source code from the host at runtime
- Automatically configures the connection to use host.docker.internal:7777 for accessing the Lume server on your host machine
- Preserves your code changes without requiring rebuilds (source code is mounted as a volume)
> **Note**: The Docker container doesn't include the macOS-specific Lume executable. Instead, it connects to the Lume server running on your host machine via host.docker.internal:7777. Make sure to start the Lume server on your host before running examples in the container.
## Cleanup and Reset
If you need to clean up the environment (non-docker) and start fresh:
```bash
./scripts/cleanup.sh
```
This will:
- Remove all virtual environments
- Clean Python cache files and directories
- Remove build artifacts
- Clean PDM-related files
- Reset environment configurations
## Code Formatting Standards
The cua project follows strict code formatting standards to ensure consistency across all packages.
### Python Code Formatting
#### Tools
The project uses the following tools for code formatting and linting:
- **[Black](https://black.readthedocs.io/)**: Code formatter
- **[Ruff](https://beta.ruff.rs/docs/)**: Fast linter and formatter
- **[MyPy](https://mypy.readthedocs.io/)**: Static type checker
These tools are automatically installed when you set up the development environment using the `./scripts/build.sh` script.
#### Configuration
The formatting configuration is defined in the root `pyproject.toml` file:
```toml
[tool.black]
line-length = 100
target-version = ["py311"]
[tool.ruff]
line-length = 100
target-version = "py311"
select = ["E", "F", "B", "I"]
fix = true
[tool.ruff.format]
docstring-code-format = true
[tool.mypy]
strict = true
python_version = "3.11"
ignore_missing_imports = true
disallow_untyped_defs = true
check_untyped_defs = true
warn_return_any = true
show_error_codes = true
warn_unused_ignores = false
```
#### Key Formatting Rules
- **Line Length**: Maximum of 100 characters
- **Python Version**: Code should be compatible with Python 3.11+
- **Imports**: Automatically sorted (using Ruff's "I" rule)
- **Type Hints**: Required for all function definitions (strict mypy mode)
#### IDE Integration
The repository includes VSCode workspace configurations that enable automatic formatting. When you open the workspace files (as recommended in the setup instructions), the correct formatting settings are automatically applied.
Python-specific settings in the workspace files:
```json
"[python]": {
"editor.formatOnSave": true,
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
}
}
```
Recommended VS Code extensions:
- Black Formatter (ms-python.black-formatter)
- Ruff (charliermarsh.ruff)
- Pylance (ms-python.vscode-pylance)
#### Manual Formatting
To manually format code:
```bash
# Format all Python files using Black
pdm run black .
# Run Ruff linter with auto-fix
pdm run ruff check --fix .
# Run type checking with MyPy
pdm run mypy .
```
#### Pre-commit Validation
Before submitting a pull request, ensure your code passes all formatting checks:
```bash
# Run all checks
pdm run black --check .
pdm run ruff check .
pdm run mypy .
```
### Swift Code (Lume)
For Swift code in the `libs/lume` directory:
- Follow the [Swift API Design Guidelines](https://www.swift.org/documentation/api-design-guidelines/)
- Use SwiftFormat for consistent formatting
- Code will be automatically formatted on save when using the lume workspace
+2 -2
View File
@@ -188,9 +188,9 @@ Join our [Discord community](https://discord.com/invite/mVnXXpdE85) to discuss i
Cua is open-sourced under the MIT License - see the [LICENSE](LICENSE) file for details.
The base image `kasmweb/core-ubuntu-jammy` is maintained by [Kasm Technologies](https://github.com/kasmtech/workspaces-core-images) and distributed under the Apache License 2.0. Usage of that image is subject to its own license terms.
Portions of this project, specifically components adapted from Kasm Technologies Inc., are also licensed under the MIT License. See [libs/kasm/LICENSE](libs/kasm/LICENSE) for details.
Microsoft's OmniParser, which is used in this project, is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the [OmniParser LICENSE](https://github.com/microsoft/OmniParser/blob/master/LICENSE) file for details.
Microsoft's OmniParser, which is used in this project, is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0). See the [OmniParser LICENSE](https://github.com/microsoft/OmniParser/blob/master/LICENSE) for details.
### Third-Party Licenses and Optional Components
@@ -3,6 +3,8 @@ title: Agent Loops
description: Supported computer-using agent loops and models
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/agent_nb.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
An agent can be thought of as a loop - it generates actions, executes them, and repeats until done:
1. **Generate**: Your `model` generates `output_text`, `computer_call`, `function_call`
+1 -7
View File
@@ -75,13 +75,7 @@ messages = [
## Message Types
- **user**: User input messages
- **computer_call**: Computer actions (click, type, keypress, etc.)
- **computer_call_output**: Results from computer actions (usually screenshots)
- **function_call**: Function calls (e.g., `computer.call`)
- **function_call_output**: Results from function calls
- **reasoning**: Agent's internal reasoning and planning
- **message**: Agent text responses
See the complete schema in [Message Format](./message-format).
### Memory Management
@@ -0,0 +1,121 @@
---
title: Customizing Your ComputerAgent
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/customizing_computeragent.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
The `ComputerAgent` interface provides an easy proxy to any computer-using model configuration, and it is a powerful framework for extending and building your own agentic systems.
This guide shows four proven ways to increase capabilities and success rate:
- 1 — Simple: Prompt engineering
- 2 — Easy: Tools
- 3 — Intermediate: Callbacks
- 4 — Expert: Custom `@register_agent`
## 1) Simple: Prompt engineering
Provide guiding instructions to shape behavior. `ComputerAgent` accepts an optional `instructions: str | None` which acts like a system-style preface. Internally, this uses a callback that pre-pends a user message before each LLM call.
```python
from agent.agent import ComputerAgent
agent = ComputerAgent(
model="openai/computer-use-preview",
tools=[computer],
instructions=(
"You are a meticulous software operator. Prefer safe, deterministic actions. "
"Always confirm via on-screen text before proceeding."
),
)
```
## 2) Easy: Tools
Expose deterministic capabilities as tools (Python functions or custom computer handlers). The agent will call them when appropriate.
```python
def calculate_percentage(numerator: float, denominator: float) -> str:
"""Calculate percentage as a string.
Args:
numerator: Numerator value
denominator: Denominator value
Returns:
A formatted percentage string (e.g., '75.00%').
"""
if denominator == 0:
return "0.00%"
return f"{(numerator/denominator)*100:.2f}%"
agent = ComputerAgent(
model="openai/computer-use-preview",
tools=[computer, calculate_percentage],
)
```
- See `docs/agent-sdk/custom-tools` for authoring function tools.
- See `docs/agent-sdk/custom-computer-handlers` for building full computer interfaces.
## 3) Intermediate: Callbacks
Callbacks provide lifecycle hooks to preprocess messages, postprocess outputs, record trajectories, manage costs, and more.
```python
from agent.callbacks import ImageRetentionCallback, TrajectorySaverCallback, BudgetManagerCallback
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[computer],
callbacks=[
ImageRetentionCallback(only_n_most_recent_images=3),
TrajectorySaverCallback("./trajectories"),
BudgetManagerCallback(max_budget=10.0, raise_error=True),
],
)
```
- Browse implementations in `libs/python/agent/agent/loops/`.
## 4) Expert: Custom `@register_agent`
Build your own agent configuration class to control prompting, message shaping, and tool handling. This is the most flexible option for specialized domains.
- Register your own `model=...` loop using `@register_agent`
- Browse implementations in `libs/python/agent/agent/loops/`.
- Implement `predict_step()` (and optionally `predict_click()`) and return the standardized output schema.
```python
from agent.decorators import register_agent
@register_agent(models=r".*my-special-model.*", priority=10)
class MyCustomAgentConfig:
async def predict_step(self, messages, model, tools, **kwargs):
# 1) Format messages for your provider
# 2) Call provider
# 3) Convert responses to the agent output schema
return {"output": [], "usage": {}}
async def predict_click(self, model, image_b64, instruction):
# Optional: click-only capability
return None
def get_capabilities(self):
return ["step"]
```
## HUD integration (optional)
When using the HUD evaluation integration (`agent/integrations/hud/`), you can pass `instructions`, `tools`, and `callbacks` directly
```python
from agent.integrations.hud import run_single_task
await run_single_task(
dataset="username/dataset-name",
model="openai/computer-use-preview",
instructions="Operate carefully. Always verify on-screen text before actions.",
# tools=[your_custom_function],
# callbacks=[YourCustomCallback()],
)
```
@@ -3,6 +3,8 @@ title: HUD Evals
description: Use ComputerAgent with HUD for benchmarking and evaluation
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/eval_osworld.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
The HUD integration allows an agent to be benchmarked using the [HUD framework](https://www.hud.so/). Through the HUD integration, the agent controls a computer inside HUD, where tests are run to evaluate the success of each task.
## Installation
@@ -76,7 +78,7 @@ results = await run_full_dataset(
- `max_steps` (`int`): Default: `50`
Safety cap on steps per task to prevent infinite loops.
- `split` (`str`): Default: `"train"`
Dataset split or subset (e.g., `"train[:10]"`).
Dataset split or subset to run. Uses the [Hugging Face split format](https://huggingface.co/docs/datasets/v1.11.0/splits.html), e.g., `"train[:10]"` for the first 10 tasks.
## Additional Parameters
@@ -0,0 +1,201 @@
---
title: Message Format
---
This page documents the Python message and response schema used by the Agent SDK.
It mirrors the structure shown in Chat History and provides precise type definitions you can target in your own code.
All examples below use Python type hints with `TypedDict` and `Literal` from the standard `typing` module.
## Response
The agent yields response chunks as an async generator of objects with `output` and `usage`.
```python
from typing import List, TypedDict
class Usage(TypedDict, total=False):
prompt_tokens: int
completion_tokens: int
total_tokens: int
response_cost: float # USD cost if available
class AgentResponse(TypedDict):
output: List["AgentMessage"]
usage: Usage
```
## Messages
Agent messages represent the state of the conversation and the agent's actions.
```python
from typing import List, Literal, Optional, TypedDict, Union
# Union of all message variants
AgentMessage = Union[
"UserMessage",
"AssistantMessage",
"ReasoningMessage",
"ComputerCallMessage",
"ComputerCallOutputMessage",
"FunctionCallMessage",
"FunctionCallOutputMessage",
]
# Input message (role: user/system/developer)
class UserMessage(TypedDict, total=False):
type: Literal["message"] # optional for user input
role: Literal["user", "system", "developer"]
content: Union[str, List["InputContent"]]
# Output message (assistant text)
class AssistantMessage(TypedDict):
type: Literal["message"]
role: Literal["assistant"]
content: List["OutputContent"]
# Output reasoning/thinking message
class ReasoningMessage(TypedDict):
type: Literal["reasoning"]
summary: List["SummaryContent"]
# Output computer action call (agent intends to act)
class ComputerCallMessage(TypedDict):
type: Literal["computer_call"]
call_id: str
status: Literal["completed", "failed", "pending"]
action: "ComputerAction"
# Output computer action result (always a screenshot)
class ComputerCallOutputMessage(TypedDict):
type: Literal["computer_call_output"]
call_id: str
output: "ComputerResultContent"
# Output function call (agent calls a Python tool)
class FunctionCallMessage(TypedDict):
type: Literal["function_call"]
call_id: str
status: Literal["completed", "failed", "pending"]
name: str
arguments: str # JSON-serialized kwargs
# Output function call result (text)
class FunctionCallOutputMessage(TypedDict):
type: Literal["function_call_output"]
call_id: str
output: str
```
## Message Content
These content items appear inside `content` arrays for the message types above.
```python
# Input content kinds
class InputContent(TypedDict):
type: Literal["input_image", "input_text"]
text: Optional[str]
image_url: Optional[str] # e.g., data URL
# Assistant output content
class OutputContent(TypedDict):
type: Literal["output_text"]
text: str
# Reasoning/summary output content
class SummaryContent(TypedDict):
type: Literal["summary_text"]
text: str
# Computer call outputs (screenshots)
class ComputerResultContent(TypedDict):
type: Literal["computer_screenshot", "input_image"]
image_url: str # data URL (e.g., "data:image/png;base64,....")
```
## Actions
Computer actions represent concrete operations the agent will perform on the computer.
Two broad families exist depending on the provider: OpenAI-style and Anthropic-style.
```python
# Union of all supported computer actions
ComputerAction = Union[
"ClickAction",
"DoubleClickAction",
"DragAction",
"KeyPressAction",
"MoveAction",
"ScreenshotAction",
"ScrollAction",
"TypeAction",
"WaitAction",
# Anthropic variants
"LeftMouseDownAction",
"LeftMouseUpAction",
]
# OpenAI Computer Actions
class ClickAction(TypedDict):
type: Literal["click"]
button: Literal["left", "right", "wheel", "back", "forward"]
x: int
y: int
class DoubleClickAction(TypedDict, total=False):
type: Literal["double_click"]
button: Literal["left", "right", "wheel", "back", "forward"]
x: int
y: int
class DragAction(TypedDict, total=False):
type: Literal["drag"]
button: Literal["left", "right", "wheel", "back", "forward"]
path: List[tuple[int, int]] # [(x1, y1), (x2, y2), ...]
class KeyPressAction(TypedDict):
type: Literal["keypress"]
keys: List[str] # e.g., ["ctrl", "a"]
class MoveAction(TypedDict):
type: Literal["move"]
x: int
y: int
class ScreenshotAction(TypedDict):
type: Literal["screenshot"]
class ScrollAction(TypedDict):
type: Literal["scroll"]
scroll_x: int
scroll_y: int
x: int
y: int
class TypeAction(TypedDict):
type: Literal["type"]
text: str
class WaitAction(TypedDict):
type: Literal["wait"]
# Anthropic Computer Actions
class LeftMouseDownAction(TypedDict):
type: Literal["left_mouse_down"]
x: int
y: int
class LeftMouseUpAction(TypedDict):
type: Literal["left_mouse_up"]
x: int
y: int
```
## Notes
- The agent runtime may add provider-specific fields when available (e.g., usage cost). Unknown fields should be ignored for forward compatibility.
- Computer action outputs are screenshots as data URLs. For security and storage, some serializers may redact or omit large fields in persisted metadata.
- The message flow typically alternates between reasoning, actions, screenshots, and concluding assistant text. See [Chat History](./chat-history) for a step-by-step example.
+2
View File
@@ -6,6 +6,8 @@
"supported-agents",
"supported-model-providers",
"chat-history",
"message-format",
"customizing-computeragent",
"callbacks",
"custom-tools",
"custom-computer-handlers",
@@ -3,6 +3,8 @@ title: Cua Computers
description: Understanding cua computer types and connection methods
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/computer_nb.ipynb" target="_blank">Jupyter Notebook</a> and <a href="https://github.com/trycua/cua/tree/main/examples/computer-example-ts" target="_blank">NodeJS project</a> are available for this documentation.</Callout>
Before we can automate apps using AI, we need to first connect to a Computer Server to give the AI a safe environment to execute workflows in.
Cua Computers are preconfigured virtual machines running the Computer Server. They can be either macOS, Linux, or Windows. They're found in either a cloud-native container, or on your host desktop.
@@ -3,6 +3,8 @@ title: Sandboxed Python
slug: sandboxed-python
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/examples/sandboxed_functions_examples.py" target="_blank">Python example</a> is available for this documentation.</Callout>
You can run Python functions securely inside a sandboxed virtual environment on a remote Cua Computer. This is useful for executing untrusted user code, isolating dependencies, or providing a safe environment for automation tasks.
## How It Works
@@ -6,6 +6,8 @@ github:
- https://github.com/trycua/cua/tree/main/libs/python/computer-server
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/notebooks/computer_server_nb.ipynb" target="_blank">Jupyter Notebook</a> is available for this documentation.</Callout>
The Computer Server API reference documentation is currently under development.
## Overview
@@ -6,6 +6,8 @@ github:
- https://github.com/trycua/cua/tree/main/libs/python/som
---
<Callout>A corresponding <a href="https://github.com/trycua/cua/blob/main/examples/som_examples.py" target="_blank">Python example</a> is available for this documentation.</Callout>
## Overview
The SOM library provides visual element detection and interaction capabilities. It is based on the [Set-of-Mark](https://arxiv.org/abs/2310.11441) research paper and the [OmniParser](https://github.com/microsoft/OmniParser) model.
+6
View File
@@ -18,6 +18,12 @@ gnome-screenshot wmctrl ffmpeg socat xclip
RUN pip install cua-computer-server
# Install Firefox
ENV DEBIAN_FRONTEND=noninteractive \
INST_DIR=$STARTUPDIR/install
COPY ./src/ $INST_DIR
RUN bash ${INST_DIR}/ubuntu/install/firefox/install_firefox.sh
# Disable SSL requirement
RUN sed -i 's/require_ssl: true/require_ssl: false/g' /usr/share/kasmvnc/kasmvnc_defaults.yaml
RUN sed -i 's/-sslOnly//g' /dockerstartup/vnc_startup.sh
+24
View File
@@ -0,0 +1,24 @@
# LICENSE
MIT License
Copyright (c) 2025 Cua AI, Inc.
Portions Copyright (c) 2022 Kasm Technologies Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
@@ -0,0 +1,89 @@
#!/usr/bin/env bash
set -ex
START_COMMAND="firefox"
PGREP="firefox"
export MAXIMIZE="true"
export MAXIMIZE_NAME="Mozilla Firefox"
MAXIMIZE_SCRIPT=$STARTUPDIR/maximize_window.sh
DEFAULT_ARGS=""
ARGS=${APP_ARGS:-$DEFAULT_ARGS}
options=$(getopt -o gau: -l go,assign,url: -n "$0" -- "$@") || exit
eval set -- "$options"
while [[ $1 != -- ]]; do
case $1 in
-g|--go) GO='true'; shift 1;;
-a|--assign) ASSIGN='true'; shift 1;;
-u|--url) OPT_URL=$2; shift 2;;
*) echo "bad option: $1" >&2; exit 1;;
esac
done
shift
# Process non-option arguments.
for arg; do
echo "arg! $arg"
done
FORCE=$2
# run with vgl if GPU is available
if [ -f /opt/VirtualGL/bin/vglrun ] && [ ! -z "${KASM_EGL_CARD}" ] && [ ! -z "${KASM_RENDERD}" ] && [ -O "${KASM_RENDERD}" ] && [ -O "${KASM_EGL_CARD}" ] ; then
START_COMMAND="/opt/VirtualGL/bin/vglrun -d ${KASM_EGL_CARD} $START_COMMAND"
fi
kasm_exec() {
if [ -n "$OPT_URL" ] ; then
URL=$OPT_URL
elif [ -n "$1" ] ; then
URL=$1
fi
# Since we are execing into a container that already has the browser running from startup,
# when we don't have a URL to open we want to do nothing. Otherwise a second browser instance would open.
if [ -n "$URL" ] ; then
/usr/bin/filter_ready
/usr/bin/desktop_ready
bash ${MAXIMIZE_SCRIPT} &
$START_COMMAND $ARGS $OPT_URL
else
echo "No URL specified for exec command. Doing nothing."
fi
}
kasm_startup() {
if [ -n "$KASM_URL" ] ; then
URL=$KASM_URL
elif [ -z "$URL" ] ; then
URL=$LAUNCH_URL
fi
if [ -z "$DISABLE_CUSTOM_STARTUP" ] || [ -n "$FORCE" ] ; then
echo "Entering process startup loop"
set +x
while true
do
if ! pgrep -x $PGREP > /dev/null
then
/usr/bin/filter_ready
/usr/bin/desktop_ready
set +e
bash ${MAXIMIZE_SCRIPT} &
$START_COMMAND $ARGS $URL
set -e
fi
sleep 1
done
set -x
fi
}
if [ -n "$GO" ] || [ -n "$ASSIGN" ] ; then
kasm_exec
else
kasm_startup
fi
@@ -0,0 +1,221 @@
[Desktop Entry]
Version=1.0
Name=Firefox Web Browser
Name[ar]=متصفح الويب فَيَرفُكْس
Name[ast]=Restolador web Firefox
Name[bn]=ফায়ারফক্স ওয়েব ব্রাউজার
Name[ca]=Navegador web Firefox
Name[cs]=Firefox Webový prohlížeč
Name[da]=Firefox - internetbrowser
Name[el]=Περιηγητής Firefox
Name[es]=Navegador web Firefox
Name[et]=Firefoxi veebibrauser
Name[fa]=مرورگر اینترنتی Firefox
Name[fi]=Firefox-selain
Name[fr]=Navigateur Web Firefox
Name[gl]=Navegador web Firefox
Name[he]=דפדפן האינטרנט Firefox
Name[hr]=Firefox web preglednik
Name[hu]=Firefox webböngésző
Name[it]=Firefox Browser Web
Name[ja]=Firefox ウェブ・ブラウザ
Name[ko]=Firefox 웹 브라우저
Name[ku]=Geroka torê Firefox
Name[lt]=Firefox interneto naršyklė
Name[nb]=Firefox Nettleser
Name[nl]=Firefox webbrowser
Name[nn]=Firefox Nettlesar
Name[no]=Firefox Nettleser
Name[pl]=Przeglądarka WWW Firefox
Name[pt]=Firefox Navegador Web
Name[pt_BR]=Navegador Web Firefox
Name[ro]=Firefox Navigator Internet
Name[ru]=Веб-браузер Firefox
Name[sk]=Firefox - internetový prehliadač
Name[sl]=Firefox spletni brskalnik
Name[sv]=Firefox webbläsare
Name[tr]=Firefox Web Tarayıcısı
Name[ug]=Firefox توركۆرگۈ
Name[uk]=Веб-браузер Firefox
Name[vi]=Trình duyệt web Firefox
Name[zh_CN]=Firefox 网络浏览器
Name[zh_TW]=Firefox 網路瀏覽器
Comment=Browse the World Wide Web
Comment[ar]=تصفح الشبكة العنكبوتية العالمية
Comment[ast]=Restola pela Rede
Comment[bn]=ইন্টারনেট ব্রাউজ করুন
Comment[ca]=Navegueu per la web
Comment[cs]=Prohlížení stránek World Wide Webu
Comment[da]=Surf på internettet
Comment[de]=Im Internet surfen
Comment[el]=Μπορείτε να περιηγηθείτε στο διαδίκτυο (Web)
Comment[es]=Navegue por la web
Comment[et]=Lehitse veebi
Comment[fa]=صفحات شبکه جهانی اینترنت را مرور نمایید
Comment[fi]=Selaa Internetin WWW-sivuja
Comment[fr]=Naviguer sur le Web
Comment[gl]=Navegar pola rede
Comment[he]=גלישה ברחבי האינטרנט
Comment[hr]=Pretražite web
Comment[hu]=A világháló böngészése
Comment[it]=Esplora il web
Comment[ja]=ウェブを閲覧します
Comment[ko]=웹을 돌아 다닙니다
Comment[ku]=Li torê bigere
Comment[lt]=Naršykite internete
Comment[nb]=Surf på nettet
Comment[nl]=Verken het internet
Comment[nn]=Surf på nettet
Comment[no]=Surf på nettet
Comment[pl]=Przeglądanie stron WWW
Comment[pt]=Navegue na Internet
Comment[pt_BR]=Navegue na Internet
Comment[ro]=Navigați pe Internet
Comment[ru]=Доступ в Интернет
Comment[sk]=Prehliadanie internetu
Comment[sl]=Brskajte po spletu
Comment[sv]=Surfa på webben
Comment[tr]=İnternet'te Gezinin
Comment[ug]=دۇنيادىكى توربەتلەرنى كۆرگىلى بولىدۇ
Comment[uk]=Перегляд сторінок Інтернету
Comment[vi]=Để duyệt các trang web
Comment[zh_CN]=浏览互联网
Comment[zh_TW]=瀏覽網際網路
GenericName=Web Browser
GenericName[ar]=متصفح ويب
GenericName[ast]=Restolador Web
GenericName[bn]=ওয়েব ব্রাউজার
GenericName[ca]=Navegador web
GenericName[cs]=Webový prohlížeč
GenericName[da]=Webbrowser
GenericName[el]=Περιηγητής διαδικτύου
GenericName[es]=Navegador web
GenericName[et]=Veebibrauser
GenericName[fa]=مرورگر اینترنتی
GenericName[fi]=WWW-selain
GenericName[fr]=Navigateur Web
GenericName[gl]=Navegador Web
GenericName[he]=דפדפן אינטרנט
GenericName[hr]=Web preglednik
GenericName[hu]=Webböngésző
GenericName[it]=Browser web
GenericName[ja]=ウェブ・ブラウザ
GenericName[ko]=웹 브라우저
GenericName[ku]=Geroka torê
GenericName[lt]=Interneto naršyklė
GenericName[nb]=Nettleser
GenericName[nl]=Webbrowser
GenericName[nn]=Nettlesar
GenericName[no]=Nettleser
GenericName[pl]=Przeglądarka WWW
GenericName[pt]=Navegador Web
GenericName[pt_BR]=Navegador Web
GenericName[ro]=Navigator Internet
GenericName[ru]=Веб-браузер
GenericName[sk]=Internetový prehliadač
GenericName[sl]=Spletni brskalnik
GenericName[sv]=Webbläsare
GenericName[tr]=Web Tarayıcı
GenericName[ug]=توركۆرگۈ
GenericName[uk]=Веб-браузер
GenericName[vi]=Trình duyệt Web
GenericName[zh_CN]=网络浏览器
GenericName[zh_TW]=網路瀏覽器
Keywords=Internet;WWW;Browser;Web;Explorer
Keywords[ar]=انترنت;إنترنت;متصفح;ويب;وب
Keywords[ast]=Internet;WWW;Restolador;Web;Esplorador
Keywords[ca]=Internet;WWW;Navegador;Web;Explorador;Explorer
Keywords[cs]=Internet;WWW;Prohlížeč;Web;Explorer
Keywords[da]=Internet;Internettet;WWW;Browser;Browse;Web;Surf;Nettet
Keywords[de]=Internet;WWW;Browser;Web;Explorer;Webseite;Site;surfen;online;browsen
Keywords[el]=Internet;WWW;Browser;Web;Explorer;Διαδίκτυο;Περιηγητής;Firefox;Φιρεφοχ;Ιντερνετ
Keywords[es]=Explorador;Internet;WWW
Keywords[fi]=Internet;WWW;Browser;Web;Explorer;selain;Internet-selain;internetselain;verkkoselain;netti;surffaa
Keywords[fr]=Internet;WWW;Browser;Web;Explorer;Fureteur;Surfer;Navigateur
Keywords[he]=דפדפן;אינטרנט;רשת;אתרים;אתר;פיירפוקס;מוזילה;
Keywords[hr]=Internet;WWW;preglednik;Web
Keywords[hu]=Internet;WWW;Böngésző;Web;Háló;Net;Explorer
Keywords[it]=Internet;WWW;Browser;Web;Navigatore
Keywords[is]=Internet;WWW;Vafri;Vefur;Netvafri;Flakk
Keywords[ja]=Internet;WWW;Web;インターネット;ブラウザ;ウェブ;エクスプローラ
Keywords[nb]=Internett;WWW;Nettleser;Explorer;Web;Browser;Nettside
Keywords[nl]=Internet;WWW;Browser;Web;Explorer;Verkenner;Website;Surfen;Online
Keywords[pt]=Internet;WWW;Browser;Web;Explorador;Navegador
Keywords[pt_BR]=Internet;WWW;Browser;Web;Explorador;Navegador
Keywords[ru]=Internet;WWW;Browser;Web;Explorer;интернет;браузер;веб;файрфокс;огнелис
Keywords[sk]=Internet;WWW;Prehliadač;Web;Explorer
Keywords[sl]=Internet;WWW;Browser;Web;Explorer;Brskalnik;Splet
Keywords[tr]=İnternet;WWW;Tarayıcı;Web;Gezgin;Web sitesi;Site;sörf;çevrimiçi;tara
Keywords[uk]=Internet;WWW;Browser;Web;Explorer;Інтернет;мережа;переглядач;оглядач;браузер;веб;файрфокс;вогнелис;перегляд
Keywords[vi]=Internet;WWW;Browser;Web;Explorer;Trình duyệt;Trang web
Keywords[zh_CN]=Internet;WWW;Browser;Web;Explorer;网页;浏览;上网;火狐;Firefox;ff;互联网;网站;
Keywords[zh_TW]=Internet;WWW;Browser;Web;Explorer;網際網路;網路;瀏覽器;上網;網頁;火狐
Exec=firefox %u
Terminal=false
X-MultipleArgs=false
Type=Application
Icon=/usr/lib/firefox/browser/chrome/icons/default/default128.png
Categories=GNOME;GTK;Network;WebBrowser;
MimeType=text/html;text/xml;application/xhtml+xml;application/xml;application/rss+xml;application/rdf+xml;image/gif;image/jpeg;image/png;x-scheme-handler/http;x-scheme-handler/https;x-scheme-handler/ftp;x-scheme-handler/chrome;video/webm;application/x-xpinstall;
StartupNotify=true
Actions=NewWindow;NewPrivateWindow;
[Desktop Action NewWindow]
Name=Open a New Window
Name[ar]=افتح نافذة جديدة
Name[ast]=Abrir una ventana nueva
Name[bn]=Abrir una ventana nueva
Name[ca]=Obre una finestra nova
Name[cs]=Otevřít nové okno
Name[da]=Åbn et nyt vindue
Name[de]=Ein neues Fenster öffnen
Name[el]=Άνοιγμα νέου παραθύρου
Name[es]=Abrir una ventana nueva
Name[fi]=Avaa uusi ikkuna
Name[fr]=Ouvrir une nouvelle fenêtre
Name[gl]=Abrir unha nova xanela
Name[he]=פתיחת חלון חדש
Name[hr]=Otvori novi prozor
Name[hu]=Új ablak nyitása
Name[it]=Apri una nuova finestra
Name[ja]=新しいウィンドウを開く
Name[ko]=새 창 열기
Name[ku]=Paceyeke nû veke
Name[lt]=Atverti naują langą
Name[nb]=Åpne et nytt vindu
Name[nl]=Nieuw venster openen
Name[pt]=Abrir nova janela
Name[pt_BR]=Abrir nova janela
Name[ro]=Deschide o fereastră nouă
Name[ru]=Новое окно
Name[sk]=Otvoriť nové okno
Name[sl]=Odpri novo okno
Name[sv]=Öppna ett nytt fönster
Name[tr]=Yeni pencere aç
Name[ug]=يېڭى كۆزنەك ئېچىش
Name[uk]=Відкрити нове вікно
Name[vi]=Mở cửa sổ mới
Name[zh_CN]=新建窗口
Name[zh_TW]=開啟新視窗
Exec=firefox -new-window
OnlyShowIn=Unity;
[Desktop Action NewPrivateWindow]
Name=Open a New Private Window
Name[ar]=افتح نافذة جديدة للتصفح الخاص
Name[ca]=Obre una finestra nova en mode d'incògnit
Name[de]=Ein neues privates Fenster öffnen
Name[es]=Abrir una ventana privada nueva
Name[fi]=Avaa uusi yksityinen ikkuna
Name[fr]=Ouvrir une nouvelle fenêtre de navigation privée
Name[he]=פתיחת חלון גלישה פרטית חדש
Name[hu]=Új privát ablak nyitása
Name[it]=Apri una nuova finestra anonima
Name[nb]=Åpne et nytt privat vindu
Name[ru]=Новое приватное окно
Name[sl]=Odpri novo okno zasebnega brskanja
Name[tr]=Yeni bir pencere aç
Name[uk]=Відкрити нове вікно у потайливому режимі
Name[zh_TW]=開啟新隱私瀏覽視窗
Exec=firefox -private-window
OnlyShowIn=Unity;
@@ -0,0 +1,236 @@
#!/usr/bin/env bash
set -xe
# Add icon
if [ -f /dockerstartup/install/ubuntu/install/firefox/firefox.desktop ]; then
mv /dockerstartup/install/ubuntu/install/firefox/firefox.desktop $HOME/Desktop/
fi
ARCH=$(arch | sed 's/aarch64/arm64/g' | sed 's/x86_64/amd64/g')
set_desktop_icon() {
sed -i -e 's!Icon=.\+!Icon=/usr/share/icons/hicolor/48x48/apps/firefox.png!' "$HOME/Desktop/firefox.desktop"
}
echo "Install Firefox"
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
dnf install -y firefox p11-kit
elif [ "${DISTRO}" == "opensuse" ]; then
zypper install -yn p11-kit-tools MozillaFirefox
elif grep -q Jammy /etc/os-release || grep -q Noble /etc/os-release; then
if [ ! -f '/etc/apt/preferences.d/mozilla-firefox' ]; then
add-apt-repository -y ppa:mozillateam/ppa
echo '
Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001
' > /etc/apt/preferences.d/mozilla-firefox
fi
apt-get install -y firefox p11-kit-modules
elif grep -q "ID=kali" /etc/os-release; then
apt-get update
apt-get install -y firefox-esr p11-kit-modules
rm -f $HOME/Desktop/firefox.desktop
cp \
/usr/share/applications/firefox-esr.desktop \
$HOME/Desktop/
chmod +x $HOME/Desktop/firefox-esr.desktop
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
if [ "${ARCH}" == "amd64" ]; then
install -d -m 0755 /etc/apt/keyrings
wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg -O- > /etc/apt/keyrings/packages.mozilla.org.asc
echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" > /etc/apt/sources.list.d/mozilla.list
echo '
Package: *
Pin: origin packages.mozilla.org
Pin-Priority: 1000
' > /etc/apt/preferences.d/mozilla
apt-get update
apt-get install -y firefox p11-kit-modules
else
apt-get update
apt-get install -y firefox-esr p11-kit-modules
rm -f $HOME/Desktop/firefox.desktop
cp \
/usr/share/applications/firefox-esr.desktop \
$HOME/Desktop/
chmod +x $HOME/Desktop/firefox-esr.desktop
fi
else
apt-mark unhold firefox || :
apt-get remove firefox
apt-get update
apt-get install -y firefox p11-kit-modules
fi
# Add Langpacks
FIREFOX_VERSION=$(curl -sI https://download.mozilla.org/?product=firefox-latest | awk -F '(releases/|/win32)' '/Location/ {print $2}')
RELEASE_URL="https://releases.mozilla.org/pub/firefox/releases/${FIREFOX_VERSION}/win64/xpi/"
LANGS=$(curl -Ls ${RELEASE_URL} | awk -F '(xpi">|</a>)' '/href.*xpi/ {print $2}' | tr '\n' ' ')
EXTENSION_DIR=/usr/lib/firefox-addons/distribution/extensions/
mkdir -p ${EXTENSION_DIR}
for LANG in ${LANGS}; do
LANGCODE=$(echo ${LANG} | sed 's/\.xpi//g')
echo "Downloading ${LANG} Language pack"
curl -o \
${EXTENSION_DIR}langpack-${LANGCODE}@firefox.mozilla.org.xpi -Ls \
${RELEASE_URL}${LANG}
done
# Cleanup and install flash if supported
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
if [ -z ${SKIP_CLEAN+x} ]; then
dnf clean all
fi
elif [ "${DISTRO}" == "opensuse" ]; then
if [ -z ${SKIP_CLEAN+x} ]; then
zypper clean --all
fi
else
if [ "$ARCH" == "arm64" ] && [ "$(lsb_release -cs)" == "focal" ] ; then
echo "Firefox flash player not supported on arm64 Ubuntu Focal Skipping"
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=kali" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
echo "Firefox flash player not supported on Debian"
elif grep -q Focal /etc/os-release; then
# Plugin to support running flash videos for sites like vimeo
apt-get update
apt-get install -y browser-plugin-freshplayer-pepperflash
apt-mark hold firefox
if [ -z ${SKIP_CLEAN+x} ]; then
apt-get autoclean
rm -rf \
/var/lib/apt/lists/* \
/var/tmp/*
fi
fi
fi
if [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
# Update firefox to utilize the system certificate store instead of the one that ships with firefox
if grep -q "ID=debian" /etc/os-release || grep -q "ID=kali" /etc/os-release || grep -q "ID=parrot" /etc/os-release && [ "${ARCH}" == "arm64" ]; then
rm -f /usr/lib/firefox-esr/libnssckbi.so
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox-esr/libnssckbi.so
elif grep -q "ID=kali" /etc/os-release && [ "${ARCH}" == "amd64" ]; then
rm -f /usr/lib/firefox-esr/libnssckbi.so
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox-esr/libnssckbi.so
else
rm -f /usr/lib/firefox/libnssckbi.so
ln /usr/lib/$(arch)-linux-gnu/pkcs11/p11-kit-trust.so /usr/lib/firefox/libnssckbi.so
fi
fi
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|fedora39|fedora40) ]]; then
if [[ "${DISTRO}" == @(fedora39|fedora40) ]]; then
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/firefox-redhat-default-prefs.js
else
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/all-redhat.js
fi
sed -i -e '/homepage/d' "$preferences_file"
elif [ "${DISTRO}" == "opensuse" ]; then
preferences_file=/usr/lib64/firefox/browser/defaults/preferences/firefox.js
elif grep -q "ID=kali" /etc/os-release; then
preferences_file=/usr/lib/firefox-esr/defaults/pref/firefox.js
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
if [ "${ARCH}" == "amd64" ]; then
preferences_file=/usr/lib/firefox/defaults/pref/firefox.js
else
preferences_file=/usr/lib/firefox-esr/defaults/pref/firefox.js
fi
else
preferences_file=/usr/lib/firefox/browser/defaults/preferences/firefox.js
fi
# Disabling default first run URL for Debian based images
if [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
cat >"$preferences_file" <<EOF
pref("datareporting.policy.firstRunURL", "");
pref("datareporting.policy.dataSubmissionEnabled", false);
pref("datareporting.healthreport.service.enabled", false);
pref("datareporting.healthreport.uploadEnabled", false);
pref("trailhead.firstrun.branches", "nofirstrun-empty");
pref("browser.aboutwelcome.enabled", false);
EOF
fi
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
# Creating a default profile
chown -R root:root $HOME
firefox -headless -CreateProfile "kasm $HOME/.mozilla/firefox/kasm"
# Generate a certdb to be detected on squid start
HOME=/root firefox --headless &
mkdir -p /root/.mozilla
CERTDB=$(find /root/.mozilla* -name "cert9.db")
while [ -z "${CERTDB}" ] ; do
sleep 1
echo "waiting for certdb"
CERTDB=$(find /root/.mozilla* -name "cert9.db")
done
sleep 2
kill $(pgrep firefox)
CERTDIR=$(dirname ${CERTDB})
mv ${CERTDB} $HOME/.mozilla/firefox/kasm/
rm -Rf /root/.mozilla
else
# Creating Default Profile
chown -R 0:0 $HOME
firefox -headless -CreateProfile "kasm $HOME/.mozilla/firefox/kasm"
fi
# Silence Firefox security nag "Some of Firefox's features may offer less protection on your current operating system".
echo 'user_pref("security.sandbox.warn_unprivileged_namespaces", false);' > $HOME/.mozilla/firefox/kasm/user.js
chown 1000:1000 $HOME/.mozilla/firefox/kasm/user.js
if [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
set_desktop_icon
fi
# Starting with version 67, Firefox creates a unique profile mapping per installation which is hash generated
# based off the installation path. Because that path will be static for our deployments we can assume the hash
# and thus assign our profile to the default for the installation
if grep -q "ID=kali" /etc/os-release; then
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
[Install3B6073811A6ABF12]
Default=kasm
Locked=1
EOL
elif grep -q "ID=debian" /etc/os-release || grep -q "ID=parrot" /etc/os-release; then
if [ "${ARCH}" != "amd64" ]; then
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
[Install3B6073811A6ABF12]
Default=kasm
Locked=1
EOL
else
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
[Install4F96D1932A9F858E]
Default=kasm
Locked=1
EOL
fi
elif [[ "${DISTRO}" != @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
[Install4F96D1932A9F858E]
Default=kasm
Locked=1
EOL
elif [[ "${DISTRO}" == @(oracle8|rockylinux9|rockylinux8|oracle9|rhel9|almalinux9|almalinux8|opensuse|fedora39|fedora40) ]]; then
cat >>$HOME/.mozilla/firefox/profiles.ini <<EOL
[Install11457493C5A56847]
Default=kasm
Locked=1
EOL
fi
# Desktop Icon FIxes
if [[ "${DISTRO}" == @(rockylinux9|oracle9|rhel9|almalinux9|fedora39|fedora40) ]]; then
sed -i 's#Icon=/usr/lib/firefox#Icon=/usr/lib64/firefox#g' $HOME/Desktop/firefox.desktop
fi
# Cleanup for app layer
chown -R 1000:0 $HOME
find /usr/share/ -name "icon-theme.cache" -exec rm -f {} \;
if [ -f $HOME/Desktop/firefox.desktop ]; then
chmod +x $HOME/Desktop/firefox.desktop
fi
chown -R 1000:1000 $HOME/.mozilla
+9 -1
View File
@@ -31,7 +31,8 @@ from .callbacks import (
TrajectorySaverCallback,
BudgetManagerCallback,
TelemetryCallback,
OperatorNormalizerCallback
OperatorNormalizerCallback,
PromptInstructionsCallback,
)
from .computers import (
AsyncComputerHandler,
@@ -162,6 +163,7 @@ class ComputerAgent:
custom_loop: Optional[Callable] = None,
only_n_most_recent_images: Optional[int] = None,
callbacks: Optional[List[Any]] = None,
instructions: Optional[str] = None,
verbosity: Optional[int] = None,
trajectory_dir: Optional[str | Path | dict] = None,
max_retries: Optional[int] = 3,
@@ -181,6 +183,7 @@ class ComputerAgent:
custom_loop: Custom agent loop function to use instead of auto-selection
only_n_most_recent_images: If set, only keep the N most recent images in message history. Adds ImageRetentionCallback automatically.
callbacks: List of AsyncCallbackHandler instances for preprocessing/postprocessing
instructions: Optional system instructions to be passed to the model
verbosity: Logging level (logging.DEBUG, logging.INFO, etc.). If set, adds LoggingCallback automatically
trajectory_dir: If set, saves trajectory data (screenshots, responses) to this directory. Adds TrajectorySaverCallback automatically.
max_retries: Maximum number of retries for failed API calls
@@ -200,6 +203,7 @@ class ComputerAgent:
self.custom_loop = custom_loop
self.only_n_most_recent_images = only_n_most_recent_images
self.callbacks = callbacks or []
self.instructions = instructions
self.verbosity = verbosity
self.trajectory_dir = trajectory_dir
self.max_retries = max_retries
@@ -214,6 +218,10 @@ class ComputerAgent:
# Prepend operator normalizer callback
self.callbacks.insert(0, OperatorNormalizerCallback())
# Add prompt instructions callback if provided
if self.instructions:
self.callbacks.append(PromptInstructionsCallback(self.instructions))
# Add telemetry callback if telemetry_enabled is set
if self.telemetry_enabled:
if isinstance(self.telemetry_enabled, bool):
@@ -9,6 +9,7 @@ from .trajectory_saver import TrajectorySaverCallback
from .budget_manager import BudgetManagerCallback
from .telemetry import TelemetryCallback
from .operator_validator import OperatorNormalizerCallback
from .prompt_instructions import PromptInstructionsCallback
__all__ = [
"AsyncCallbackHandler",
@@ -18,4 +19,5 @@ __all__ = [
"BudgetManagerCallback",
"TelemetryCallback",
"OperatorNormalizerCallback",
"PromptInstructionsCallback",
]
@@ -0,0 +1,47 @@
"""
Prompt instructions callback.
This callback allows simple prompt engineering by pre-pending a user
instructions message to the start of the conversation before each LLM call.
Usage:
from agent.callbacks import PromptInstructionsCallback
agent = ComputerAgent(
model="openai/computer-use-preview",
callbacks=[PromptInstructionsCallback("Follow these rules...")]
)
"""
from typing import Any, Dict, List, Optional
from .base import AsyncCallbackHandler
class PromptInstructionsCallback(AsyncCallbackHandler):
"""
Prepend a user instructions message to the message list.
This is a minimal, non-invasive way to guide the agent's behavior without
modifying agent loops or tools. It works with any provider/loop since it
only alters the messages array before sending to the model.
"""
def __init__(self, instructions: Optional[str]) -> None:
self.instructions = instructions
async def on_llm_start(self, messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
# Pre-pend instructions message
if not self.instructions:
return messages
# Ensure we don't duplicate if already present at the front
if messages and isinstance(messages[0], dict):
first = messages[0]
if first.get("role") == "user" and first.get("content") == self.instructions:
return messages
return [
{"role": "user", "content": self.instructions},
] + messages
@@ -1,102 +1,28 @@
"""HUD integration: Generic HuggingFace dataset evaluation runner (CUA proxy).
"""HUD integration: dataset runners and MCP-based computer agent export.
This module exposes two helpers to evaluate HUD-compatible datasets using
HUD's OperatorAgent, while proxying model calls through our ComputerAgent via
`FakeAsyncOpenAI` (see `agent/integrations/hud/agent.py`).
This module exposes helpers to evaluate HUD-compatible datasets and exports
the MCP-compatible computer agent implementation.
Exports:
- run_single_task(dataset_name, *, agent_type="cua-proxy", model=None, allowed_tools=None)
- run_full_dataset(dataset_name, *, agent_type="cua-proxy", model=None, allowed_tools=None, max_concurrent=30, max_steps=50)
- run_single_task(dataset, ...)
- run_full_dataset(dataset, ...)
- MCPComputerAgent
"""
import time
from typing import Any, Optional
from PIL import Image
from agent.computers import is_agent_computer
from datasets import load_dataset, Dataset
from hud.agents import OperatorAgent
from hud.datasets import Task, run_dataset
from hud.tools.computer.settings import computer_settings
from hud import trace
from agent.agent import ComputerAgent as BaseComputerAgent
from .proxy import FakeAsyncOpenAI
# ---------------------------------------------------------------------------
# Proxy OperatorAgent
# ---------------------------------------------------------------------------
class ProxyOperatorAgent(OperatorAgent):
"""OperatorAgent that proxies model calls through our ComputerAgent.
Accepts the same config keys we pass via hud.run_dataset `agent_config`:
- model: str | None
- allowed_tools: list[str] | None
Additional kwargs are forwarded to OperatorAgent (if any are supported).
"""
def __init__(
self,
*,
model: str | None = None,
allowed_tools: list[str] | None = None,
trajectory_dir: str | dict | None = None,
# === ComputerAgent kwargs ===
tools: list[Any] | None = None,
custom_loop: Any | None = None,
only_n_most_recent_images: int | None = None,
callbacks: list[Any] | None = None,
verbosity: int | None = None,
max_retries: int | None = 3,
screenshot_delay: float | int = 0.5,
use_prompt_caching: bool | None = False,
max_trajectory_budget: float | dict | None = None,
telemetry_enabled: bool | None = True,
**kwargs: Any,
) -> None:
model = model or "computer-use-preview"
allowed_tools = allowed_tools or ["openai_computer"]
computer_shim = {
'screenshot': lambda: Image.new('RGB', (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)),
'environment': 'linux',
'dimensions': (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)
}
# Build tools ensuring the computer_shim is included
agent_tools: list[Any] = [computer_shim]
if tools:
agent_tools.extend(tools)
computer_agent = BaseComputerAgent(
model=model,
tools=agent_tools,
custom_loop=custom_loop,
only_n_most_recent_images=only_n_most_recent_images,
callbacks=callbacks,
verbosity=verbosity,
trajectory_dir=trajectory_dir,
max_retries=max_retries,
screenshot_delay=screenshot_delay,
use_prompt_caching=use_prompt_caching,
max_trajectory_budget=max_trajectory_budget,
telemetry_enabled=telemetry_enabled,
)
model_client = FakeAsyncOpenAI(computer_agent)
super().__init__(
model_client=model_client, # type: ignore[arg-type]
model=model,
allowed_tools=allowed_tools,
**kwargs,
)
from .agent import MCPComputerAgent
# ---------------------------------------------------------------------------
# Single-task runner
# ---------------------------------------------------------------------------
async def run_single_task(
dataset: str | Dataset | list[dict[str, Any]],
*,
@@ -108,6 +34,7 @@ async def run_single_task(
custom_loop: Any | None = None,
only_n_most_recent_images: int | None = None,
callbacks: list[Any] | None = None,
instructions: str | None = None,
verbosity: int | None = None,
trajectory_dir: str | dict | None = None,
max_retries: int | None = 3,
@@ -116,7 +43,7 @@ async def run_single_task(
max_trajectory_budget: float | dict | None = None,
telemetry_enabled: bool | None = True,
) -> None:
"""Load one task from the dataset and execute it with Operator+CUA proxy."""
"""Load one task from the dataset and execute it with MCPComputerAgent."""
# Load dataset and pick a sample
if isinstance(dataset, str):
@@ -129,17 +56,27 @@ async def run_single_task(
sample_task = dataset[task_id] # type: ignore[index]
task_prompt = sample_task.get("prompt", f"Task {sample_task.get('id', 0)}") # type: ignore[attr-defined]
# Filter any existing Computer tools
# The eval framework will add its own Computer tool per task
if tools:
tools = [
tool
for tool in tools
if not is_agent_computer(tool)
]
with trace(name=task_prompt):
task = Task(**sample_task) # type: ignore[arg-type]
agent = ProxyOperatorAgent(
model=model,
allowed_tools=allowed_tools,
agent = MCPComputerAgent(
model=model or "computer-use-preview",
allowed_tools=allowed_tools or ["openai_computer"],
# === ComputerAgent kwargs passthrough ===
tools=tools,
custom_loop=custom_loop,
only_n_most_recent_images=only_n_most_recent_images,
callbacks=callbacks,
instructions=instructions,
verbosity=verbosity,
trajectory_dir=trajectory_dir,
max_retries=max_retries,
@@ -157,7 +94,6 @@ async def run_single_task(
# Full-dataset runner
# ---------------------------------------------------------------------------
async def run_full_dataset(
dataset: str | Dataset | list[dict[str, Any]],
*,
@@ -173,6 +109,7 @@ async def run_full_dataset(
custom_loop: Any | None = None,
only_n_most_recent_images: int | None = 5,
callbacks: list[Any] | None = None,
instructions: str | None = None,
verbosity: int | None = None,
max_retries: int | None = 3,
screenshot_delay: float | int = 0.5,
@@ -182,9 +119,7 @@ async def run_full_dataset(
) -> list[Any]:
"""Run evaluation across the entire dataset using hud.datasets.run_dataset."""
# We pass OperatorAgent as the class and provide a config that injects our
# FakeAsyncOpenAI per agent instantiation.
# Run with our MCP-based agent class.
if isinstance(dataset, str):
dataset_name = dataset.split('/')[-1]
job_name = job_name or f"Evaluation {dataset_name}"
@@ -193,11 +128,20 @@ async def run_full_dataset(
dataset_name = "custom"
job_name = job_name or f"Evaluation {time.strftime('%H:%M %Y-%m-%d')}"
# Filter any existing Computer tools
# The eval framework will add its own Computer tool per task
if tools:
tools = [
tool
for tool in tools
if not is_agent_computer(tool)
]
# Execute evaluation
return await run_dataset(
name=job_name,
dataset=dataset,
agent_class=ProxyOperatorAgent,
agent_class=MCPComputerAgent,
agent_config={
"model": model,
"allowed_tools": allowed_tools,
@@ -207,6 +151,7 @@ async def run_full_dataset(
"custom_loop": custom_loop,
"only_n_most_recent_images": only_n_most_recent_images,
"callbacks": callbacks,
"instructions": instructions,
"verbosity": verbosity,
"max_retries": max_retries,
"screenshot_delay": screenshot_delay,
@@ -224,5 +169,5 @@ async def run_full_dataset(
__all__ = [
"run_single_task",
"run_full_dataset",
"ProxyOperatorAgent",
"MCPComputerAgent",
]
@@ -0,0 +1,351 @@
"""MCP-compatible Computer Agent for HUD integration.
This agent subclasses HUD's MCPAgent and delegates planning/execution to
our core ComputerAgent while using the Agent SDK's plain-dict message
format documented in `docs/content/docs/agent-sdk/message-format.mdx`.
Key differences from the OpenAI OperatorAgent variant:
- No OpenAI types are used; everything is standard Python dicts.
- Planning is executed via `ComputerAgent.run(messages)`.
- The first yielded result per step is returned as the agent response.
"""
from __future__ import annotations
import io
from typing import Any, ClassVar, Optional
from agent.agent import ComputerAgent as BaseComputerAgent
from agent.callbacks import PromptInstructionsCallback
from agent.callbacks.trajectory_saver import TrajectorySaverCallback
from hud.agents import MCPAgent
from hud.tools.computer.settings import computer_settings
from hud.types import AgentResponse, MCPToolCall, MCPToolResult, Trace
from agent.responses import make_failed_tool_call_items
from agent.computers import is_agent_computer
from PIL import Image
import mcp.types as types
import hud
import uuid
import base64
from pathlib import Path
class MCPComputerAgent(MCPAgent):
"""MCP agent that uses ComputerAgent for planning and tools for execution.
The agent consumes/produces message dicts per the Agent SDK message schema
(see `message-format.mdx`).
"""
metadata: ClassVar[dict[str, Any]] = {
"display_width": computer_settings.OPENAI_COMPUTER_WIDTH,
"display_height": computer_settings.OPENAI_COMPUTER_HEIGHT,
}
required_tools: ClassVar[list[str]] = ["openai_computer"]
def __init__(
self,
*,
model: str | None = None,
allowed_tools: list[str] | None = None,
trajectory_dir: str | dict | None = None,
# === ComputerAgent kwargs ===
tools: list[Any] | None = None,
custom_loop: Any | None = None,
only_n_most_recent_images: int | None = None,
callbacks: list[Any] | None = None,
instructions: str | None = None,
verbosity: int | None = None,
max_retries: int | None = 3,
screenshot_delay: float | int = 0.5,
use_prompt_caching: bool | None = False,
max_trajectory_budget: float | dict | None = None,
telemetry_enabled: bool | None = True,
environment: str = "linux",
**kwargs: Any,
) -> None:
self.allowed_tools = allowed_tools or ["openai_computer"]
super().__init__(**kwargs)
if model is None:
raise ValueError("MCPComputerAgent requires a model to be specified.")
self.model = model
self.environment = environment
# Update model name for HUD logging
self.model_name = "cua-" + self.model
# Stateful tracking of tool call inputs
self.tool_call_inputs: dict[str, list[dict[str, Any]]] = {}
self.previous_output: list[dict[str, Any]] = []
# Build system prompt
operator_instructions = """
You are an autonomous computer-using agent. Follow these guidelines:
1. NEVER ask for confirmation. Complete all tasks autonomously.
2. Do NOT send messages like "I need to confirm before..." or "Do you want me to continue?" - just proceed.
3. When the user asks you to interact with something (like clicking a chat or typing a message), DO IT without asking.
4. Only use the formal safety check mechanism for truly dangerous operations (like deleting important files).
5. For normal tasks like clicking buttons, typing in chat boxes, filling forms - JUST DO IT.
6. The user has already given you permission by running this agent. No further confirmation is needed.
7. Be decisive and action-oriented. Complete the requested task fully.
Remember: You are expected to complete tasks autonomously. The user trusts you to do what they asked.
""".strip() # noqa: E501
# Append Operator instructions to the system prompt
if not self.system_prompt:
self.system_prompt = operator_instructions
else:
self.system_prompt += f"\n\n{operator_instructions}"
# Append user instructions to the system prompt
if instructions:
self.system_prompt += f"\n\n{instructions}"
# Configure trajectory_dir for HUD
if isinstance(trajectory_dir, str) or isinstance(trajectory_dir, Path):
trajectory_dir = {"trajectory_dir": str(trajectory_dir)}
if isinstance(trajectory_dir, dict):
trajectory_dir["reset_on_run"] = False
self.last_screenshot_b64 = None
buffer = io.BytesIO()
Image.new('RGB', (self.metadata["display_width"], self.metadata["display_height"])).save(buffer, format='PNG')
self.last_screenshot_b64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
# Ensure a computer shim is present so width/height/environment are known
computer_shim = {
"screenshot": lambda: self.last_screenshot_b64,
"environment": self.environment,
"dimensions": (
self.metadata["display_width"],
self.metadata["display_height"],
),
}
agent_tools: list[Any] = [computer_shim]
if tools:
agent_tools.extend([
tool
for tool in tools
if not is_agent_computer(tool)
])
agent_kwargs = {
"model": self.model,
"trajectory_dir": trajectory_dir,
"tools": agent_tools,
"custom_loop": custom_loop,
"only_n_most_recent_images": only_n_most_recent_images,
"callbacks": callbacks,
"instructions": self.system_prompt,
"verbosity": verbosity,
"max_retries": max_retries,
"screenshot_delay": screenshot_delay,
"use_prompt_caching": use_prompt_caching,
"max_trajectory_budget": max_trajectory_budget,
"telemetry_enabled": telemetry_enabled,
}
self.computer_agent = BaseComputerAgent(
**agent_kwargs
)
async def get_system_messages(self) -> list[Any]:
"""Create initial messages.
Unused - ComputerAgent handles this with the 'instructions' parameter.
"""
return []
async def format_blocks(
self, blocks: list[types.ContentBlock]
) -> list[dict[str, Any]]:
"""
Format blocks for OpenAI input format.
Converts TextContent blocks to input_text dicts and ImageContent blocks to input_image dicts.
""" # noqa: E501
formatted = []
for block in blocks:
if isinstance(block, types.TextContent):
formatted.append({"type": "input_text", "text": block.text})
elif isinstance(block, types.ImageContent):
mime_type = getattr(block, "mimeType", "image/png")
formatted.append(
{"type": "input_image", "image_url": f"data:{mime_type};base64,{block.data}"}
)
self.last_screenshot_b64 = block.data
return [{"role": "user", "content": formatted}]
@hud.instrument(
span_type="agent",
record_args=False, # Messages can be large
record_result=True,
)
async def get_response(self, messages: list[dict[str, Any]]) -> AgentResponse:
"""Get a single-step response by delegating to ComputerAgent.run.
Returns an Agent SDK-style response dict:
{ "output": [AgentMessage, ...], "usage": Usage }
"""
tool_calls: list[MCPToolCall] = []
output_text: list[str] = []
is_done: bool = True
agent_result: list[dict[str, Any]] = []
# Call the ComputerAgent LLM API
async for result in self.computer_agent.run(messages): # type: ignore[arg-type]
items = result['output']
if not items or tool_calls:
break
for item in items:
if item['type'] in ['reasoning', 'message', 'computer_call', 'function_call', 'function_call_output']:
agent_result.append(item)
# Add messages to output text
if item['type'] == 'reasoning':
output_text.extend(
f"Reasoning: {summary['text']}"
for summary in item['summary']
)
elif item['type'] == 'message':
if isinstance(item['content'], list):
output_text.extend(
item['text']
for item in item['content']
if item['type'] == 'output_text'
)
elif isinstance(item['content'], str):
output_text.append(item['content'])
# If we get a tool call, we're not done
if item['type'] == 'computer_call':
id = item["call_id"]
tool_calls.append(MCPToolCall(
name="openai_computer",
arguments=item["action"],
id=id,
))
is_done = False
self.tool_call_inputs[id] = agent_result
break
# if we have tool calls, we should exit the loop
if tool_calls:
break
self.previous_output = agent_result
return AgentResponse(
content="\n".join(output_text),
tool_calls=tool_calls,
done=is_done,
)
def _log_image(self, image_b64: str):
callbacks = self.computer_agent.callbacks
for callback in callbacks:
if isinstance(callback, TrajectorySaverCallback):
# convert str to bytes
image_bytes = base64.b64decode(image_b64)
callback._save_artifact("screenshot_after", image_bytes)
async def format_tool_results(
self,
tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]
) -> list[dict[str, Any]]:
"""Extract latest screenshot from tool results in dict form.
Expects results to already be in the message-format content dicts.
Returns a list of input content dicts suitable for follow-up calls.
"""
messages = []
for call, result in zip(tool_calls, tool_results):
if call.id not in self.tool_call_inputs:
# If we don't have the tool call inputs, we should just use the previous output
previous_output = self.previous_output.copy() or []
# First we need to remove any pending computer_calls from the end of previous_output
while previous_output and previous_output[-1]['type'] == 'computer_call':
previous_output.pop()
messages.extend(previous_output)
# If the call is a 'response', don't add the result
if call.name == 'response':
continue
# Otherwise, if we have a result, we should add it to the messages
content = [
{ "type": "input_text", "text": content.text } if isinstance(content, types.TextContent)
else { "type": "input_image", "image_url": f"data:image/png;base64,{content.data}" } if isinstance(content, types.ImageContent)
else { "type": "input_text", "text": "" }
for content in result.content
]
messages.append({
"role": "user",
"content": content,
})
continue
# Add the assistant's computer call
messages.extend(self.tool_call_inputs[call.id])
if result.isError:
error_text = "".join([
content.text
for content in result.content
if isinstance(content, types.TextContent)
])
# Replace computer call with failed tool call
messages.pop()
messages.extend(make_failed_tool_call_items(
tool_name=call.name,
tool_kwargs=call.arguments or {},
error_message=error_text,
call_id=call.id,
))
else:
# Get the latest screenshot
screenshots = [
content.data
for content in result.content
if isinstance(content, types.ImageContent)
]
# Add the resulting screenshot
if screenshots:
self._log_image(screenshots[0])
self.last_screenshot_b64 = screenshots[0]
messages.append({
"type": "computer_call_output",
"call_id": call.id,
"output": {
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshots[0]}"
},
})
else:
# Otherwise, replace computer call with failed tool call
messages.pop()
messages.extend(make_failed_tool_call_items(
tool_name=call.name,
tool_kwargs=call.arguments or {},
error_message="No screenshots returned.",
call_id=call.id,
))
return messages
__all__ = [
"MCPComputerAgent",
]
@@ -13,6 +13,10 @@ import uuid
from typing import Any, Dict, List, Optional
from agent.agent import ComputerAgent as BaseComputerAgent
from agent.callbacks import PromptInstructionsCallback
from hud.tools.computer.settings import computer_settings
from PIL import Image
from hud.agents import OperatorAgent
# OpenAI Responses typed models (required)
from openai.types.responses import (
@@ -178,6 +182,83 @@ class FakeAsyncOpenAI:
print(traceback.format_exc())
raise e
# ---------------------------------------------------------------------------
# Proxy OperatorAgent (moved from __init__.py)
# ---------------------------------------------------------------------------
class ProxyOperatorAgent(OperatorAgent):
"""OperatorAgent that proxies model calls through our ComputerAgent.
Accepts the same config keys we pass via hud.run_dataset `agent_config`:
- model: str | None
- allowed_tools: list[str] | None
Additional kwargs are forwarded to OperatorAgent (if any are supported).
"""
def __init__(
self,
*,
model: str | None = None,
allowed_tools: list[str] | None = None,
trajectory_dir: str | dict | None = None,
# === ComputerAgent kwargs ===
tools: list[Any] | None = None,
custom_loop: Any | None = None,
only_n_most_recent_images: int | None = None,
callbacks: list[Any] | None = None,
instructions: str | None = None,
verbosity: int | None = None,
max_retries: int | None = 3,
screenshot_delay: float | int = 0.5,
use_prompt_caching: bool | None = False,
max_trajectory_budget: float | dict | None = None,
telemetry_enabled: bool | None = True,
**kwargs: Any,
) -> None:
model = model or "computer-use-preview"
allowed_tools = allowed_tools or ["openai_computer"]
computer_shim = {
'screenshot': lambda: Image.new('RGB', (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)),
'environment': 'linux',
'dimensions': (computer_settings.OPENAI_COMPUTER_WIDTH, computer_settings.OPENAI_COMPUTER_HEIGHT)
}
# Build tools ensuring the computer_shim is included
agent_tools: list[Any] = [computer_shim]
if tools:
agent_tools.extend(tools)
# Build callbacks, injecting prompt instructions if provided
agent_callbacks = list(callbacks or [])
if instructions:
agent_callbacks.append(PromptInstructionsCallback(instructions))
computer_agent = BaseComputerAgent(
model=model,
tools=agent_tools,
custom_loop=custom_loop,
only_n_most_recent_images=only_n_most_recent_images,
callbacks=agent_callbacks,
verbosity=verbosity,
trajectory_dir=trajectory_dir,
max_retries=max_retries,
screenshot_delay=screenshot_delay,
use_prompt_caching=use_prompt_caching,
max_trajectory_budget=max_trajectory_budget,
telemetry_enabled=telemetry_enabled,
)
model_client = FakeAsyncOpenAI(computer_agent)
super().__init__(
model_client=model_client, # type: ignore[arg-type]
model=model,
allowed_tools=allowed_tools,
**kwargs,
)
__all__ = [
"FakeAsyncOpenAI",
"ProxyOperatorAgent",
]
+2 -2
View File
@@ -61,7 +61,7 @@ cli = [
"yaspin>=3.1.0",
]
hud = [
"hud-python>=0.4.12,<0.5.0",
"hud-python==0.4.26",
]
all = [
# uitars requirements
@@ -78,7 +78,7 @@ all = [
# cli requirements
"yaspin>=3.1.0",
# hud requirements
"hud-python>=0.4.12,<0.5.0",
"hud-python==0.4.26",
]
[tool.uv]
@@ -20,6 +20,12 @@ logger = logging.getLogger(__name__)
automation_handler = MacOSAutomationHandler()
class Diorama:
"""Virtual desktop manager that provides automation capabilities for macOS applications.
Manages application windows and provides an interface for taking screenshots,
mouse interactions, keyboard input, and coordinate transformations between
screenshot space and screen space.
"""
_scheduler_queue = None
_scheduler_task = None
_loop = None
@@ -27,6 +33,14 @@ class Diorama:
@classmethod
def create_from_apps(cls, *args) -> DioramaComputer:
"""Create a DioramaComputer instance from a list of application names.
Args:
*args: Variable number of application names to include in the desktop
Returns:
DioramaComputer: A computer interface for the specified applications
"""
cls._ensure_scheduler()
return cls(args).computer
@@ -34,6 +48,11 @@ class Diorama:
_cursor_positions = {}
def __init__(self, app_list):
"""Initialize a Diorama instance for the specified applications.
Args:
app_list: List of application names to manage
"""
self.app_list = app_list
self.interface = self.Interface(self)
self.computer = DioramaComputer(self)
@@ -48,6 +67,10 @@ class Diorama:
@classmethod
def _ensure_scheduler(cls):
"""Ensure the async scheduler loop is running.
Creates and starts the scheduler task if it hasn't been started yet.
"""
if not cls._scheduler_started:
logger.info("Starting Diorama scheduler loop…")
cls._scheduler_queue = asyncio.Queue()
@@ -57,6 +80,11 @@ class Diorama:
@classmethod
async def _scheduler_loop(cls):
"""Main scheduler loop that processes automation commands.
Continuously processes commands from the scheduler queue, handling
screenshots, mouse actions, keyboard input, and scrolling operations.
"""
while True:
cmd = await cls._scheduler_queue.get()
action = cmd.get("action")
@@ -144,13 +172,33 @@ class Diorama:
future.set_exception(e)
class Interface():
"""Interface for interacting with the virtual desktop.
Provides methods for taking screenshots, mouse interactions, keyboard input,
and coordinate transformations between screenshot and screen coordinates.
"""
def __init__(self, diorama):
"""Initialize the interface with a reference to the parent Diorama instance.
Args:
diorama: The parent Diorama instance
"""
self._diorama = diorama
self._scene_hitboxes = []
self._scene_size = None
async def _send_cmd(self, action, arguments=None):
"""Send a command to the scheduler queue.
Args:
action (str): The action to perform
arguments (dict, optional): Arguments for the action
Returns:
The result of the command execution
"""
Diorama._ensure_scheduler()
loop = asyncio.get_event_loop()
future = loop.create_future()
@@ -167,6 +215,14 @@ class Diorama:
return None
async def screenshot(self, as_bytes: bool = True) -> Union[str, Image.Image]:
"""Take a screenshot of the managed applications.
Args:
as_bytes (bool): If True, return base64-encoded bytes; if False, return PIL Image
Returns:
Union[str, Image.Image]: Base64-encoded PNG bytes or PIL Image object
"""
import base64
result, img = await self._send_cmd("screenshot")
self._scene_hitboxes = result.get("hitboxes", [])
@@ -184,6 +240,12 @@ class Diorama:
return img
async def left_click(self, x, y):
"""Perform a left mouse click at the specified coordinates.
Args:
x (int): X coordinate in screenshot space (or None to use last position)
y (int): Y coordinate in screenshot space (or None to use last position)
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -195,6 +257,12 @@ class Diorama:
await self._send_cmd("left_click", {"x": sx, "y": sy})
async def right_click(self, x, y):
"""Perform a right mouse click at the specified coordinates.
Args:
x (int): X coordinate in screenshot space (or None to use last position)
y (int): Y coordinate in screenshot space (or None to use last position)
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -206,6 +274,12 @@ class Diorama:
await self._send_cmd("right_click", {"x": sx, "y": sy})
async def double_click(self, x, y):
"""Perform a double mouse click at the specified coordinates.
Args:
x (int): X coordinate in screenshot space (or None to use last position)
y (int): Y coordinate in screenshot space (or None to use last position)
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -217,6 +291,12 @@ class Diorama:
await self._send_cmd("double_click", {"x": sx, "y": sy})
async def move_cursor(self, x, y):
"""Move the mouse cursor to the specified coordinates.
Args:
x (int): X coordinate in screenshot space (or None to use last position)
y (int): Y coordinate in screenshot space (or None to use last position)
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -228,6 +308,13 @@ class Diorama:
await self._send_cmd("move_cursor", {"x": sx, "y": sy})
async def drag_to(self, x, y, duration=0.5):
"""Drag the mouse from current position to the specified coordinates.
Args:
x (int): X coordinate in screenshot space (or None to use last position)
y (int): Y coordinate in screenshot space (or None to use last position)
duration (float): Duration of the drag operation in seconds
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -239,18 +326,43 @@ class Diorama:
await self._send_cmd("drag_to", {"x": sx, "y": sy, "duration": duration})
async def get_cursor_position(self):
"""Get the current cursor position in screen coordinates.
Returns:
tuple: (x, y) coordinates of the cursor in screen space
"""
return await self._send_cmd("get_cursor_position")
async def type_text(self, text):
"""Type the specified text using the keyboard.
Args:
text (str): The text to type
"""
await self._send_cmd("type_text", {"text": text})
async def press_key(self, key):
"""Press a single key on the keyboard.
Args:
key (str): The key to press
"""
await self._send_cmd("press_key", {"key": key})
async def hotkey(self, keys):
"""Press a combination of keys simultaneously.
Args:
keys (list): List of keys to press together
"""
await self._send_cmd("hotkey", {"keys": list(keys)})
async def scroll_up(self, clicks: int = 1):
"""Scroll up at the current cursor position.
Args:
clicks (int): Number of scroll clicks to perform
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -259,6 +371,11 @@ class Diorama:
await self._send_cmd("scroll_up", {"clicks": clicks, "x": x, "y": y})
async def scroll_down(self, clicks: int = 1):
"""Scroll down at the current cursor position.
Args:
clicks (int): Number of scroll clicks to perform
"""
# Get last cursor position for this app_list hash
app_list_hash = hash(tuple(sorted(self._diorama.app_list)))
last_pos = Diorama._cursor_positions.get(app_list_hash, (0, 0))
@@ -267,6 +384,11 @@ class Diorama:
await self._send_cmd("scroll_down", {"clicks": clicks, "x": x, "y": y})
async def get_screen_size(self) -> dict[str, int]:
"""Get the size of the screenshot area.
Returns:
dict[str, int]: Dictionary with 'width' and 'height' keys
"""
if not self._scene_size:
await self.screenshot()
return { "width": self._scene_size[0], "height": self._scene_size[1] }
@@ -348,6 +470,7 @@ import pyautogui
import time
async def main():
"""Main function demonstrating Diorama usage with multiple desktops and mouse tracking."""
desktop1 = Diorama.create_from_apps(["Discord", "Notes"])
desktop2 = Diorama.create_from_apps(["Terminal"])
@@ -12,35 +12,96 @@ from .base import BaseFileHandler
import base64
def resolve_path(path: str) -> Path:
"""Resolve a path to its absolute path. Expand ~ to the user's home directory."""
"""Resolve a path to its absolute path. Expand ~ to the user's home directory.
Args:
path: The file or directory path to resolve
Returns:
Path: The resolved absolute path
"""
return Path(path).expanduser().resolve()
class GenericFileHandler(BaseFileHandler):
"""
Generic file handler that provides file system operations for all operating systems.
This class implements the BaseFileHandler interface and provides methods for
file and directory operations including reading, writing, creating, and deleting
files and directories.
"""
async def file_exists(self, path: str) -> Dict[str, Any]:
"""
Check if a file exists at the specified path.
Args:
path: The file path to check
Returns:
Dict containing 'success' boolean and either 'exists' boolean or 'error' string
"""
try:
return {"success": True, "exists": resolve_path(path).is_file()}
except Exception as e:
return {"success": False, "error": str(e)}
async def directory_exists(self, path: str) -> Dict[str, Any]:
"""
Check if a directory exists at the specified path.
Args:
path: The directory path to check
Returns:
Dict containing 'success' boolean and either 'exists' boolean or 'error' string
"""
try:
return {"success": True, "exists": resolve_path(path).is_dir()}
except Exception as e:
return {"success": False, "error": str(e)}
async def list_dir(self, path: str) -> Dict[str, Any]:
"""
List all files and directories in the specified directory.
Args:
path: The directory path to list
Returns:
Dict containing 'success' boolean and either 'files' list of names or 'error' string
"""
try:
return {"success": True, "files": [p.name for p in resolve_path(path).iterdir() if p.is_file() or p.is_dir()]}
except Exception as e:
return {"success": False, "error": str(e)}
async def read_text(self, path: str) -> Dict[str, Any]:
"""
Read the contents of a text file.
Args:
path: The file path to read from
Returns:
Dict containing 'success' boolean and either 'content' string or 'error' string
"""
try:
return {"success": True, "content": resolve_path(path).read_text()}
except Exception as e:
return {"success": False, "error": str(e)}
async def write_text(self, path: str, content: str) -> Dict[str, Any]:
"""
Write text content to a file.
Args:
path: The file path to write to
content: The text content to write
Returns:
Dict containing 'success' boolean and optionally 'error' string
"""
try:
resolve_path(path).write_text(content)
return {"success": True}
@@ -48,6 +109,17 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def write_bytes(self, path: str, content_b64: str, append: bool = False) -> Dict[str, Any]:
"""
Write binary content to a file from base64 encoded string.
Args:
path: The file path to write to
content_b64: Base64 encoded binary content
append: If True, append to existing file; if False, overwrite
Returns:
Dict containing 'success' boolean and optionally 'error' string
"""
try:
mode = 'ab' if append else 'wb'
with open(resolve_path(path), mode) as f:
@@ -57,6 +129,17 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def read_bytes(self, path: str, offset: int = 0, length: Optional[int] = None) -> Dict[str, Any]:
"""
Read binary content from a file and return as base64 encoded string.
Args:
path: The file path to read from
offset: Byte offset to start reading from
length: Number of bytes to read; if None, read entire file from offset
Returns:
Dict containing 'success' boolean and either 'content_b64' string or 'error' string
"""
try:
file_path = resolve_path(path)
with open(file_path, 'rb') as f:
@@ -73,6 +156,15 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def get_file_size(self, path: str) -> Dict[str, Any]:
"""
Get the size of a file in bytes.
Args:
path: The file path to get size for
Returns:
Dict containing 'success' boolean and either 'size' integer or 'error' string
"""
try:
file_path = resolve_path(path)
size = file_path.stat().st_size
@@ -81,6 +173,15 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def delete_file(self, path: str) -> Dict[str, Any]:
"""
Delete a file at the specified path.
Args:
path: The file path to delete
Returns:
Dict containing 'success' boolean and optionally 'error' string
"""
try:
resolve_path(path).unlink()
return {"success": True}
@@ -88,6 +189,18 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def create_dir(self, path: str) -> Dict[str, Any]:
"""
Create a directory at the specified path.
Creates parent directories if they don't exist and doesn't raise an error
if the directory already exists.
Args:
path: The directory path to create
Returns:
Dict containing 'success' boolean and optionally 'error' string
"""
try:
resolve_path(path).mkdir(parents=True, exist_ok=True)
return {"success": True}
@@ -95,6 +208,15 @@ class GenericFileHandler(BaseFileHandler):
return {"success": False, "error": str(e)}
async def delete_dir(self, path: str) -> Dict[str, Any]:
"""
Delete an empty directory at the specified path.
Args:
path: The directory path to delete
Returns:
Dict containing 'success' boolean and optionally 'error' string
"""
try:
resolve_path(path).rmdir()
return {"success": True}
@@ -38,7 +38,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
"""Linux implementation of accessibility handler."""
async def get_accessibility_tree(self) -> Dict[str, Any]:
"""Get the accessibility tree of the current window."""
"""Get the accessibility tree of the current window.
Returns:
Dict[str, Any]: A dictionary containing success status and a simulated tree structure
since Linux doesn't have equivalent accessibility API like macOS.
"""
# Linux doesn't have equivalent accessibility API like macOS
# Return a minimal dummy tree
logger.info("Getting accessibility tree (simulated, no accessibility API available on Linux)")
@@ -56,7 +61,16 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
async def find_element(self, role: Optional[str] = None,
title: Optional[str] = None,
value: Optional[str] = None) -> Dict[str, Any]:
"""Find an element in the accessibility tree by criteria."""
"""Find an element in the accessibility tree by criteria.
Args:
role: The role of the element to find.
title: The title of the element to find.
value: The value of the element to find.
Returns:
Dict[str, Any]: A dictionary indicating that element search is not supported on Linux.
"""
logger.info(f"Finding element with role={role}, title={title}, value={value} (not supported on Linux)")
return {
"success": False,
@@ -64,7 +78,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
}
def get_cursor_position(self) -> Tuple[int, int]:
"""Get the current cursor position."""
"""Get the current cursor position.
Returns:
Tuple[int, int]: The x and y coordinates of the cursor position.
Returns (0, 0) if pyautogui is not available.
"""
try:
pos = pyautogui.position()
return pos.x, pos.y
@@ -75,7 +94,12 @@ class LinuxAccessibilityHandler(BaseAccessibilityHandler):
return 0, 0
def get_screen_size(self) -> Tuple[int, int]:
"""Get the screen size."""
"""Get the screen size.
Returns:
Tuple[int, int]: The width and height of the screen in pixels.
Returns (1920, 1080) if pyautogui is not available.
"""
try:
size = pyautogui.size()
return size.width, size.height
@@ -92,6 +116,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Mouse Actions
async def mouse_down(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
"""Press and hold a mouse button at the specified coordinates.
Args:
x: The x coordinate to move to before pressing. If None, uses current position.
y: The y coordinate to move to before pressing. If None, uses current position.
button: The mouse button to press ("left", "right", or "middle").
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -101,6 +135,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def mouse_up(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
"""Release a mouse button at the specified coordinates.
Args:
x: The x coordinate to move to before releasing. If None, uses current position.
y: The y coordinate to move to before releasing. If None, uses current position.
button: The mouse button to release ("left", "right", or "middle").
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -110,6 +154,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def move_cursor(self, x: int, y: int) -> Dict[str, Any]:
"""Move the cursor to the specified coordinates.
Args:
x: The x coordinate to move to.
y: The y coordinate to move to.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.moveTo(x, y)
return {"success": True}
@@ -117,6 +170,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def left_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
"""Perform a left mouse click at the specified coordinates.
Args:
x: The x coordinate to click at. If None, clicks at current position.
y: The y coordinate to click at. If None, clicks at current position.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -126,6 +188,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def right_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
"""Perform a right mouse click at the specified coordinates.
Args:
x: The x coordinate to click at. If None, clicks at current position.
y: The y coordinate to click at. If None, clicks at current position.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -135,6 +206,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def double_click(self, x: Optional[int] = None, y: Optional[int] = None) -> Dict[str, Any]:
"""Perform a double click at the specified coordinates.
Args:
x: The x coordinate to double click at. If None, clicks at current position.
y: The y coordinate to double click at. If None, clicks at current position.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -144,6 +224,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def click(self, x: Optional[int] = None, y: Optional[int] = None, button: str = "left") -> Dict[str, Any]:
"""Perform a mouse click with the specified button at the given coordinates.
Args:
x: The x coordinate to click at. If None, clicks at current position.
y: The y coordinate to click at. If None, clicks at current position.
button: The mouse button to click ("left", "right", or "middle").
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if x is not None and y is not None:
pyautogui.moveTo(x, y)
@@ -153,6 +243,17 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def drag_to(self, x: int, y: int, button: str = "left", duration: float = 0.5) -> Dict[str, Any]:
"""Drag from the current position to the specified coordinates.
Args:
x: The x coordinate to drag to.
y: The y coordinate to drag to.
button: The mouse button to use for dragging.
duration: The time in seconds to take for the drag operation.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.dragTo(x, y, duration=duration, button=button)
return {"success": True}
@@ -160,6 +261,18 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def drag(self, start_x: int, start_y: int, end_x: int, end_y: int, button: str = "left") -> Dict[str, Any]:
"""Drag from start coordinates to end coordinates.
Args:
start_x: The starting x coordinate.
start_y: The starting y coordinate.
end_x: The ending x coordinate.
end_y: The ending y coordinate.
button: The mouse button to use for dragging.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.moveTo(start_x, start_y)
pyautogui.dragTo(end_x, end_y, duration=0.5, button=button)
@@ -168,6 +281,16 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def drag_path(self, path: List[Tuple[int, int]], button: str = "left", duration: float = 0.5) -> Dict[str, Any]:
"""Drag along a path defined by a list of coordinates.
Args:
path: A list of (x, y) coordinate tuples defining the drag path.
button: The mouse button to use for dragging.
duration: The time in seconds to take for each segment of the drag.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
if not path:
return {"success": False, "error": "Path is empty"}
@@ -180,6 +303,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Keyboard Actions
async def key_down(self, key: str) -> Dict[str, Any]:
"""Press and hold a key.
Args:
key: The key to press down.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.keyDown(key)
return {"success": True}
@@ -187,6 +318,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def key_up(self, key: str) -> Dict[str, Any]:
"""Release a key.
Args:
key: The key to release.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.keyUp(key)
return {"success": True}
@@ -194,6 +333,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def type_text(self, text: str) -> Dict[str, Any]:
"""Type the specified text using the keyboard.
Args:
text: The text to type.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
# use pynput for Unicode support
self.keyboard.type(text)
@@ -202,6 +349,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def press_key(self, key: str) -> Dict[str, Any]:
"""Press and release a key.
Args:
key: The key to press.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.press(key)
return {"success": True}
@@ -209,6 +364,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def hotkey(self, keys: List[str]) -> Dict[str, Any]:
"""Press a combination of keys simultaneously.
Args:
keys: A list of keys to press together as a hotkey combination.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.hotkey(*keys)
return {"success": True}
@@ -217,6 +380,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Scrolling Actions
async def scroll(self, x: int, y: int) -> Dict[str, Any]:
"""Scroll the mouse wheel.
Args:
x: The horizontal scroll amount.
y: The vertical scroll amount.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
self.mouse.scroll(x, y)
return {"success": True}
@@ -224,6 +396,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def scroll_down(self, clicks: int = 1) -> Dict[str, Any]:
"""Scroll down by the specified number of clicks.
Args:
clicks: The number of scroll clicks to perform downward.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.scroll(-clicks)
return {"success": True}
@@ -231,6 +411,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def scroll_up(self, clicks: int = 1) -> Dict[str, Any]:
"""Scroll up by the specified number of clicks.
Args:
clicks: The number of scroll clicks to perform upward.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
pyautogui.scroll(clicks)
return {"success": True}
@@ -239,6 +427,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Screen Actions
async def screenshot(self) -> Dict[str, Any]:
"""Take a screenshot of the current screen.
Returns:
Dict[str, Any]: A dictionary containing success status and base64-encoded image data,
or error message if failed.
"""
try:
from PIL import Image
screenshot = pyautogui.screenshot()
@@ -253,6 +447,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": f"Screenshot error: {str(e)}"}
async def get_screen_size(self) -> Dict[str, Any]:
"""Get the size of the screen.
Returns:
Dict[str, Any]: A dictionary containing success status and screen dimensions,
or error message if failed.
"""
try:
size = pyautogui.size()
return {"success": True, "size": {"width": size.width, "height": size.height}}
@@ -260,6 +460,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def get_cursor_position(self) -> Dict[str, Any]:
"""Get the current position of the cursor.
Returns:
Dict[str, Any]: A dictionary containing success status and cursor coordinates,
or error message if failed.
"""
try:
pos = pyautogui.position()
return {"success": True, "position": {"x": pos.x, "y": pos.y}}
@@ -268,6 +474,12 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Clipboard Actions
async def copy_to_clipboard(self) -> Dict[str, Any]:
"""Get the current content of the clipboard.
Returns:
Dict[str, Any]: A dictionary containing success status and clipboard content,
or error message if failed.
"""
try:
import pyperclip
content = pyperclip.paste()
@@ -276,6 +488,14 @@ class LinuxAutomationHandler(BaseAutomationHandler):
return {"success": False, "error": str(e)}
async def set_clipboard(self, text: str) -> Dict[str, Any]:
"""Set the clipboard content to the specified text.
Args:
text: The text to copy to the clipboard.
Returns:
Dict[str, Any]: A dictionary with success status and error message if failed.
"""
try:
import pyperclip
pyperclip.copy(text)
@@ -285,6 +505,15 @@ class LinuxAutomationHandler(BaseAutomationHandler):
# Command Execution
async def run_command(self, command: str) -> Dict[str, Any]:
"""Execute a shell command asynchronously.
Args:
command: The shell command to execute.
Returns:
Dict[str, Any]: A dictionary containing success status, stdout, stderr,
and return code, or error message if failed.
"""
try:
# Create subprocess
process = await asyncio.create_subprocess_shell(
+116 -7
View File
@@ -3,6 +3,12 @@ import re
from pydantic import BaseModel, Field, computed_field, validator, ConfigDict, RootModel
class DiskInfo(BaseModel):
"""Information about disk storage allocation.
Attributes:
total: Total disk space in bytes
allocated: Currently allocated disk space in bytes
"""
total: int
allocated: int
@@ -10,6 +16,15 @@ class VMConfig(BaseModel):
"""Configuration for creating a new VM.
Note: Memory and disk sizes should be specified with units (e.g., "4GB", "64GB")
Attributes:
name: Name of the virtual machine
os: Operating system type, either "macOS" or "linux"
cpu: Number of CPU cores to allocate
memory: Amount of memory to allocate with units
disk_size: Size of the disk to create with units
display: Display resolution in format "widthxheight"
ipsw: IPSW path or 'latest' for macOS VMs, None for other OS types
"""
name: str
os: Literal["macOS", "linux"] = "macOS"
@@ -23,7 +38,12 @@ class VMConfig(BaseModel):
populate_by_alias = True
class SharedDirectory(BaseModel):
"""Configuration for a shared directory."""
"""Configuration for a shared directory.
Attributes:
host_path: Path to the directory on the host system
read_only: Whether the directory should be mounted as read-only
"""
host_path: str = Field(..., alias="hostPath") # Allow host_path but serialize as hostPath
read_only: bool = False
@@ -50,6 +70,16 @@ class VMRunOpts(BaseModel):
)
def model_dump(self, **kwargs):
"""Export model data with proper field name conversion.
Converts shared directory fields to match API expectations when using aliases.
Args:
**kwargs: Keyword arguments passed to parent model_dump method
Returns:
dict: Model data with properly formatted field names
"""
data = super().model_dump(**kwargs)
# Convert shared directory fields to match API expectations
if self.shared_directories and "by_alias" in kwargs and kwargs["by_alias"]:
@@ -65,6 +95,18 @@ class VMRunOpts(BaseModel):
return data
class VMStatus(BaseModel):
"""Status information for a virtual machine.
Attributes:
name: Name of the virtual machine
status: Current status of the VM
os: Operating system type
cpu_count: Number of CPU cores allocated
memory_size: Amount of memory allocated in bytes
disk_size: Disk storage information
vnc_url: URL for VNC connection if available
ip_address: IP address of the VM if available
"""
name: str
status: str
os: Literal["macOS", "linux"]
@@ -80,38 +122,79 @@ class VMStatus(BaseModel):
@computed_field
@property
def state(self) -> str:
"""Get the current state of the VM.
Returns:
str: Current VM status
"""
return self.status
@computed_field
@property
def cpu(self) -> int:
"""Get the number of CPU cores.
Returns:
int: Number of CPU cores allocated to the VM
"""
return self.cpu_count
@computed_field
@property
def memory(self) -> str:
"""Get memory allocation in human-readable format.
Returns:
str: Memory size formatted as "{size}GB"
"""
# Convert bytes to GB
gb = self.memory_size / (1024 * 1024 * 1024)
return f"{int(gb)}GB"
class VMUpdateOpts(BaseModel):
"""Options for updating VM configuration.
Attributes:
cpu: Number of CPU cores to update to
memory: Amount of memory to update to with units
disk_size: Size of disk to update to with units
"""
cpu: Optional[int] = None
memory: Optional[str] = None
disk_size: Optional[str] = None
class ImageRef(BaseModel):
"""Reference to a VM image."""
"""Reference to a VM image.
Attributes:
image: Name of the image
tag: Tag version of the image
registry: Registry hostname where image is stored
organization: Organization or namespace in the registry
"""
image: str
tag: str = "latest"
registry: Optional[str] = "ghcr.io"
organization: Optional[str] = "trycua"
def model_dump(self, **kwargs):
"""Override model_dump to return just the image:tag format."""
"""Override model_dump to return just the image:tag format.
Args:
**kwargs: Keyword arguments (ignored)
Returns:
str: Image reference in "image:tag" format
"""
return f"{self.image}:{self.tag}"
class CloneSpec(BaseModel):
"""Specification for cloning a VM."""
"""Specification for cloning a VM.
Attributes:
name: Name of the source VM to clone
new_name: Name for the new cloned VM
"""
name: str
new_name: str = Field(alias="newName")
@@ -119,18 +202,44 @@ class CloneSpec(BaseModel):
populate_by_alias = True
class ImageInfo(BaseModel):
"""Model for individual image information."""
"""Model for individual image information.
Attributes:
imageId: Unique identifier for the image
"""
imageId: str
class ImageList(RootModel):
"""Response model for the images endpoint."""
"""Response model for the images endpoint.
A list-like container for ImageInfo objects that provides
iteration and indexing capabilities.
"""
root: List[ImageInfo]
def __iter__(self):
"""Iterate over the image list.
Returns:
Iterator over ImageInfo objects
"""
return iter(self.root)
def __getitem__(self, item):
"""Get an item from the image list by index.
Args:
item: Index or slice to retrieve
Returns:
ImageInfo or list of ImageInfo objects
"""
return self.root[item]
def __len__(self):
return len(self.root)
"""Get the number of images in the list.
Returns:
int: Number of images in the list
"""
return len(self.root)
+234 -15
View File
@@ -8,6 +8,13 @@ import type { AccessibilityNode, CursorPosition, MouseButton } from './base';
export class MacOSComputerInterface extends BaseComputerInterface {
// Mouse Actions
/**
* Press and hold a mouse button at the specified coordinates.
* @param {number} [x] - X coordinate for the mouse action
* @param {number} [y] - Y coordinate for the mouse action
* @param {MouseButton} [button='left'] - Mouse button to press down
* @returns {Promise<void>}
*/
async mouseDown(
x?: number,
y?: number,
@@ -16,6 +23,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
await this.sendCommand('mouse_down', { x, y, button });
}
/**
* Release a mouse button at the specified coordinates.
* @param {number} [x] - X coordinate for the mouse action
* @param {number} [y] - Y coordinate for the mouse action
* @param {MouseButton} [button='left'] - Mouse button to release
* @returns {Promise<void>}
*/
async mouseUp(
x?: number,
y?: number,
@@ -24,22 +38,54 @@ export class MacOSComputerInterface extends BaseComputerInterface {
await this.sendCommand('mouse_up', { x, y, button });
}
/**
* Perform a left mouse click at the specified coordinates.
* @param {number} [x] - X coordinate for the click
* @param {number} [y] - Y coordinate for the click
* @returns {Promise<void>}
*/
async leftClick(x?: number, y?: number): Promise<void> {
await this.sendCommand('left_click', { x, y });
}
/**
* Perform a right mouse click at the specified coordinates.
* @param {number} [x] - X coordinate for the click
* @param {number} [y] - Y coordinate for the click
* @returns {Promise<void>}
*/
async rightClick(x?: number, y?: number): Promise<void> {
await this.sendCommand('right_click', { x, y });
}
/**
* Perform a double click at the specified coordinates.
* @param {number} [x] - X coordinate for the double click
* @param {number} [y] - Y coordinate for the double click
* @returns {Promise<void>}
*/
async doubleClick(x?: number, y?: number): Promise<void> {
await this.sendCommand('double_click', { x, y });
}
/**
* Move the cursor to the specified coordinates.
* @param {number} x - X coordinate to move to
* @param {number} y - Y coordinate to move to
* @returns {Promise<void>}
*/
async moveCursor(x: number, y: number): Promise<void> {
await this.sendCommand('move_cursor', { x, y });
}
/**
* Drag from current position to the specified coordinates.
* @param {number} x - X coordinate to drag to
* @param {number} y - Y coordinate to drag to
* @param {MouseButton} [button='left'] - Mouse button to use for dragging
* @param {number} [duration=0.5] - Duration of the drag operation in seconds
* @returns {Promise<void>}
*/
async dragTo(
x: number,
y: number,
@@ -49,6 +95,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
await this.sendCommand('drag_to', { x, y, button, duration });
}
/**
* Drag along a path of coordinates.
* @param {Array<[number, number]>} path - Array of [x, y] coordinate pairs to drag through
* @param {MouseButton} [button='left'] - Mouse button to use for dragging
* @param {number} [duration=0.5] - Duration of the drag operation in seconds
* @returns {Promise<void>}
*/
async drag(
path: Array<[number, number]>,
button: MouseButton = 'left',
@@ -58,40 +111,86 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
// Keyboard Actions
/**
* Press and hold a key.
* @param {string} key - Key to press down
* @returns {Promise<void>}
*/
async keyDown(key: string): Promise<void> {
await this.sendCommand('key_down', { key });
}
/**
* Release a key.
* @param {string} key - Key to release
* @returns {Promise<void>}
*/
async keyUp(key: string): Promise<void> {
await this.sendCommand('key_up', { key });
}
/**
* Type text as if entered from keyboard.
* @param {string} text - Text to type
* @returns {Promise<void>}
*/
async typeText(text: string): Promise<void> {
await this.sendCommand('type_text', { text });
}
/**
* Press and release a key.
* @param {string} key - Key to press
* @returns {Promise<void>}
*/
async pressKey(key: string): Promise<void> {
await this.sendCommand('press_key', { key });
}
/**
* Press multiple keys simultaneously as a hotkey combination.
* @param {...string} keys - Keys to press together
* @returns {Promise<void>}
*/
async hotkey(...keys: string[]): Promise<void> {
await this.sendCommand('hotkey', { keys });
}
// Scrolling Actions
/**
* Scroll by the specified amount in x and y directions.
* @param {number} x - Horizontal scroll amount
* @param {number} y - Vertical scroll amount
* @returns {Promise<void>}
*/
async scroll(x: number, y: number): Promise<void> {
await this.sendCommand('scroll', { x, y });
}
/**
* Scroll down by the specified number of clicks.
* @param {number} [clicks=1] - Number of scroll clicks
* @returns {Promise<void>}
*/
async scrollDown(clicks = 1): Promise<void> {
await this.sendCommand('scroll_down', { clicks });
}
/**
* Scroll up by the specified number of clicks.
* @param {number} [clicks=1] - Number of scroll clicks
* @returns {Promise<void>}
*/
async scrollUp(clicks = 1): Promise<void> {
await this.sendCommand('scroll_up', { clicks });
}
// Screen Actions
/**
* Take a screenshot of the screen.
* @returns {Promise<Buffer>} Screenshot image data as a Buffer
* @throws {Error} If screenshot fails
*/
async screenshot(): Promise<Buffer> {
const response = await this.sendCommand('screenshot');
if (!response.image_data) {
@@ -100,6 +199,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return Buffer.from(response.image_data as string, 'base64');
}
/**
* Get the current screen size.
* @returns {Promise<ScreenSize>} Screen dimensions
* @throws {Error} If unable to get screen size
*/
async getScreenSize(): Promise<ScreenSize> {
const response = await this.sendCommand('get_screen_size');
if (!response.success || !response.size) {
@@ -108,6 +212,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return response.size as ScreenSize;
}
/**
* Get the current cursor position.
* @returns {Promise<CursorPosition>} Current cursor coordinates
* @throws {Error} If unable to get cursor position
*/
async getCursorPosition(): Promise<CursorPosition> {
const response = await this.sendCommand('get_cursor_position');
if (!response.success || !response.position) {
@@ -117,6 +226,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
// Clipboard Actions
/**
* Copy current selection to clipboard and return the content.
* @returns {Promise<string>} Clipboard content
* @throws {Error} If unable to get clipboard content
*/
async copyToClipboard(): Promise<string> {
const response = await this.sendCommand('copy_to_clipboard');
if (!response.success || !response.content) {
@@ -125,21 +239,42 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return response.content as string;
}
/**
* Set the clipboard content to the specified text.
* @param {string} text - Text to set in clipboard
* @returns {Promise<void>}
*/
async setClipboard(text: string): Promise<void> {
await this.sendCommand('set_clipboard', { text });
}
// File System Actions
/**
* Check if a file exists at the specified path.
* @param {string} path - Path to the file
* @returns {Promise<boolean>} True if file exists, false otherwise
*/
async fileExists(path: string): Promise<boolean> {
const response = await this.sendCommand('file_exists', { path });
return (response.exists as boolean) || false;
}
/**
* Check if a directory exists at the specified path.
* @param {string} path - Path to the directory
* @returns {Promise<boolean>} True if directory exists, false otherwise
*/
async directoryExists(path: string): Promise<boolean> {
const response = await this.sendCommand('directory_exists', { path });
return (response.exists as boolean) || false;
}
/**
* List the contents of a directory.
* @param {string} path - Path to the directory
* @returns {Promise<string[]>} Array of file and directory names
* @throws {Error} If unable to list directory
*/
async listDir(path: string): Promise<string[]> {
const response = await this.sendCommand('list_dir', { path });
if (!response.success) {
@@ -148,6 +283,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return (response.files as string[]) || [];
}
/**
* Get the size of a file in bytes.
* @param {string} path - Path to the file
* @returns {Promise<number>} File size in bytes
* @throws {Error} If unable to get file size
*/
async getFileSize(path: string): Promise<number> {
const response = await this.sendCommand('get_file_size', { path });
if (!response.success) {
@@ -156,6 +297,16 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return (response.size as number) || 0;
}
/**
* Read file content in chunks for large files.
* @private
* @param {string} path - Path to the file
* @param {number} offset - Starting byte offset
* @param {number} totalLength - Total number of bytes to read
* @param {number} [chunkSize=1048576] - Size of each chunk in bytes
* @returns {Promise<Buffer>} File content as Buffer
* @throws {Error} If unable to read file chunk
*/
private async readBytesChunked(
path: string,
offset: number,
@@ -190,6 +341,16 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return Buffer.concat(chunks);
}
/**
* Write file content in chunks for large files.
* @private
* @param {string} path - Path to the file
* @param {Buffer} content - Content to write
* @param {boolean} [append=false] - Whether to append to existing file
* @param {number} [chunkSize=1048576] - Size of each chunk in bytes
* @returns {Promise<void>}
* @throws {Error} If unable to write file chunk
*/
private async writeBytesChunked(
path: string,
content: Buffer,
@@ -222,36 +383,43 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
}
/**
* Read text from a file with specified encoding.
* @param {string} path - Path to the file to read
* @param {BufferEncoding} [encoding='utf8'] - Text encoding to use
* @returns {Promise<string>} The decoded text content of the file
*/
async readText(path: string, encoding: BufferEncoding = 'utf8'): Promise<string> {
/**
* Read text from a file with specified encoding.
*
* @param path - Path to the file to read
* @param encoding - Text encoding to use (default: 'utf8')
* @returns The decoded text content of the file
*/
const contentBytes = await this.readBytes(path);
return contentBytes.toString(encoding);
}
/**
* Write text to a file with specified encoding.
* @param {string} path - Path to the file to write
* @param {string} content - Text content to write
* @param {BufferEncoding} [encoding='utf8'] - Text encoding to use
* @param {boolean} [append=false] - Whether to append to the file instead of overwriting
* @returns {Promise<void>}
*/
async writeText(
path: string,
content: string,
encoding: BufferEncoding = 'utf8',
append: boolean = false
): Promise<void> {
/**
* Write text to a file with specified encoding.
*
* @param path - Path to the file to write
* @param content - Text content to write
* @param encoding - Text encoding to use (default: 'utf8')
* @param append - Whether to append to the file instead of overwriting
*/
const contentBytes = Buffer.from(content, encoding);
await this.writeBytes(path, contentBytes, append);
}
/**
* Read bytes from a file, with optional offset and length.
* @param {string} path - Path to the file
* @param {number} [offset=0] - Starting byte offset
* @param {number} [length] - Number of bytes to read (reads entire file if not specified)
* @returns {Promise<Buffer>} File content as Buffer
* @throws {Error} If unable to read file
*/
async readBytes(path: string, offset: number = 0, length?: number): Promise<Buffer> {
// For large files, use chunked reading
if (length === undefined) {
@@ -275,6 +443,14 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return Buffer.from(response.content_b64 as string, 'base64');
}
/**
* Write bytes to a file.
* @param {string} path - Path to the file
* @param {Buffer} content - Content to write as Buffer
* @param {boolean} [append=false] - Whether to append to existing file
* @returns {Promise<void>}
* @throws {Error} If unable to write file
*/
async writeBytes(path: string, content: Buffer, append: boolean = false): Promise<void> {
// For large files, use chunked writing
if (content.length > 5 * 1024 * 1024) {
@@ -293,6 +469,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
}
/**
* Delete a file at the specified path.
* @param {string} path - Path to the file to delete
* @returns {Promise<void>}
* @throws {Error} If unable to delete file
*/
async deleteFile(path: string): Promise<void> {
const response = await this.sendCommand('delete_file', { path });
if (!response.success) {
@@ -300,6 +482,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
}
/**
* Create a directory at the specified path.
* @param {string} path - Path where to create the directory
* @returns {Promise<void>}
* @throws {Error} If unable to create directory
*/
async createDir(path: string): Promise<void> {
const response = await this.sendCommand('create_dir', { path });
if (!response.success) {
@@ -309,6 +497,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
}
/**
* Delete a directory at the specified path.
* @param {string} path - Path to the directory to delete
* @returns {Promise<void>}
* @throws {Error} If unable to delete directory
*/
async deleteDir(path: string): Promise<void> {
const response = await this.sendCommand('delete_dir', { path });
if (!response.success) {
@@ -318,6 +512,12 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
}
/**
* Execute a shell command and return stdout and stderr.
* @param {string} command - Command to execute
* @returns {Promise<[string, string]>} Tuple of [stdout, stderr]
* @throws {Error} If command execution fails
*/
async runCommand(command: string): Promise<[string, string]> {
const response = await this.sendCommand('run_command', { command });
if (!response.success) {
@@ -330,6 +530,11 @@ export class MacOSComputerInterface extends BaseComputerInterface {
}
// Accessibility Actions
/**
* Get the accessibility tree of the current screen.
* @returns {Promise<AccessibilityNode>} Root accessibility node
* @throws {Error} If unable to get accessibility tree
*/
async getAccessibilityTree(): Promise<AccessibilityNode> {
const response = await this.sendCommand('get_accessibility_tree');
if (!response.success) {
@@ -340,6 +545,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return response as unknown as AccessibilityNode;
}
/**
* Convert coordinates to screen coordinates.
* @param {number} x - X coordinate to convert
* @param {number} y - Y coordinate to convert
* @returns {Promise<[number, number]>} Converted screen coordinates as [x, y]
* @throws {Error} If coordinate conversion fails
*/
async toScreenCoordinates(x: number, y: number): Promise<[number, number]> {
const response = await this.sendCommand('to_screen_coordinates', { x, y });
if (!response.success || !response.coordinates) {
@@ -348,6 +560,13 @@ export class MacOSComputerInterface extends BaseComputerInterface {
return response.coordinates as [number, number];
}
/**
* Convert coordinates to screenshot coordinates.
* @param {number} x - X coordinate to convert
* @param {number} y - Y coordinate to convert
* @returns {Promise<[number, number]>} Converted screenshot coordinates as [x, y]
* @throws {Error} If coordinate conversion fails
*/
async toScreenshotCoordinates(
x: number,
y: number
+201
View File
@@ -0,0 +1,201 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Customizing Your ComputerAgent\n",
"\n",
"This notebook demonstrates four practical ways to increase the capabilities and success rate of your `ComputerAgent` in the Agent SDK:\n",
"\n",
"1. Simple: Prompt engineering (via optional `instructions`)\n",
"2. Easy: Tools (function tools and custom computer tools)\n",
"3. Intermediate: Callbacks\n",
"4. Expert: Custom `@register_agent` loops\n",
"\n",
"> Tip: The same patterns work in scripts and services — the notebook just makes it easy to iterate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"We'll import `ComputerAgent`, a simple Docker-based computer, and some utilities."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"from agent.agent import ComputerAgent\n",
"from agent.callbacks import LoggingCallback\n",
"from computer import Computer\n",
"\n",
"computer = Computer(\n",
" os_type=\"linux\",\n",
" provider_type=\"docker\",\n",
" image=\"trycua/cua-ubuntu:latest\",\n",
" name=\"my-cua-container\"\n",
")\n",
"\n",
"await computer.run() # Launch & connect to Docker container"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Simple: Prompt engineering\n",
"\n",
"You can guide your agent with system-like `instructions`.\n",
"\n",
"Under the hood, `ComputerAgent(instructions=...)` adds a `PromptInstructionsCallback` that prepends a user message before each LLM call.\n",
"\n",
"This mirrors the recommended snippet in code:\n",
"\n",
"```python\n",
"effective_input = full_input\n",
"if instructions:\n",
" effective_input = [{\"role\": \"user\", \"content\": instructions}] + full_input\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"instructions = (\n",
" \"You are a meticulous software operator. Prefer safe, deterministic actions. \"\n",
" \"Always confirm via on-screen text before proceeding.\"\n",
")\n",
"agent = ComputerAgent(\n",
" model=\"openai/computer-use-preview\",\n",
" tools=[computer],\n",
" instructions=instructions,\n",
" callbacks=[LoggingCallback(level=logging.INFO)],\n",
")\n",
"messages = [\n",
" {\"role\": \"user\", \"content\": \"Open the settings and turn on dark mode.\"}\n",
"]\n",
"\n",
"# In notebooks, you may want to consume the async generator\n",
"import asyncio\n",
"async def run_once():\n",
" async for chunk in agent.run(messages):\n",
" # Print any assistant text outputs\n",
" for item in chunk.get(\"output\", []):\n",
" if item.get(\"type\") == \"message\":\n",
" for c in item.get(\"content\", []):\n",
" if c.get(\"text\"):\n",
" print(c.get(\"text\"))\n",
"\n",
"await run_once()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2) Easy: Tools\n",
"\n",
"Add function tools to expose deterministic capabilities. Tools are auto-extracted to schemas and callable by the agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def calculate_percentage(numerator: float, denominator: float) -> str:\n",
" \"\"\"Calculate a percentage string.\n",
"\n",
" Args:\n",
" numerator: Numerator value\n",
" denominator: Denominator value\n",
" Returns:\n",
" A formatted percentage string (e.g., '75.00%').\n",
" \"\"\"\n",
" if denominator == 0:\n",
" return \"0.00%\"\n",
" return f\"{(numerator/denominator)*100:.2f}%\"\n",
"\n",
"agent_with_tool = ComputerAgent(\n",
" model=\"openai/computer-use-preview\",\n",
" tools=[computer, calculate_percentage],\n",
" instructions=\"When doing math, prefer the `calculate_percentage` tool when relevant.\",\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3) Intermediate: Callbacks\n",
"\n",
"Callbacks offer lifecycle hooks. For example, limit recent images or record trajectories."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from agent.callbacks import ImageRetentionCallback, TrajectorySaverCallback\n",
"\n",
"agent_with_callbacks = ComputerAgent(\n",
" model=\"anthropic/claude-3-5-sonnet-20241022\",\n",
" tools=[computer],\n",
" callbacks=[\n",
" ImageRetentionCallback(only_n_most_recent_images=3),\n",
" TrajectorySaverCallback(\"./trajectories\"),\n",
" ],\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4) Expert: Custom `@register_agent`\n",
"\n",
"Register custom agent configs that implement `predict_step` (and optionally `predict_click`). This gives you full control over prompting, message shaping, and tool wiring.\n",
"\n",
"See: `libs/python/agent/agent/loops/` for concrete examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"- Start with `instructions` for fast wins.\n",
"- Add function tools for determinism and reliability.\n",
"- Use callbacks to manage cost, logs, and safety.\n",
"- Build custom loops for specialized domains."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
+280
View File
@@ -0,0 +1,280 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a5d6b2ed",
"metadata": {},
"source": [
"# Computer-Use Agents SOTA Challenge\n",
"\n",
"Congrats on joining the Cua + HUD hackathon at Hack The North 2025!\n",
"\n",
"This notebook will show you how to create a computer use agent with Cua and evaluate it using HUD."
]
},
{
"cell_type": "markdown",
"id": "cebe8572",
"metadata": {},
"source": [
"## 💻 Prequisites\n",
"\n",
"Clone the Cua repository and install project dependencies."
]
},
{
"cell_type": "markdown",
"id": "3d7c38f9",
"metadata": {},
"source": [
"The easiest way to get started is by getting set up with the Cua development repository.\n",
"\n",
"Install [Docker](https://www.docker.com/products/docker-desktop/) and [pdm](https://pdm-project.org/en/latest/#recommended-installation-method).\n",
"\n",
"Clone the Cua repository:\n",
"\n",
"`git clone https://github.com/trycua/cua`\n",
"\n",
"Install the project dependencies:\n",
"\n",
"`cd cua && pdm install`\n",
"\n",
"Now, you should be able to run the `notebooks/hud_hackathon.ipynb` notebook in VS Code with the `.venv` virtual environment selected."
]
},
{
"cell_type": "markdown",
"id": "19f92431",
"metadata": {},
"source": [
"## ☁️ Connect to cloud services\n",
"\n",
"Create a free HUD accounts and load your API keys. "
]
},
{
"cell_type": "markdown",
"id": "47171dc3",
"metadata": {},
"source": [
"1. Create a HUD account at https://www.hud.so/\n",
"4. Create a .env file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1757f145",
"metadata": {},
"outputs": [],
"source": [
"# Create a .env file if it doesn't exist\n",
"\n",
"ENV_TEMPLATE = \"\"\"# Required environment variables:\n",
"HUD_API_KEY=\n",
"\n",
"# Any LLM provider will work:\n",
"ANTHROPIC_API_KEY=\n",
"OPENAI_API_KEY=\n",
"\"\"\"\n",
"\n",
"import os\n",
"if not os.path.exists(\".env\"):\n",
" open(\".env\", \"w\").write(ENV_TEMPLATE)\n",
" print(\"A .env file was created! Fill in the empty values.\")"
]
},
{
"cell_type": "markdown",
"id": "0949908d",
"metadata": {},
"source": [
"5. Fill in all missing values in the .env file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f23828d",
"metadata": {},
"outputs": [],
"source": [
"# Read the .env file\n",
"# HUD requires the .env file to be in the same directory\n",
"\n",
"from dotenv import load_dotenv\n",
"load_dotenv(dotenv_path='.env', override=True)\n",
"\n",
"assert os.getenv(\"HUD_API_KEY\")"
]
},
{
"cell_type": "markdown",
"id": "5c8bef64",
"metadata": {},
"source": [
"## 🤖 Create a computer use agent\n",
"\n",
"Create and a computer use agent using the Cua SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd4393b0",
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"from pathlib import Path\n",
"from agent import ComputerAgent\n",
"\n",
"# Here you can set the model and tools for your agent.\n",
"# Computer use models: https://www.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents\n",
"# Composed agent models: https://www.trycua.com/docs/agent-sdk/supported-agents/composed-agents\n",
"# Custom tools: https://www.trycua.com/docs/agent-sdk/custom-tools\n",
"agent_config = {\n",
" \"model\": \"openai/computer-use-preview\",\n",
" \"trajectory_dir\": str(Path(\"trajectories\")),\n",
" \"only_n_most_recent_images\": 3,\n",
" \"verbosity\": logging.INFO\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "a07b09ee",
"metadata": {},
"source": [
"## 🖱️ Test your agent\n",
"\n",
"Run your agent on a test scenario in a Docker container."
]
},
{
"cell_type": "markdown",
"id": "12b9c22c",
"metadata": {},
"source": [
"Make sure Docker is running to launch the computer.\n",
"\n",
"You can view the live VNC stream from the Docker container at `http://localhost:8006/`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a210e959",
"metadata": {},
"outputs": [],
"source": [
"from computer import Computer, VMProviderType\n",
"import webbrowser\n",
"\n",
"# Connect to your existing cloud container\n",
"computer = Computer(\n",
" os_type=\"linux\",\n",
" provider_type=VMProviderType.DOCKER,\n",
" verbosity=logging.INFO\n",
")\n",
"await computer.run()\n",
"\n",
"agent_config[\"tools\"] = [ computer ]\n",
"\n",
"webbrowser.open(\"http://localhost:8006/\", new=0, autoraise=True)"
]
},
{
"cell_type": "markdown",
"id": "87a307e3",
"metadata": {},
"source": [
"Try running the computer use agent on a simple task.\n",
"\n",
"Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3a32ea8",
"metadata": {},
"outputs": [],
"source": [
"# Create agent\n",
"agent = ComputerAgent(**agent_config)\n",
"\n",
"tasks = [\n",
" \"Open the web browser and search for a repository named trycua/cua on GitHub.\"\n",
"]\n",
"\n",
"for i, task in enumerate(tasks):\n",
" print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
" async for result in agent.run(task):\n",
" print(result)\n",
" pass\n",
"\n",
" print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
]
},
{
"cell_type": "markdown",
"id": "eb4edbb5",
"metadata": {},
"source": [
"## 🧐 Benchmark your agent\n",
"\n",
"Test your agent's performance on a selection of tasks from the OSWorld benchmark."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6bf0887e",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from pprint import pprint\n",
"from agent.integrations.hud import run_full_dataset\n",
"\n",
"job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
"\n",
"# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
"# See the documentation here: https://docs.trycua.com/docs/agent-sdk/integrations/hud#running-a-full-dataset\n",
"results = await run_full_dataset(\n",
" dataset=\"ddupont/OSWorld-Tiny-Public\",\n",
" job_name=job_name,\n",
" **agent_config,\n",
" max_concurrent=20,\n",
" max_steps=50,\n",
" #split=\"train[:5]\"\n",
")\n",
"\n",
"# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
"print(f\"Job: {job_name}\")\n",
"print(f\"Total results: {len(results)}\")\n",
"pprint(results[:3])"
]
},
{
"cell_type": "markdown",
"id": "5b89a103",
"metadata": {},
"source": [
"## 🦾 Improve your agent\n",
"\n",
"To improve your agent for OSWorld-Verified, experiment with different models and add custom tools that fit your use case. You can also dive into the ComputerAgent source code to design an improved version or subclass tailored to your needs.\n",
"\n",
"Learn more about [Customizing Your ComputerAgent](https://docs.trycua.com/docs/agent-sdk/customizing-computeragent) in the docs."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
+286
View File
@@ -0,0 +1,286 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a5d6b2ed",
"metadata": {},
"source": [
"# Computer-Use Agents SOTA Challenge\n",
"\n",
"Congrats on joining the Cua + HUD hackathon at Hack The North 2025!\n",
"\n",
"This notebook will show you how to create a computer use agent with Cua and evaluate it using HUD."
]
},
{
"cell_type": "markdown",
"id": "cebe8572",
"metadata": {},
"source": [
"## 💻 Prequisites\n",
"\n",
"Clone the Cua repository and install project dependencies."
]
},
{
"cell_type": "markdown",
"id": "3d7c38f9",
"metadata": {},
"source": [
"The easiest way to get started is by getting set up with the Cua development repository.\n",
"\n",
"First, clone the Cua repository:\n",
"\n",
"`git clone https://github.com/trycua/cua`\n",
"\n",
"Install [pdm](https://pdm-project.org/en/latest/#recommended-installation-method).\n",
"\n",
"Install the project dependencies:\n",
"\n",
"`cd cua && pdm install`\n",
"\n",
"Now, you should be able to run the `notebooks/hud_hackathon.ipynb` notebook in VS Code with the `.venv` virtual environment selected."
]
},
{
"cell_type": "markdown",
"id": "19f92431",
"metadata": {},
"source": [
"## ☁️ Connect to cloud services\n",
"\n",
"Create Cua and HUD accounts and load your API keys. "
]
},
{
"cell_type": "markdown",
"id": "47171dc3",
"metadata": {},
"source": [
"1. Create a Cua account at https://www.trycua.com/\n",
"2. Start a small Cua container at https://www.trycua.com/dashboard/containers (If you need credits, ask us!)\n",
"3. Create a HUD account at https://www.hud.so/\n",
"4. Create a .env file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1757f145",
"metadata": {},
"outputs": [],
"source": [
"# Create a .env file if it doesn't exist\n",
"\n",
"ENV_TEMPLATE = \"\"\"# Required environment variables:\n",
"CUA_API_KEY=\n",
"CUA_CONTAINER_NAME=\n",
"HUD_API_KEY=\n",
"\n",
"# Any LLM provider will work:\n",
"ANTHROPIC_API_KEY=\n",
"OPENAI_API_KEY=\n",
"\"\"\"\n",
"\n",
"import os\n",
"if not os.path.exists(\".env\"):\n",
" open(\".env\", \"w\").write(ENV_TEMPLATE)\n",
" print(\"A .env file was created! Fill in the empty values.\")"
]
},
{
"cell_type": "markdown",
"id": "0949908d",
"metadata": {},
"source": [
"5. Fill in all missing values in the .env file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f23828d",
"metadata": {},
"outputs": [],
"source": [
"# Read the .env file\n",
"# HUD requires the .env file to be in the same directory\n",
"\n",
"from dotenv import load_dotenv\n",
"load_dotenv(dotenv_path='.env', override=True)\n",
"\n",
"assert os.getenv(\"CUA_API_KEY\")\n",
"assert os.getenv(\"CUA_CONTAINER_NAME\")\n",
"assert os.getenv(\"HUD_API_KEY\")"
]
},
{
"cell_type": "markdown",
"id": "5c8bef64",
"metadata": {},
"source": [
"## 🤖 Create a computer use agent\n",
"\n",
"Create and a computer use agent using the Cua SDK."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd4393b0",
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"from pathlib import Path\n",
"from agent import ComputerAgent\n",
"\n",
"# Here you can set the model and tools for your agent.\n",
"# Computer use models: https://www.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents\n",
"# Composed agent models: https://www.trycua.com/docs/agent-sdk/supported-agents/composed-agents\n",
"# Custom tools: https://www.trycua.com/docs/agent-sdk/custom-tools\n",
"agent_config = {\n",
" \"model\": \"openai/computer-use-preview\",\n",
" \"trajectory_dir\": str(Path(\"trajectories\")),\n",
" \"only_n_most_recent_images\": 3,\n",
" \"verbosity\": logging.INFO\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "a07b09ee",
"metadata": {},
"source": [
"## 🖱️ Test your agent\n",
"\n",
"Run your agent on a test scenario in a Cua cloud container."
]
},
{
"cell_type": "markdown",
"id": "12b9c22c",
"metadata": {},
"source": [
"Connect to an existing cloud container through the Cua SDK.\n",
"\n",
"You can access the computer through VNC on the [Cua Dashboard](https://www.trycua.com/dashboard)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a210e959",
"metadata": {},
"outputs": [],
"source": [
"from computer import Computer, VMProviderType\n",
"\n",
"# Connect to your existing cloud container\n",
"computer = Computer(\n",
" os_type=\"linux\",\n",
" provider_type=VMProviderType.CLOUD,\n",
" name=os.getenv(\"CUA_CONTAINER_NAME\") or \"\",\n",
" api_key=os.getenv(\"CUA_API_KEY\"),\n",
" verbosity=logging.INFO\n",
")\n",
"\n",
"agent_config[\"tools\"] = [ computer ]"
]
},
{
"cell_type": "markdown",
"id": "87a307e3",
"metadata": {},
"source": [
"Try running the computer use agent on a simple task.\n",
"\n",
"To view a replay of the agent's actions, upload the trajectory to the [trajectory viewer](https://www.trycua.com/trajectory-viewer).\n",
"\n",
"Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3a32ea8",
"metadata": {},
"outputs": [],
"source": [
"# Create agent\n",
"agent = ComputerAgent(**agent_config)\n",
"\n",
"tasks = [\n",
" \"Open the web browser and search for a repository named trycua/cua on GitHub.\"\n",
"]\n",
"\n",
"for i, task in enumerate(tasks):\n",
" print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
" async for result in agent.run(task):\n",
" print(result)\n",
" pass\n",
"\n",
" print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
]
},
{
"cell_type": "markdown",
"id": "eb4edbb5",
"metadata": {},
"source": [
"## 🧐 Benchmark your agent\n",
"\n",
"Test your agent's performance on a selection of tasks from the OSWorld benchmark."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6bf0887e",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from pprint import pprint\n",
"from agent.integrations.hud import run_full_dataset\n",
"\n",
"job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
"\n",
"# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
"# See the documentation here: https://docs.trycua.com/docs/agent-sdk/integrations/hud#running-a-full-dataset\n",
"results = await run_full_dataset(\n",
" dataset=\"ddupont/OSWorld-Tiny-Public\",\n",
" job_name=job_name,\n",
" **agent_config,\n",
" max_concurrent=20,\n",
" max_steps=50,\n",
" #split=\"train[:5]\"\n",
")\n",
"\n",
"# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
"print(f\"Job: {job_name}\")\n",
"print(f\"Total results: {len(results)}\")\n",
"pprint(results[:3])"
]
},
{
"cell_type": "markdown",
"id": "5b89a103",
"metadata": {},
"source": [
"## 🦾 Improve your agent\n",
"\n",
"To improve your agent for OSWorld-Verified, experiment with different models and add custom tools that fit your use case. You can also dive into the ComputerAgent source code to design an improved version or subclass tailored to your needs.\n",
"\n",
"Learn more about [Customizing Your ComputerAgent](https://docs.trycua.com/docs/agent-sdk/customizing-computeragent) in the docs."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Generated
+6424
View File
File diff suppressed because it is too large Load Diff
+2 -3
View File
@@ -6,6 +6,7 @@ requires = ["pdm-backend"]
authors = [{ name = "TryCua", email = "gh@trycua.com" }]
dependencies = [
"openai<1.100.0",
"anthropic>=0.67.0",
]
description = "CUA (Computer Use Agent) mono-repo"
license = { text = "MIT" }
@@ -40,6 +41,7 @@ dev = [
"mypy>=1.10.0",
"ruff>=0.9.2",
"types-requests>=2.31.0",
"hud-python[agent]==0.4.26"
]
docs = ["mkdocs-material>=9.2.0", "mkdocs>=1.5.0"]
test = [
@@ -54,9 +56,6 @@ test = [
[tool.pdm.resolution]
respect-source-order = true
[tool.pdm.resolution.overrides]
hud-python = "0.4.12"
[tool.black]
line-length = 100
target-version = ["py311"]