mirror of
https://github.com/trycua/computer.git
synced 2026-01-07 14:00:04 -06:00
Merge pull request #80 from trycua/docs/operator-blog-1
Add operator blogpost - part 1
This commit is contained in:
316
notebooks/blog/build-your-own-operator-on-macos-1.ipynb
Normal file
316
notebooks/blog/build-your-own-operator-on-macos-1.ipynb
Normal file
@@ -0,0 +1,316 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Build Your Own Operator on macOS - Part 1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Welcome to Part 1 of our tutorial series on building a Computer Use Automation (CUA) operator using OpenAI. For the complete guide, check out our [full blog post](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We'll learn how to combine the OpenAI Responses API (using the `computer-use-preview` model) with the CUA macOS interface to automate tasks. Instead of using Playwright like in the original [OpenAI CUA docs](https://platform.openai.com/docs/guides/tools-computer-use), we use the CUA computer to control a macOS sandbox (via the Lume CLI) and execute actions such as clicking and typing. The loop is as follows:\n",
|
||||
"\n",
|
||||
"1. **Initialize the CUA sandbox** using the `cua-computer` py package.\n",
|
||||
"2. **Capture an initial screenshot** of the sandbox.\n",
|
||||
"3. **Send the screenshot and a user prompt** to the OpenAI Responses API.\n",
|
||||
"4. **Receive a `computer_call` action** from the API.\n",
|
||||
"5. **Map and execute the action** on the CUA interface (e.g. move the cursor and click, type text, etc.).\n",
|
||||
"6. **Capture a new screenshot** and repeat the loop as needed.\n",
|
||||
"\n",
|
||||
"Note: For this example, you must have your OpenAI API key set up and the Lume daemon running with a downloaded macOS VM image (see installation instructions in the CUA documentation)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prerequisites\n",
|
||||
"\n",
|
||||
"- Install the `cua-computer` package and set up the Lume daemon as described in its documentation.\n",
|
||||
"- Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).\n",
|
||||
"- This notebook uses asynchronous Python (async/await)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Install the required packages"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install cua-computer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install openai"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Import required modules"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import asyncio\n",
|
||||
"import base64\n",
|
||||
"import openai\n",
|
||||
"\n",
|
||||
"from computer import Computer\n",
|
||||
"\n",
|
||||
"# Ensure your OpenAI API key is set\n",
|
||||
"openai.api_key = input(\"Enter your OpenAI API key: \")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Mapping OpenAI Actions to CUA Methods\n",
|
||||
"\n",
|
||||
"The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUA interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the Cua Sandbox. We will use the computer interface to execute the actions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"async def execute_action(computer, action):\n",
|
||||
" action_type = action.type\n",
|
||||
" \n",
|
||||
" if action_type == \"click\":\n",
|
||||
" x = action.x\n",
|
||||
" y = action.y\n",
|
||||
" button = action.button\n",
|
||||
" print(f\"Executing click at ({x}, {y}) with button '{button}'\")\n",
|
||||
" await computer.interface.move_cursor(x, y)\n",
|
||||
" if button == \"right\":\n",
|
||||
" await computer.interface.right_click()\n",
|
||||
" else:\n",
|
||||
" await computer.interface.left_click()\n",
|
||||
" \n",
|
||||
" elif action_type == \"type\":\n",
|
||||
" text = action.text\n",
|
||||
" print(f\"Typing text: {text}\")\n",
|
||||
" await computer.interface.type_text(text)\n",
|
||||
" \n",
|
||||
" elif action_type == \"scroll\":\n",
|
||||
" x = action.x\n",
|
||||
" y = action.y\n",
|
||||
" scroll_x = action.scroll_x\n",
|
||||
" scroll_y = action.scroll_y\n",
|
||||
" print(f\"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})\")\n",
|
||||
" await computer.interface.move_cursor(x, y)\n",
|
||||
" await computer.interface.scroll(scroll_y) # Assuming CUA provides a scroll method\n",
|
||||
" \n",
|
||||
" elif action_type == \"keypress\":\n",
|
||||
" keys = action.keys\n",
|
||||
" for key in keys:\n",
|
||||
" print(f\"Pressing key: {key}\")\n",
|
||||
" # Map common key names to CUA equivalents\n",
|
||||
" if key.lower() == \"enter\":\n",
|
||||
" await computer.interface.press_key(\"return\")\n",
|
||||
" elif key.lower() == \"space\":\n",
|
||||
" await computer.interface.press_key(\"space\")\n",
|
||||
" else:\n",
|
||||
" await computer.interface.press_key(key)\n",
|
||||
" \n",
|
||||
" elif action_type == \"wait\":\n",
|
||||
" wait_time = action.time\n",
|
||||
" print(f\"Waiting for {wait_time} seconds\")\n",
|
||||
" await asyncio.sleep(wait_time)\n",
|
||||
" \n",
|
||||
" elif action_type == \"screenshot\":\n",
|
||||
" print(\"Taking screenshot\")\n",
|
||||
" # This is handled automatically in the main loop, but we can take an extra one if requested\n",
|
||||
" screenshot = await computer.interface.screenshot()\n",
|
||||
" return screenshot\n",
|
||||
" \n",
|
||||
" else:\n",
|
||||
" print(f\"Unrecognized action: {action_type}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## The CUA/OpenAI Loop\n",
|
||||
"\n",
|
||||
"This cell defines a loop that:\n",
|
||||
"\n",
|
||||
"1. Initializes the CUA computer instance (connecting to a macOS sandbox).\n",
|
||||
"2. Captures a screenshot of the current state.\n",
|
||||
"3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` tool.\n",
|
||||
"4. Processes the returned `computer_call` action and executes it using our helper function.\n",
|
||||
"5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).\n",
|
||||
"\n",
|
||||
"For a full loop, you would repeat these steps until no further actions are returned."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"async def cua_openai_loop():\n",
|
||||
" # Initialize the CUA computer instance (macOS sandbox)\n",
|
||||
" async with Computer(\n",
|
||||
" display=\"1024x768\",\n",
|
||||
" memory=\"4GB\",\n",
|
||||
" cpu=\"2\",\n",
|
||||
" os=\"macos\"\n",
|
||||
" ) as computer:\n",
|
||||
" await computer.run()\n",
|
||||
" \n",
|
||||
" # Capture the initial screenshot\n",
|
||||
" screenshot = await computer.interface.screenshot()\n",
|
||||
" screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')\n",
|
||||
"\n",
|
||||
" # Initial request to start the loop\n",
|
||||
" response = openai.responses.create(\n",
|
||||
" model=\"computer-use-preview\",\n",
|
||||
" tools=[{\n",
|
||||
" \"type\": \"computer_use_preview\",\n",
|
||||
" \"display_width\": 1024,\n",
|
||||
" \"display_height\": 768,\n",
|
||||
" \"environment\": \"mac\"\n",
|
||||
" }],\n",
|
||||
" input=[\n",
|
||||
" { # type: ignore\n",
|
||||
" \"role\": \"user\", \n",
|
||||
" \"content\": [\n",
|
||||
" {\"type\": \"input_text\", \"text\": \"Open Safari, download and install Cursor.\"},\n",
|
||||
" {\"type\": \"input_image\", \"image_url\": f\"data:image/png;base64,{screenshot_base64}\"}\n",
|
||||
" ]\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" truncation=\"auto\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Continue the loop until no more computer_call actions\n",
|
||||
" while True:\n",
|
||||
" # Check for computer_call actions\n",
|
||||
" computer_calls = [item for item in response.output if item and item.type == \"computer_call\"]\n",
|
||||
" if not computer_calls:\n",
|
||||
" print(\"No more computer calls. Loop complete.\")\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
" # Get the first computer call\n",
|
||||
" call = computer_calls[0]\n",
|
||||
" last_call_id = call.call_id\n",
|
||||
" action = call.action\n",
|
||||
" print(\"Received action from OpenAI Responses API:\", action)\n",
|
||||
"\n",
|
||||
" # Handle any pending safety checks\n",
|
||||
" if call.pending_safety_checks:\n",
|
||||
" print(\"Safety checks pending:\", call.pending_safety_checks)\n",
|
||||
" # In a real implementation, you would want to get user confirmation here\n",
|
||||
" acknowledged_checks = call.pending_safety_checks\n",
|
||||
" else:\n",
|
||||
" acknowledged_checks = []\n",
|
||||
"\n",
|
||||
" # Execute the action\n",
|
||||
" await execute_action(computer, action)\n",
|
||||
" await asyncio.sleep(1) # Allow time for changes to take effect\n",
|
||||
"\n",
|
||||
" # Capture new screenshot after action\n",
|
||||
" new_screenshot = await computer.interface.screenshot()\n",
|
||||
" new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')\n",
|
||||
"\n",
|
||||
" # Send the screenshot back as computer_call_output\n",
|
||||
" response = openai.responses.create(\n",
|
||||
" model=\"computer-use-preview\",\n",
|
||||
" previous_response_id=response.id, # Link to previous response\n",
|
||||
" tools=[{\n",
|
||||
" \"type\": \"computer_use_preview\",\n",
|
||||
" \"display_width\": 1024,\n",
|
||||
" \"display_height\": 768,\n",
|
||||
" \"environment\": \"mac\"\n",
|
||||
" }],\n",
|
||||
" input=[{ # type: ignore\n",
|
||||
" \"type\": \"computer_call_output\",\n",
|
||||
" \"call_id\": last_call_id,\n",
|
||||
" \"acknowledged_safety_checks\": acknowledged_checks,\n",
|
||||
" \"output\": {\n",
|
||||
" \"type\": \"input_image\",\n",
|
||||
" \"image_url\": f\"data:image/png;base64,{new_screenshot_base64}\"\n",
|
||||
" }\n",
|
||||
" }],\n",
|
||||
" truncation=\"auto\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # End the session\n",
|
||||
" await computer.stop()\n",
|
||||
"\n",
|
||||
"# Run the loop\n",
|
||||
"await cua_openai_loop()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Final Remarks\n",
|
||||
"\n",
|
||||
"This notebook demonstrates a single iteration of a CUA/OpenAI loop where:\n",
|
||||
"\n",
|
||||
"- A macOS sandbox is controlled using the CUA interface.\n",
|
||||
"- A screenshot and prompt are sent to the OpenAI Responses API.\n",
|
||||
"- The returned action (e.g. a click or type command) is executed via the CUA interface.\n",
|
||||
"\n",
|
||||
"In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "cua",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Reference in New Issue
Block a user