Merge pull request #80 from trycua/docs/operator-blog-1

Add operator blogpost - part 1
This commit is contained in:
f-trycua
2025-04-01 07:48:39 -07:00
committed by GitHub

View File

@@ -0,0 +1,316 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Build Your Own Operator on macOS - Part 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Welcome to Part 1 of our tutorial series on building a Computer Use Automation (CUA) operator using OpenAI. For the complete guide, check out our [full blog post](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll learn how to combine the OpenAI Responses API (using the `computer-use-preview` model) with the CUA macOS interface to automate tasks. Instead of using Playwright like in the original [OpenAI CUA docs](https://platform.openai.com/docs/guides/tools-computer-use), we use the CUA computer to control a macOS sandbox (via the Lume CLI) and execute actions such as clicking and typing. The loop is as follows:\n",
"\n",
"1. **Initialize the CUA sandbox** using the `cua-computer` py package.\n",
"2. **Capture an initial screenshot** of the sandbox.\n",
"3. **Send the screenshot and a user prompt** to the OpenAI Responses API.\n",
"4. **Receive a `computer_call` action** from the API.\n",
"5. **Map and execute the action** on the CUA interface (e.g. move the cursor and click, type text, etc.).\n",
"6. **Capture a new screenshot** and repeat the loop as needed.\n",
"\n",
"Note: For this example, you must have your OpenAI API key set up and the Lume daemon running with a downloaded macOS VM image (see installation instructions in the CUA documentation)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"- Install the `cua-computer` package and set up the Lume daemon as described in its documentation.\n",
"- Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).\n",
"- This notebook uses asynchronous Python (async/await)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install the required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install cua-computer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install openai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import required modules"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"import base64\n",
"import openai\n",
"\n",
"from computer import Computer\n",
"\n",
"# Ensure your OpenAI API key is set\n",
"openai.api_key = input(\"Enter your OpenAI API key: \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mapping OpenAI Actions to CUA Methods\n",
"\n",
"The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUA interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the Cua Sandbox. We will use the computer interface to execute the actions."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"async def execute_action(computer, action):\n",
" action_type = action.type\n",
" \n",
" if action_type == \"click\":\n",
" x = action.x\n",
" y = action.y\n",
" button = action.button\n",
" print(f\"Executing click at ({x}, {y}) with button '{button}'\")\n",
" await computer.interface.move_cursor(x, y)\n",
" if button == \"right\":\n",
" await computer.interface.right_click()\n",
" else:\n",
" await computer.interface.left_click()\n",
" \n",
" elif action_type == \"type\":\n",
" text = action.text\n",
" print(f\"Typing text: {text}\")\n",
" await computer.interface.type_text(text)\n",
" \n",
" elif action_type == \"scroll\":\n",
" x = action.x\n",
" y = action.y\n",
" scroll_x = action.scroll_x\n",
" scroll_y = action.scroll_y\n",
" print(f\"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})\")\n",
" await computer.interface.move_cursor(x, y)\n",
" await computer.interface.scroll(scroll_y) # Assuming CUA provides a scroll method\n",
" \n",
" elif action_type == \"keypress\":\n",
" keys = action.keys\n",
" for key in keys:\n",
" print(f\"Pressing key: {key}\")\n",
" # Map common key names to CUA equivalents\n",
" if key.lower() == \"enter\":\n",
" await computer.interface.press_key(\"return\")\n",
" elif key.lower() == \"space\":\n",
" await computer.interface.press_key(\"space\")\n",
" else:\n",
" await computer.interface.press_key(key)\n",
" \n",
" elif action_type == \"wait\":\n",
" wait_time = action.time\n",
" print(f\"Waiting for {wait_time} seconds\")\n",
" await asyncio.sleep(wait_time)\n",
" \n",
" elif action_type == \"screenshot\":\n",
" print(\"Taking screenshot\")\n",
" # This is handled automatically in the main loop, but we can take an extra one if requested\n",
" screenshot = await computer.interface.screenshot()\n",
" return screenshot\n",
" \n",
" else:\n",
" print(f\"Unrecognized action: {action_type}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The CUA/OpenAI Loop\n",
"\n",
"This cell defines a loop that:\n",
"\n",
"1. Initializes the CUA computer instance (connecting to a macOS sandbox).\n",
"2. Captures a screenshot of the current state.\n",
"3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` tool.\n",
"4. Processes the returned `computer_call` action and executes it using our helper function.\n",
"5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).\n",
"\n",
"For a full loop, you would repeat these steps until no further actions are returned."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async def cua_openai_loop():\n",
" # Initialize the CUA computer instance (macOS sandbox)\n",
" async with Computer(\n",
" display=\"1024x768\",\n",
" memory=\"4GB\",\n",
" cpu=\"2\",\n",
" os=\"macos\"\n",
" ) as computer:\n",
" await computer.run()\n",
" \n",
" # Capture the initial screenshot\n",
" screenshot = await computer.interface.screenshot()\n",
" screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')\n",
"\n",
" # Initial request to start the loop\n",
" response = openai.responses.create(\n",
" model=\"computer-use-preview\",\n",
" tools=[{\n",
" \"type\": \"computer_use_preview\",\n",
" \"display_width\": 1024,\n",
" \"display_height\": 768,\n",
" \"environment\": \"mac\"\n",
" }],\n",
" input=[\n",
" { # type: ignore\n",
" \"role\": \"user\", \n",
" \"content\": [\n",
" {\"type\": \"input_text\", \"text\": \"Open Safari, download and install Cursor.\"},\n",
" {\"type\": \"input_image\", \"image_url\": f\"data:image/png;base64,{screenshot_base64}\"}\n",
" ]\n",
" }\n",
" ],\n",
" truncation=\"auto\"\n",
" )\n",
"\n",
" # Continue the loop until no more computer_call actions\n",
" while True:\n",
" # Check for computer_call actions\n",
" computer_calls = [item for item in response.output if item and item.type == \"computer_call\"]\n",
" if not computer_calls:\n",
" print(\"No more computer calls. Loop complete.\")\n",
" break\n",
"\n",
" # Get the first computer call\n",
" call = computer_calls[0]\n",
" last_call_id = call.call_id\n",
" action = call.action\n",
" print(\"Received action from OpenAI Responses API:\", action)\n",
"\n",
" # Handle any pending safety checks\n",
" if call.pending_safety_checks:\n",
" print(\"Safety checks pending:\", call.pending_safety_checks)\n",
" # In a real implementation, you would want to get user confirmation here\n",
" acknowledged_checks = call.pending_safety_checks\n",
" else:\n",
" acknowledged_checks = []\n",
"\n",
" # Execute the action\n",
" await execute_action(computer, action)\n",
" await asyncio.sleep(1) # Allow time for changes to take effect\n",
"\n",
" # Capture new screenshot after action\n",
" new_screenshot = await computer.interface.screenshot()\n",
" new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')\n",
"\n",
" # Send the screenshot back as computer_call_output\n",
" response = openai.responses.create(\n",
" model=\"computer-use-preview\",\n",
" previous_response_id=response.id, # Link to previous response\n",
" tools=[{\n",
" \"type\": \"computer_use_preview\",\n",
" \"display_width\": 1024,\n",
" \"display_height\": 768,\n",
" \"environment\": \"mac\"\n",
" }],\n",
" input=[{ # type: ignore\n",
" \"type\": \"computer_call_output\",\n",
" \"call_id\": last_call_id,\n",
" \"acknowledged_safety_checks\": acknowledged_checks,\n",
" \"output\": {\n",
" \"type\": \"input_image\",\n",
" \"image_url\": f\"data:image/png;base64,{new_screenshot_base64}\"\n",
" }\n",
" }],\n",
" truncation=\"auto\"\n",
" )\n",
"\n",
" # End the session\n",
" await computer.stop()\n",
"\n",
"# Run the loop\n",
"await cua_openai_loop()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final Remarks\n",
"\n",
"This notebook demonstrates a single iteration of a CUA/OpenAI loop where:\n",
"\n",
"- A macOS sandbox is controlled using the CUA interface.\n",
"- A screenshot and prompt are sent to the OpenAI Responses API.\n",
"- The returned action (e.g. a click or type command) is executed via the CUA interface.\n",
"\n",
"In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "cua",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}