Merge pull request #80 from trycua/docs/operator-blog-1

Add operator blogpost - part 1
2026-01-20 04:21:05 -06:00 · 2025-04-01 07:48:39 -07:00
parent 9324ab56b2 83f79de0d3
commit 224b6df25e
1 changed files with 316 additions and 0 deletions
--- a/notebooks/blog/build-your-own-operator-on-macos-1.ipynb
+++ b/notebooks/blog/build-your-own-operator-on-macos-1.ipynb
@@ -0,0 +1,316 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Build Your Own Operator on macOS - Part 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Welcome to Part 1 of our tutorial series on building a Computer Use Automation (CUA) operator using OpenAI. For the complete guide, check out our [full blog post](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll learn how to combine the OpenAI Responses API (using the `computer-use-preview` model) with the CUA macOS interface to automate tasks. Instead of using Playwright like in the original [OpenAI CUA docs](https://platform.openai.com/docs/guides/tools-computer-use), we use the CUA computer to control a macOS sandbox (via the Lume CLI) and execute actions such as clicking and typing. The loop is as follows:\n",
+    "\n",
+    "1. **Initialize the CUA sandbox** using the `cua-computer` py package.\n",
+    "2. **Capture an initial screenshot** of the sandbox.\n",
+    "3. **Send the screenshot and a user prompt** to the OpenAI Responses API.\n",
+    "4. **Receive a `computer_call` action** from the API.\n",
+    "5. **Map and execute the action** on the CUA interface (e.g. move the cursor and click, type text, etc.).\n",
+    "6. **Capture a new screenshot** and repeat the loop as needed.\n",
+    "\n",
+    "Note: For this example, you must have your OpenAI API key set up and the Lume daemon running with a downloaded macOS VM image (see installation instructions in the CUA documentation)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "- Install the `cua-computer` package and set up the Lume daemon as described in its documentation.\n",
+    "- Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).\n",
+    "- This notebook uses asynchronous Python (async/await)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Install the required packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install cua-computer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install openai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import required modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import base64\n",
+    "import openai\n",
+    "\n",
+    "from computer import Computer\n",
+    "\n",
+    "# Ensure your OpenAI API key is set\n",
+    "openai.api_key = input(\"Enter your OpenAI API key: \")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Mapping OpenAI Actions to CUA Methods\n",
+    "\n",
+    "The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUA interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the Cua Sandbox. We will use the computer interface to execute the actions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def execute_action(computer, action):\n",
+    "    action_type = action.type\n",
+    "    \n",
+    "    if action_type == \"click\":\n",
+    "        x = action.x\n",
+    "        y = action.y\n",
+    "        button = action.button\n",
+    "        print(f\"Executing click at ({x}, {y}) with button '{button}'\")\n",
+    "        await computer.interface.move_cursor(x, y)\n",
+    "        if button == \"right\":\n",
+    "            await computer.interface.right_click()\n",
+    "        else:\n",
+    "            await computer.interface.left_click()\n",
+    "    \n",
+    "    elif action_type == \"type\":\n",
+    "        text = action.text\n",
+    "        print(f\"Typing text: {text}\")\n",
+    "        await computer.interface.type_text(text)\n",
+    "    \n",
+    "    elif action_type == \"scroll\":\n",
+    "        x = action.x\n",
+    "        y = action.y\n",
+    "        scroll_x = action.scroll_x\n",
+    "        scroll_y = action.scroll_y\n",
+    "        print(f\"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})\")\n",
+    "        await computer.interface.move_cursor(x, y)\n",
+    "        await computer.interface.scroll(scroll_y)  # Assuming CUA provides a scroll method\n",
+    "    \n",
+    "    elif action_type == \"keypress\":\n",
+    "        keys = action.keys\n",
+    "        for key in keys:\n",
+    "            print(f\"Pressing key: {key}\")\n",
+    "            # Map common key names to CUA equivalents\n",
+    "            if key.lower() == \"enter\":\n",
+    "                await computer.interface.press_key(\"return\")\n",
+    "            elif key.lower() == \"space\":\n",
+    "                await computer.interface.press_key(\"space\")\n",
+    "            else:\n",
+    "                await computer.interface.press_key(key)\n",
+    "    \n",
+    "    elif action_type == \"wait\":\n",
+    "        wait_time = action.time\n",
+    "        print(f\"Waiting for {wait_time} seconds\")\n",
+    "        await asyncio.sleep(wait_time)\n",
+    "    \n",
+    "    elif action_type == \"screenshot\":\n",
+    "        print(\"Taking screenshot\")\n",
+    "        # This is handled automatically in the main loop, but we can take an extra one if requested\n",
+    "        screenshot = await computer.interface.screenshot()\n",
+    "        return screenshot\n",
+    "    \n",
+    "    else:\n",
+    "        print(f\"Unrecognized action: {action_type}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The CUA/OpenAI Loop\n",
+    "\n",
+    "This cell defines a loop that:\n",
+    "\n",
+    "1. Initializes the CUA computer instance (connecting to a macOS sandbox).\n",
+    "2. Captures a screenshot of the current state.\n",
+    "3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` tool.\n",
+    "4. Processes the returned `computer_call` action and executes it using our helper function.\n",
+    "5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).\n",
+    "\n",
+    "For a full loop, you would repeat these steps until no further actions are returned."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def cua_openai_loop():\n",
+    "    # Initialize the CUA computer instance (macOS sandbox)\n",
+    "    async with Computer(\n",
+    "        display=\"1024x768\",\n",
+    "        memory=\"4GB\",\n",
+    "        cpu=\"2\",\n",
+    "        os=\"macos\"\n",
+    "    ) as computer:\n",
+    "        await computer.run()\n",
+    "        \n",
+    "        # Capture the initial screenshot\n",
+    "        screenshot = await computer.interface.screenshot()\n",
+    "        screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')\n",
+    "\n",
+    "        # Initial request to start the loop\n",
+    "        response = openai.responses.create(\n",
+    "            model=\"computer-use-preview\",\n",
+    "            tools=[{\n",
+    "                \"type\": \"computer_use_preview\",\n",
+    "                \"display_width\": 1024,\n",
+    "                \"display_height\": 768,\n",
+    "                \"environment\": \"mac\"\n",
+    "            }],\n",
+    "            input=[\n",
+    "                {  # type: ignore\n",
+    "                    \"role\": \"user\", \n",
+    "                    \"content\": [\n",
+    "                        {\"type\": \"input_text\", \"text\": \"Open Safari, download and install Cursor.\"},\n",
+    "                        {\"type\": \"input_image\", \"image_url\": f\"data:image/png;base64,{screenshot_base64}\"}\n",
+    "                    ]\n",
+    "                }\n",
+    "            ],\n",
+    "            truncation=\"auto\"\n",
+    "        )\n",
+    "\n",
+    "        # Continue the loop until no more computer_call actions\n",
+    "        while True:\n",
+    "            # Check for computer_call actions\n",
+    "            computer_calls = [item for item in response.output if item and item.type == \"computer_call\"]\n",
+    "            if not computer_calls:\n",
+    "                print(\"No more computer calls. Loop complete.\")\n",
+    "                break\n",
+    "\n",
+    "            # Get the first computer call\n",
+    "            call = computer_calls[0]\n",
+    "            last_call_id = call.call_id\n",
+    "            action = call.action\n",
+    "            print(\"Received action from OpenAI Responses API:\", action)\n",
+    "\n",
+    "            # Handle any pending safety checks\n",
+    "            if call.pending_safety_checks:\n",
+    "                print(\"Safety checks pending:\", call.pending_safety_checks)\n",
+    "                # In a real implementation, you would want to get user confirmation here\n",
+    "                acknowledged_checks = call.pending_safety_checks\n",
+    "            else:\n",
+    "                acknowledged_checks = []\n",
+    "\n",
+    "            # Execute the action\n",
+    "            await execute_action(computer, action)\n",
+    "            await asyncio.sleep(1)  # Allow time for changes to take effect\n",
+    "\n",
+    "            # Capture new screenshot after action\n",
+    "            new_screenshot = await computer.interface.screenshot()\n",
+    "            new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')\n",
+    "\n",
+    "            # Send the screenshot back as computer_call_output\n",
+    "            response = openai.responses.create(\n",
+    "                model=\"computer-use-preview\",\n",
+    "                previous_response_id=response.id,  # Link to previous response\n",
+    "                tools=[{\n",
+    "                    \"type\": \"computer_use_preview\",\n",
+    "                    \"display_width\": 1024,\n",
+    "                    \"display_height\": 768,\n",
+    "                    \"environment\": \"mac\"\n",
+    "                }],\n",
+    "                input=[{  # type: ignore\n",
+    "                    \"type\": \"computer_call_output\",\n",
+    "                    \"call_id\": last_call_id,\n",
+    "                    \"acknowledged_safety_checks\": acknowledged_checks,\n",
+    "                    \"output\": {\n",
+    "                        \"type\": \"input_image\",\n",
+    "                        \"image_url\": f\"data:image/png;base64,{new_screenshot_base64}\"\n",
+    "                    }\n",
+    "                }],\n",
+    "                truncation=\"auto\"\n",
+    "            )\n",
+    "\n",
+    "        # End the session\n",
+    "        await computer.stop()\n",
+    "\n",
+    "# Run the loop\n",
+    "await cua_openai_loop()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Final Remarks\n",
+    "\n",
+    "This notebook demonstrates a single iteration of a CUA/OpenAI loop where:\n",
+    "\n",
+    "- A macOS sandbox is controlled using the CUA interface.\n",
+    "- A screenshot and prompt are sent to the OpenAI Responses API.\n",
+    "- The returned action (e.g. a click or type command) is executed via the CUA interface.\n",
+    "\n",
+    "In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "cua",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}