From 83f79de0d30cdf5fe2f366298b24a8b67d5bfe49 Mon Sep 17 00:00:00 2001 From: f-trycua Date: Tue, 1 Apr 2025 16:46:40 +0200 Subject: [PATCH] Add operator blogpost - pt 1 --- .../build-your-own-operator-on-macos-1.ipynb | 316 ++++++++++++++++++ 1 file changed, 316 insertions(+) create mode 100644 notebooks/blog/build-your-own-operator-on-macos-1.ipynb diff --git a/notebooks/blog/build-your-own-operator-on-macos-1.ipynb b/notebooks/blog/build-your-own-operator-on-macos-1.ipynb new file mode 100644 index 00000000..22db332d --- /dev/null +++ b/notebooks/blog/build-your-own-operator-on-macos-1.ipynb @@ -0,0 +1,316 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Build Your Own Operator on macOS - Part 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Welcome to Part 1 of our tutorial series on building a Computer Use Automation (CUA) operator using OpenAI. For the complete guide, check out our [full blog post](https://www.trycua.com/blog/build-your-own-operator-on-macos-1)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll learn how to combine the OpenAI Responses API (using the `computer-use-preview` model) with the CUA macOS interface to automate tasks. Instead of using Playwright like in the original [OpenAI CUA docs](https://platform.openai.com/docs/guides/tools-computer-use), we use the CUA computer to control a macOS sandbox (via the Lume CLI) and execute actions such as clicking and typing. The loop is as follows:\n", + "\n", + "1. **Initialize the CUA sandbox** using the `cua-computer` py package.\n", + "2. **Capture an initial screenshot** of the sandbox.\n", + "3. **Send the screenshot and a user prompt** to the OpenAI Responses API.\n", + "4. **Receive a `computer_call` action** from the API.\n", + "5. **Map and execute the action** on the CUA interface (e.g. move the cursor and click, type text, etc.).\n", + "6. **Capture a new screenshot** and repeat the loop as needed.\n", + "\n", + "Note: For this example, you must have your OpenAI API key set up and the Lume daemon running with a downloaded macOS VM image (see installation instructions in the CUA documentation)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "- Install the `cua-computer` package and set up the Lume daemon as described in its documentation.\n", + "- Ensure you have an OpenAI API key (set as an environment variable or in your OpenAI configuration).\n", + "- This notebook uses asynchronous Python (async/await)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install the required packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install cua-computer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install openai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import required modules" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import base64\n", + "import openai\n", + "\n", + "from computer import Computer\n", + "\n", + "# Ensure your OpenAI API key is set\n", + "openai.api_key = input(\"Enter your OpenAI API key: \")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Mapping OpenAI Actions to CUA Methods\n", + "\n", + "The following helper function converts a `computer_call` action from the OpenAI Responses API into corresponding commands on the CUA interface. For example, if the API instructs a `click` action, we move the cursor and perform a left click on the Cua Sandbox. We will use the computer interface to execute the actions." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "async def execute_action(computer, action):\n", + " action_type = action.type\n", + " \n", + " if action_type == \"click\":\n", + " x = action.x\n", + " y = action.y\n", + " button = action.button\n", + " print(f\"Executing click at ({x}, {y}) with button '{button}'\")\n", + " await computer.interface.move_cursor(x, y)\n", + " if button == \"right\":\n", + " await computer.interface.right_click()\n", + " else:\n", + " await computer.interface.left_click()\n", + " \n", + " elif action_type == \"type\":\n", + " text = action.text\n", + " print(f\"Typing text: {text}\")\n", + " await computer.interface.type_text(text)\n", + " \n", + " elif action_type == \"scroll\":\n", + " x = action.x\n", + " y = action.y\n", + " scroll_x = action.scroll_x\n", + " scroll_y = action.scroll_y\n", + " print(f\"Scrolling at ({x}, {y}) with offsets (scroll_x={scroll_x}, scroll_y={scroll_y})\")\n", + " await computer.interface.move_cursor(x, y)\n", + " await computer.interface.scroll(scroll_y) # Assuming CUA provides a scroll method\n", + " \n", + " elif action_type == \"keypress\":\n", + " keys = action.keys\n", + " for key in keys:\n", + " print(f\"Pressing key: {key}\")\n", + " # Map common key names to CUA equivalents\n", + " if key.lower() == \"enter\":\n", + " await computer.interface.press_key(\"return\")\n", + " elif key.lower() == \"space\":\n", + " await computer.interface.press_key(\"space\")\n", + " else:\n", + " await computer.interface.press_key(key)\n", + " \n", + " elif action_type == \"wait\":\n", + " wait_time = action.time\n", + " print(f\"Waiting for {wait_time} seconds\")\n", + " await asyncio.sleep(wait_time)\n", + " \n", + " elif action_type == \"screenshot\":\n", + " print(\"Taking screenshot\")\n", + " # This is handled automatically in the main loop, but we can take an extra one if requested\n", + " screenshot = await computer.interface.screenshot()\n", + " return screenshot\n", + " \n", + " else:\n", + " print(f\"Unrecognized action: {action_type}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The CUA/OpenAI Loop\n", + "\n", + "This cell defines a loop that:\n", + "\n", + "1. Initializes the CUA computer instance (connecting to a macOS sandbox).\n", + "2. Captures a screenshot of the current state.\n", + "3. Sends the screenshot (with a user prompt) to the OpenAI Responses API using the `computer-use-preview` tool.\n", + "4. Processes the returned `computer_call` action and executes it using our helper function.\n", + "5. Captures an updated screenshot after the action (this example runs one iteration, but you can wrap it in a loop).\n", + "\n", + "For a full loop, you would repeat these steps until no further actions are returned." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "async def cua_openai_loop():\n", + " # Initialize the CUA computer instance (macOS sandbox)\n", + " async with Computer(\n", + " display=\"1024x768\",\n", + " memory=\"4GB\",\n", + " cpu=\"2\",\n", + " os=\"macos\"\n", + " ) as computer:\n", + " await computer.run()\n", + " \n", + " # Capture the initial screenshot\n", + " screenshot = await computer.interface.screenshot()\n", + " screenshot_base64 = base64.b64encode(screenshot).decode('utf-8')\n", + "\n", + " # Initial request to start the loop\n", + " response = openai.responses.create(\n", + " model=\"computer-use-preview\",\n", + " tools=[{\n", + " \"type\": \"computer_use_preview\",\n", + " \"display_width\": 1024,\n", + " \"display_height\": 768,\n", + " \"environment\": \"mac\"\n", + " }],\n", + " input=[\n", + " { # type: ignore\n", + " \"role\": \"user\", \n", + " \"content\": [\n", + " {\"type\": \"input_text\", \"text\": \"Open Safari, download and install Cursor.\"},\n", + " {\"type\": \"input_image\", \"image_url\": f\"data:image/png;base64,{screenshot_base64}\"}\n", + " ]\n", + " }\n", + " ],\n", + " truncation=\"auto\"\n", + " )\n", + "\n", + " # Continue the loop until no more computer_call actions\n", + " while True:\n", + " # Check for computer_call actions\n", + " computer_calls = [item for item in response.output if item and item.type == \"computer_call\"]\n", + " if not computer_calls:\n", + " print(\"No more computer calls. Loop complete.\")\n", + " break\n", + "\n", + " # Get the first computer call\n", + " call = computer_calls[0]\n", + " last_call_id = call.call_id\n", + " action = call.action\n", + " print(\"Received action from OpenAI Responses API:\", action)\n", + "\n", + " # Handle any pending safety checks\n", + " if call.pending_safety_checks:\n", + " print(\"Safety checks pending:\", call.pending_safety_checks)\n", + " # In a real implementation, you would want to get user confirmation here\n", + " acknowledged_checks = call.pending_safety_checks\n", + " else:\n", + " acknowledged_checks = []\n", + "\n", + " # Execute the action\n", + " await execute_action(computer, action)\n", + " await asyncio.sleep(1) # Allow time for changes to take effect\n", + "\n", + " # Capture new screenshot after action\n", + " new_screenshot = await computer.interface.screenshot()\n", + " new_screenshot_base64 = base64.b64encode(new_screenshot).decode('utf-8')\n", + "\n", + " # Send the screenshot back as computer_call_output\n", + " response = openai.responses.create(\n", + " model=\"computer-use-preview\",\n", + " previous_response_id=response.id, # Link to previous response\n", + " tools=[{\n", + " \"type\": \"computer_use_preview\",\n", + " \"display_width\": 1024,\n", + " \"display_height\": 768,\n", + " \"environment\": \"mac\"\n", + " }],\n", + " input=[{ # type: ignore\n", + " \"type\": \"computer_call_output\",\n", + " \"call_id\": last_call_id,\n", + " \"acknowledged_safety_checks\": acknowledged_checks,\n", + " \"output\": {\n", + " \"type\": \"input_image\",\n", + " \"image_url\": f\"data:image/png;base64,{new_screenshot_base64}\"\n", + " }\n", + " }],\n", + " truncation=\"auto\"\n", + " )\n", + "\n", + " # End the session\n", + " await computer.stop()\n", + "\n", + "# Run the loop\n", + "await cua_openai_loop()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Final Remarks\n", + "\n", + "This notebook demonstrates a single iteration of a CUA/OpenAI loop where:\n", + "\n", + "- A macOS sandbox is controlled using the CUA interface.\n", + "- A screenshot and prompt are sent to the OpenAI Responses API.\n", + "- The returned action (e.g. a click or type command) is executed via the CUA interface.\n", + "\n", + "In a production setting, you would wrap the action-response cycle in a loop, handling multiple actions and safety checks as needed." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "cua", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}