computer/notebooks/sota_hackathon.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a5d6b2ed",
   "metadata": {},
   "source": [
    "# Computer-Use Agents SOTA Challenge\n",
    "\n",
    "Congrats on joining the Cua + HUD hackathon at Hack The North 2025!\n",
    "\n",
    "This notebook will show you how to create a computer use agent with Cua and evaluate it using HUD."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cebe8572",
   "metadata": {},
   "source": [
    "## 💻 Prequisites\n",
    "\n",
    "Clone the Cua repository and install project dependencies."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d7c38f9",
   "metadata": {},
   "source": [
    "The easiest way to get started is by getting set up with the Cua development repository.\n",
    "\n",
    "Install [Docker](https://www.docker.com/products/docker-desktop/) and [pdm](https://pdm-project.org/en/latest/#recommended-installation-method).\n",
    "\n",
    "Clone the Cua repository:\n",
    "\n",
    "`git clone https://github.com/trycua/cua`\n",
    "\n",
    "Install the project dependencies:\n",
    "\n",
    "`cd cua && pdm install`\n",
    "\n",
    "Now, you should be able to run the `notebooks/hud_hackathon.ipynb` notebook in VS Code with the `.venv` virtual environment selected."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19f92431",
   "metadata": {},
   "source": [
    "## ☁️ Connect to cloud services\n",
    "\n",
    "Create a free HUD accounts and load your API keys. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47171dc3",
   "metadata": {},
   "source": [
    "1. Create a HUD account at https://www.hud.so/\n",
    "4. Create a .env file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1757f145",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a .env file if it doesn't exist\n",
    "\n",
    "ENV_TEMPLATE = \"\"\"# Required environment variables:\n",
    "HUD_API_KEY=\n",
    "\n",
    "# Any LLM provider will work:\n",
    "ANTHROPIC_API_KEY=\n",
    "OPENAI_API_KEY=\n",
    "\"\"\"\n",
    "\n",
    "import os\n",
    "if not os.path.exists(\".env\"):\n",
    "    open(\".env\", \"w\").write(ENV_TEMPLATE)\n",
    "    print(\"A .env file was created! Fill in the empty values.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0949908d",
   "metadata": {},
   "source": [
    "5. Fill in all missing values in the .env file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f23828d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the .env file\n",
    "# HUD requires the .env file to be in the same directory\n",
    "\n",
    "from dotenv import load_dotenv\n",
    "load_dotenv(dotenv_path='.env', override=True)\n",
    "\n",
    "assert os.getenv(\"HUD_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c8bef64",
   "metadata": {},
   "source": [
    "## 🤖 Create a computer use agent\n",
    "\n",
    "Create and a computer use agent using the Cua SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd4393b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "from pathlib import Path\n",
    "from agent import ComputerAgent\n",
    "\n",
    "# Here you can set the model and tools for your agent.\n",
    "# Computer use models: https://www.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents\n",
    "# Composed agent models: https://www.trycua.com/docs/agent-sdk/supported-agents/composed-agents\n",
    "# Custom tools: https://www.trycua.com/docs/agent-sdk/custom-tools\n",
    "agent_config = {\n",
    "    \"model\": \"openai/computer-use-preview\",\n",
    "    \"trajectory_dir\": str(Path(\"trajectories\")),\n",
    "    \"only_n_most_recent_images\": 3,\n",
    "    \"verbosity\": logging.INFO\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a07b09ee",
   "metadata": {},
   "source": [
    "## 🖱️ Test your agent\n",
    "\n",
    "Run your agent on a test scenario in a Docker container."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12b9c22c",
   "metadata": {},
   "source": [
    "Make sure Docker is running to launch the computer.\n",
    "\n",
    "You can view the live VNC stream from the Docker container at `http://localhost:8006/`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a210e959",
   "metadata": {},
   "outputs": [],
   "source": [
    "from computer import Computer, VMProviderType\n",
    "import webbrowser\n",
    "\n",
    "# Connect to your existing cloud container\n",
    "computer = Computer(\n",
    "    os_type=\"linux\",\n",
    "    provider_type=VMProviderType.DOCKER,\n",
    "    verbosity=logging.INFO\n",
    ")\n",
    "await computer.run()\n",
    "\n",
    "agent_config[\"tools\"] = [ computer ]\n",
    "\n",
    "webbrowser.open(\"http://localhost:8006/\", new=0, autoraise=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87a307e3",
   "metadata": {},
   "source": [
    "Try running the computer use agent on a simple task.\n",
    "\n",
    "Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3a32ea8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create agent\n",
    "agent = ComputerAgent(**agent_config)\n",
    "\n",
    "tasks = [\n",
    "    \"Open the web browser and search for a repository named trycua/cua on GitHub.\"\n",
    "]\n",
    "\n",
    "for i, task in enumerate(tasks):\n",
    "    print(f\"\\nExecuting task {i}/{len(tasks)}: {task}\")\n",
    "    async for result in agent.run(task):\n",
    "        print(result)\n",
    "        pass\n",
    "\n",
    "    print(f\"\\n✅ Task {i+1}/{len(tasks)} completed: {task}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb4edbb5",
   "metadata": {},
   "source": [
    "## 🧐 Benchmark your agent\n",
    "\n",
    "Test your agent's performance on a selection of tasks from the OSWorld benchmark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bf0887e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "from pprint import pprint\n",
    "from agent.integrations.hud import run_full_dataset\n",
    "\n",
    "job_name = f\"osworld-test-{str(uuid.uuid4())[:4]}\"\n",
    "\n",
    "# Full dataset evaluation (runs via HUD's run_dataset under the hood)\n",
    "# See the documentation here: https://docs.trycua.com/docs/agent-sdk/integrations/hud#running-a-full-dataset\n",
    "results = await run_full_dataset(\n",
    "    dataset=\"ddupont/OSWorld-Tiny-Public\",\n",
    "    job_name=job_name,\n",
    "    **agent_config,\n",
    "    max_concurrent=20,\n",
    "    max_steps=50,\n",
    "    #split=\"train[:5]\"\n",
    ")\n",
    "\n",
    "# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed\n",
    "print(f\"Job: {job_name}\")\n",
    "print(f\"Total results: {len(results)}\")\n",
    "pprint(results[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b89a103",
   "metadata": {},
   "source": [
    "## 🦾 Improve your agent\n",
    "\n",
    "To improve your agent for OSWorld-Verified, experiment with different models and add custom tools that fit your use case. You can also dive into the ComputerAgent source code to design an improved version or subclass tailored to your needs.\n",
    "\n",
    "Learn more about [Customizing Your ComputerAgent](https://docs.trycua.com/docs/agent-sdk/customizing-computeragent) in the docs."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}