Merge branch 'main' into fix/nextjs-vuln

This commit is contained in:
Morgan Dean
2025-12-03 12:07:10 -08:00
25 changed files with 3959 additions and 2927 deletions

View File

@@ -2,8 +2,7 @@ name: Lint & Format Check
on:
pull_request:
branches:
- main
push:
branches:
- main

View File

@@ -15,6 +15,8 @@ repos:
name: TypeScript type check
entry: node ./scripts/typescript-typecheck.js
language: node
files: \.(ts|tsx)$
pass_filenames: false
- repo: https://github.com/PyCQA/isort
rev: 7.0.0

View File

@@ -21,7 +21,6 @@ The Playground connects to your existing Cua sandboxes—the same ones you use w
<video src="https://github.com/user-attachments/assets/9fef0f30-1024-4833-8b7a-6a2c02d8eb99" width="600" controls></video>
</div>
Sign up at [cua.ai/signin](https://cua.ai/signin) and grab your API key from the dashboard. Then navigate to the Playground:
1. Navigate to Dashboard > Playground
@@ -33,6 +32,7 @@ Sign up at [cua.ai/signin](https://cua.ai/signin) and grab your API key from the
Example use cases:
**Prompt Testing**
```
❌ "Check the website"
✅ "Navigate to example.com in Firefox and take a screenshot of the homepage"
@@ -42,6 +42,7 @@ Example use cases:
Run the same task with different models to compare quality, speed, and cost.
**Debugging Agent Behavior**
1. Send: "Find the login button and click it"
2. View tool calls to see each mouse movement
3. Check screenshots to verify the agent found the right element

View File

@@ -51,7 +51,6 @@ When you request an Anthropic model through Cua, we automatically route to the b
Sign up at [cua.ai/signin](https://cua.ai/signin) and create your API key from **Dashboard > API Keys > New API Key** (save it immediately—you won't see it again).
Use it with the Agent SDK (make sure to set your environment variable):
```python

View File

@@ -29,13 +29,13 @@ A few papers stand out for their immediate relevance to anyone building or deplo
## Summary Statistics
| Category | Count |
|----------|-------|
| Benchmarks & Datasets | 18 |
| Safety & Security | 12 |
| Grounding & Visual Reasoning | 14 |
| Agent Architectures & Training | 11 |
| Adversarial Attacks | 8 |
| Category | Count |
| ------------------------------ | ----- |
| Benchmarks & Datasets | 18 |
| Safety & Security | 12 |
| Grounding & Visual Reasoning | 14 |
| Agent Architectures & Training | 11 |
| Adversarial Attacks | 8 |
**Total Papers:** 45
@@ -56,6 +56,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** The first comprehensive benchmark for evaluating GUI agents on macOS. Features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with support for 5 languages (English, Chinese, Arabic, Japanese, Russian). Reveals a dramatic gap: proprietary agents achieve 30%+ success rate while open-source models lag below 5%. Also includes safety benchmarking for deception attacks.
**Key Findings:**
- Proprietary computer-use agents lead at above 30% success rate
- Open-source lightweight models struggle below 5%, highlighting need for macOS domain adaptation
- Multilingual benchmarks expose weaknesses, especially in Arabic (28.8% degradation vs English)
@@ -70,6 +71,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A comprehensive safety benchmark built on OSWorld for testing computer-use agents across three harm categories: deliberate user misuse, prompt injection attacks, and model misbehavior. Includes 150 tasks spanning harassment, copyright infringement, disinformation, data exfiltration, and more. Proposes an automated judge achieving high agreement with human annotations (0.76-0.79 F1 score).
**Key Findings:**
- All tested models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro) tend to directly comply with many deliberate misuse queries
- Models are relatively vulnerable to static prompt injections
- Models occasionally perform unsafe actions without explicit malicious prompts
@@ -83,6 +85,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A comprehensive open-source framework for scaling computer-use agent data and foundation models. Introduces AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications/websites. OpenCUA-72B achieves 45% success rate on OSWorld-Verified, establishing new state-of-the-art among open-source models.
**Key Contributions:**
- Annotation infrastructure for capturing human computer-use demonstrations
- AgentNet: large-scale dataset across 3 OSes and 200+ apps
- Scalable pipeline transforming demonstrations into state-action pairs with reflective Chain-of-Thought reasoning
@@ -97,6 +100,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A benchmark of 130 realistic, high-quality, long-horizon tasks for agentic search systems (like Deep Research), requiring real-time web browsing and extensive information synthesis. Constructed with 1000+ hours of human labor. Introduces Agent-as-a-Judge framework using tree-structured rubric design for automated evaluation.
**Key Findings:**
- OpenAI Deep Research achieves 50-70% of human performance while spending half the time
- First systematic evaluation of ten frontier agentic search systems vs. human performance
- Addresses the challenge of evaluating time-varying, complex answers
@@ -110,6 +114,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Addresses GUI grounding—mapping natural language to specific UI actions—as a critical bottleneck in agent development. Introduces OSWorld-G benchmark (564 annotated samples) and Jedi dataset (4 million synthetic examples), the largest computer-use grounding dataset. Improved grounding directly enhances agentic capabilities, boosting OSWorld performance from 23% to 51%.
**Key Contributions:**
- OSWorld-G: comprehensive benchmark for diverse grounding tasks (text matching, element recognition, layout understanding, precise manipulation)
- Jedi: 4M examples through multi-perspective task decoupling
- Demonstrates compositional generalization to novel interfaces
@@ -123,6 +128,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Evaluates potential safety risks of MLLM-based agents during real-world computer manipulation. Features 492 risky tasks spanning web, social media, multimedia, OS, email, and office software. Categorizes risks into user-originated and environmental risks, evaluating both risk goal intention and completion.
**Key Findings:**
- Current computer-use agents face significant safety risks in real-world scenarios
- Safety principles designed for dialogue scenarios don't transfer well to computer-use
- Highlights necessity and urgency of safety alignment for computer-use agents
@@ -136,6 +142,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A benchmark featuring high-fidelity, deterministic replicas of 11 widely-used websites across e-commerce, travel, communication, and professional networking. Contains 112 practical tasks requiring both information retrieval and state-changing actions. Enables reproducible evaluation without safety risks.
**Key Findings:**
- Best frontier language models achieve only 41% success rate
- Highlights critical gaps in autonomous web navigation and task completion
- Supports scalable post-training data generation
@@ -149,6 +156,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** An RL-based framework for GUI grounding incorporating seed data curation, dense policy gradients, and self-evolutionary reinforcement finetuning using attention maps. With only 3K training samples, the 7B model achieves state-of-the-art on three grounding benchmarks, outperforming UI-TARS-72B by 24.2% on ScreenSpot-Pro.
**Key Results:**
- 47.3% accuracy on ScreenSpot-Pro with 7B model
- Outperforms 72B models with fraction of training data
- Demonstrates effectiveness of RL for high-resolution, complex environments
@@ -162,6 +170,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A generative adversarial framework that manipulates agent decision-making using diffusion-based semantic injections. Combines negative prompt degradation with positive semantic optimization. Without model access, produces visually natural images that induce consistent decision biases in agents.
**Key Findings:**
- Consistently induces decision-level preference redirection on LLaVA-34B, Gemma3, GPT-4o, and Mistral-3.2
- Outperforms baselines (SPSA, Bandit, standard diffusion)
- Exposes vulnerability: autonomous agents can be misled through visually subtle, semantically-guided manipulations
@@ -175,6 +184,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** An extensible benchmark simulating a small software company environment where AI agents interact like digital workers: browsing the web, writing code, running programs, and communicating with coworkers. Tests agents on real professional tasks with important implications for industry adoption and labor market effects.
**Key Findings:**
- Best agent achieves 30% autonomous task completion
- Simpler tasks are solvable autonomously
- More difficult long-horizon tasks remain beyond current systems' reach
@@ -188,6 +198,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A comprehensive benchmark for VLMs in video game QA, encompassing visual unit testing, visual regression testing, needle-in-a-haystack challenges, glitch detection, and bug report generation for both images and videos. Addresses the need for standardized benchmarks in this labor-intensive domain.
**Key Focus:**
- First benchmark specifically designed for video game QA with VLMs
- Covers wide range of QA activities across images and videos
- Addresses lack of automation in game development workflows
@@ -201,6 +212,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** End-to-end benchmark for evaluating web agent security against prompt injection attacks. Tests realistic scenarios where even simple, low-effort human-written injections can deceive top-tier AI models including those with advanced reasoning.
**Key Findings:**
- Attacks partially succeed in up to 86% of cases
- State-of-the-art agents often struggle to fully complete attacker goals
- Reveals "security by incompetence"—agents' limitations sometimes prevent full attack success
@@ -214,6 +226,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Measures whether AI web-navigation agents follow the privacy principle of "data minimization"—using sensitive information only when truly necessary to complete a task. Simulates realistic web interaction scenarios end-to-end.
**Key Findings:**
- Agents built on GPT-4, Llama-3, and Claude are prone to inadvertent use of unnecessary sensitive information
- Proposes prompting-based defense that reduces information leakage
- End-to-end benchmarking provides more realistic measure than probing LLMs about privacy
@@ -227,6 +240,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. Creates unified simulation integrating realistic 3D indoor/outdoor environments with functional web interfaces. Tasks include cooking from online recipes, navigating with dynamic map data, and interpreting landmarks using web knowledge.
**Key Contributions:**
- Unified platform combining 3D environments with web interfaces
- Benchmark spanning cooking, navigation, shopping, tourism, and geolocation
- Reveals significant performance gaps between AI systems and humans
@@ -240,6 +254,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** The first attempt to model UI interactions for precision engineering tasks. Features 41K+ annotated video recordings of CAD operations with time horizons up to 20x longer than existing datasets. Proposes VideoCADFormer for learning CAD interactions directly from video.
**Key Contributions:**
- Large-scale synthetic dataset for CAD UI interactions
- VQA benchmark for evaluating spatial reasoning and video understanding
- Reveals challenges in precise action grounding and long-horizon dependencies
@@ -253,6 +268,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Introduces a pre-operative critic mechanism that provides feedback before action execution by reasoning about potential outcomes. Proposes Suggestion-aware Group Relative Policy Optimization (S-GRPO) for building the GUI-Critic-R1 model with fully automated data generation.
**Key Results:**
- Significant advantages in critic accuracy compared to current MLLMs
- Improved success rates and operational efficiency on GUI automation benchmarks
- Works across both mobile and web domains
@@ -266,7 +282,8 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Introduces multi-turn RL framework enabling dynamic zooming into predicted coordinates during reasoning.
**Key Results:**
- 86.4% on V*Bench for visual search
- 86.4% on V\*Bench for visual search
- Outperforms supervised fine-tuning and conventional RL across spatial reasoning, visual search, and web-based grounding
- Grounding amplifies region exploration, subgoal setting, and visual verification
@@ -279,6 +296,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A VLM-based method for coordinate-free GUI grounding using an attention-based action head. Enables proposing one or more action regions in a single forward pass with a grounding verifier for selection.
**Key Results:**
- GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro with Qwen2.5-VL, outperforming UI-TARS-72B (38.1)
- Improved generalization to unseen resolutions and layouts
- Fine-tuning only ~100M parameters achieves SOTA performance
@@ -292,11 +310,13 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Extensive analysis of the R1-Zero paradigm (online RL + chain-of-thought reasoning) for GUI grounding. Identifies issues: longer reasoning chains lead to worse performance, reward hacking via box size exploitation, and overfitting easy examples.
**Solutions Proposed:**
- Fast Thinking Template for direct answer generation
- Box size constraint in reward function
- Difficulty-aware scaling in RL objective
**Key Results:**
- GUI-G1-3B achieves 90.3% on ScreenSpot and 37.1% on ScreenSpot-Pro
- Outperforms larger UI-TARS-7B with only 3B parameters
@@ -309,6 +329,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Framework integrating self-reflection and error correction into end-to-end multimodal GUI models through GUI-specific pre-training, offline SFT, and online reflection tuning. Enables self-reflection emergence with fully automated data generation.
**Key Contributions:**
- Scalable pipelines for automatic reflection/correction data from successful trajectories
- GUI-Reflection Task Suite for reflection-oriented abilities
- Diverse environment for online training on mobile devices
@@ -323,6 +344,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A generalist agent capable of multimodal computer interaction (text, images, audio, video). Integrates tool-based and pure vision agents within highly modular architecture, enabling collaborative step-by-step task solving.
**Key Results:**
- 7.27 accuracy gain over Claude-Computer-Use on OSWorld
- Evaluated on pure vision benchmarks (OSWorld), general benchmarks (GAIA), and tool-intensive benchmarks (SWE-Bench)
- Demonstrates value of modular, collaborative agent architecture
@@ -336,6 +358,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A fine-grained adversarial attack framework that modifies VLM perception of only key objects while preserving semantics of remaining regions. Unlike broad semantic disruption, this targeted approach reduces conflicts with task context, making VLMs output valid but incorrect decisions that affect agent actions in the physical world.
**Key Contributions:**
- AdvEDM-R: removes semantics of specific objects from images
- AdvEDM-A: adds semantics of new objects into images
- Demonstrates fine-grained control with excellent attack performance in embodied decision-making tasks
@@ -349,6 +372,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A vision-centric reasoning benchmark grounded in challenging perceptual tasks. Unlike prior benchmarks, it moves beyond shallow perception ("see") to require fine-grained observation and analytical reasoning ("observe"). Features natural adversarial image pairs and annotated reasoning chains for process evaluation.
**Key Findings:**
- Tests 20 leading MLLMs including 12 foundation models and 8 reasoning-enhanced models
- Existing reasoning strategies (chain-of-thought, self-criticism) result in unstable and redundant reasoning
- Repeated image observation improves performance across models
@@ -363,6 +387,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** First systematic investigation of backdoor vulnerabilities in VLA models. Proposes Objective-Decoupled Optimization with two stages: explicit feature-space separation to isolate trigger representations, and conditional control deviations activated only by triggers.
**Key Findings:**
- Consistently achieves near-100% attack success rates with minimal impact on clean task accuracy
- Robust against common input perturbations, task transfers, and model fine-tuning
- Exposes critical security vulnerabilities in current VLA deployments under Training-as-a-Service paradigm
@@ -376,6 +401,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Benchmark for proactively inferring user goals from multimodal contextual observations for wearable assistant agents (smart glasses). Dataset comprises ~30 hours from 363 participants across 3,482 recordings with visual, audio, digital, and longitudinal context.
**Key Findings:**
- Humans achieve 93% MCQ accuracy; best VLM reaches ~84%
- For open-ended generation, best models produce relevant goals only ~57% of the time
- Smaller models (suited for wearables) achieve ~49% accuracy
@@ -390,6 +416,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A game-theoretic multi-agent framework formulating reasoning as a non-zero-sum game between base agents (visual perception specialists) and a critical agent (logic/fact verification). Features uncertainty-aware controller for dynamic agent collaboration with multi-round debates.
**Key Results:**
- Boosts small-to-mid scale models (Qwen2.5-VL-7B, InternVL3-14B) by 5-6%
- Enhances strong models like GPT-4o by 2-3%
- Modular, scalable, and generalizable framework
@@ -403,6 +430,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Introduces Grounded Reasoning with Images and Texts—a method for training MLLMs to generate reasoning chains interleaving natural language with explicit bounding box coordinates. Uses GRPO-GR reinforcement learning with rewards focused on answer accuracy and grounding format.
**Key Contributions:**
- Exceptional data efficiency: requires as few as 20 image-question-answer triplets
- Successfully unifies reasoning and grounding abilities
- Eliminates need for reasoning chain annotations or explicit bounding box labels
@@ -416,6 +444,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** First multimodal safety alignment framework. Introduces BeaverTails-V (first dataset with dual preference annotations for helpfulness and safety), and Beaver-Guard-V (multi-level guardrail system defending against unsafe queries and adversarial attacks).
**Key Results:**
- Guard model improves precursor model's safety by average of 40.9% over five filtering rounds
- Safe RLHF-V enhances model safety by 34.2% and helpfulness by 34.3%
- First exploration of multi-modal safety alignment within constrained optimization
@@ -429,6 +458,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** An inference-time approach that quantifies visual token uncertainty and selectively masks uncertain tokens. Decomposes uncertainty into aleatoric and epistemic components, focusing on epistemic uncertainty for perception-related errors.
**Key Results:**
- Significantly reduces object hallucinations
- Enhances reliability and quality of LVLM outputs across diverse visual contexts
- Validated on CHAIR, THRONE, and MMBench benchmarks
@@ -442,6 +472,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A unified LVLM integrating segmentation-aware perception and controllable object-centric generation. Uses dual-branch visual encoder for global semantic context and fine-grained spatial details, with MoVQGAN-based visual tokenizer for discrete visual tokens.
**Key Contributions:**
- Progressive multi-stage training pipeline
- Segmentation masks jointly optimized as spatial condition prompts
- Bridges segmentation-aware perception with fine-grained visual synthesis
@@ -455,6 +486,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Introduces Multi-Model Monte Carlo Tree Search (M3CTS) for generating diverse Long Chain-of-Thought reasoning trajectories. Proposes fine-grained Direct Preference Optimization (fDPO) with segment-specific preference granularity guided by spatial reward mechanism.
**Key Results:**
- fDPO achieves 4.1% and 9.0% gains over standard DPO on spatial quality and quantity tasks
- SpatialReasoner-R1 sets new SOTA on SpatialRGPT-Bench, outperforming strongest baseline by 9.8%
- Maintains competitive performance on general vision-language tasks
@@ -468,6 +500,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A two-stage reinforcement fine-tuning framework: SFT with curated Chain-of-Thought data activates reasoning potential, followed by RL based on Group Relative Policy Optimization (GRPO) for domain shift adaptability.
**Key Advantages:**
- State-of-the-art results outperforming both open-source and proprietary models
- Robust performance under domain shifts across various tasks
- Excellent data efficiency in few-shot learning scenarios
@@ -481,6 +514,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Reveals that safe images can be exploited for jailbreaking when combined with additional safe images and prompts, exploiting LVLMs' universal reasoning capabilities and safety snowball effect. Proposes Safety Snowball Agent (SSA) framework.
**Key Findings:**
- SSA can use nearly any image to induce LVLMs to produce unsafe content
- Achieves high jailbreak success rates against latest LVLMs
- Exploits inherent LVLM properties rather than alignment flaws
@@ -494,6 +528,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Uncovers novel attack vector: Malicious Image Patches (MIPs)—adversarially perturbed screen regions that induce OS agents to perform harmful actions. MIPs can be embedded in wallpapers or shared on social media to exfiltrate sensitive data.
**Key Findings:**
- MIPs generalize across user prompts and screen configurations
- Can hijack multiple OS agents during execution of benign instructions
- Exposes critical security vulnerabilities requiring attention before widespread deployment
@@ -507,6 +542,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A framework leveraging instruction-driven routing and sparsification for VLA efficiency. Features 3-stage progressive architecture inspired by human multimodal coordination: Encoder-FiLM Aggregation Routing, LLM-FiLM Pruning Routing, and V-L-A Coupled Attention.
**Key Results:**
- 97.4% success rate on LIBERO benchmark, 70.0% on real-world robotic tasks
- Reduces training costs by 2.5x and inference latency by 2.8x compared to OpenVLA
- Achieves state-of-the-art performance
@@ -520,6 +556,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Novel off-policy RL algorithm applying direct policy updates for positive samples and conservative, regularized updates for negative ones. Augmented with Successful Transition Replay (STR) for prioritizing successful interactions.
**Key Results:**
- At least 17% relative increase over existing methods on AndroidWorld benchmark
- Substantially fewer computational resources than GPT-4o-based methods
- 5-60x faster inference
@@ -533,6 +570,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** An API-centric stress testing framework that uncovers intent integrity violations in LLM agents. Uses semantic partitioning to organize tasks into meaningful categories, with targeted mutations to expose subtle agent errors while preserving user intent.
**Key Contributions:**
- Datatype-aware strategy memory for retrieving effective mutation patterns
- Lightweight predictor for ranking mutations by error likelihood
- Generalizes to stronger target models using smaller LLMs for test generation
@@ -546,6 +584,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** A dual-system framework bridging high-level reasoning with low-level action execution. Trains multimodal LLM to generate embodied reasoning plans guided by action-aligned visual rewards, compressed into visual plan latents for downstream action execution.
**Key Capabilities:**
- Few-shot adaptation
- Long-horizon planning
- Self-correction behaviors in complex embodied AI tasks
@@ -559,6 +598,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Automated attack framework that constructs chains of images with risky visual thoughts to challenge VLMs. Exploits the conflict between logical processing and safety protocols, leading to unsafe content generation.
**Key Results:**
- Improves average attack success rate by 26.71% (from 63.70% to 90.41%)
- Tested on 9 open-source and 6 commercial VLMs
- Outperforms state-of-the-art methods
@@ -572,6 +612,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** First web-based benchmark evaluating MLLM agents on diverse CAPTCHA puzzles. Spans 20 modern CAPTCHA types (225 total) with novel metric: CAPTCHA Reasoning Depth quantifying cognitive and motor steps required.
**Key Findings:**
- Humans achieve 93.3% success rate
- State-of-the-art agents achieve at most 40.0% (Browser-Use OpenAI-o3)
- Highlights significant gap between human and agent capabilities
@@ -585,7 +626,8 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Introduces pixel-space reasoning framework where VLMs use visual operations (zoom-in, select-frame) to directly inspect and infer from visual evidence. Two-phase training: instruction tuning on synthesized traces, then RL with curiosity-driven rewards.
**Key Results:**
- 84% on V*Bench, 74% on TallyQA-Complex, 84% on InfographicsVQA
- 84% on V\*Bench, 74% on TallyQA-Complex, 84% on InfographicsVQA
- Highest accuracy achieved by any open-source 7B model
- Enables proactive information gathering from complex visual inputs
@@ -598,6 +640,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Brain-inspired framework decomposing interactions into three biologically plausible phases: Blink (rapid detection via saccadic-like attention), Think (higher-level reasoning/planning), and Link (executable command generation for motor control).
**Key Innovations:**
- Automated annotation pipeline for blink data
- BTL Reward: first rule-based reward mechanism driven by both process and outcome
- Competitive performance on static GUI understanding and dynamic interaction tasks
@@ -611,6 +654,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Simulation environment engine enabling flexible definition of screens, icons, and navigation graphs with full environment access for agent training/evaluation. Demonstrates progressive training approach from SFT to multi-turn RL.
**Key Findings:**
- Supervised fine-tuning enables memorization of fundamental knowledge
- Single-turn RL enhances generalization to unseen scenarios
- Multi-turn RL encourages exploration strategies through interactive trial and error
@@ -624,6 +668,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Reasoning-enhanced framework integrating structured reasoning, action prediction, and history summarization. Uses Chain-of-Thought analyses combining progress estimation and decision reasoning, trained via SFT and GRPO with history-aware rewards.
**Key Results:**
- State-of-the-art under identical training data conditions
- Particularly strong in out-of-domain scenarios
- Robust reasoning and generalization across diverse GUI navigation tasks
@@ -637,6 +682,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
**Summary:** Self-improving framework addressing trajectory verification and training data scalability. Features UI-Genie-RM (image-text interleaved reward model) and self-improvement pipeline with reward-guided exploration and outcome verification.
**Key Contributions:**
- UI-Genie-RM-517k: first reward-specific dataset for GUI agents
- UI-Genie-Agent-16k: high-quality synthetic trajectories without manual annotation
- State-of-the-art across multiple GUI agent benchmarks through three generations of self-improvement

View File

@@ -250,13 +250,15 @@ $ cua get my-dev-sandbox --json
**Computer Server Health Check:**
The `cua get` command automatically probes the computer-server when the sandbox is running:
- Checks OS type via `https://{host}:8443/status`
- Checks version via `https://{host}:8443/cmd`
- Shows "Computer Server Status: healthy" when both probes succeed
- Uses a 3-second timeout for each probe
<Callout type="info">
The computer server status is only checked for running sandboxes. Stopped or suspended sandboxes will not show computer server information.
The computer server status is only checked for running sandboxes. Stopped or suspended sandboxes
will not show computer server information.
</Callout>
### `cua start`

View File

@@ -4,12 +4,12 @@ title: Configuration
The server is configured using environment variables (can be set in the Claude Desktop config):
| Variable | Description | Default |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
| Variable | Description | Default |
| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
| `CUA_MODEL_NAME` | Model string (e.g., "anthropic/claude-sonnet-4-20250514", "openai/computer-use-preview", "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", "omniparser+litellm/gpt-4o", "omniparser+ollama_chat/gemma3") | anthropic/claude-sonnet-4-20250514 |
| `ANTHROPIC_API_KEY` | Your Anthropic API key (required for Anthropic models) | None |
| `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |
| `CUA_USE_HOST_COMPUTER_SERVER` | Target your local desktop instead of a VM. Set to "true" to use your host system. **Warning:** AI models may perform risky actions. | false |
| `ANTHROPIC_API_KEY` | Your Anthropic API key (required for Anthropic models) | None |
| `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |
| `CUA_USE_HOST_COMPUTER_SERVER` | Target your local desktop instead of a VM. Set to "true" to use your host system. **Warning:** AI models may perform risky actions. | false |
## Model Configuration
@@ -17,7 +17,7 @@ The `CUA_MODEL_NAME` environment variable supports various model providers throu
### Supported Providers
- **Anthropic**: `anthropic/claude-sonnet-4-20250514`,
- **Anthropic**: `anthropic/claude-sonnet-4-20250514`,
- **OpenAI**: `openai/computer-use-preview`, `openai/gpt-4o`
- **Local Models**: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- **Omni + LiteLLM**: `omniparser+litellm/gpt-4o`, `omniparser+litellm/claude-3-haiku`

View File

@@ -0,0 +1,119 @@
"""
Browser Tool Example
Demonstrates how to use the BrowserTool to control a browser programmatically
via the computer server. The browser runs visibly on the XFCE desktop so visual
agents can see it.
Prerequisites:
- Computer server running (Docker container or local)
- For Docker: Container should be running with browser tool support
- For local: Playwright and Firefox must be installed
Usage:
python examples/browser_tool_example.py
"""
import asyncio
import logging
import sys
from pathlib import Path
# Add the libs path to sys.path
libs_path = Path(__file__).parent.parent / "libs" / "python"
sys.path.insert(0, str(libs_path))
from agent.tools.browser_tool import BrowserTool
# Import Computer interface and BrowserTool
from computer import Computer
# Configure logging to see what's happening
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
async def test_browser_tool():
"""Test the BrowserTool with various commands."""
# Initialize the computer interface
# For local testing, use provider_type="docker"
# For provider_type="cloud", provide name and api_key
computer = Computer(provider_type="docker", os_type="linux", image="cua-xfce:dev")
await computer.run()
# Initialize the browser tool with the computer interface
browser = BrowserTool(interface=computer)
logger.info("Testing Browser Tool...")
try:
# Test 0: Take a screenshot (pre-init)
logger.info("Test 0: Taking a screenshot...")
screenshot_bytes = await browser.screenshot()
screenshot_path = Path(__file__).parent / "browser_screenshot_init.png"
with open(screenshot_path, "wb") as f:
f.write(screenshot_bytes)
logger.info(f"Screenshot captured: {len(screenshot_bytes)} bytes")
# Test 1: Visit a URL
logger.info("Test 1: Visiting a URL...")
result = await browser.visit_url("https://www.trycua.com")
logger.info(f"Visit URL result: {result}")
# Wait a bit for the page to load
await asyncio.sleep(2)
# Test 2: Take a screenshot
logger.info("Test 2: Taking a screenshot...")
screenshot_bytes = await browser.screenshot()
screenshot_path = Path(__file__).parent / "browser_screenshot.png"
with open(screenshot_path, "wb") as f:
f.write(screenshot_bytes)
logger.info(f"Screenshot captured: {len(screenshot_bytes)} bytes")
# Wait a bit
await asyncio.sleep(1)
# Test 3: Visit bot detector
logger.info("Test 3: Visiting bot detector...")
result = await browser.visit_url("https://bot-detector.rebrowser.net/")
logger.info(f"Visit URL result: {result}")
# Test 2: Web search
logger.info("Test 2: Performing a web search...")
result = await browser.web_search("Python programming")
logger.info(f"Web search result: {result}")
# Wait a bit
await asyncio.sleep(2)
# Test 3: Scroll
logger.info("Test 3: Scrolling the page...")
result = await browser.scroll(delta_x=0, delta_y=500)
logger.info(f"Scroll result: {result}")
# Wait a bit
await asyncio.sleep(1)
# Test 4: Click (example coordinates - adjust based on your screen)
logger.info("Test 4: Clicking at coordinates...")
result = await browser.click(x=500, y=300)
logger.info(f"Click result: {result}")
# Wait a bit
await asyncio.sleep(1)
# Test 5: Type text (if there's a focused input field)
logger.info("Test 5: Typing text...")
result = await browser.type("Hello from BrowserTool!")
logger.info(f"Type result: {result}")
logger.info("All tests completed!")
except Exception as e:
logger.error(f"Error during testing: {e}", exc_info=True)
if __name__ == "__main__":
asyncio.run(test_browser_tool())

View File

@@ -8,6 +8,7 @@ from . import (
composed_grounded,
gelato,
gemini,
generic_vlm,
glm45v,
gta1,
holo,
@@ -16,7 +17,6 @@ from . import (
omniparser,
openai,
opencua,
generic_vlm,
uiins,
uitars,
uitars2,
@@ -24,19 +24,19 @@ from . import (
__all__ = [
"anthropic",
"openai",
"uitars",
"omniparser",
"gta1",
"composed_grounded",
"glm45v",
"opencua",
"internvl",
"holo",
"moondream3",
"gelato",
"gemini",
"generic_vlm",
"glm45v",
"gta1",
"holo",
"internvl",
"moondream3",
"omniparser",
"openai",
"opencua",
"uiins",
"gelato",
"uitars",
"uitars2",
]

View File

@@ -442,7 +442,7 @@ def get_all_element_descriptions(responses_items: List[Dict[str, Any]]) -> List[
# Conversion functions between responses_items and completion messages formats
def convert_responses_items_to_completion_messages(
messages: List[Dict[str, Any]],
messages: List[Dict[str, Any]],
allow_images_in_tool_results: bool = True,
send_multiple_user_images_per_parallel_tool_results: bool = False,
) -> List[Dict[str, Any]]:
@@ -573,25 +573,33 @@ def convert_responses_items_to_completion_messages(
"computer_call_output",
]
# Send tool message + separate user message with image (OpenAI compatible)
completion_messages += [
{
"role": "tool",
"tool_call_id": call_id,
"content": "[Execution completed. See screenshot below]",
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": output.get("image_url")}}
],
},
] if send_multiple_user_images_per_parallel_tool_results or (not is_next_message_image_result) else [
{
"role": "tool",
"tool_call_id": call_id,
"content": "[Execution completed. See screenshot below]",
},
]
completion_messages += (
[
{
"role": "tool",
"tool_call_id": call_id,
"content": "[Execution completed. See screenshot below]",
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": output.get("image_url")},
}
],
},
]
if send_multiple_user_images_per_parallel_tool_results
or (not is_next_message_image_result)
else [
{
"role": "tool",
"tool_call_id": call_id,
"content": "[Execution completed. See screenshot below]",
},
]
)
else:
# Handle text output as tool response
completion_messages.append(

View File

@@ -0,0 +1,6 @@
"""Tools for agent interactions."""
from .browser_tool import BrowserTool
__all__ = ["BrowserTool"]

View File

@@ -0,0 +1,135 @@
"""
Browser Tool for agent interactions.
Allows agents to control a browser programmatically via Playwright.
"""
import logging
from typing import TYPE_CHECKING, Optional
if TYPE_CHECKING:
from computer.interface import GenericComputerInterface
logger = logging.getLogger(__name__)
class BrowserTool:
"""
Browser tool that uses the computer SDK's interface to control a browser.
Implements the Fara/Magentic-One agent interface for browser control.
"""
def __init__(
self,
interface: "GenericComputerInterface",
):
"""
Initialize the BrowserTool.
Args:
interface: A GenericComputerInterface instance that provides playwright_exec
"""
self.interface = interface
self.logger = logger
async def _execute_command(self, command: str, params: dict) -> dict:
"""
Execute a browser command via the computer interface.
Args:
command: Command name
params: Command parameters
Returns:
Response dictionary
"""
try:
result = await self.interface.playwright_exec(command, params)
if not result.get("success"):
self.logger.error(
f"Browser command '{command}' failed: {result.get('error', 'Unknown error')}"
)
return result
except Exception as e:
self.logger.error(f"Error executing browser command '{command}': {e}")
return {"success": False, "error": str(e)}
async def visit_url(self, url: str) -> dict:
"""
Navigate to a URL.
Args:
url: URL to visit
Returns:
Response dictionary with success status and current URL
"""
return await self._execute_command("visit_url", {"url": url})
async def click(self, x: int, y: int) -> dict:
"""
Click at coordinates.
Args:
x: X coordinate
y: Y coordinate
Returns:
Response dictionary with success status
"""
return await self._execute_command("click", {"x": x, "y": y})
async def type(self, text: str) -> dict:
"""
Type text into the focused element.
Args:
text: Text to type
Returns:
Response dictionary with success status
"""
return await self._execute_command("type", {"text": text})
async def scroll(self, delta_x: int, delta_y: int) -> dict:
"""
Scroll the page.
Args:
delta_x: Horizontal scroll delta
delta_y: Vertical scroll delta
Returns:
Response dictionary with success status
"""
return await self._execute_command("scroll", {"delta_x": delta_x, "delta_y": delta_y})
async def web_search(self, query: str) -> dict:
"""
Navigate to a Google search for the query.
Args:
query: Search query
Returns:
Response dictionary with success status and current URL
"""
return await self._execute_command("web_search", {"query": query})
async def screenshot(self) -> bytes:
"""
Take a screenshot of the current browser page.
Returns:
Screenshot image data as bytes (PNG format)
"""
import base64
result = await self._execute_command("screenshot", {})
if result.get("success") and result.get("screenshot"):
# Decode base64 screenshot to bytes
screenshot_b64 = result["screenshot"]
screenshot_bytes = base64.b64decode(screenshot_b64)
return screenshot_bytes
else:
error = result.get("error", "Unknown error")
raise RuntimeError(f"Failed to take screenshot: {error}")

View File

@@ -0,0 +1,361 @@
"""
Browser manager using Playwright for programmatic browser control.
This allows agents to control a browser that runs visibly on the XFCE desktop.
"""
import asyncio
import logging
import os
from typing import Any, Dict, Optional
try:
from playwright.async_api import Browser, BrowserContext, Page, async_playwright
except ImportError:
async_playwright = None
Browser = None
BrowserContext = None
Page = None
logger = logging.getLogger(__name__)
class BrowserManager:
"""
Manages a Playwright browser instance that runs visibly on the XFCE desktop.
Uses persistent context to maintain cookies and sessions.
"""
def __init__(self):
"""Initialize the BrowserManager."""
self.playwright = None
self.browser: Optional[Browser] = None
self.context: Optional[BrowserContext] = None
self.page: Optional[Page] = None
self._initialized = False
self._initialization_error: Optional[str] = None
self._lock = asyncio.Lock()
async def _ensure_initialized(self):
"""Ensure the browser is initialized."""
# Check if browser was closed and needs reinitialization
if self._initialized:
try:
# Check if context is still valid by trying to access it
if self.context:
# Try to get pages - this will raise if context is closed
_ = self.context.pages
# If we get here, context is still alive
return
else:
# Context was closed, need to reinitialize
self._initialized = False
logger.warning("Browser context was closed, will reinitialize...")
except Exception as e:
# Context is dead, need to reinitialize
logger.warning(f"Browser context is dead ({e}), will reinitialize...")
self._initialized = False
self.context = None
self.page = None
# Clean up playwright if it exists
if self.playwright:
try:
await self.playwright.stop()
except Exception:
pass
self.playwright = None
async with self._lock:
# Double-check after acquiring lock (another thread might have initialized it)
if self._initialized:
try:
if self.context:
_ = self.context.pages
return
except Exception:
self._initialized = False
self.context = None
self.page = None
if self.playwright:
try:
await self.playwright.stop()
except Exception:
pass
self.playwright = None
if async_playwright is None:
raise RuntimeError(
"playwright is not installed. Please install it with: pip install playwright && playwright install --with-deps firefox"
)
try:
# Get display from environment or default to :1
display = os.environ.get("DISPLAY", ":1")
logger.info(f"Initializing browser with DISPLAY={display}")
# Start playwright
self.playwright = await async_playwright().start()
# Launch Firefox with persistent context (keeps cookies/sessions)
# headless=False is CRITICAL so the visual agent can see it
user_data_dir = os.path.join(os.path.expanduser("~"), ".playwright-firefox")
os.makedirs(user_data_dir, exist_ok=True)
# launch_persistent_context returns a BrowserContext, not a Browser
# Note: Removed --kiosk mode so the desktop remains visible
self.context = await self.playwright.firefox.launch_persistent_context(
user_data_dir=user_data_dir,
headless=False, # CRITICAL: visible for visual agent
viewport={"width": 1024, "height": 768},
# Removed --kiosk to allow desktop visibility
)
# Add init script to make the browser less detectable
await self.context.add_init_script(
"""const defaultGetter = Object.getOwnPropertyDescriptor(
Navigator.prototype,
"webdriver"
).get;
defaultGetter.apply(navigator);
defaultGetter.toString();
Object.defineProperty(Navigator.prototype, "webdriver", {
set: undefined,
enumerable: true,
configurable: true,
get: new Proxy(defaultGetter, {
apply: (target, thisArg, args) => {
Reflect.apply(target, thisArg, args);
return false;
},
}),
});
const patchedGetter = Object.getOwnPropertyDescriptor(
Navigator.prototype,
"webdriver"
).get;
patchedGetter.apply(navigator);
patchedGetter.toString();"""
)
# Get the first page or create one
pages = self.context.pages
if pages:
self.page = pages[0]
else:
self.page = await self.context.new_page()
self._initialized = True
logger.info("Browser initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize browser: {e}")
import traceback
logger.error(traceback.format_exc())
# Don't raise - return error in execute_command instead
self._initialization_error = str(e)
raise
async def _execute_command_impl(self, cmd: str, params: Dict[str, Any]) -> Dict[str, Any]:
"""Internal implementation of command execution."""
if cmd == "visit_url":
url = params.get("url")
if not url:
return {"success": False, "error": "url parameter is required"}
await self.page.goto(url, wait_until="domcontentloaded", timeout=30000)
return {"success": True, "url": self.page.url}
elif cmd == "click":
x = params.get("x")
y = params.get("y")
if x is None or y is None:
return {"success": False, "error": "x and y parameters are required"}
await self.page.mouse.click(x, y)
return {"success": True}
elif cmd == "type":
text = params.get("text")
if text is None:
return {"success": False, "error": "text parameter is required"}
await self.page.keyboard.type(text)
return {"success": True}
elif cmd == "scroll":
delta_x = params.get("delta_x", 0)
delta_y = params.get("delta_y", 0)
await self.page.mouse.wheel(delta_x, delta_y)
return {"success": True}
elif cmd == "web_search":
query = params.get("query")
if not query:
return {"success": False, "error": "query parameter is required"}
# Navigate to Google search
search_url = f"https://www.google.com/search?q={query}"
await self.page.goto(search_url, wait_until="domcontentloaded", timeout=30000)
return {"success": True, "url": self.page.url}
elif cmd == "screenshot":
# Take a screenshot and return as base64
import base64
screenshot_bytes = await self.page.screenshot(type="png")
screenshot_b64 = base64.b64encode(screenshot_bytes).decode("utf-8")
return {"success": True, "screenshot": screenshot_b64}
else:
return {"success": False, "error": f"Unknown command: {cmd}"}
async def execute_command(self, cmd: str, params: Dict[str, Any]) -> Dict[str, Any]:
"""
Execute a browser command with automatic recovery.
Args:
cmd: Command name (visit_url, click, type, scroll, web_search)
params: Command parameters
Returns:
Result dictionary with success status and any data
"""
max_retries = 2
for attempt in range(max_retries):
try:
await self._ensure_initialized()
except Exception as e:
error_msg = getattr(self, "_initialization_error", None) or str(e)
logger.error(f"Browser initialization failed: {error_msg}")
return {
"success": False,
"error": f"Browser initialization failed: {error_msg}. "
f"Make sure Playwright and Firefox are installed, and DISPLAY is set correctly.",
}
# Check if page is still valid and get a new one if needed
page_valid = False
try:
if self.page is not None and not self.page.is_closed():
# Try to access page.url to check if it's still valid
_ = self.page.url
page_valid = True
except Exception as e:
logger.warning(f"Page is invalid: {e}, will get a new page...")
self.page = None
# Get a valid page if we don't have one
if not page_valid or self.page is None:
try:
if self.context:
pages = self.context.pages
if pages:
# Find first non-closed page
for p in pages:
try:
if not p.is_closed():
self.page = p
logger.info("Reusing existing open page")
page_valid = True
break
except Exception:
continue
# If no valid page found, create a new one
if not page_valid:
self.page = await self.context.new_page()
logger.info("Created new page")
except Exception as e:
logger.error(f"Failed to get new page: {e}, browser may be closed")
# Browser was closed - force reinitialization
self._initialized = False
self.context = None
self.page = None
if self.playwright:
try:
await self.playwright.stop()
except Exception:
pass
self.playwright = None
# If this isn't the last attempt, continue to retry
if attempt < max_retries - 1:
logger.info("Browser was closed, retrying with fresh initialization...")
continue
else:
return {
"success": False,
"error": f"Browser was closed and cannot be recovered: {e}",
}
# Try to execute the command
try:
return await self._execute_command_impl(cmd, params)
except Exception as e:
error_str = str(e)
logger.error(f"Error executing command {cmd}: {e}")
# Check if this is a "browser/page/context closed" error
if any(keyword in error_str.lower() for keyword in ["closed", "target", "context"]):
logger.warning(
f"Browser/page was closed during command execution (attempt {attempt + 1}/{max_retries})"
)
# Force reinitialization
self._initialized = False
self.context = None
self.page = None
if self.playwright:
try:
await self.playwright.stop()
except Exception:
pass
self.playwright = None
# If this isn't the last attempt, retry
if attempt < max_retries - 1:
logger.info("Retrying command after browser reinitialization...")
continue
else:
return {
"success": False,
"error": f"Command failed after {max_retries} attempts: {error_str}",
}
else:
# Not a browser closed error, return immediately
import traceback
logger.error(traceback.format_exc())
return {"success": False, "error": error_str}
# Should never reach here, but just in case
return {"success": False, "error": "Command failed after all retries"}
async def close(self):
"""Close the browser and cleanup resources."""
async with self._lock:
try:
if self.context:
await self.context.close()
self.context = None
if self.browser:
await self.browser.close()
self.browser = None
if self.playwright:
await self.playwright.stop()
self.playwright = None
self.page = None
self._initialized = False
logger.info("Browser closed successfully")
except Exception as e:
logger.error(f"Error closing browser: {e}")
# Global instance
_browser_manager: Optional[BrowserManager] = None
def get_browser_manager() -> BrowserManager:
"""Get or create the global BrowserManager instance."""
global _browser_manager
if _browser_manager is None:
_browser_manager = BrowserManager()
return _browser_manager

View File

@@ -25,6 +25,7 @@ from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, StreamingResponse
from .handlers.factory import HandlerFactory
from .browser import get_browser_manager
# Authentication session TTL (in seconds). Override via env var CUA_AUTH_TTL_SECONDS. Default: 60s
AUTH_SESSION_TTL_SECONDS: int = int(os.environ.get("CUA_AUTH_TTL_SECONDS", "60"))
@@ -749,5 +750,71 @@ async def agent_response_endpoint(
return JSONResponse(content=payload, headers=headers)
@app.post("/playwright_exec")
async def playwright_exec_endpoint(
request: Request,
container_name: Optional[str] = Header(None, alias="X-Container-Name"),
api_key: Optional[str] = Header(None, alias="X-API-Key"),
):
"""
Execute Playwright browser commands.
Headers:
- X-Container-Name: Container name for cloud authentication
- X-API-Key: API key for cloud authentication
Body:
{
"command": "visit_url|click|type|scroll|web_search",
"params": {...}
}
"""
# Parse request body
try:
body = await request.json()
command = body.get("command")
params = body.get("params", {})
except Exception as e:
raise HTTPException(status_code=400, detail=f"Invalid JSON body: {str(e)}")
if not command:
raise HTTPException(status_code=400, detail="Command is required")
# Check if CONTAINER_NAME is set (indicating cloud provider)
server_container_name = os.environ.get("CONTAINER_NAME")
# If cloud provider, perform authentication
if server_container_name:
logger.info(
f"Cloud provider detected. CONTAINER_NAME: {server_container_name}. Performing authentication..."
)
# Validate required headers
if not container_name:
raise HTTPException(status_code=401, detail="Container name required")
if not api_key:
raise HTTPException(status_code=401, detail="API key required")
# Validate with AuthenticationManager
is_authenticated = await auth_manager.auth(container_name, api_key)
if not is_authenticated:
raise HTTPException(status_code=401, detail="Authentication failed")
# Get browser manager and execute command
try:
browser_manager = get_browser_manager()
result = await browser_manager.execute_command(command, params)
if result.get("success"):
return JSONResponse(content=result)
else:
raise HTTPException(status_code=400, detail=result.get("error", "Command failed"))
except Exception as e:
logger.error(f"Error executing playwright command: {str(e)}")
logger.error(traceback.format_exc())
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)

View File

@@ -24,6 +24,7 @@ dependencies = [
"pyperclip>=1.9.0",
"websockets>=12.0",
"pywinctl>=0.4.1",
"playwright>=1.40.0",
# OS-specific runtime deps
"pyobjc-framework-Cocoa>=10.1; sys_platform == 'darwin'",
"pyobjc-framework-Quartz>=10.1; sys_platform == 'darwin'",

View File

@@ -953,6 +953,35 @@ class Computer:
"""
return await self.interface.to_screenshot_coordinates(x, y)
async def playwright_exec(self, command: str, params: Optional[Dict] = None) -> Dict[str, Any]:
"""
Execute a Playwright browser command.
Args:
command: The browser command to execute (visit_url, click, type, scroll, web_search)
params: Command parameters
Returns:
Dict containing the command result
Examples:
# Navigate to a URL
await computer.playwright_exec("visit_url", {"url": "https://example.com"})
# Click at coordinates
await computer.playwright_exec("click", {"x": 100, "y": 200})
# Type text
await computer.playwright_exec("type", {"text": "Hello, world!"})
# Scroll
await computer.playwright_exec("scroll", {"delta_x": 0, "delta_y": -100})
# Web search
await computer.playwright_exec("web_search", {"query": "computer use agent"})
"""
return await self.interface.playwright_exec(command, params)
# Add virtual environment management functions to computer interface
async def venv_install(self, venv_name: str, requirements: list[str]):
"""Install packages in a virtual environment.

View File

@@ -661,6 +661,56 @@ class GenericComputerInterface(BaseComputerInterface):
return screenshot_x, screenshot_y
# Playwright browser control
async def playwright_exec(self, command: str, params: Optional[Dict] = None) -> Dict[str, Any]:
"""
Execute a Playwright browser command.
Args:
command: The browser command to execute (visit_url, click, type, scroll, web_search)
params: Command parameters
Returns:
Dict containing the command result
Examples:
# Navigate to a URL
await interface.playwright_exec("visit_url", {"url": "https://example.com"})
# Click at coordinates
await interface.playwright_exec("click", {"x": 100, "y": 200})
# Type text
await interface.playwright_exec("type", {"text": "Hello, world!"})
# Scroll
await interface.playwright_exec("scroll", {"delta_x": 0, "delta_y": -100})
# Web search
await interface.playwright_exec("web_search", {"query": "computer use agent"})
"""
protocol = "https" if self.api_key else "http"
port = "8443" if self.api_key else "8000"
url = f"{protocol}://{self.ip_address}:{port}/playwright_exec"
payload = {"command": command, "params": params or {}}
headers = {"Content-Type": "application/json"}
if self.api_key:
headers["X-API-Key"] = self.api_key
if self.vm_name:
headers["X-Container-Name"] = self.vm_name
try:
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload, headers=headers) as response:
if response.status == 200:
return await response.json()
else:
error_text = await response.text()
return {"success": False, "error": error_text}
except Exception as e:
return {"success": False, "error": str(e)}
# Websocket Methods
async def _keep_alive(self):
"""Keep the WebSocket connection alive with automatic reconnection."""

View File

@@ -45,7 +45,9 @@ class CloudProvider(BaseVMProvider):
# Fall back to environment variable if api_key not provided
if api_key is None:
api_key = os.getenv("CUA_API_KEY")
assert api_key, "api_key required for CloudProvider (provide via parameter or CUA_API_KEY environment variable)"
assert (
api_key
), "api_key required for CloudProvider (provide via parameter or CUA_API_KEY environment variable)"
self.api_key = api_key
self.verbose = verbose
self.api_base = (api_base or DEFAULT_API_BASE).rstrip("/")

View File

@@ -14,7 +14,7 @@ export async function runCli() {
' env Export API key to .env file\n' +
' logout Clear stored credentials\n' +
'\n' +
' cua sb <command> Create and manage cloud sandboxes\n' +
' cua sb <command> Create and manage cloud sandboxes\n' +
' list View all your sandboxes\n' +
' create Provision a new sandbox\n' +
' get Get detailed info about a sandbox\n' +

View File

@@ -29,7 +29,7 @@ async function fetchSandboxDetails(
const sandboxes = (await listRes.json()) as SandboxItem[];
const sandbox = sandboxes.find((s) => s.name === name);
if (!sandbox) {
console.error('Sandbox not found');
process.exit(1);
@@ -53,24 +53,32 @@ async function fetchSandboxDetails(
}
// Probe computer-server if requested and sandbox is running
if (options.probeComputerServer && sandbox.status === 'running' && sandbox.host) {
if (
options.probeComputerServer &&
sandbox.status === 'running' &&
sandbox.host
) {
let statusProbeSuccess = false;
let versionProbeSuccess = false;
try {
// Probe OS type
const statusUrl = `https://${sandbox.host}:8443/status`;
const statusController = new AbortController();
const statusTimeout = setTimeout(() => statusController.abort(), 3000);
try {
const statusRes = await fetch(statusUrl, {
signal: statusController.signal,
});
clearTimeout(statusTimeout);
if (statusRes.ok) {
const statusData = await statusRes.json() as { status: string; os_type: string; features?: string[] };
const statusData = (await statusRes.json()) as {
status: string;
os_type: string;
features?: string[];
};
result.os_type = statusData.os_type;
statusProbeSuccess = true;
}
@@ -82,7 +90,7 @@ async function fetchSandboxDetails(
const versionUrl = `https://${sandbox.host}:8443/cmd`;
const versionController = new AbortController();
const versionTimeout = setTimeout(() => versionController.abort(), 3000);
try {
const versionRes = await fetch(versionUrl, {
method: 'POST',
@@ -98,12 +106,16 @@ async function fetchSandboxDetails(
signal: versionController.signal,
});
clearTimeout(versionTimeout);
if (versionRes.ok) {
const versionDataRaw = await versionRes.text();
if (versionDataRaw.startsWith('data: ')) {
const jsonStr = versionDataRaw.slice(6);
const versionData = JSON.parse(jsonStr) as { success: boolean; protocol: number; package: string };
const versionData = JSON.parse(jsonStr) as {
success: boolean;
protocol: number;
package: string;
};
if (versionData.package) {
result.computer_server_version = versionData.package;
versionProbeSuccess = true;
@@ -116,7 +128,7 @@ async function fetchSandboxDetails(
} catch (err) {
// General error - skip probing
}
// Set computer server status based on probe results
if (statusProbeSuccess && versionProbeSuccess) {
result.computer_server_status = 'healthy';
@@ -394,23 +406,25 @@ const getHandler = async (argv: Record<string, unknown>) => {
console.log(`Name: ${details.name}`);
console.log(`Status: ${details.status}`);
console.log(`Host: ${details.host}`);
if (showPasswords) {
console.log(`Password: ${details.password}`);
}
if (details.os_type) {
console.log(`OS Type: ${details.os_type}`);
}
if (details.computer_server_version) {
console.log(`Computer Server Version: ${details.computer_server_version}`);
console.log(
`Computer Server Version: ${details.computer_server_version}`
);
}
if (details.computer_server_status) {
console.log(`Computer Server Status: ${details.computer_server_status}`);
}
if (showVncUrl) {
console.log(`VNC URL: ${details.vnc_url}`);
}

28
libs/xfce/Development.md Normal file
View File

@@ -0,0 +1,28 @@
# Development
## Building the Development Docker Image
To build the XFCE container with local computer-server changes:
```bash
cd libs/xfce
docker build -f Dockerfile.dev -t cua-xfce:dev ..
```
The build context is set to the parent directory to allow copying the local `computer-server` source.
## Tagging the Image
To tag the dev image as latest:
```bash
docker tag cua-xfce:dev cua-xfce:latest
```
## Running the Development Container
```bash
docker run -p 6901:6901 -p 8000:8000 cua-xfce:dev
```
Access noVNC at: http://localhost:6901

View File

@@ -107,6 +107,10 @@ RUN mkdir -p /home/cua/.cache && \
# Install computer-server using Python 3.12 pip
RUN python3.12 -m pip install cua-computer-server
# Install playwright and Firefox dependencies
RUN python3.12 -m pip install playwright && \
python3.12 -m playwright install --with-deps firefox
# Fix any cache files created by pip
RUN chown -R cua:cua /home/cua/.cache

159
libs/xfce/Dockerfile.dev Normal file
View File

@@ -0,0 +1,159 @@
# CUA Docker XFCE Container - Development Version
# Vanilla XFCE desktop with noVNC and computer-server (from local source)
FROM ubuntu:22.04
# Avoid prompts from apt
ENV DEBIAN_FRONTEND=noninteractive
# Set environment variables
ENV HOME=/home/cua
ENV DISPLAY=:1
ENV VNC_PORT=5901
ENV NOVNC_PORT=6901
ENV API_PORT=8000
ENV VNC_RESOLUTION=1024x768
ENV VNC_COL_DEPTH=24
# Install system dependencies first (including sudo)
RUN apt-get update && apt-get install -y \
# System utilities
sudo \
unzip \
zip \
xdg-utils \
# Desktop environment
xfce4 \
xfce4-terminal \
dbus-x11 \
# VNC server
tigervnc-standalone-server \
tigervnc-common \
# noVNC dependencies
# python will be installed via deadsnakes as 3.12 \
git \
net-tools \
netcat \
supervisor \
# Computer-server dependencies
# python-tk/dev for 3.12 will be installed later \
gnome-screenshot \
wmctrl \
ffmpeg \
socat \
xclip \
# Browser
wget \
software-properties-common \
# Build tools
build-essential \
libncursesw5-dev \
libssl-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Python 3.12 from deadsnakes (keep system python3 for apt)
RUN add-apt-repository -y ppa:deadsnakes/ppa && \
apt-get update && apt-get install -y \
python3.12 python3.12-venv python3.12-dev python3.12-tk && \
python3.12 -m ensurepip --upgrade && \
python3.12 -m pip install --upgrade pip setuptools wheel && \
rm -rf /var/lib/apt/lists/*
# Ensure 'python' points to Python 3.12
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 2
# Remove screensavers and power manager to avoid popups and lock screens
RUN apt-get remove -y \
xfce4-power-manager \
xfce4-power-manager-data \
xfce4-power-manager-plugins \
xfce4-screensaver \
light-locker \
xscreensaver \
xscreensaver-data || true
# Create user after sudo is installed
RUN useradd -m -s /bin/bash -G sudo cua && \
echo "cua:cua" | chpasswd && \
echo "cua ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
# Install Firefox from Mozilla PPA (snap-free) - inline to avoid script issues
RUN apt-get update && \
add-apt-repository -y ppa:mozillateam/ppa && \
echo 'Package: *\nPin: release o=LP-PPA-mozillateam\nPin-Priority: 1001' > /etc/apt/preferences.d/mozilla-firefox && \
apt-get update && \
apt-get install -y firefox && \
echo 'pref("datareporting.policy.firstRunURL", "");\npref("datareporting.policy.dataSubmissionEnabled", false);\npref("datareporting.healthreport.service.enabled", false);\npref("datareporting.healthreport.uploadEnabled", false);\npref("trailhead.firstrun.branches", "nofirstrun-empty");\npref("browser.aboutwelcome.enabled", false);' > /usr/lib/firefox/browser/defaults/preferences/firefox.js && \
update-alternatives --install /usr/bin/x-www-browser x-www-browser /usr/bin/firefox 100 && \
update-alternatives --install /usr/bin/gnome-www-browser gnome-www-browser /usr/bin/firefox 100 && \
rm -rf /var/lib/apt/lists/*
# Install noVNC
RUN git clone https://github.com/novnc/noVNC.git /opt/noVNC && \
git clone https://github.com/novnc/websockify /opt/noVNC/utils/websockify && \
ln -s /opt/noVNC/vnc.html /opt/noVNC/index.html
# Pre-create cache directory with correct ownership before pip install
RUN mkdir -p /home/cua/.cache && \
chown -R cua:cua /home/cua/.cache
# Copy local computer-server source and install it
COPY python/computer-server /tmp/computer-server
RUN python3.12 -m pip install /tmp/computer-server && \
rm -rf /tmp/computer-server
# Install playwright and Firefox dependencies
RUN python3.12 -m pip install playwright && \
python3.12 -m playwright install --with-deps firefox
# Fix any cache files created by pip
RUN chown -R cua:cua /home/cua/.cache
# Copy startup scripts
COPY xfce/src/supervisor/ /etc/supervisor/conf.d/
COPY xfce/src/scripts/ /usr/local/bin/
# Make scripts executable
RUN chmod +x /usr/local/bin/*.sh
# Setup VNC
RUN chown -R cua:cua /home/cua
USER cua
WORKDIR /home/cua
# Create VNC directory (no password needed with SecurityTypes None)
RUN mkdir -p $HOME/.vnc
# Configure XFCE for first start
RUN mkdir -p $HOME/.config/xfce4/xfconf/xfce-perchannel-xml $HOME/.config/xfce4 $HOME/.config/autostart
# Copy XFCE config to disable browser launching and welcome screens
COPY --chown=cua:cua xfce/src/xfce-config/helpers.rc $HOME/.config/xfce4/helpers.rc
COPY --chown=cua:cua xfce/src/xfce-config/xfce4-session.xml $HOME/.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-session.xml
COPY --chown=cua:cua xfce/src/xfce-config/xfce4-power-manager.xml $HOME/.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-power-manager.xml
# Disable autostart for screensaver, lock screen, and power manager
RUN echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-tips-autostart.desktop && \
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-screensaver.desktop && \
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/light-locker.desktop && \
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-power-manager.desktop && \
chown -R cua:cua $HOME/.config
# Create storage and shared directories, and Firefox cache directory
RUN mkdir -p $HOME/storage $HOME/shared $HOME/.cache/dconf $HOME/.mozilla/firefox && \
chown -R cua:cua $HOME/storage $HOME/shared $HOME/.cache $HOME/.mozilla $HOME/.vnc
USER root
# Expose ports
EXPOSE $VNC_PORT $NOVNC_PORT $API_PORT
# Start services via supervisor
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/supervisord.conf"]

View File

@@ -10,4 +10,4 @@ echo "X server is ready"
# Start computer-server
export DISPLAY=:1
python3 -m computer_server --port ${API_PORT:-8000}
python -m computer_server --port ${API_PORT:-8000}

5714
uv.lock generated

File diff suppressed because it is too large Load Diff