mirror of
https://github.com/trycua/computer.git
synced 2026-01-07 22:10:02 -06:00
Merge branch 'main' into fix/nextjs-vuln
This commit is contained in:
3
.github/workflows/lint.yml
vendored
3
.github/workflows/lint.yml
vendored
@@ -2,8 +2,7 @@ name: Lint & Format Check
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
|
||||
@@ -15,6 +15,8 @@ repos:
|
||||
name: TypeScript type check
|
||||
entry: node ./scripts/typescript-typecheck.js
|
||||
language: node
|
||||
files: \.(ts|tsx)$
|
||||
pass_filenames: false
|
||||
|
||||
- repo: https://github.com/PyCQA/isort
|
||||
rev: 7.0.0
|
||||
|
||||
@@ -21,7 +21,6 @@ The Playground connects to your existing Cua sandboxes—the same ones you use w
|
||||
<video src="https://github.com/user-attachments/assets/9fef0f30-1024-4833-8b7a-6a2c02d8eb99" width="600" controls></video>
|
||||
</div>
|
||||
|
||||
|
||||
Sign up at [cua.ai/signin](https://cua.ai/signin) and grab your API key from the dashboard. Then navigate to the Playground:
|
||||
|
||||
1. Navigate to Dashboard > Playground
|
||||
@@ -33,6 +32,7 @@ Sign up at [cua.ai/signin](https://cua.ai/signin) and grab your API key from the
|
||||
Example use cases:
|
||||
|
||||
**Prompt Testing**
|
||||
|
||||
```
|
||||
❌ "Check the website"
|
||||
✅ "Navigate to example.com in Firefox and take a screenshot of the homepage"
|
||||
@@ -42,6 +42,7 @@ Example use cases:
|
||||
Run the same task with different models to compare quality, speed, and cost.
|
||||
|
||||
**Debugging Agent Behavior**
|
||||
|
||||
1. Send: "Find the login button and click it"
|
||||
2. View tool calls to see each mouse movement
|
||||
3. Check screenshots to verify the agent found the right element
|
||||
|
||||
@@ -51,7 +51,6 @@ When you request an Anthropic model through Cua, we automatically route to the b
|
||||
|
||||
Sign up at [cua.ai/signin](https://cua.ai/signin) and create your API key from **Dashboard > API Keys > New API Key** (save it immediately—you won't see it again).
|
||||
|
||||
|
||||
Use it with the Agent SDK (make sure to set your environment variable):
|
||||
|
||||
```python
|
||||
|
||||
@@ -29,13 +29,13 @@ A few papers stand out for their immediate relevance to anyone building or deplo
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
| Category | Count |
|
||||
|----------|-------|
|
||||
| Benchmarks & Datasets | 18 |
|
||||
| Safety & Security | 12 |
|
||||
| Grounding & Visual Reasoning | 14 |
|
||||
| Agent Architectures & Training | 11 |
|
||||
| Adversarial Attacks | 8 |
|
||||
| Category | Count |
|
||||
| ------------------------------ | ----- |
|
||||
| Benchmarks & Datasets | 18 |
|
||||
| Safety & Security | 12 |
|
||||
| Grounding & Visual Reasoning | 14 |
|
||||
| Agent Architectures & Training | 11 |
|
||||
| Adversarial Attacks | 8 |
|
||||
|
||||
**Total Papers:** 45
|
||||
|
||||
@@ -56,6 +56,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** The first comprehensive benchmark for evaluating GUI agents on macOS. Features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with support for 5 languages (English, Chinese, Arabic, Japanese, Russian). Reveals a dramatic gap: proprietary agents achieve 30%+ success rate while open-source models lag below 5%. Also includes safety benchmarking for deception attacks.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Proprietary computer-use agents lead at above 30% success rate
|
||||
- Open-source lightweight models struggle below 5%, highlighting need for macOS domain adaptation
|
||||
- Multilingual benchmarks expose weaknesses, especially in Arabic (28.8% degradation vs English)
|
||||
@@ -70,6 +71,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A comprehensive safety benchmark built on OSWorld for testing computer-use agents across three harm categories: deliberate user misuse, prompt injection attacks, and model misbehavior. Includes 150 tasks spanning harassment, copyright infringement, disinformation, data exfiltration, and more. Proposes an automated judge achieving high agreement with human annotations (0.76-0.79 F1 score).
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- All tested models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro) tend to directly comply with many deliberate misuse queries
|
||||
- Models are relatively vulnerable to static prompt injections
|
||||
- Models occasionally perform unsafe actions without explicit malicious prompts
|
||||
@@ -83,6 +85,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A comprehensive open-source framework for scaling computer-use agent data and foundation models. Introduces AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications/websites. OpenCUA-72B achieves 45% success rate on OSWorld-Verified, establishing new state-of-the-art among open-source models.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Annotation infrastructure for capturing human computer-use demonstrations
|
||||
- AgentNet: large-scale dataset across 3 OSes and 200+ apps
|
||||
- Scalable pipeline transforming demonstrations into state-action pairs with reflective Chain-of-Thought reasoning
|
||||
@@ -97,6 +100,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A benchmark of 130 realistic, high-quality, long-horizon tasks for agentic search systems (like Deep Research), requiring real-time web browsing and extensive information synthesis. Constructed with 1000+ hours of human labor. Introduces Agent-as-a-Judge framework using tree-structured rubric design for automated evaluation.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- OpenAI Deep Research achieves 50-70% of human performance while spending half the time
|
||||
- First systematic evaluation of ten frontier agentic search systems vs. human performance
|
||||
- Addresses the challenge of evaluating time-varying, complex answers
|
||||
@@ -110,6 +114,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Addresses GUI grounding—mapping natural language to specific UI actions—as a critical bottleneck in agent development. Introduces OSWorld-G benchmark (564 annotated samples) and Jedi dataset (4 million synthetic examples), the largest computer-use grounding dataset. Improved grounding directly enhances agentic capabilities, boosting OSWorld performance from 23% to 51%.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- OSWorld-G: comprehensive benchmark for diverse grounding tasks (text matching, element recognition, layout understanding, precise manipulation)
|
||||
- Jedi: 4M examples through multi-perspective task decoupling
|
||||
- Demonstrates compositional generalization to novel interfaces
|
||||
@@ -123,6 +128,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Evaluates potential safety risks of MLLM-based agents during real-world computer manipulation. Features 492 risky tasks spanning web, social media, multimedia, OS, email, and office software. Categorizes risks into user-originated and environmental risks, evaluating both risk goal intention and completion.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Current computer-use agents face significant safety risks in real-world scenarios
|
||||
- Safety principles designed for dialogue scenarios don't transfer well to computer-use
|
||||
- Highlights necessity and urgency of safety alignment for computer-use agents
|
||||
@@ -136,6 +142,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A benchmark featuring high-fidelity, deterministic replicas of 11 widely-used websites across e-commerce, travel, communication, and professional networking. Contains 112 practical tasks requiring both information retrieval and state-changing actions. Enables reproducible evaluation without safety risks.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Best frontier language models achieve only 41% success rate
|
||||
- Highlights critical gaps in autonomous web navigation and task completion
|
||||
- Supports scalable post-training data generation
|
||||
@@ -149,6 +156,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** An RL-based framework for GUI grounding incorporating seed data curation, dense policy gradients, and self-evolutionary reinforcement finetuning using attention maps. With only 3K training samples, the 7B model achieves state-of-the-art on three grounding benchmarks, outperforming UI-TARS-72B by 24.2% on ScreenSpot-Pro.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- 47.3% accuracy on ScreenSpot-Pro with 7B model
|
||||
- Outperforms 72B models with fraction of training data
|
||||
- Demonstrates effectiveness of RL for high-resolution, complex environments
|
||||
@@ -162,6 +170,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A generative adversarial framework that manipulates agent decision-making using diffusion-based semantic injections. Combines negative prompt degradation with positive semantic optimization. Without model access, produces visually natural images that induce consistent decision biases in agents.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Consistently induces decision-level preference redirection on LLaVA-34B, Gemma3, GPT-4o, and Mistral-3.2
|
||||
- Outperforms baselines (SPSA, Bandit, standard diffusion)
|
||||
- Exposes vulnerability: autonomous agents can be misled through visually subtle, semantically-guided manipulations
|
||||
@@ -175,6 +184,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** An extensible benchmark simulating a small software company environment where AI agents interact like digital workers: browsing the web, writing code, running programs, and communicating with coworkers. Tests agents on real professional tasks with important implications for industry adoption and labor market effects.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Best agent achieves 30% autonomous task completion
|
||||
- Simpler tasks are solvable autonomously
|
||||
- More difficult long-horizon tasks remain beyond current systems' reach
|
||||
@@ -188,6 +198,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A comprehensive benchmark for VLMs in video game QA, encompassing visual unit testing, visual regression testing, needle-in-a-haystack challenges, glitch detection, and bug report generation for both images and videos. Addresses the need for standardized benchmarks in this labor-intensive domain.
|
||||
|
||||
**Key Focus:**
|
||||
|
||||
- First benchmark specifically designed for video game QA with VLMs
|
||||
- Covers wide range of QA activities across images and videos
|
||||
- Addresses lack of automation in game development workflows
|
||||
@@ -201,6 +212,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** End-to-end benchmark for evaluating web agent security against prompt injection attacks. Tests realistic scenarios where even simple, low-effort human-written injections can deceive top-tier AI models including those with advanced reasoning.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Attacks partially succeed in up to 86% of cases
|
||||
- State-of-the-art agents often struggle to fully complete attacker goals
|
||||
- Reveals "security by incompetence"—agents' limitations sometimes prevent full attack success
|
||||
@@ -214,6 +226,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Measures whether AI web-navigation agents follow the privacy principle of "data minimization"—using sensitive information only when truly necessary to complete a task. Simulates realistic web interaction scenarios end-to-end.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Agents built on GPT-4, Llama-3, and Claude are prone to inadvertent use of unnecessary sensitive information
|
||||
- Proposes prompting-based defense that reduces information leakage
|
||||
- End-to-end benchmarking provides more realistic measure than probing LLMs about privacy
|
||||
@@ -227,6 +240,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. Creates unified simulation integrating realistic 3D indoor/outdoor environments with functional web interfaces. Tasks include cooking from online recipes, navigating with dynamic map data, and interpreting landmarks using web knowledge.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Unified platform combining 3D environments with web interfaces
|
||||
- Benchmark spanning cooking, navigation, shopping, tourism, and geolocation
|
||||
- Reveals significant performance gaps between AI systems and humans
|
||||
@@ -240,6 +254,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** The first attempt to model UI interactions for precision engineering tasks. Features 41K+ annotated video recordings of CAD operations with time horizons up to 20x longer than existing datasets. Proposes VideoCADFormer for learning CAD interactions directly from video.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Large-scale synthetic dataset for CAD UI interactions
|
||||
- VQA benchmark for evaluating spatial reasoning and video understanding
|
||||
- Reveals challenges in precise action grounding and long-horizon dependencies
|
||||
@@ -253,6 +268,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Introduces a pre-operative critic mechanism that provides feedback before action execution by reasoning about potential outcomes. Proposes Suggestion-aware Group Relative Policy Optimization (S-GRPO) for building the GUI-Critic-R1 model with fully automated data generation.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- Significant advantages in critic accuracy compared to current MLLMs
|
||||
- Improved success rates and operational efficiency on GUI automation benchmarks
|
||||
- Works across both mobile and web domains
|
||||
@@ -266,7 +282,8 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Introduces multi-turn RL framework enabling dynamic zooming into predicted coordinates during reasoning.
|
||||
|
||||
**Key Results:**
|
||||
- 86.4% on V*Bench for visual search
|
||||
|
||||
- 86.4% on V\*Bench for visual search
|
||||
- Outperforms supervised fine-tuning and conventional RL across spatial reasoning, visual search, and web-based grounding
|
||||
- Grounding amplifies region exploration, subgoal setting, and visual verification
|
||||
|
||||
@@ -279,6 +296,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A VLM-based method for coordinate-free GUI grounding using an attention-based action head. Enables proposing one or more action regions in a single forward pass with a grounding verifier for selection.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- GUI-Actor-7B achieves 44.6 on ScreenSpot-Pro with Qwen2.5-VL, outperforming UI-TARS-72B (38.1)
|
||||
- Improved generalization to unseen resolutions and layouts
|
||||
- Fine-tuning only ~100M parameters achieves SOTA performance
|
||||
@@ -292,11 +310,13 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Extensive analysis of the R1-Zero paradigm (online RL + chain-of-thought reasoning) for GUI grounding. Identifies issues: longer reasoning chains lead to worse performance, reward hacking via box size exploitation, and overfitting easy examples.
|
||||
|
||||
**Solutions Proposed:**
|
||||
|
||||
- Fast Thinking Template for direct answer generation
|
||||
- Box size constraint in reward function
|
||||
- Difficulty-aware scaling in RL objective
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- GUI-G1-3B achieves 90.3% on ScreenSpot and 37.1% on ScreenSpot-Pro
|
||||
- Outperforms larger UI-TARS-7B with only 3B parameters
|
||||
|
||||
@@ -309,6 +329,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Framework integrating self-reflection and error correction into end-to-end multimodal GUI models through GUI-specific pre-training, offline SFT, and online reflection tuning. Enables self-reflection emergence with fully automated data generation.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Scalable pipelines for automatic reflection/correction data from successful trajectories
|
||||
- GUI-Reflection Task Suite for reflection-oriented abilities
|
||||
- Diverse environment for online training on mobile devices
|
||||
@@ -323,6 +344,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A generalist agent capable of multimodal computer interaction (text, images, audio, video). Integrates tool-based and pure vision agents within highly modular architecture, enabling collaborative step-by-step task solving.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- 7.27 accuracy gain over Claude-Computer-Use on OSWorld
|
||||
- Evaluated on pure vision benchmarks (OSWorld), general benchmarks (GAIA), and tool-intensive benchmarks (SWE-Bench)
|
||||
- Demonstrates value of modular, collaborative agent architecture
|
||||
@@ -336,6 +358,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A fine-grained adversarial attack framework that modifies VLM perception of only key objects while preserving semantics of remaining regions. Unlike broad semantic disruption, this targeted approach reduces conflicts with task context, making VLMs output valid but incorrect decisions that affect agent actions in the physical world.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- AdvEDM-R: removes semantics of specific objects from images
|
||||
- AdvEDM-A: adds semantics of new objects into images
|
||||
- Demonstrates fine-grained control with excellent attack performance in embodied decision-making tasks
|
||||
@@ -349,6 +372,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A vision-centric reasoning benchmark grounded in challenging perceptual tasks. Unlike prior benchmarks, it moves beyond shallow perception ("see") to require fine-grained observation and analytical reasoning ("observe"). Features natural adversarial image pairs and annotated reasoning chains for process evaluation.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Tests 20 leading MLLMs including 12 foundation models and 8 reasoning-enhanced models
|
||||
- Existing reasoning strategies (chain-of-thought, self-criticism) result in unstable and redundant reasoning
|
||||
- Repeated image observation improves performance across models
|
||||
@@ -363,6 +387,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** First systematic investigation of backdoor vulnerabilities in VLA models. Proposes Objective-Decoupled Optimization with two stages: explicit feature-space separation to isolate trigger representations, and conditional control deviations activated only by triggers.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Consistently achieves near-100% attack success rates with minimal impact on clean task accuracy
|
||||
- Robust against common input perturbations, task transfers, and model fine-tuning
|
||||
- Exposes critical security vulnerabilities in current VLA deployments under Training-as-a-Service paradigm
|
||||
@@ -376,6 +401,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Benchmark for proactively inferring user goals from multimodal contextual observations for wearable assistant agents (smart glasses). Dataset comprises ~30 hours from 363 participants across 3,482 recordings with visual, audio, digital, and longitudinal context.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Humans achieve 93% MCQ accuracy; best VLM reaches ~84%
|
||||
- For open-ended generation, best models produce relevant goals only ~57% of the time
|
||||
- Smaller models (suited for wearables) achieve ~49% accuracy
|
||||
@@ -390,6 +416,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A game-theoretic multi-agent framework formulating reasoning as a non-zero-sum game between base agents (visual perception specialists) and a critical agent (logic/fact verification). Features uncertainty-aware controller for dynamic agent collaboration with multi-round debates.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- Boosts small-to-mid scale models (Qwen2.5-VL-7B, InternVL3-14B) by 5-6%
|
||||
- Enhances strong models like GPT-4o by 2-3%
|
||||
- Modular, scalable, and generalizable framework
|
||||
@@ -403,6 +430,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Introduces Grounded Reasoning with Images and Texts—a method for training MLLMs to generate reasoning chains interleaving natural language with explicit bounding box coordinates. Uses GRPO-GR reinforcement learning with rewards focused on answer accuracy and grounding format.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Exceptional data efficiency: requires as few as 20 image-question-answer triplets
|
||||
- Successfully unifies reasoning and grounding abilities
|
||||
- Eliminates need for reasoning chain annotations or explicit bounding box labels
|
||||
@@ -416,6 +444,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** First multimodal safety alignment framework. Introduces BeaverTails-V (first dataset with dual preference annotations for helpfulness and safety), and Beaver-Guard-V (multi-level guardrail system defending against unsafe queries and adversarial attacks).
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- Guard model improves precursor model's safety by average of 40.9% over five filtering rounds
|
||||
- Safe RLHF-V enhances model safety by 34.2% and helpfulness by 34.3%
|
||||
- First exploration of multi-modal safety alignment within constrained optimization
|
||||
@@ -429,6 +458,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** An inference-time approach that quantifies visual token uncertainty and selectively masks uncertain tokens. Decomposes uncertainty into aleatoric and epistemic components, focusing on epistemic uncertainty for perception-related errors.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- Significantly reduces object hallucinations
|
||||
- Enhances reliability and quality of LVLM outputs across diverse visual contexts
|
||||
- Validated on CHAIR, THRONE, and MMBench benchmarks
|
||||
@@ -442,6 +472,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A unified LVLM integrating segmentation-aware perception and controllable object-centric generation. Uses dual-branch visual encoder for global semantic context and fine-grained spatial details, with MoVQGAN-based visual tokenizer for discrete visual tokens.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Progressive multi-stage training pipeline
|
||||
- Segmentation masks jointly optimized as spatial condition prompts
|
||||
- Bridges segmentation-aware perception with fine-grained visual synthesis
|
||||
@@ -455,6 +486,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Introduces Multi-Model Monte Carlo Tree Search (M3CTS) for generating diverse Long Chain-of-Thought reasoning trajectories. Proposes fine-grained Direct Preference Optimization (fDPO) with segment-specific preference granularity guided by spatial reward mechanism.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- fDPO achieves 4.1% and 9.0% gains over standard DPO on spatial quality and quantity tasks
|
||||
- SpatialReasoner-R1 sets new SOTA on SpatialRGPT-Bench, outperforming strongest baseline by 9.8%
|
||||
- Maintains competitive performance on general vision-language tasks
|
||||
@@ -468,6 +500,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A two-stage reinforcement fine-tuning framework: SFT with curated Chain-of-Thought data activates reasoning potential, followed by RL based on Group Relative Policy Optimization (GRPO) for domain shift adaptability.
|
||||
|
||||
**Key Advantages:**
|
||||
|
||||
- State-of-the-art results outperforming both open-source and proprietary models
|
||||
- Robust performance under domain shifts across various tasks
|
||||
- Excellent data efficiency in few-shot learning scenarios
|
||||
@@ -481,6 +514,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Reveals that safe images can be exploited for jailbreaking when combined with additional safe images and prompts, exploiting LVLMs' universal reasoning capabilities and safety snowball effect. Proposes Safety Snowball Agent (SSA) framework.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- SSA can use nearly any image to induce LVLMs to produce unsafe content
|
||||
- Achieves high jailbreak success rates against latest LVLMs
|
||||
- Exploits inherent LVLM properties rather than alignment flaws
|
||||
@@ -494,6 +528,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Uncovers novel attack vector: Malicious Image Patches (MIPs)—adversarially perturbed screen regions that induce OS agents to perform harmful actions. MIPs can be embedded in wallpapers or shared on social media to exfiltrate sensitive data.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- MIPs generalize across user prompts and screen configurations
|
||||
- Can hijack multiple OS agents during execution of benign instructions
|
||||
- Exposes critical security vulnerabilities requiring attention before widespread deployment
|
||||
@@ -507,6 +542,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A framework leveraging instruction-driven routing and sparsification for VLA efficiency. Features 3-stage progressive architecture inspired by human multimodal coordination: Encoder-FiLM Aggregation Routing, LLM-FiLM Pruning Routing, and V-L-A Coupled Attention.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- 97.4% success rate on LIBERO benchmark, 70.0% on real-world robotic tasks
|
||||
- Reduces training costs by 2.5x and inference latency by 2.8x compared to OpenVLA
|
||||
- Achieves state-of-the-art performance
|
||||
@@ -520,6 +556,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Novel off-policy RL algorithm applying direct policy updates for positive samples and conservative, regularized updates for negative ones. Augmented with Successful Transition Replay (STR) for prioritizing successful interactions.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- At least 17% relative increase over existing methods on AndroidWorld benchmark
|
||||
- Substantially fewer computational resources than GPT-4o-based methods
|
||||
- 5-60x faster inference
|
||||
@@ -533,6 +570,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** An API-centric stress testing framework that uncovers intent integrity violations in LLM agents. Uses semantic partitioning to organize tasks into meaningful categories, with targeted mutations to expose subtle agent errors while preserving user intent.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- Datatype-aware strategy memory for retrieving effective mutation patterns
|
||||
- Lightweight predictor for ranking mutations by error likelihood
|
||||
- Generalizes to stronger target models using smaller LLMs for test generation
|
||||
@@ -546,6 +584,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** A dual-system framework bridging high-level reasoning with low-level action execution. Trains multimodal LLM to generate embodied reasoning plans guided by action-aligned visual rewards, compressed into visual plan latents for downstream action execution.
|
||||
|
||||
**Key Capabilities:**
|
||||
|
||||
- Few-shot adaptation
|
||||
- Long-horizon planning
|
||||
- Self-correction behaviors in complex embodied AI tasks
|
||||
@@ -559,6 +598,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Automated attack framework that constructs chains of images with risky visual thoughts to challenge VLMs. Exploits the conflict between logical processing and safety protocols, leading to unsafe content generation.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- Improves average attack success rate by 26.71% (from 63.70% to 90.41%)
|
||||
- Tested on 9 open-source and 6 commercial VLMs
|
||||
- Outperforms state-of-the-art methods
|
||||
@@ -572,6 +612,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** First web-based benchmark evaluating MLLM agents on diverse CAPTCHA puzzles. Spans 20 modern CAPTCHA types (225 total) with novel metric: CAPTCHA Reasoning Depth quantifying cognitive and motor steps required.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Humans achieve 93.3% success rate
|
||||
- State-of-the-art agents achieve at most 40.0% (Browser-Use OpenAI-o3)
|
||||
- Highlights significant gap between human and agent capabilities
|
||||
@@ -585,7 +626,8 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Introduces pixel-space reasoning framework where VLMs use visual operations (zoom-in, select-frame) to directly inspect and infer from visual evidence. Two-phase training: instruction tuning on synthesized traces, then RL with curiosity-driven rewards.
|
||||
|
||||
**Key Results:**
|
||||
- 84% on V*Bench, 74% on TallyQA-Complex, 84% on InfographicsVQA
|
||||
|
||||
- 84% on V\*Bench, 74% on TallyQA-Complex, 84% on InfographicsVQA
|
||||
- Highest accuracy achieved by any open-source 7B model
|
||||
- Enables proactive information gathering from complex visual inputs
|
||||
|
||||
@@ -598,6 +640,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Brain-inspired framework decomposing interactions into three biologically plausible phases: Blink (rapid detection via saccadic-like attention), Think (higher-level reasoning/planning), and Link (executable command generation for motor control).
|
||||
|
||||
**Key Innovations:**
|
||||
|
||||
- Automated annotation pipeline for blink data
|
||||
- BTL Reward: first rule-based reward mechanism driven by both process and outcome
|
||||
- Competitive performance on static GUI understanding and dynamic interaction tasks
|
||||
@@ -611,6 +654,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Simulation environment engine enabling flexible definition of screens, icons, and navigation graphs with full environment access for agent training/evaluation. Demonstrates progressive training approach from SFT to multi-turn RL.
|
||||
|
||||
**Key Findings:**
|
||||
|
||||
- Supervised fine-tuning enables memorization of fundamental knowledge
|
||||
- Single-turn RL enhances generalization to unseen scenarios
|
||||
- Multi-turn RL encourages exploration strategies through interactive trial and error
|
||||
@@ -624,6 +668,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Reasoning-enhanced framework integrating structured reasoning, action prediction, and history summarization. Uses Chain-of-Thought analyses combining progress estimation and decision reasoning, trained via SFT and GRPO with history-aware rewards.
|
||||
|
||||
**Key Results:**
|
||||
|
||||
- State-of-the-art under identical training data conditions
|
||||
- Particularly strong in out-of-domain scenarios
|
||||
- Robust reasoning and generalization across diverse GUI navigation tasks
|
||||
@@ -637,6 +682,7 @@ We'll be at NeurIPS in San Diego. If you're working on computer-use agents, buil
|
||||
**Summary:** Self-improving framework addressing trajectory verification and training data scalability. Features UI-Genie-RM (image-text interleaved reward model) and self-improvement pipeline with reward-guided exploration and outcome verification.
|
||||
|
||||
**Key Contributions:**
|
||||
|
||||
- UI-Genie-RM-517k: first reward-specific dataset for GUI agents
|
||||
- UI-Genie-Agent-16k: high-quality synthetic trajectories without manual annotation
|
||||
- State-of-the-art across multiple GUI agent benchmarks through three generations of self-improvement
|
||||
|
||||
@@ -250,13 +250,15 @@ $ cua get my-dev-sandbox --json
|
||||
**Computer Server Health Check:**
|
||||
|
||||
The `cua get` command automatically probes the computer-server when the sandbox is running:
|
||||
|
||||
- Checks OS type via `https://{host}:8443/status`
|
||||
- Checks version via `https://{host}:8443/cmd`
|
||||
- Shows "Computer Server Status: healthy" when both probes succeed
|
||||
- Uses a 3-second timeout for each probe
|
||||
|
||||
<Callout type="info">
|
||||
The computer server status is only checked for running sandboxes. Stopped or suspended sandboxes will not show computer server information.
|
||||
The computer server status is only checked for running sandboxes. Stopped or suspended sandboxes
|
||||
will not show computer server information.
|
||||
</Callout>
|
||||
|
||||
### `cua start`
|
||||
|
||||
@@ -4,12 +4,12 @@ title: Configuration
|
||||
|
||||
The server is configured using environment variables (can be set in the Claude Desktop config):
|
||||
|
||||
| Variable | Description | Default |
|
||||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
|
||||
| Variable | Description | Default |
|
||||
| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
|
||||
| `CUA_MODEL_NAME` | Model string (e.g., "anthropic/claude-sonnet-4-20250514", "openai/computer-use-preview", "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", "omniparser+litellm/gpt-4o", "omniparser+ollama_chat/gemma3") | anthropic/claude-sonnet-4-20250514 |
|
||||
| `ANTHROPIC_API_KEY` | Your Anthropic API key (required for Anthropic models) | None |
|
||||
| `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |
|
||||
| `CUA_USE_HOST_COMPUTER_SERVER` | Target your local desktop instead of a VM. Set to "true" to use your host system. **Warning:** AI models may perform risky actions. | false |
|
||||
| `ANTHROPIC_API_KEY` | Your Anthropic API key (required for Anthropic models) | None |
|
||||
| `CUA_MAX_IMAGES` | Maximum number of images to keep in context | 3 |
|
||||
| `CUA_USE_HOST_COMPUTER_SERVER` | Target your local desktop instead of a VM. Set to "true" to use your host system. **Warning:** AI models may perform risky actions. | false |
|
||||
|
||||
## Model Configuration
|
||||
|
||||
@@ -17,7 +17,7 @@ The `CUA_MODEL_NAME` environment variable supports various model providers throu
|
||||
|
||||
### Supported Providers
|
||||
|
||||
- **Anthropic**: `anthropic/claude-sonnet-4-20250514`,
|
||||
- **Anthropic**: `anthropic/claude-sonnet-4-20250514`,
|
||||
- **OpenAI**: `openai/computer-use-preview`, `openai/gpt-4o`
|
||||
- **Local Models**: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
|
||||
- **Omni + LiteLLM**: `omniparser+litellm/gpt-4o`, `omniparser+litellm/claude-3-haiku`
|
||||
|
||||
119
examples/browser_tool_example.py
Normal file
119
examples/browser_tool_example.py
Normal file
@@ -0,0 +1,119 @@
|
||||
"""
|
||||
Browser Tool Example
|
||||
|
||||
Demonstrates how to use the BrowserTool to control a browser programmatically
|
||||
via the computer server. The browser runs visibly on the XFCE desktop so visual
|
||||
agents can see it.
|
||||
|
||||
Prerequisites:
|
||||
- Computer server running (Docker container or local)
|
||||
- For Docker: Container should be running with browser tool support
|
||||
- For local: Playwright and Firefox must be installed
|
||||
|
||||
Usage:
|
||||
python examples/browser_tool_example.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add the libs path to sys.path
|
||||
libs_path = Path(__file__).parent.parent / "libs" / "python"
|
||||
sys.path.insert(0, str(libs_path))
|
||||
|
||||
from agent.tools.browser_tool import BrowserTool
|
||||
|
||||
# Import Computer interface and BrowserTool
|
||||
from computer import Computer
|
||||
|
||||
# Configure logging to see what's happening
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
async def test_browser_tool():
|
||||
"""Test the BrowserTool with various commands."""
|
||||
|
||||
# Initialize the computer interface
|
||||
# For local testing, use provider_type="docker"
|
||||
# For provider_type="cloud", provide name and api_key
|
||||
computer = Computer(provider_type="docker", os_type="linux", image="cua-xfce:dev")
|
||||
await computer.run()
|
||||
|
||||
# Initialize the browser tool with the computer interface
|
||||
browser = BrowserTool(interface=computer)
|
||||
|
||||
logger.info("Testing Browser Tool...")
|
||||
|
||||
try:
|
||||
# Test 0: Take a screenshot (pre-init)
|
||||
logger.info("Test 0: Taking a screenshot...")
|
||||
screenshot_bytes = await browser.screenshot()
|
||||
screenshot_path = Path(__file__).parent / "browser_screenshot_init.png"
|
||||
with open(screenshot_path, "wb") as f:
|
||||
f.write(screenshot_bytes)
|
||||
logger.info(f"Screenshot captured: {len(screenshot_bytes)} bytes")
|
||||
|
||||
# Test 1: Visit a URL
|
||||
logger.info("Test 1: Visiting a URL...")
|
||||
result = await browser.visit_url("https://www.trycua.com")
|
||||
logger.info(f"Visit URL result: {result}")
|
||||
|
||||
# Wait a bit for the page to load
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test 2: Take a screenshot
|
||||
logger.info("Test 2: Taking a screenshot...")
|
||||
screenshot_bytes = await browser.screenshot()
|
||||
screenshot_path = Path(__file__).parent / "browser_screenshot.png"
|
||||
with open(screenshot_path, "wb") as f:
|
||||
f.write(screenshot_bytes)
|
||||
logger.info(f"Screenshot captured: {len(screenshot_bytes)} bytes")
|
||||
|
||||
# Wait a bit
|
||||
await asyncio.sleep(1)
|
||||
|
||||
# Test 3: Visit bot detector
|
||||
logger.info("Test 3: Visiting bot detector...")
|
||||
result = await browser.visit_url("https://bot-detector.rebrowser.net/")
|
||||
logger.info(f"Visit URL result: {result}")
|
||||
|
||||
# Test 2: Web search
|
||||
logger.info("Test 2: Performing a web search...")
|
||||
result = await browser.web_search("Python programming")
|
||||
logger.info(f"Web search result: {result}")
|
||||
|
||||
# Wait a bit
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test 3: Scroll
|
||||
logger.info("Test 3: Scrolling the page...")
|
||||
result = await browser.scroll(delta_x=0, delta_y=500)
|
||||
logger.info(f"Scroll result: {result}")
|
||||
|
||||
# Wait a bit
|
||||
await asyncio.sleep(1)
|
||||
|
||||
# Test 4: Click (example coordinates - adjust based on your screen)
|
||||
logger.info("Test 4: Clicking at coordinates...")
|
||||
result = await browser.click(x=500, y=300)
|
||||
logger.info(f"Click result: {result}")
|
||||
|
||||
# Wait a bit
|
||||
await asyncio.sleep(1)
|
||||
|
||||
# Test 5: Type text (if there's a focused input field)
|
||||
logger.info("Test 5: Typing text...")
|
||||
result = await browser.type("Hello from BrowserTool!")
|
||||
logger.info(f"Type result: {result}")
|
||||
|
||||
logger.info("All tests completed!")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error during testing: {e}", exc_info=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(test_browser_tool())
|
||||
@@ -8,6 +8,7 @@ from . import (
|
||||
composed_grounded,
|
||||
gelato,
|
||||
gemini,
|
||||
generic_vlm,
|
||||
glm45v,
|
||||
gta1,
|
||||
holo,
|
||||
@@ -16,7 +17,6 @@ from . import (
|
||||
omniparser,
|
||||
openai,
|
||||
opencua,
|
||||
generic_vlm,
|
||||
uiins,
|
||||
uitars,
|
||||
uitars2,
|
||||
@@ -24,19 +24,19 @@ from . import (
|
||||
|
||||
__all__ = [
|
||||
"anthropic",
|
||||
"openai",
|
||||
"uitars",
|
||||
"omniparser",
|
||||
"gta1",
|
||||
"composed_grounded",
|
||||
"glm45v",
|
||||
"opencua",
|
||||
"internvl",
|
||||
"holo",
|
||||
"moondream3",
|
||||
"gelato",
|
||||
"gemini",
|
||||
"generic_vlm",
|
||||
"glm45v",
|
||||
"gta1",
|
||||
"holo",
|
||||
"internvl",
|
||||
"moondream3",
|
||||
"omniparser",
|
||||
"openai",
|
||||
"opencua",
|
||||
"uiins",
|
||||
"gelato",
|
||||
"uitars",
|
||||
"uitars2",
|
||||
]
|
||||
|
||||
@@ -442,7 +442,7 @@ def get_all_element_descriptions(responses_items: List[Dict[str, Any]]) -> List[
|
||||
|
||||
# Conversion functions between responses_items and completion messages formats
|
||||
def convert_responses_items_to_completion_messages(
|
||||
messages: List[Dict[str, Any]],
|
||||
messages: List[Dict[str, Any]],
|
||||
allow_images_in_tool_results: bool = True,
|
||||
send_multiple_user_images_per_parallel_tool_results: bool = False,
|
||||
) -> List[Dict[str, Any]]:
|
||||
@@ -573,25 +573,33 @@ def convert_responses_items_to_completion_messages(
|
||||
"computer_call_output",
|
||||
]
|
||||
# Send tool message + separate user message with image (OpenAI compatible)
|
||||
completion_messages += [
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": call_id,
|
||||
"content": "[Execution completed. See screenshot below]",
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": output.get("image_url")}}
|
||||
],
|
||||
},
|
||||
] if send_multiple_user_images_per_parallel_tool_results or (not is_next_message_image_result) else [
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": call_id,
|
||||
"content": "[Execution completed. See screenshot below]",
|
||||
},
|
||||
]
|
||||
completion_messages += (
|
||||
[
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": call_id,
|
||||
"content": "[Execution completed. See screenshot below]",
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": output.get("image_url")},
|
||||
}
|
||||
],
|
||||
},
|
||||
]
|
||||
if send_multiple_user_images_per_parallel_tool_results
|
||||
or (not is_next_message_image_result)
|
||||
else [
|
||||
{
|
||||
"role": "tool",
|
||||
"tool_call_id": call_id,
|
||||
"content": "[Execution completed. See screenshot below]",
|
||||
},
|
||||
]
|
||||
)
|
||||
else:
|
||||
# Handle text output as tool response
|
||||
completion_messages.append(
|
||||
|
||||
6
libs/python/agent/agent/tools/__init__.py
Normal file
6
libs/python/agent/agent/tools/__init__.py
Normal file
@@ -0,0 +1,6 @@
|
||||
"""Tools for agent interactions."""
|
||||
|
||||
from .browser_tool import BrowserTool
|
||||
|
||||
__all__ = ["BrowserTool"]
|
||||
|
||||
135
libs/python/agent/agent/tools/browser_tool.py
Normal file
135
libs/python/agent/agent/tools/browser_tool.py
Normal file
@@ -0,0 +1,135 @@
|
||||
"""
|
||||
Browser Tool for agent interactions.
|
||||
Allows agents to control a browser programmatically via Playwright.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import TYPE_CHECKING, Optional
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from computer.interface import GenericComputerInterface
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class BrowserTool:
|
||||
"""
|
||||
Browser tool that uses the computer SDK's interface to control a browser.
|
||||
Implements the Fara/Magentic-One agent interface for browser control.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
interface: "GenericComputerInterface",
|
||||
):
|
||||
"""
|
||||
Initialize the BrowserTool.
|
||||
|
||||
Args:
|
||||
interface: A GenericComputerInterface instance that provides playwright_exec
|
||||
"""
|
||||
self.interface = interface
|
||||
self.logger = logger
|
||||
|
||||
async def _execute_command(self, command: str, params: dict) -> dict:
|
||||
"""
|
||||
Execute a browser command via the computer interface.
|
||||
|
||||
Args:
|
||||
command: Command name
|
||||
params: Command parameters
|
||||
|
||||
Returns:
|
||||
Response dictionary
|
||||
"""
|
||||
try:
|
||||
result = await self.interface.playwright_exec(command, params)
|
||||
if not result.get("success"):
|
||||
self.logger.error(
|
||||
f"Browser command '{command}' failed: {result.get('error', 'Unknown error')}"
|
||||
)
|
||||
return result
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error executing browser command '{command}': {e}")
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
async def visit_url(self, url: str) -> dict:
|
||||
"""
|
||||
Navigate to a URL.
|
||||
|
||||
Args:
|
||||
url: URL to visit
|
||||
|
||||
Returns:
|
||||
Response dictionary with success status and current URL
|
||||
"""
|
||||
return await self._execute_command("visit_url", {"url": url})
|
||||
|
||||
async def click(self, x: int, y: int) -> dict:
|
||||
"""
|
||||
Click at coordinates.
|
||||
|
||||
Args:
|
||||
x: X coordinate
|
||||
y: Y coordinate
|
||||
|
||||
Returns:
|
||||
Response dictionary with success status
|
||||
"""
|
||||
return await self._execute_command("click", {"x": x, "y": y})
|
||||
|
||||
async def type(self, text: str) -> dict:
|
||||
"""
|
||||
Type text into the focused element.
|
||||
|
||||
Args:
|
||||
text: Text to type
|
||||
|
||||
Returns:
|
||||
Response dictionary with success status
|
||||
"""
|
||||
return await self._execute_command("type", {"text": text})
|
||||
|
||||
async def scroll(self, delta_x: int, delta_y: int) -> dict:
|
||||
"""
|
||||
Scroll the page.
|
||||
|
||||
Args:
|
||||
delta_x: Horizontal scroll delta
|
||||
delta_y: Vertical scroll delta
|
||||
|
||||
Returns:
|
||||
Response dictionary with success status
|
||||
"""
|
||||
return await self._execute_command("scroll", {"delta_x": delta_x, "delta_y": delta_y})
|
||||
|
||||
async def web_search(self, query: str) -> dict:
|
||||
"""
|
||||
Navigate to a Google search for the query.
|
||||
|
||||
Args:
|
||||
query: Search query
|
||||
|
||||
Returns:
|
||||
Response dictionary with success status and current URL
|
||||
"""
|
||||
return await self._execute_command("web_search", {"query": query})
|
||||
|
||||
async def screenshot(self) -> bytes:
|
||||
"""
|
||||
Take a screenshot of the current browser page.
|
||||
|
||||
Returns:
|
||||
Screenshot image data as bytes (PNG format)
|
||||
"""
|
||||
import base64
|
||||
|
||||
result = await self._execute_command("screenshot", {})
|
||||
if result.get("success") and result.get("screenshot"):
|
||||
# Decode base64 screenshot to bytes
|
||||
screenshot_b64 = result["screenshot"]
|
||||
screenshot_bytes = base64.b64decode(screenshot_b64)
|
||||
return screenshot_bytes
|
||||
else:
|
||||
error = result.get("error", "Unknown error")
|
||||
raise RuntimeError(f"Failed to take screenshot: {error}")
|
||||
361
libs/python/computer-server/computer_server/browser.py
Normal file
361
libs/python/computer-server/computer_server/browser.py
Normal file
@@ -0,0 +1,361 @@
|
||||
"""
|
||||
Browser manager using Playwright for programmatic browser control.
|
||||
This allows agents to control a browser that runs visibly on the XFCE desktop.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
try:
|
||||
from playwright.async_api import Browser, BrowserContext, Page, async_playwright
|
||||
except ImportError:
|
||||
async_playwright = None
|
||||
Browser = None
|
||||
BrowserContext = None
|
||||
Page = None
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class BrowserManager:
|
||||
"""
|
||||
Manages a Playwright browser instance that runs visibly on the XFCE desktop.
|
||||
Uses persistent context to maintain cookies and sessions.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the BrowserManager."""
|
||||
self.playwright = None
|
||||
self.browser: Optional[Browser] = None
|
||||
self.context: Optional[BrowserContext] = None
|
||||
self.page: Optional[Page] = None
|
||||
self._initialized = False
|
||||
self._initialization_error: Optional[str] = None
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def _ensure_initialized(self):
|
||||
"""Ensure the browser is initialized."""
|
||||
# Check if browser was closed and needs reinitialization
|
||||
if self._initialized:
|
||||
try:
|
||||
# Check if context is still valid by trying to access it
|
||||
if self.context:
|
||||
# Try to get pages - this will raise if context is closed
|
||||
_ = self.context.pages
|
||||
# If we get here, context is still alive
|
||||
return
|
||||
else:
|
||||
# Context was closed, need to reinitialize
|
||||
self._initialized = False
|
||||
logger.warning("Browser context was closed, will reinitialize...")
|
||||
except Exception as e:
|
||||
# Context is dead, need to reinitialize
|
||||
logger.warning(f"Browser context is dead ({e}), will reinitialize...")
|
||||
self._initialized = False
|
||||
self.context = None
|
||||
self.page = None
|
||||
# Clean up playwright if it exists
|
||||
if self.playwright:
|
||||
try:
|
||||
await self.playwright.stop()
|
||||
except Exception:
|
||||
pass
|
||||
self.playwright = None
|
||||
|
||||
async with self._lock:
|
||||
# Double-check after acquiring lock (another thread might have initialized it)
|
||||
if self._initialized:
|
||||
try:
|
||||
if self.context:
|
||||
_ = self.context.pages
|
||||
return
|
||||
except Exception:
|
||||
self._initialized = False
|
||||
self.context = None
|
||||
self.page = None
|
||||
if self.playwright:
|
||||
try:
|
||||
await self.playwright.stop()
|
||||
except Exception:
|
||||
pass
|
||||
self.playwright = None
|
||||
|
||||
if async_playwright is None:
|
||||
raise RuntimeError(
|
||||
"playwright is not installed. Please install it with: pip install playwright && playwright install --with-deps firefox"
|
||||
)
|
||||
|
||||
try:
|
||||
# Get display from environment or default to :1
|
||||
display = os.environ.get("DISPLAY", ":1")
|
||||
logger.info(f"Initializing browser with DISPLAY={display}")
|
||||
|
||||
# Start playwright
|
||||
self.playwright = await async_playwright().start()
|
||||
|
||||
# Launch Firefox with persistent context (keeps cookies/sessions)
|
||||
# headless=False is CRITICAL so the visual agent can see it
|
||||
user_data_dir = os.path.join(os.path.expanduser("~"), ".playwright-firefox")
|
||||
os.makedirs(user_data_dir, exist_ok=True)
|
||||
|
||||
# launch_persistent_context returns a BrowserContext, not a Browser
|
||||
# Note: Removed --kiosk mode so the desktop remains visible
|
||||
self.context = await self.playwright.firefox.launch_persistent_context(
|
||||
user_data_dir=user_data_dir,
|
||||
headless=False, # CRITICAL: visible for visual agent
|
||||
viewport={"width": 1024, "height": 768},
|
||||
# Removed --kiosk to allow desktop visibility
|
||||
)
|
||||
|
||||
# Add init script to make the browser less detectable
|
||||
await self.context.add_init_script(
|
||||
"""const defaultGetter = Object.getOwnPropertyDescriptor(
|
||||
Navigator.prototype,
|
||||
"webdriver"
|
||||
).get;
|
||||
defaultGetter.apply(navigator);
|
||||
defaultGetter.toString();
|
||||
Object.defineProperty(Navigator.prototype, "webdriver", {
|
||||
set: undefined,
|
||||
enumerable: true,
|
||||
configurable: true,
|
||||
get: new Proxy(defaultGetter, {
|
||||
apply: (target, thisArg, args) => {
|
||||
Reflect.apply(target, thisArg, args);
|
||||
return false;
|
||||
},
|
||||
}),
|
||||
});
|
||||
const patchedGetter = Object.getOwnPropertyDescriptor(
|
||||
Navigator.prototype,
|
||||
"webdriver"
|
||||
).get;
|
||||
patchedGetter.apply(navigator);
|
||||
patchedGetter.toString();"""
|
||||
)
|
||||
|
||||
# Get the first page or create one
|
||||
pages = self.context.pages
|
||||
if pages:
|
||||
self.page = pages[0]
|
||||
else:
|
||||
self.page = await self.context.new_page()
|
||||
|
||||
self._initialized = True
|
||||
logger.info("Browser initialized successfully")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to initialize browser: {e}")
|
||||
import traceback
|
||||
|
||||
logger.error(traceback.format_exc())
|
||||
# Don't raise - return error in execute_command instead
|
||||
self._initialization_error = str(e)
|
||||
raise
|
||||
|
||||
async def _execute_command_impl(self, cmd: str, params: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Internal implementation of command execution."""
|
||||
if cmd == "visit_url":
|
||||
url = params.get("url")
|
||||
if not url:
|
||||
return {"success": False, "error": "url parameter is required"}
|
||||
await self.page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
return {"success": True, "url": self.page.url}
|
||||
|
||||
elif cmd == "click":
|
||||
x = params.get("x")
|
||||
y = params.get("y")
|
||||
if x is None or y is None:
|
||||
return {"success": False, "error": "x and y parameters are required"}
|
||||
await self.page.mouse.click(x, y)
|
||||
return {"success": True}
|
||||
|
||||
elif cmd == "type":
|
||||
text = params.get("text")
|
||||
if text is None:
|
||||
return {"success": False, "error": "text parameter is required"}
|
||||
await self.page.keyboard.type(text)
|
||||
return {"success": True}
|
||||
|
||||
elif cmd == "scroll":
|
||||
delta_x = params.get("delta_x", 0)
|
||||
delta_y = params.get("delta_y", 0)
|
||||
await self.page.mouse.wheel(delta_x, delta_y)
|
||||
return {"success": True}
|
||||
|
||||
elif cmd == "web_search":
|
||||
query = params.get("query")
|
||||
if not query:
|
||||
return {"success": False, "error": "query parameter is required"}
|
||||
# Navigate to Google search
|
||||
search_url = f"https://www.google.com/search?q={query}"
|
||||
await self.page.goto(search_url, wait_until="domcontentloaded", timeout=30000)
|
||||
return {"success": True, "url": self.page.url}
|
||||
|
||||
elif cmd == "screenshot":
|
||||
# Take a screenshot and return as base64
|
||||
import base64
|
||||
|
||||
screenshot_bytes = await self.page.screenshot(type="png")
|
||||
screenshot_b64 = base64.b64encode(screenshot_bytes).decode("utf-8")
|
||||
return {"success": True, "screenshot": screenshot_b64}
|
||||
|
||||
else:
|
||||
return {"success": False, "error": f"Unknown command: {cmd}"}
|
||||
|
||||
async def execute_command(self, cmd: str, params: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Execute a browser command with automatic recovery.
|
||||
|
||||
Args:
|
||||
cmd: Command name (visit_url, click, type, scroll, web_search)
|
||||
params: Command parameters
|
||||
|
||||
Returns:
|
||||
Result dictionary with success status and any data
|
||||
"""
|
||||
max_retries = 2
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
await self._ensure_initialized()
|
||||
except Exception as e:
|
||||
error_msg = getattr(self, "_initialization_error", None) or str(e)
|
||||
logger.error(f"Browser initialization failed: {error_msg}")
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Browser initialization failed: {error_msg}. "
|
||||
f"Make sure Playwright and Firefox are installed, and DISPLAY is set correctly.",
|
||||
}
|
||||
|
||||
# Check if page is still valid and get a new one if needed
|
||||
page_valid = False
|
||||
try:
|
||||
if self.page is not None and not self.page.is_closed():
|
||||
# Try to access page.url to check if it's still valid
|
||||
_ = self.page.url
|
||||
page_valid = True
|
||||
except Exception as e:
|
||||
logger.warning(f"Page is invalid: {e}, will get a new page...")
|
||||
self.page = None
|
||||
|
||||
# Get a valid page if we don't have one
|
||||
if not page_valid or self.page is None:
|
||||
try:
|
||||
if self.context:
|
||||
pages = self.context.pages
|
||||
if pages:
|
||||
# Find first non-closed page
|
||||
for p in pages:
|
||||
try:
|
||||
if not p.is_closed():
|
||||
self.page = p
|
||||
logger.info("Reusing existing open page")
|
||||
page_valid = True
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
# If no valid page found, create a new one
|
||||
if not page_valid:
|
||||
self.page = await self.context.new_page()
|
||||
logger.info("Created new page")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get new page: {e}, browser may be closed")
|
||||
# Browser was closed - force reinitialization
|
||||
self._initialized = False
|
||||
self.context = None
|
||||
self.page = None
|
||||
if self.playwright:
|
||||
try:
|
||||
await self.playwright.stop()
|
||||
except Exception:
|
||||
pass
|
||||
self.playwright = None
|
||||
|
||||
# If this isn't the last attempt, continue to retry
|
||||
if attempt < max_retries - 1:
|
||||
logger.info("Browser was closed, retrying with fresh initialization...")
|
||||
continue
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Browser was closed and cannot be recovered: {e}",
|
||||
}
|
||||
|
||||
# Try to execute the command
|
||||
try:
|
||||
return await self._execute_command_impl(cmd, params)
|
||||
except Exception as e:
|
||||
error_str = str(e)
|
||||
logger.error(f"Error executing command {cmd}: {e}")
|
||||
|
||||
# Check if this is a "browser/page/context closed" error
|
||||
if any(keyword in error_str.lower() for keyword in ["closed", "target", "context"]):
|
||||
logger.warning(
|
||||
f"Browser/page was closed during command execution (attempt {attempt + 1}/{max_retries})"
|
||||
)
|
||||
|
||||
# Force reinitialization
|
||||
self._initialized = False
|
||||
self.context = None
|
||||
self.page = None
|
||||
if self.playwright:
|
||||
try:
|
||||
await self.playwright.stop()
|
||||
except Exception:
|
||||
pass
|
||||
self.playwright = None
|
||||
|
||||
# If this isn't the last attempt, retry
|
||||
if attempt < max_retries - 1:
|
||||
logger.info("Retrying command after browser reinitialization...")
|
||||
continue
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": f"Command failed after {max_retries} attempts: {error_str}",
|
||||
}
|
||||
else:
|
||||
# Not a browser closed error, return immediately
|
||||
import traceback
|
||||
|
||||
logger.error(traceback.format_exc())
|
||||
return {"success": False, "error": error_str}
|
||||
|
||||
# Should never reach here, but just in case
|
||||
return {"success": False, "error": "Command failed after all retries"}
|
||||
|
||||
async def close(self):
|
||||
"""Close the browser and cleanup resources."""
|
||||
async with self._lock:
|
||||
try:
|
||||
if self.context:
|
||||
await self.context.close()
|
||||
self.context = None
|
||||
if self.browser:
|
||||
await self.browser.close()
|
||||
self.browser = None
|
||||
|
||||
if self.playwright:
|
||||
await self.playwright.stop()
|
||||
self.playwright = None
|
||||
|
||||
self.page = None
|
||||
self._initialized = False
|
||||
logger.info("Browser closed successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Error closing browser: {e}")
|
||||
|
||||
|
||||
# Global instance
|
||||
_browser_manager: Optional[BrowserManager] = None
|
||||
|
||||
|
||||
def get_browser_manager() -> BrowserManager:
|
||||
"""Get or create the global BrowserManager instance."""
|
||||
global _browser_manager
|
||||
if _browser_manager is None:
|
||||
_browser_manager = BrowserManager()
|
||||
return _browser_manager
|
||||
@@ -25,6 +25,7 @@ from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse, StreamingResponse
|
||||
|
||||
from .handlers.factory import HandlerFactory
|
||||
from .browser import get_browser_manager
|
||||
|
||||
# Authentication session TTL (in seconds). Override via env var CUA_AUTH_TTL_SECONDS. Default: 60s
|
||||
AUTH_SESSION_TTL_SECONDS: int = int(os.environ.get("CUA_AUTH_TTL_SECONDS", "60"))
|
||||
@@ -749,5 +750,71 @@ async def agent_response_endpoint(
|
||||
return JSONResponse(content=payload, headers=headers)
|
||||
|
||||
|
||||
@app.post("/playwright_exec")
|
||||
async def playwright_exec_endpoint(
|
||||
request: Request,
|
||||
container_name: Optional[str] = Header(None, alias="X-Container-Name"),
|
||||
api_key: Optional[str] = Header(None, alias="X-API-Key"),
|
||||
):
|
||||
"""
|
||||
Execute Playwright browser commands.
|
||||
|
||||
Headers:
|
||||
- X-Container-Name: Container name for cloud authentication
|
||||
- X-API-Key: API key for cloud authentication
|
||||
|
||||
Body:
|
||||
{
|
||||
"command": "visit_url|click|type|scroll|web_search",
|
||||
"params": {...}
|
||||
}
|
||||
"""
|
||||
# Parse request body
|
||||
try:
|
||||
body = await request.json()
|
||||
command = body.get("command")
|
||||
params = body.get("params", {})
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=400, detail=f"Invalid JSON body: {str(e)}")
|
||||
|
||||
if not command:
|
||||
raise HTTPException(status_code=400, detail="Command is required")
|
||||
|
||||
# Check if CONTAINER_NAME is set (indicating cloud provider)
|
||||
server_container_name = os.environ.get("CONTAINER_NAME")
|
||||
|
||||
# If cloud provider, perform authentication
|
||||
if server_container_name:
|
||||
logger.info(
|
||||
f"Cloud provider detected. CONTAINER_NAME: {server_container_name}. Performing authentication..."
|
||||
)
|
||||
|
||||
# Validate required headers
|
||||
if not container_name:
|
||||
raise HTTPException(status_code=401, detail="Container name required")
|
||||
|
||||
if not api_key:
|
||||
raise HTTPException(status_code=401, detail="API key required")
|
||||
|
||||
# Validate with AuthenticationManager
|
||||
is_authenticated = await auth_manager.auth(container_name, api_key)
|
||||
if not is_authenticated:
|
||||
raise HTTPException(status_code=401, detail="Authentication failed")
|
||||
|
||||
# Get browser manager and execute command
|
||||
try:
|
||||
browser_manager = get_browser_manager()
|
||||
result = await browser_manager.execute_command(command, params)
|
||||
|
||||
if result.get("success"):
|
||||
return JSONResponse(content=result)
|
||||
else:
|
||||
raise HTTPException(status_code=400, detail=result.get("error", "Command failed"))
|
||||
except Exception as e:
|
||||
logger.error(f"Error executing playwright command: {str(e)}")
|
||||
logger.error(traceback.format_exc())
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||
|
||||
@@ -24,6 +24,7 @@ dependencies = [
|
||||
"pyperclip>=1.9.0",
|
||||
"websockets>=12.0",
|
||||
"pywinctl>=0.4.1",
|
||||
"playwright>=1.40.0",
|
||||
# OS-specific runtime deps
|
||||
"pyobjc-framework-Cocoa>=10.1; sys_platform == 'darwin'",
|
||||
"pyobjc-framework-Quartz>=10.1; sys_platform == 'darwin'",
|
||||
|
||||
@@ -953,6 +953,35 @@ class Computer:
|
||||
"""
|
||||
return await self.interface.to_screenshot_coordinates(x, y)
|
||||
|
||||
async def playwright_exec(self, command: str, params: Optional[Dict] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Execute a Playwright browser command.
|
||||
|
||||
Args:
|
||||
command: The browser command to execute (visit_url, click, type, scroll, web_search)
|
||||
params: Command parameters
|
||||
|
||||
Returns:
|
||||
Dict containing the command result
|
||||
|
||||
Examples:
|
||||
# Navigate to a URL
|
||||
await computer.playwright_exec("visit_url", {"url": "https://example.com"})
|
||||
|
||||
# Click at coordinates
|
||||
await computer.playwright_exec("click", {"x": 100, "y": 200})
|
||||
|
||||
# Type text
|
||||
await computer.playwright_exec("type", {"text": "Hello, world!"})
|
||||
|
||||
# Scroll
|
||||
await computer.playwright_exec("scroll", {"delta_x": 0, "delta_y": -100})
|
||||
|
||||
# Web search
|
||||
await computer.playwright_exec("web_search", {"query": "computer use agent"})
|
||||
"""
|
||||
return await self.interface.playwright_exec(command, params)
|
||||
|
||||
# Add virtual environment management functions to computer interface
|
||||
async def venv_install(self, venv_name: str, requirements: list[str]):
|
||||
"""Install packages in a virtual environment.
|
||||
|
||||
@@ -661,6 +661,56 @@ class GenericComputerInterface(BaseComputerInterface):
|
||||
|
||||
return screenshot_x, screenshot_y
|
||||
|
||||
# Playwright browser control
|
||||
async def playwright_exec(self, command: str, params: Optional[Dict] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Execute a Playwright browser command.
|
||||
|
||||
Args:
|
||||
command: The browser command to execute (visit_url, click, type, scroll, web_search)
|
||||
params: Command parameters
|
||||
|
||||
Returns:
|
||||
Dict containing the command result
|
||||
|
||||
Examples:
|
||||
# Navigate to a URL
|
||||
await interface.playwright_exec("visit_url", {"url": "https://example.com"})
|
||||
|
||||
# Click at coordinates
|
||||
await interface.playwright_exec("click", {"x": 100, "y": 200})
|
||||
|
||||
# Type text
|
||||
await interface.playwright_exec("type", {"text": "Hello, world!"})
|
||||
|
||||
# Scroll
|
||||
await interface.playwright_exec("scroll", {"delta_x": 0, "delta_y": -100})
|
||||
|
||||
# Web search
|
||||
await interface.playwright_exec("web_search", {"query": "computer use agent"})
|
||||
"""
|
||||
protocol = "https" if self.api_key else "http"
|
||||
port = "8443" if self.api_key else "8000"
|
||||
url = f"{protocol}://{self.ip_address}:{port}/playwright_exec"
|
||||
|
||||
payload = {"command": command, "params": params or {}}
|
||||
headers = {"Content-Type": "application/json"}
|
||||
if self.api_key:
|
||||
headers["X-API-Key"] = self.api_key
|
||||
if self.vm_name:
|
||||
headers["X-Container-Name"] = self.vm_name
|
||||
|
||||
try:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.post(url, json=payload, headers=headers) as response:
|
||||
if response.status == 200:
|
||||
return await response.json()
|
||||
else:
|
||||
error_text = await response.text()
|
||||
return {"success": False, "error": error_text}
|
||||
except Exception as e:
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
# Websocket Methods
|
||||
async def _keep_alive(self):
|
||||
"""Keep the WebSocket connection alive with automatic reconnection."""
|
||||
|
||||
@@ -45,7 +45,9 @@ class CloudProvider(BaseVMProvider):
|
||||
# Fall back to environment variable if api_key not provided
|
||||
if api_key is None:
|
||||
api_key = os.getenv("CUA_API_KEY")
|
||||
assert api_key, "api_key required for CloudProvider (provide via parameter or CUA_API_KEY environment variable)"
|
||||
assert (
|
||||
api_key
|
||||
), "api_key required for CloudProvider (provide via parameter or CUA_API_KEY environment variable)"
|
||||
self.api_key = api_key
|
||||
self.verbose = verbose
|
||||
self.api_base = (api_base or DEFAULT_API_BASE).rstrip("/")
|
||||
|
||||
@@ -14,7 +14,7 @@ export async function runCli() {
|
||||
' env Export API key to .env file\n' +
|
||||
' logout Clear stored credentials\n' +
|
||||
'\n' +
|
||||
' cua sb <command> Create and manage cloud sandboxes\n' +
|
||||
' cua sb <command> Create and manage cloud sandboxes\n' +
|
||||
' list View all your sandboxes\n' +
|
||||
' create Provision a new sandbox\n' +
|
||||
' get Get detailed info about a sandbox\n' +
|
||||
|
||||
@@ -29,7 +29,7 @@ async function fetchSandboxDetails(
|
||||
|
||||
const sandboxes = (await listRes.json()) as SandboxItem[];
|
||||
const sandbox = sandboxes.find((s) => s.name === name);
|
||||
|
||||
|
||||
if (!sandbox) {
|
||||
console.error('Sandbox not found');
|
||||
process.exit(1);
|
||||
@@ -53,24 +53,32 @@ async function fetchSandboxDetails(
|
||||
}
|
||||
|
||||
// Probe computer-server if requested and sandbox is running
|
||||
if (options.probeComputerServer && sandbox.status === 'running' && sandbox.host) {
|
||||
if (
|
||||
options.probeComputerServer &&
|
||||
sandbox.status === 'running' &&
|
||||
sandbox.host
|
||||
) {
|
||||
let statusProbeSuccess = false;
|
||||
let versionProbeSuccess = false;
|
||||
|
||||
|
||||
try {
|
||||
// Probe OS type
|
||||
const statusUrl = `https://${sandbox.host}:8443/status`;
|
||||
const statusController = new AbortController();
|
||||
const statusTimeout = setTimeout(() => statusController.abort(), 3000);
|
||||
|
||||
|
||||
try {
|
||||
const statusRes = await fetch(statusUrl, {
|
||||
signal: statusController.signal,
|
||||
});
|
||||
clearTimeout(statusTimeout);
|
||||
|
||||
|
||||
if (statusRes.ok) {
|
||||
const statusData = await statusRes.json() as { status: string; os_type: string; features?: string[] };
|
||||
const statusData = (await statusRes.json()) as {
|
||||
status: string;
|
||||
os_type: string;
|
||||
features?: string[];
|
||||
};
|
||||
result.os_type = statusData.os_type;
|
||||
statusProbeSuccess = true;
|
||||
}
|
||||
@@ -82,7 +90,7 @@ async function fetchSandboxDetails(
|
||||
const versionUrl = `https://${sandbox.host}:8443/cmd`;
|
||||
const versionController = new AbortController();
|
||||
const versionTimeout = setTimeout(() => versionController.abort(), 3000);
|
||||
|
||||
|
||||
try {
|
||||
const versionRes = await fetch(versionUrl, {
|
||||
method: 'POST',
|
||||
@@ -98,12 +106,16 @@ async function fetchSandboxDetails(
|
||||
signal: versionController.signal,
|
||||
});
|
||||
clearTimeout(versionTimeout);
|
||||
|
||||
|
||||
if (versionRes.ok) {
|
||||
const versionDataRaw = await versionRes.text();
|
||||
if (versionDataRaw.startsWith('data: ')) {
|
||||
const jsonStr = versionDataRaw.slice(6);
|
||||
const versionData = JSON.parse(jsonStr) as { success: boolean; protocol: number; package: string };
|
||||
const versionData = JSON.parse(jsonStr) as {
|
||||
success: boolean;
|
||||
protocol: number;
|
||||
package: string;
|
||||
};
|
||||
if (versionData.package) {
|
||||
result.computer_server_version = versionData.package;
|
||||
versionProbeSuccess = true;
|
||||
@@ -116,7 +128,7 @@ async function fetchSandboxDetails(
|
||||
} catch (err) {
|
||||
// General error - skip probing
|
||||
}
|
||||
|
||||
|
||||
// Set computer server status based on probe results
|
||||
if (statusProbeSuccess && versionProbeSuccess) {
|
||||
result.computer_server_status = 'healthy';
|
||||
@@ -394,23 +406,25 @@ const getHandler = async (argv: Record<string, unknown>) => {
|
||||
console.log(`Name: ${details.name}`);
|
||||
console.log(`Status: ${details.status}`);
|
||||
console.log(`Host: ${details.host}`);
|
||||
|
||||
|
||||
if (showPasswords) {
|
||||
console.log(`Password: ${details.password}`);
|
||||
}
|
||||
|
||||
|
||||
if (details.os_type) {
|
||||
console.log(`OS Type: ${details.os_type}`);
|
||||
}
|
||||
|
||||
|
||||
if (details.computer_server_version) {
|
||||
console.log(`Computer Server Version: ${details.computer_server_version}`);
|
||||
console.log(
|
||||
`Computer Server Version: ${details.computer_server_version}`
|
||||
);
|
||||
}
|
||||
|
||||
|
||||
if (details.computer_server_status) {
|
||||
console.log(`Computer Server Status: ${details.computer_server_status}`);
|
||||
}
|
||||
|
||||
|
||||
if (showVncUrl) {
|
||||
console.log(`VNC URL: ${details.vnc_url}`);
|
||||
}
|
||||
|
||||
28
libs/xfce/Development.md
Normal file
28
libs/xfce/Development.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Development
|
||||
|
||||
## Building the Development Docker Image
|
||||
|
||||
To build the XFCE container with local computer-server changes:
|
||||
|
||||
```bash
|
||||
cd libs/xfce
|
||||
docker build -f Dockerfile.dev -t cua-xfce:dev ..
|
||||
```
|
||||
|
||||
The build context is set to the parent directory to allow copying the local `computer-server` source.
|
||||
|
||||
## Tagging the Image
|
||||
|
||||
To tag the dev image as latest:
|
||||
|
||||
```bash
|
||||
docker tag cua-xfce:dev cua-xfce:latest
|
||||
```
|
||||
|
||||
## Running the Development Container
|
||||
|
||||
```bash
|
||||
docker run -p 6901:6901 -p 8000:8000 cua-xfce:dev
|
||||
```
|
||||
|
||||
Access noVNC at: http://localhost:6901
|
||||
@@ -107,6 +107,10 @@ RUN mkdir -p /home/cua/.cache && \
|
||||
# Install computer-server using Python 3.12 pip
|
||||
RUN python3.12 -m pip install cua-computer-server
|
||||
|
||||
# Install playwright and Firefox dependencies
|
||||
RUN python3.12 -m pip install playwright && \
|
||||
python3.12 -m playwright install --with-deps firefox
|
||||
|
||||
# Fix any cache files created by pip
|
||||
RUN chown -R cua:cua /home/cua/.cache
|
||||
|
||||
|
||||
159
libs/xfce/Dockerfile.dev
Normal file
159
libs/xfce/Dockerfile.dev
Normal file
@@ -0,0 +1,159 @@
|
||||
# CUA Docker XFCE Container - Development Version
|
||||
# Vanilla XFCE desktop with noVNC and computer-server (from local source)
|
||||
|
||||
FROM ubuntu:22.04
|
||||
|
||||
# Avoid prompts from apt
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
# Set environment variables
|
||||
ENV HOME=/home/cua
|
||||
ENV DISPLAY=:1
|
||||
ENV VNC_PORT=5901
|
||||
ENV NOVNC_PORT=6901
|
||||
ENV API_PORT=8000
|
||||
ENV VNC_RESOLUTION=1024x768
|
||||
ENV VNC_COL_DEPTH=24
|
||||
|
||||
# Install system dependencies first (including sudo)
|
||||
RUN apt-get update && apt-get install -y \
|
||||
# System utilities
|
||||
sudo \
|
||||
unzip \
|
||||
zip \
|
||||
xdg-utils \
|
||||
# Desktop environment
|
||||
xfce4 \
|
||||
xfce4-terminal \
|
||||
dbus-x11 \
|
||||
# VNC server
|
||||
tigervnc-standalone-server \
|
||||
tigervnc-common \
|
||||
# noVNC dependencies
|
||||
# python will be installed via deadsnakes as 3.12 \
|
||||
git \
|
||||
net-tools \
|
||||
netcat \
|
||||
supervisor \
|
||||
# Computer-server dependencies
|
||||
# python-tk/dev for 3.12 will be installed later \
|
||||
gnome-screenshot \
|
||||
wmctrl \
|
||||
ffmpeg \
|
||||
socat \
|
||||
xclip \
|
||||
# Browser
|
||||
wget \
|
||||
software-properties-common \
|
||||
# Build tools
|
||||
build-essential \
|
||||
libncursesw5-dev \
|
||||
libssl-dev \
|
||||
libsqlite3-dev \
|
||||
tk-dev \
|
||||
libgdbm-dev \
|
||||
libc6-dev \
|
||||
libbz2-dev \
|
||||
libffi-dev \
|
||||
zlib1g-dev \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Python 3.12 from deadsnakes (keep system python3 for apt)
|
||||
RUN add-apt-repository -y ppa:deadsnakes/ppa && \
|
||||
apt-get update && apt-get install -y \
|
||||
python3.12 python3.12-venv python3.12-dev python3.12-tk && \
|
||||
python3.12 -m ensurepip --upgrade && \
|
||||
python3.12 -m pip install --upgrade pip setuptools wheel && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Ensure 'python' points to Python 3.12
|
||||
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 2
|
||||
|
||||
# Remove screensavers and power manager to avoid popups and lock screens
|
||||
RUN apt-get remove -y \
|
||||
xfce4-power-manager \
|
||||
xfce4-power-manager-data \
|
||||
xfce4-power-manager-plugins \
|
||||
xfce4-screensaver \
|
||||
light-locker \
|
||||
xscreensaver \
|
||||
xscreensaver-data || true
|
||||
|
||||
# Create user after sudo is installed
|
||||
RUN useradd -m -s /bin/bash -G sudo cua && \
|
||||
echo "cua:cua" | chpasswd && \
|
||||
echo "cua ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
|
||||
|
||||
# Install Firefox from Mozilla PPA (snap-free) - inline to avoid script issues
|
||||
RUN apt-get update && \
|
||||
add-apt-repository -y ppa:mozillateam/ppa && \
|
||||
echo 'Package: *\nPin: release o=LP-PPA-mozillateam\nPin-Priority: 1001' > /etc/apt/preferences.d/mozilla-firefox && \
|
||||
apt-get update && \
|
||||
apt-get install -y firefox && \
|
||||
echo 'pref("datareporting.policy.firstRunURL", "");\npref("datareporting.policy.dataSubmissionEnabled", false);\npref("datareporting.healthreport.service.enabled", false);\npref("datareporting.healthreport.uploadEnabled", false);\npref("trailhead.firstrun.branches", "nofirstrun-empty");\npref("browser.aboutwelcome.enabled", false);' > /usr/lib/firefox/browser/defaults/preferences/firefox.js && \
|
||||
update-alternatives --install /usr/bin/x-www-browser x-www-browser /usr/bin/firefox 100 && \
|
||||
update-alternatives --install /usr/bin/gnome-www-browser gnome-www-browser /usr/bin/firefox 100 && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install noVNC
|
||||
RUN git clone https://github.com/novnc/noVNC.git /opt/noVNC && \
|
||||
git clone https://github.com/novnc/websockify /opt/noVNC/utils/websockify && \
|
||||
ln -s /opt/noVNC/vnc.html /opt/noVNC/index.html
|
||||
|
||||
# Pre-create cache directory with correct ownership before pip install
|
||||
RUN mkdir -p /home/cua/.cache && \
|
||||
chown -R cua:cua /home/cua/.cache
|
||||
|
||||
# Copy local computer-server source and install it
|
||||
COPY python/computer-server /tmp/computer-server
|
||||
RUN python3.12 -m pip install /tmp/computer-server && \
|
||||
rm -rf /tmp/computer-server
|
||||
|
||||
# Install playwright and Firefox dependencies
|
||||
RUN python3.12 -m pip install playwright && \
|
||||
python3.12 -m playwright install --with-deps firefox
|
||||
|
||||
# Fix any cache files created by pip
|
||||
RUN chown -R cua:cua /home/cua/.cache
|
||||
|
||||
# Copy startup scripts
|
||||
COPY xfce/src/supervisor/ /etc/supervisor/conf.d/
|
||||
COPY xfce/src/scripts/ /usr/local/bin/
|
||||
|
||||
# Make scripts executable
|
||||
RUN chmod +x /usr/local/bin/*.sh
|
||||
|
||||
# Setup VNC
|
||||
RUN chown -R cua:cua /home/cua
|
||||
USER cua
|
||||
WORKDIR /home/cua
|
||||
|
||||
# Create VNC directory (no password needed with SecurityTypes None)
|
||||
RUN mkdir -p $HOME/.vnc
|
||||
|
||||
# Configure XFCE for first start
|
||||
RUN mkdir -p $HOME/.config/xfce4/xfconf/xfce-perchannel-xml $HOME/.config/xfce4 $HOME/.config/autostart
|
||||
|
||||
# Copy XFCE config to disable browser launching and welcome screens
|
||||
COPY --chown=cua:cua xfce/src/xfce-config/helpers.rc $HOME/.config/xfce4/helpers.rc
|
||||
COPY --chown=cua:cua xfce/src/xfce-config/xfce4-session.xml $HOME/.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-session.xml
|
||||
COPY --chown=cua:cua xfce/src/xfce-config/xfce4-power-manager.xml $HOME/.config/xfce4/xfconf/xfce-perchannel-xml/xfce4-power-manager.xml
|
||||
|
||||
# Disable autostart for screensaver, lock screen, and power manager
|
||||
RUN echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-tips-autostart.desktop && \
|
||||
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-screensaver.desktop && \
|
||||
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/light-locker.desktop && \
|
||||
echo "[Desktop Entry]\nHidden=true" > $HOME/.config/autostart/xfce4-power-manager.desktop && \
|
||||
chown -R cua:cua $HOME/.config
|
||||
|
||||
# Create storage and shared directories, and Firefox cache directory
|
||||
RUN mkdir -p $HOME/storage $HOME/shared $HOME/.cache/dconf $HOME/.mozilla/firefox && \
|
||||
chown -R cua:cua $HOME/storage $HOME/shared $HOME/.cache $HOME/.mozilla $HOME/.vnc
|
||||
|
||||
USER root
|
||||
|
||||
# Expose ports
|
||||
EXPOSE $VNC_PORT $NOVNC_PORT $API_PORT
|
||||
|
||||
# Start services via supervisor
|
||||
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/supervisord.conf"]
|
||||
@@ -10,4 +10,4 @@ echo "X server is ready"
|
||||
|
||||
# Start computer-server
|
||||
export DISPLAY=:1
|
||||
python3 -m computer_server --port ${API_PORT:-8000}
|
||||
python -m computer_server --port ${API_PORT:-8000}
|
||||
|
||||
Reference in New Issue
Block a user