🖥️

CUA - Computer Use Agent

Open-source framework for AI agents that autonomously interact with computers through visual understanding

Sandbox macOS, Linux, and Windows VMs. AI perceives screens, reasons about interfaces, and executes actions. Supports Claude, GPT-4o, Gemini, UI-TARS, and many other models.

Python 3.12+Apple SiliconVision + Action

🏗️

Architecture

How CUA works under the hood

┌─────────────────────────────────────────────────┐
│          Agent SDK (agent package)              │
├─────────────────────────────────────────────────┤
│  ComputerAgent                                  │
│  ├─ Agent Loops (anthropic, openai, gemini)    │
│  ├─ Callbacks (trajectory, budget, logging)    │
│  ├─ Adapters (HuggingFace, MLX, CUA cloud)     │
│  └─ Tools (computers, functions)               │
├─────────────────────────────────────────────────┤
│        Computer SDK (computer package)          │
├─────────────────────────────────────────────────┤
│  Computer Class                                 │
│  ├─ VM Providers (lume, docker, cloud)         │
│  ├─ Interfaces (macOS, linux, windows)         │
│  └─ Actions (click, type, scroll, screenshot)  │
├─────────────────────────────────────────────────┤
│     Infrastructure (Lume, Core, SOM)            │
├─────────────────────────────────────────────────┤
│  Lume (macOS VM manager - Swift)               │
│  Computer Server (runs inside VM)              │
│  Set-of-Mark grounding                         │
└─────────────────────────────────────────────────┘

Agent Layer

ComputerAgent orchestrates the reasoning loop. Takes screenshots, sends to LLM, parses actions, executes on computer.

Computer Layer

Computer class manages VM lifecycle. Connects via WebSocket to computer-server running inside the VM.

Infrastructure Layer

Lume provides near-native macOS VMs on Apple Silicon. Docker/Cloud for Linux. Windows Sandbox for Windows.

🍎

Running on macOS

Complete setup guide for Apple Silicon

Requirements

• Apple Silicon Mac (M1, M2, M3, M4)
• macOS 13.0 (Ventura) or later
• Python 3.12 or 3.13 (NOT 3.14 - dependency issues)
• 16GB RAM recommended (8GB minimum)
• 50GB free disk space for VM images

Step 1: Install Lume (macOS VM Manager)

Lume uses Apple's Virtualization.framework for near-native performance. It manages macOS VM lifecycle and provides an HTTP API.

# Install Lume CLI and daemon
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"

# Verify installation
lume --version

# Test by listing available images
lume images

Step 2: Clone Repository & Setup Python

# Clone the repository
git clone https://github.com/trycua/cua.git
cd cua

# Install uv (fast Python package manager)
pip install uv

# Sync all Python dependencies (uses pyproject.toml workspace)
uv sync

# Set up pre-commit hooks (optional but recommended)
uv run pre-commit install

Step 3: Configure API Keys

Create a .env.local file in the root directory with your API keys:

# .env.local
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here

# Optional: For CUA cloud provider
CUA_API_KEY=cua_your-key-here

# Optional: Disable telemetry
CUA_TELEMETRY_DISABLED=1

Step 4: Pull a macOS VM Image

# Pull the CUA-optimized macOS Sequoia image (has computer-server pre-installed)
lume pull macos-sequoia-cua:latest

# Or for a vanilla macOS install
lume pull macos-sequoia-vanilla:latest

# List pulled images
lume images

Default credentials: username "lume", password "lume" (change immediately!)

Step 5: Run Your First Agent

# Run the example script
uv run python examples/agent_ui_examples.py

# Or create your own script
cat << 'EOF' > my_agent.py
import asyncio
from agent import ComputerAgent
from computer import Computer

async def main():
    # Create a macOS VM
    computer = Computer(
        os_type="macos",
        display="1024x768",
        memory="8GB",
        cpu="4",
        provider_type="lume"
    )

    try:
        await computer.run()  # Start VM

        agent = ComputerAgent(
            model="anthropic/claude-sonnet-4-5-20250929",
            tools=[computer],
            verbosity=20  # logging.INFO
        )

        async for result in agent.run([{
            "role": "user",
            "content": "Open Safari and search for 'hello world'"
        }]):
            print(result)
    finally:
        await computer.stop()

asyncio.run(main())
EOF

uv run python my_agent.py

🔄

Agent Execution Flow

How actions are performed step by step

User sends instruction

e.g., 'Open Safari and search for Python tutorials'

Agent takes screenshot

Captures current VM screen state

Screenshot + instruction sent to LLM

Claude/GPT-4o/etc. receives image and reasons about what to do

LLM returns action

e.g., {type: 'left_click', x: 100, y: 50} or {type: 'type_text', text: 'hello'}

Action executed on VM

Computer interface sends command via WebSocket to computer-server in VM

Loop continues

Take new screenshot, send to LLM, repeat until task complete or budget exhausted

Callback Lifecycle

on_run_start()           # Agent run begins
  │
  ├─► on_llm_start()       # Before API call (preprocess messages)
  │     └─► on_api_start() # API request begins
  │         └─► on_api_end() # API response received
  │     └─► on_llm_end()   # After API call (postprocess)
  │
  ├─► on_computer_call_start()  # Before executing action
  │     └─► Execute action (click, type, etc.)
  │     └─► on_screenshot()     # Screenshot captured
  │     └─► on_computer_call_end()
  │
  ├─► on_usage()           # Token usage tracked
  ├─► on_run_continue()    # Check if should continue
  │
  └─► Loop until done
      │
on_run_end()             # Agent run complete

🤖

Supported Models

Switch between different AI providers and local models

Model	Computer-Use	Grounding	Tools	Type
Claude Sonnet/Haiku 4	✓	✓	✓	Full Agent
OpenAI computer-use-preview	✓	✓		Full Agent
Gemini computer-use-preview	✓	✓		Full Agent
Qwen3 VL	✓	✓	✓	Full Agent
GLM-4.5V	✓	✓	✓	Full Agent
UI-TARS / UI-TARS-2	✓	✓	✓	Full Agent
InternVL	✓	✓	✓	Full Agent
Moondream3		✓		Grounding
OmniParser		✓		Grounding
OpenCUA / GTA / Holo		✓		Grounding

Switching Models

# Full computer-use models (API-based)
agent = ComputerAgent(model="anthropic/claude-sonnet-4-5-20250929")
agent = ComputerAgent(model="openai/computer-use-preview")
agent = ComputerAgent(model="gemini-2.5-computer-use-preview")

# Via OpenRouter (access many models)
agent = ComputerAgent(model="openrouter/qwen/qwen3-vl-235b-a22b-instruct")

# Local inference with HuggingFace Transformers
agent = ComputerAgent(model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B")

# Local inference with Ollama
agent = ComputerAgent(model="ollama/llava")

# MLX (Apple Silicon optimized)
agent = ComputerAgent(model="mlx/llava-v1.6-mistral-7b")

Model Composition (Grounding + Planning)

Combine a specialized grounding model with a general LLM for planning:

# Format: {grounding-model}+{planning-llm}

# UI-TARS for clicking + Claude for reasoning
agent = ComputerAgent(
    model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+anthropic/claude-sonnet-4-5"
)

# Moondream3 grounding + GPT-4o planning
agent = ComputerAgent(model="moondream3+openai/gpt-4o")

# OmniParser grounding + any LLM
agent = ComputerAgent(model="omniparser+openai/gpt-4o")

# Human-in-the-loop for click verification
agent = ComputerAgent(model="openai/computer-use-preview+human/human")

⚙️

Computer Configuration

All VM and interface options

from computer import Computer

computer = Computer(
    # Display & Resources
    display="1024x768",       # Screen resolution (WIDTHxHEIGHT)
    memory="8GB",             # RAM allocation
    cpu="4",                  # CPU cores

    # OS Selection
    os_type="macos",          # "macos", "linux", or "windows"
    image=None,               # VM image (auto-selected if None)
    name="my-vm",             # VM identifier

    # Provider
    provider_type="lume",     # lume, docker, cloud, winsandbox
    host="localhost",         # Provider host
    port=7777,                # Provider API port
    noVNC_port=8006,          # Web VNC interface port

    # Storage
    storage="/path/to/storage",  # Persistent storage path
    ephemeral=False,             # True = destroy on stop
    shared_directories=[         # Mount host directories
        "/Users/me/shared"
    ],

    # Cloud/Remote
    api_key=None,             # CUA_API_KEY (from env if not set)
    use_host_computer_server=False,  # Use localhost instead of VM

    # Logging
    verbosity=20,             # logging.INFO
    telemetry_enabled=True,   # Anonymous usage tracking

    # Experiments
    experiments=["app-use"]   # macOS app-specific automation
)

lumemacOS

Apple Virtualization.framework. Near-native performance on Apple Silicon. Best for macOS VMs.

dockerLinux

Docker containers with desktop environment. Fast startup, good for Linux automation.

cloudLinux/macOS

CUA Cloud service. No local VM needed. Requires CUA_API_KEY.

winsandboxWindows

Windows Sandbox. Always ephemeral. Requires Windows host.

🎛️

Agent Configuration

All ComputerAgent options

from agent import ComputerAgent

agent = ComputerAgent(
    # Core
    model="anthropic/claude-sonnet-4-5-20250929",
    tools=[computer],                    # Computer + function tools

    # Control
    custom_loop=None,                    # Override default agent loop
    max_retries=3,                       # API retry attempts
    screenshot_delay=0.5,                # Delay before screenshots (seconds)

    # Memory Optimization
    only_n_most_recent_images=3,         # Keep only N screenshots in context
    use_prompt_caching=False,            # Anthropic prompt caching

    # Callbacks
    callbacks=[                          # Built-in and custom callbacks
        TrajectorySaverCallback(path="./trajectories"),
        BudgetManagerCallback(max_usd=5.0),
        LoggingCallback(level=logging.DEBUG),
    ],
    instructions="Custom system prompt here",

    # Tracking
    trajectory_dir="./trajectories",     # Save actions to disk
    max_trajectory_budget=5.0,           # Cost budget in USD
    telemetry_enabled=True,

    # Logging
    verbosity=10,                        # logging.DEBUG

    # Model Provider Overrides
    api_key=None,                        # Override provider API key
    api_base=None,                       # Override provider URL
    trust_remote_code=False,             # For HuggingFace models

    # Additional LLM kwargs
    temperature=0.7,
    max_tokens=4096,
)

only_n_most_recent_images

Prevents context overflow. Keeps last N screenshots.

max_trajectory_budget

Stop when cost exceeds this USD amount.

trajectory_dir

Save screenshots + actions for replay/debugging.

🖱️

Interface Methods

All available computer actions

Mouse Actions

# Clicks
await computer.interface.left_click(x, y)
await computer.interface.right_click(x, y)
await computer.interface.double_click(x, y)
await computer.interface.middle_click(x, y)

# Movement
await computer.interface.move_cursor(x, y)

# Drag
await computer.interface.drag_to(x, y)
await computer.interface.drag([(x1, y1), (x2, y2)])

# Scroll
await computer.interface.scroll(x, y, dx, dy)

Keyboard Actions

# Type text
await computer.interface.type_text("hello world")

# Press key
await computer.interface.press_key("enter")
await computer.interface.press_key("escape")

# Hotkeys (modifier combos)
await computer.interface.hotkey("command", "space")
await computer.interface.hotkey("command", "shift", "4")
await computer.interface.hotkey("control", "c")

Clipboard & Screenshot

# Screenshot
screenshot_bytes = await computer.interface.screenshot()

# Clipboard
await computer.interface.set_clipboard("text to paste")
content = await computer.interface.copy_to_clipboard()

# Get screen dimensions
width, height = await computer.interface.get_dimensions()

# Get environment info
env = await computer.interface.get_environment()

macOS-Specific

# Accessibility tree (UI element hierarchy)
tree = await computer.interface.get_accessibility_tree()

# Diorama commands (app-specific automation)
result = await computer.interface.diorama_cmd(
    "open_app",
    {"app_name": "Safari"}
)

# Run shell command
output = await computer.interface.run_command("ls -la")

📞

Built-in Callbacks

Extend agent behavior

TrajectorySaverCallback

Saves screenshots and actions to disk for replay and debugging

from agent.callbacks import TrajectorySaverCallback

callback = TrajectorySaverCallback(
    path="./trajectories",
    save_screenshots=True
)

BudgetManagerCallback

Stops agent when cost budget is exceeded

from agent.callbacks import BudgetManagerCallback

callback = BudgetManagerCallback(
    max_usd=5.0,
    warn_at=0.8  # Warn at 80%
)

ImageRetentionCallback

Keeps only N most recent images in context

from agent.callbacks import ImageRetentionCallback

callback = ImageRetentionCallback(
    keep_last=3
)

LoggingCallback

Detailed execution logging

from agent.callbacks import LoggingCallback
import logging

callback = LoggingCallback(
    level=logging.DEBUG
)

Custom Callback

from agent.callbacks import AsyncCallbackHandler

class MyCallback(AsyncCallbackHandler):
    async def on_run_start(self, messages):
        print("Agent starting...")

    async def on_computer_call_start(self, action, call_id):
        print(f"Executing: {action}")

    async def on_computer_call_end(self, action, call_id, result):
        print(f"Completed: {action}")

    async def on_screenshot(self, screenshot_bytes):
        # Save or process screenshot
        pass

    async def on_usage(self, usage):
        print(f"Tokens: {usage['total_tokens']}")

    async def on_run_end(self, result):
        print(f"Agent finished. Cost: ${result.get('cost', 0):.4f}")

agent = ComputerAgent(
    model="anthropic/claude-sonnet-4-5-20250929",
    tools=[computer],
    callbacks=[MyCallback()]
)

📝

Complete Examples

Basic macOS Automation

import asyncio
from agent import ComputerAgent
from computer import Computer

async def main():
    computer = Computer(
        os_type="macos",
        display="1024x768",
        memory="8GB",
        provider_type="lume"
    )

    try:
        await computer.run()

        agent = ComputerAgent(
            model="anthropic/claude-sonnet-4-5-20250929",
            tools=[computer],
            only_n_most_recent_images=3,
            trajectory_dir="./trajectories"
        )

        messages = [{
            "role": "user",
            "content": "Open Safari, go to github.com, and take a screenshot"
        }]

        async for result in agent.run(messages):
            for item in result.get("output", []):
                if item.get("type") == "message":
                    print(f"Agent: {item['content'][0]['text']}")
                elif item.get("type") == "computer_call":
                    print(f"Action: {item['action']}")

    finally:
        await computer.stop()

asyncio.run(main())

Using Local Model (UI-TARS)

import asyncio
from agent import ComputerAgent
from computer import Computer

async def main():
    computer = Computer(
        os_type="macos",
        provider_type="lume"
    )

    await computer.run()

    # Use UI-TARS locally (requires GPU with enough VRAM)
    agent = ComputerAgent(
        model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B",
        tools=[computer],
        trust_remote_code=True  # Required for HuggingFace models
    )

    async for result in agent.run([{
        "role": "user",
        "content": "Click on the Safari icon in the dock"
    }]):
        print(result)

    await computer.stop()

asyncio.run(main())

With Budget & Trajectory Saving

import asyncio
import logging
from agent import ComputerAgent
from agent.callbacks import TrajectorySaverCallback, BudgetManagerCallback
from computer import Computer

async def main():
    computer = Computer(
        os_type="macos",
        provider_type="lume",
        storage="./vm-storage"  # Persistent storage
    )

    await computer.run()

    agent = ComputerAgent(
        model="anthropic/claude-sonnet-4-5-20250929",
        tools=[computer],
        callbacks=[
            TrajectorySaverCallback(path="./trajectories"),
            BudgetManagerCallback(max_usd=2.0),
        ],
        only_n_most_recent_images=5,
        verbosity=logging.INFO,
        instructions="Be concise. Take direct actions."
    )

    async for result in agent.run([{
        "role": "user",
        "content": """
        1. Open System Settings
        2. Navigate to General
        3. Check the About section
        """
    }]):
        usage = result.get("usage", {})
        if usage:
            print(f"Tokens used: {usage.get('total_tokens', 0)}")

    await computer.stop()

asyncio.run(main())

Composed Grounding + Planning

import asyncio
from agent import ComputerAgent
from computer import Computer

async def main():
    computer = Computer(
        os_type="macos",
        provider_type="lume"
    )

    await computer.run()

    # Moondream3 for click prediction + Claude for reasoning
    agent = ComputerAgent(
        model="moondream3+anthropic/claude-sonnet-4-5-20250929",
        tools=[computer]
    )

    async for result in agent.run([{
        "role": "user",
        "content": "Find and click on the Finder icon"
    }]):
        print(result)

    await computer.stop()

asyncio.run(main())

🔧

Lume CLI Reference

Manage macOS VMs

# List available images
lume images

# Pull an image
lume pull macos-sequoia-cua:latest

# Run a VM (interactive)
lume run macos-sequoia-cua:latest

# List running VMs
lume list

# Stop a VM
lume stop <vm-name>

# Delete a VM
lume delete <vm-name>

# Get VM info
lume info <vm-name>

# Clone a VM (for snapshots)
lume clone <source> <destination>

🔍

Troubleshooting

Lume install fails

Ensure you're on Apple Silicon Mac with macOS 13+. Check Xcode Command Line Tools are installed: xcode-select --install

VM won't start / hangs

Check available disk space (need ~50GB). Try: lume delete <vm-name> && lume pull <image> to reset.

Python dependency errors

Use Python 3.12 or 3.13 (NOT 3.14). Run: uv python install 3.13 && uv sync

API rate limits

Add delays between actions with screenshot_delay parameter. Use budget callbacks to limit costs.

Screenshots are black/wrong

VM may not be fully booted. Add initial delay or check VM is responsive via VNC (localhost:8006).

Actions not executing

Check computer-server is running in VM. For custom images, ensure computer-server is installed.

💎

Key Takeaways

Sandboxed by design

All actions happen in VMs, not on your host. Safe for experimentation and production automation.

Model-agnostic

Switch between Claude, GPT-4o, Gemini, or local models like UI-TARS. Compose grounding + planning.

Callback extensibility

Built-in callbacks for budgets, trajectories, logging. Create custom callbacks for any behavior.

Near-native macOS

Lume uses Apple Virtualization.framework for near-native performance on Apple Silicon.

🔗

Resources

GitHub Repository Documentation Discord Community