CUA - Computer Use Agent
Open-source framework for AI agents that autonomously interact with computers through visual understanding
Sandbox macOS, Linux, and Windows VMs. AI perceives screens, reasons about interfaces, and executes actions. Supports Claude, GPT-4o, Gemini, UI-TARS, and many other models.
Architecture
How CUA works under the hood
┌─────────────────────────────────────────────────┐ │ Agent SDK (agent package) │ ├─────────────────────────────────────────────────┤ │ ComputerAgent │ │ ├─ Agent Loops (anthropic, openai, gemini) │ │ ├─ Callbacks (trajectory, budget, logging) │ │ ├─ Adapters (HuggingFace, MLX, CUA cloud) │ │ └─ Tools (computers, functions) │ ├─────────────────────────────────────────────────┤ │ Computer SDK (computer package) │ ├─────────────────────────────────────────────────┤ │ Computer Class │ │ ├─ VM Providers (lume, docker, cloud) │ │ ├─ Interfaces (macOS, linux, windows) │ │ └─ Actions (click, type, scroll, screenshot) │ ├─────────────────────────────────────────────────┤ │ Infrastructure (Lume, Core, SOM) │ ├─────────────────────────────────────────────────┤ │ Lume (macOS VM manager - Swift) │ │ Computer Server (runs inside VM) │ │ Set-of-Mark grounding │ └─────────────────────────────────────────────────┘
Agent Layer
ComputerAgent orchestrates the reasoning loop. Takes screenshots, sends to LLM, parses actions, executes on computer.
Computer Layer
Computer class manages VM lifecycle. Connects via WebSocket to computer-server running inside the VM.
Infrastructure Layer
Lume provides near-native macOS VMs on Apple Silicon. Docker/Cloud for Linux. Windows Sandbox for Windows.
Running on macOS
Complete setup guide for Apple Silicon
Requirements
- • Apple Silicon Mac (M1, M2, M3, M4)
- • macOS 13.0 (Ventura) or later
- • Python 3.12 or 3.13 (NOT 3.14 - dependency issues)
- • 16GB RAM recommended (8GB minimum)
- • 50GB free disk space for VM images
Step 1: Install Lume (macOS VM Manager)
Lume uses Apple's Virtualization.framework for near-native performance. It manages macOS VM lifecycle and provides an HTTP API.
# Install Lume CLI and daemon
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/lume/scripts/install.sh)"
# Verify installation
lume --version
# Test by listing available images
lume imagesStep 2: Clone Repository & Setup Python
# Clone the repository
git clone https://github.com/trycua/cua.git
cd cua
# Install uv (fast Python package manager)
pip install uv
# Sync all Python dependencies (uses pyproject.toml workspace)
uv sync
# Set up pre-commit hooks (optional but recommended)
uv run pre-commit installStep 3: Configure API Keys
Create a .env.local file in the root directory with your API keys:
# .env.local
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here
# Optional: For CUA cloud provider
CUA_API_KEY=cua_your-key-here
# Optional: Disable telemetry
CUA_TELEMETRY_DISABLED=1Step 4: Pull a macOS VM Image
# Pull the CUA-optimized macOS Sequoia image (has computer-server pre-installed)
lume pull macos-sequoia-cua:latest
# Or for a vanilla macOS install
lume pull macos-sequoia-vanilla:latest
# List pulled images
lume imagesStep 5: Run Your First Agent
# Run the example script
uv run python examples/agent_ui_examples.py
# Or create your own script
cat << 'EOF' > my_agent.py
import asyncio
from agent import ComputerAgent
from computer import Computer
async def main():
# Create a macOS VM
computer = Computer(
os_type="macos",
display="1024x768",
memory="8GB",
cpu="4",
provider_type="lume"
)
try:
await computer.run() # Start VM
agent = ComputerAgent(
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer],
verbosity=20 # logging.INFO
)
async for result in agent.run([{
"role": "user",
"content": "Open Safari and search for 'hello world'"
}]):
print(result)
finally:
await computer.stop()
asyncio.run(main())
EOF
uv run python my_agent.pyAgent Execution Flow
How actions are performed step by step
User sends instruction
e.g., 'Open Safari and search for Python tutorials'
Agent takes screenshot
Captures current VM screen state
Screenshot + instruction sent to LLM
Claude/GPT-4o/etc. receives image and reasons about what to do
LLM returns action
e.g., {type: 'left_click', x: 100, y: 50} or {type: 'type_text', text: 'hello'}
Action executed on VM
Computer interface sends command via WebSocket to computer-server in VM
Loop continues
Take new screenshot, send to LLM, repeat until task complete or budget exhausted
Callback Lifecycle
on_run_start() # Agent run begins
│
├─► on_llm_start() # Before API call (preprocess messages)
│ └─► on_api_start() # API request begins
│ └─► on_api_end() # API response received
│ └─► on_llm_end() # After API call (postprocess)
│
├─► on_computer_call_start() # Before executing action
│ └─► Execute action (click, type, etc.)
│ └─► on_screenshot() # Screenshot captured
│ └─► on_computer_call_end()
│
├─► on_usage() # Token usage tracked
├─► on_run_continue() # Check if should continue
│
└─► Loop until done
│
on_run_end() # Agent run completeSupported Models
Switch between different AI providers and local models
| Model | Computer-Use | Grounding | Tools | Type |
|---|---|---|---|---|
| Claude Sonnet/Haiku 4 | ✓ | ✓ | ✓ | Full Agent |
| OpenAI computer-use-preview | ✓ | ✓ | Full Agent | |
| Gemini computer-use-preview | ✓ | ✓ | Full Agent | |
| Qwen3 VL | ✓ | ✓ | ✓ | Full Agent |
| GLM-4.5V | ✓ | ✓ | ✓ | Full Agent |
| UI-TARS / UI-TARS-2 | ✓ | ✓ | ✓ | Full Agent |
| InternVL | ✓ | ✓ | ✓ | Full Agent |
| Moondream3 | ✓ | Grounding | ||
| OmniParser | ✓ | Grounding | ||
| OpenCUA / GTA / Holo | ✓ | Grounding |
Switching Models
# Full computer-use models (API-based)
agent = ComputerAgent(model="anthropic/claude-sonnet-4-5-20250929")
agent = ComputerAgent(model="openai/computer-use-preview")
agent = ComputerAgent(model="gemini-2.5-computer-use-preview")
# Via OpenRouter (access many models)
agent = ComputerAgent(model="openrouter/qwen/qwen3-vl-235b-a22b-instruct")
# Local inference with HuggingFace Transformers
agent = ComputerAgent(model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B")
# Local inference with Ollama
agent = ComputerAgent(model="ollama/llava")
# MLX (Apple Silicon optimized)
agent = ComputerAgent(model="mlx/llava-v1.6-mistral-7b")Model Composition (Grounding + Planning)
Combine a specialized grounding model with a general LLM for planning:
# Format: {grounding-model}+{planning-llm}
# UI-TARS for clicking + Claude for reasoning
agent = ComputerAgent(
model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+anthropic/claude-sonnet-4-5"
)
# Moondream3 grounding + GPT-4o planning
agent = ComputerAgent(model="moondream3+openai/gpt-4o")
# OmniParser grounding + any LLM
agent = ComputerAgent(model="omniparser+openai/gpt-4o")
# Human-in-the-loop for click verification
agent = ComputerAgent(model="openai/computer-use-preview+human/human")Computer Configuration
All VM and interface options
from computer import Computer
computer = Computer(
# Display & Resources
display="1024x768", # Screen resolution (WIDTHxHEIGHT)
memory="8GB", # RAM allocation
cpu="4", # CPU cores
# OS Selection
os_type="macos", # "macos", "linux", or "windows"
image=None, # VM image (auto-selected if None)
name="my-vm", # VM identifier
# Provider
provider_type="lume", # lume, docker, cloud, winsandbox
host="localhost", # Provider host
port=7777, # Provider API port
noVNC_port=8006, # Web VNC interface port
# Storage
storage="/path/to/storage", # Persistent storage path
ephemeral=False, # True = destroy on stop
shared_directories=[ # Mount host directories
"/Users/me/shared"
],
# Cloud/Remote
api_key=None, # CUA_API_KEY (from env if not set)
use_host_computer_server=False, # Use localhost instead of VM
# Logging
verbosity=20, # logging.INFO
telemetry_enabled=True, # Anonymous usage tracking
# Experiments
experiments=["app-use"] # macOS app-specific automation
)lumemacOSApple Virtualization.framework. Near-native performance on Apple Silicon. Best for macOS VMs.
dockerLinuxDocker containers with desktop environment. Fast startup, good for Linux automation.
cloudLinux/macOSCUA Cloud service. No local VM needed. Requires CUA_API_KEY.
winsandboxWindowsWindows Sandbox. Always ephemeral. Requires Windows host.
Agent Configuration
All ComputerAgent options
from agent import ComputerAgent
agent = ComputerAgent(
# Core
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer], # Computer + function tools
# Control
custom_loop=None, # Override default agent loop
max_retries=3, # API retry attempts
screenshot_delay=0.5, # Delay before screenshots (seconds)
# Memory Optimization
only_n_most_recent_images=3, # Keep only N screenshots in context
use_prompt_caching=False, # Anthropic prompt caching
# Callbacks
callbacks=[ # Built-in and custom callbacks
TrajectorySaverCallback(path="./trajectories"),
BudgetManagerCallback(max_usd=5.0),
LoggingCallback(level=logging.DEBUG),
],
instructions="Custom system prompt here",
# Tracking
trajectory_dir="./trajectories", # Save actions to disk
max_trajectory_budget=5.0, # Cost budget in USD
telemetry_enabled=True,
# Logging
verbosity=10, # logging.DEBUG
# Model Provider Overrides
api_key=None, # Override provider API key
api_base=None, # Override provider URL
trust_remote_code=False, # For HuggingFace models
# Additional LLM kwargs
temperature=0.7,
max_tokens=4096,
)only_n_most_recent_imagesPrevents context overflow. Keeps last N screenshots.
max_trajectory_budgetStop when cost exceeds this USD amount.
trajectory_dirSave screenshots + actions for replay/debugging.
Interface Methods
All available computer actions
Mouse Actions
# Clicks
await computer.interface.left_click(x, y)
await computer.interface.right_click(x, y)
await computer.interface.double_click(x, y)
await computer.interface.middle_click(x, y)
# Movement
await computer.interface.move_cursor(x, y)
# Drag
await computer.interface.drag_to(x, y)
await computer.interface.drag([(x1, y1), (x2, y2)])
# Scroll
await computer.interface.scroll(x, y, dx, dy)Keyboard Actions
# Type text
await computer.interface.type_text("hello world")
# Press key
await computer.interface.press_key("enter")
await computer.interface.press_key("escape")
# Hotkeys (modifier combos)
await computer.interface.hotkey("command", "space")
await computer.interface.hotkey("command", "shift", "4")
await computer.interface.hotkey("control", "c")Clipboard & Screenshot
# Screenshot
screenshot_bytes = await computer.interface.screenshot()
# Clipboard
await computer.interface.set_clipboard("text to paste")
content = await computer.interface.copy_to_clipboard()
# Get screen dimensions
width, height = await computer.interface.get_dimensions()
# Get environment info
env = await computer.interface.get_environment()macOS-Specific
# Accessibility tree (UI element hierarchy)
tree = await computer.interface.get_accessibility_tree()
# Diorama commands (app-specific automation)
result = await computer.interface.diorama_cmd(
"open_app",
{"app_name": "Safari"}
)
# Run shell command
output = await computer.interface.run_command("ls -la")Built-in Callbacks
Extend agent behavior
TrajectorySaverCallback
Saves screenshots and actions to disk for replay and debugging
from agent.callbacks import TrajectorySaverCallback
callback = TrajectorySaverCallback(
path="./trajectories",
save_screenshots=True
)BudgetManagerCallback
Stops agent when cost budget is exceeded
from agent.callbacks import BudgetManagerCallback
callback = BudgetManagerCallback(
max_usd=5.0,
warn_at=0.8 # Warn at 80%
)ImageRetentionCallback
Keeps only N most recent images in context
from agent.callbacks import ImageRetentionCallback
callback = ImageRetentionCallback(
keep_last=3
)LoggingCallback
Detailed execution logging
from agent.callbacks import LoggingCallback
import logging
callback = LoggingCallback(
level=logging.DEBUG
)Custom Callback
from agent.callbacks import AsyncCallbackHandler
class MyCallback(AsyncCallbackHandler):
async def on_run_start(self, messages):
print("Agent starting...")
async def on_computer_call_start(self, action, call_id):
print(f"Executing: {action}")
async def on_computer_call_end(self, action, call_id, result):
print(f"Completed: {action}")
async def on_screenshot(self, screenshot_bytes):
# Save or process screenshot
pass
async def on_usage(self, usage):
print(f"Tokens: {usage['total_tokens']}")
async def on_run_end(self, result):
print(f"Agent finished. Cost: ${result.get('cost', 0):.4f}")
agent = ComputerAgent(
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer],
callbacks=[MyCallback()]
)Complete Examples
Basic macOS Automation
import asyncio
from agent import ComputerAgent
from computer import Computer
async def main():
computer = Computer(
os_type="macos",
display="1024x768",
memory="8GB",
provider_type="lume"
)
try:
await computer.run()
agent = ComputerAgent(
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer],
only_n_most_recent_images=3,
trajectory_dir="./trajectories"
)
messages = [{
"role": "user",
"content": "Open Safari, go to github.com, and take a screenshot"
}]
async for result in agent.run(messages):
for item in result.get("output", []):
if item.get("type") == "message":
print(f"Agent: {item['content'][0]['text']}")
elif item.get("type") == "computer_call":
print(f"Action: {item['action']}")
finally:
await computer.stop()
asyncio.run(main())Using Local Model (UI-TARS)
import asyncio
from agent import ComputerAgent
from computer import Computer
async def main():
computer = Computer(
os_type="macos",
provider_type="lume"
)
await computer.run()
# Use UI-TARS locally (requires GPU with enough VRAM)
agent = ComputerAgent(
model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B",
tools=[computer],
trust_remote_code=True # Required for HuggingFace models
)
async for result in agent.run([{
"role": "user",
"content": "Click on the Safari icon in the dock"
}]):
print(result)
await computer.stop()
asyncio.run(main())With Budget & Trajectory Saving
import asyncio
import logging
from agent import ComputerAgent
from agent.callbacks import TrajectorySaverCallback, BudgetManagerCallback
from computer import Computer
async def main():
computer = Computer(
os_type="macos",
provider_type="lume",
storage="./vm-storage" # Persistent storage
)
await computer.run()
agent = ComputerAgent(
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer],
callbacks=[
TrajectorySaverCallback(path="./trajectories"),
BudgetManagerCallback(max_usd=2.0),
],
only_n_most_recent_images=5,
verbosity=logging.INFO,
instructions="Be concise. Take direct actions."
)
async for result in agent.run([{
"role": "user",
"content": """
1. Open System Settings
2. Navigate to General
3. Check the About section
"""
}]):
usage = result.get("usage", {})
if usage:
print(f"Tokens used: {usage.get('total_tokens', 0)}")
await computer.stop()
asyncio.run(main())Composed Grounding + Planning
import asyncio
from agent import ComputerAgent
from computer import Computer
async def main():
computer = Computer(
os_type="macos",
provider_type="lume"
)
await computer.run()
# Moondream3 for click prediction + Claude for reasoning
agent = ComputerAgent(
model="moondream3+anthropic/claude-sonnet-4-5-20250929",
tools=[computer]
)
async for result in agent.run([{
"role": "user",
"content": "Find and click on the Finder icon"
}]):
print(result)
await computer.stop()
asyncio.run(main())Lume CLI Reference
Manage macOS VMs
# List available images
lume images
# Pull an image
lume pull macos-sequoia-cua:latest
# Run a VM (interactive)
lume run macos-sequoia-cua:latest
# List running VMs
lume list
# Stop a VM
lume stop <vm-name>
# Delete a VM
lume delete <vm-name>
# Get VM info
lume info <vm-name>
# Clone a VM (for snapshots)
lume clone <source> <destination>Troubleshooting
Lume install fails
Ensure you're on Apple Silicon Mac with macOS 13+. Check Xcode Command Line Tools are installed: xcode-select --install
VM won't start / hangs
Check available disk space (need ~50GB). Try: lume delete <vm-name> && lume pull <image> to reset.
Python dependency errors
Use Python 3.12 or 3.13 (NOT 3.14). Run: uv python install 3.13 && uv sync
API rate limits
Add delays between actions with screenshot_delay parameter. Use budget callbacks to limit costs.
Screenshots are black/wrong
VM may not be fully booted. Add initial delay or check VM is responsive via VNC (localhost:8006).
Actions not executing
Check computer-server is running in VM. For custom images, ensure computer-server is installed.
Key Takeaways
Sandboxed by design
All actions happen in VMs, not on your host. Safe for experimentation and production automation.
Model-agnostic
Switch between Claude, GPT-4o, Gemini, or local models like UI-TARS. Compose grounding + planning.
Callback extensibility
Built-in callbacks for budgets, trajectories, logging. Create custom callbacks for any behavior.
Near-native macOS
Lume uses Apple Virtualization.framework for near-native performance on Apple Silicon.