Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives.
62
55%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/computer-use-agents/SKILL.mdBuild AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control.
The fundamental architecture of computer use agents: observe screen, reason about next action, execute action, repeat. This loop integrates vision models with action execution through an iterative pipeline.
Key components:
Critical insight: Vision agents are completely still during "thinking" phase (1-5 seconds), creating a detectable pause pattern.
When to use: Building any computer use agent from scratch,Integrating vision models with desktop control,Understanding agent behavior patterns
from anthropic import Anthropic from PIL import Image import base64 import pyautogui import time
class ComputerUseAgent: """ Perception-Reasoning-Action loop implementation. Based on Anthropic Computer Use patterns. """
def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
self.client = client
self.model = model
self.max_steps = 50 # Prevent runaway loops
self.action_delay = 0.5 # Seconds between actions
def capture_screenshot(self) -> str:
"""Capture screen and return base64 encoded image."""
screenshot = pyautogui.screenshot()
# Resize for token efficiency (1280x800 is good balance)
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def execute_action(self, action: dict) -> dict:
"""Execute mouse/keyboard action on the computer."""
action_type = action.get("type")
if action_type == "click":
x, y = action["x"], action["y"]
button = action.get("button", "left")
pyautogui.click(x, y, button=button)
return {"success": True, "action": f"clicked at ({x}, {y})"}
elif action_type == "type":
text = action["text"]
pyautogui.typewrite(text, interval=0.02)
return {"success": True, "action": f"typed {len(text)} chars"}
elif action_type == "key":
key = action["key"]
pyautogui.press(key)
return {"success": True, "action": f"pressed {key}"}
elif action_type == "scroll":
direction = action.get("direction", "down")
amount = action.get("amount", 3)
scroll = -amount if direction == "down" else amount
pyautogui.scroll(scroll)
return {"success": True, "action": f"scrolled {direction}"}
elif action_type == "move":
x, y = action["x"], action["y"]
pyautogui.moveTo(x, y)
return {"success": True, "action": f"moved to ({x}, {y})"}
else:
return {"success": False, "error": f"Unknown action: {action_type}"}
def run(self, task: str) -> dict:
"""
Run perception-reasoning-action loop until task complete.
The loop:
1. Screenshot current state
2. Send to vision model with task context
3. Parse action from response
4. Execute action
5. Repeat until done or max steps
"""
messages = []
step_count = 0
system_prompt = """You are a computer use agent. You can see the screen
and control mouse/keyboard.
Available actions (respond with JSON):
- {"type": "click", "x": 100, "y": 200, "button": "left"}
- {"type": "type", "text": "hello world"}
- {"type": "key", "key": "enter"}
- {"type": "scroll", "direction": "down", "amount": 3}
- {"type": "done", "result": "task completed successfully"}
Always respond with ONLY a JSON action object.
Be precise with coordinates - click exactly where needed.
If you see an error, try to recover.
"""
while step_count < self.max_steps:
step_count += 1
# 1. PERCEPTION: Capture current screen
screenshot_b64 = self.capture_screenshot()
# 2. REASONING: Send to vision model
user_content = [
{"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64
}}
]
messages.append({"role": "user", "content": user_content})
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=system_prompt,
messages=messages
)
assistant_message = response.content[0].text
messages.append({"role": "assistant", "content": assistant_message})
# 3. Parse action from response
import json
try:
action = json.loads(assistant_message)
except json.JSONDecodeError:
# Try to extract JSON from response
import re
match = re.search(r'\{[^}]+\}', assistant_message)
if match:
action = json.loads(match.group())
else:
continue
# Check if done
if action.get("type") == "done":
return {
"success": True,
"result": action.get("result"),
"steps": step_count
}
# 4. ACTION: Execute
result = self.execute_action(action)
# Small delay for UI to update
time.sleep(self.action_delay)
return {
"success": False,
"error": "Max steps reached",
"steps": step_count
}agent = ComputerUseAgent(Anthropic()) result = agent.run("Open Chrome and search for 'weather today'")
Computer use agents MUST run in isolated, sandboxed environments. Never give agents direct access to your main system - the security risks are too high. Use Docker containers with virtual desktops.
Key isolation requirements:
The goal is "blast radius minimization" - if the agent goes wrong, damage is contained to the sandbox.
When to use: Deploying any computer use agent,Testing agent behavior safely,Running untrusted automation tasks
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor
RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc
COPY requirements.txt /tmp/ RUN pip3 install -r /tmp/requirements.txt
RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true
COPY --chown=agent:agent . /app WORKDIR /app
COPY supervisord.conf /etc/supervisor/conf.d/
EXPOSE 5900
USER agent
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
version: '3.8'
services: computer-use-agent: build: . ports: - "5900:5900" # VNC for observation - "8080:8080" # API for control
# Security constraints
security_opt:
- no-new-privileges:true
- seccomp:seccomp-profile.json
# Resource limits
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '0.5'
memory: 1G
# Network isolation
networks:
- agent-network
# No access to host filesystem
volumes:
- agent-tmp:/tmp
# Read-only root filesystem
read_only: true
tmpfs:
- /run
- /var/run
# Environment
environment:
- DISPLAY=:99
- NO_PROXY=localhostnetworks: agent-network: driver: bridge internal: true # No internet by default
volumes: agent-tmp:
import subprocess import os from dataclasses import dataclass from typing import Optional
@dataclass class SandboxConfig: """Configuration for agent sandbox.""" network_allowed: list[str] = None # Allowed domains max_runtime_seconds: int = 300 max_memory_mb: int = 2048 allow_downloads: bool = False allow_clipboard: bool = False
class SandboxedAgent: """ Run computer use agent in Docker sandbox. """
def __init__(self, config: SandboxConfig):
self.config = config
self.container_id: Optional[str] = None
def start(self):
"""Start sandboxed environment."""
# Build network rules
network_rules = ""
if self.config.network_allowed:
for domain in self.config.network_allowed:
network_rules += f"--add-host={domain}:$(dig +short {domain}) "
else:
network_rules = "--network=none"
cmd = f"""
docker run -d \
--name computer-use-sandbox-$$ \
--security-opt no-new-privileges \
--cap-drop ALL \
--memory {self.config.max_memory_mb}m \
--cpus 2 \
--read-only \
--tmpfs /tmp \
{network_rules} \
computer-use-agent:latest
"""
result = subprocess.run(cmd, shell=True, capture_output=True)
self.container_id = result.stdout.decode().strip()
# Set up kill timer
subprocess.Popen([
"sh", "-c",
f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
])
return self.container_id
def execute_task(self, task: str) -> dict:
"""Execute task in sandbox."""
if not self.container_id:
self.start()
# Send task to agent via API
import requests
response = requests.post(
f"http://localhost:8080/task",
json={"task": task},
timeout=self.config.max_runtime_seconds
)
return response.json()
def stop(self):
"""Stop and remove sandbox."""
if self.container_id:
subprocess.run(f"docker rm -f {self.container_id}", shell=True)
self.container_id = NoneOfficial implementation pattern using Claude's computer use capability. Claude 3.5 Sonnet was the first frontier model to offer computer use. Claude Opus 4.5 is now the "best model in the world for computer use."
Key capabilities:
Tool versions:
Critical limitation: "Some UI elements (like dropdowns and scrollbars) might be tricky for Claude to manipulate" - Anthropic docs
When to use: Building production computer use agents,Need highest quality vision understanding,Full desktop control (not just browser)
from anthropic import Anthropic from anthropic.types.beta import ( BetaToolComputerUse20241022, BetaToolBash20241022, BetaToolTextEditor20241022, ) import subprocess import base64 from PIL import Image import io
class AnthropicComputerUse: """ Official Anthropic Computer Use implementation.
Requires:
- Docker container with virtual display
- VNC for viewing agent actions
- Proper tool implementations
"""
def __init__(self):
self.client = Anthropic()
self.model = "claude-sonnet-4-20250514" # Best for computer use
self.screen_size = (1280, 800)
def get_tools(self) -> list:
"""Define computer use tools."""
return [
BetaToolComputerUse20241022(
type="computer_20241022",
name="computer",
display_width_px=self.screen_size[0],
display_height_px=self.screen_size[1],
),
BetaToolBash20241022(
type="bash_20241022",
name="bash",
),
BetaToolTextEditor20241022(
type="text_editor_20241022",
name="str_replace_editor",
),
]
def execute_tool(self, name: str, input: dict) -> dict:
"""Execute a tool and return result."""
if name == "computer":
return self._handle_computer_action(input)
elif name == "bash":
return self._handle_bash(input)
elif name == "str_replace_editor":
return self._handle_editor(input)
else:
return {"error": f"Unknown tool: {name}"}
def _handle_computer_action(self, input: dict) -> dict:
"""Handle computer control actions."""
action = input.get("action")
if action == "screenshot":
# Capture via xdotool/scrot
subprocess.run(["scrot", "/tmp/screenshot.png"])
with open("/tmp/screenshot.png", "rb") as f:
img_data = f.read()
# Resize for efficiency
img = Image.open(io.BytesIO(img_data))
img = img.resize(self.screen_size, Image.LANCZOS)
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64.b64encode(buffer.getvalue()).decode()
}
}
elif action == "mouse_move":
x, y = input.get("coordinate", [0, 0])
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
return {"success": True}
elif action == "left_click":
subprocess.run(["xdotool", "click", "1"])
return {"success": True}
elif action == "right_click":
subprocess.run(["xdotool", "click", "3"])
return {"success": True}
elif action == "double_click":
subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
return {"success": True}
elif action == "type":
text = input.get("text", "")
# Use xdotool type with delay for reliability
subprocess.run(["xdotool", "type", "--delay", "50", text])
return {"success": True}
elif action == "key":
key = input.get("key", "")
# Map common key names
key_map = {
"return": "Return",
"enter": "Return",
"tab": "Tab",
"escape": "Escape",
"backspace": "BackSpace",
}
xdotool_key = key_map.get(key.lower(), key)
subprocess.run(["xdotool", "key", xdotool_key])
return {"success": True}
elif action == "scroll":
direction = input.get("direction", "down")
amount = input.get("amount", 3)
button = "5" if direction == "down" else "4"
for _ in range(amount):
subprocess.run(["xdotool", "click", button])
return {"success": True}
return {"error": f"Unknown action: {action}"}
def _handle_bash(self, input: dict) -> dict:
"""Execute bash command."""
command = input.get("command", "")
# Security: Sanitize and limit commands
dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
for pattern in dangerous_patterns:
if pattern in command:
return {"error": "Dangerous command blocked"}
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=30
)
return {
"stdout": result.stdout[:10000], # Limit output
"stderr": result.stderr[:1000],
"returncode": result.returncode
}
except subprocess.TimeoutExpired:
return {"error": "Command timed out"}
def _handle_editor(self, input: dict) -> dict:
"""Handle text editor operations."""
command = input.get("command")
path = input.get("path")
if command == "view":
try:
with open(path, "r") as f:
content = f.read()
return {"content": content[:50000]} # Limit size
except Exception as e:
return {"error": str(e)}
elif command == "str_replace":
old_str = input.get("old_str")
new_str = input.get("new_str")
try:
with open(path, "r") as f:
content = f.read()
if old_str not in content:
return {"error": "old_str not found in file"}
content = content.replace(old_str, new_str, 1)
with open(path, "w") as f:
f.write(content)
return {"success": True}
except Exception as e:
return {"error": str(e)}
return {"error": f"Unknown editor command: {command}"}
def run_task(self, task: str, max_steps: int = 50) -> dict:
"""Run computer use task with agentic loop."""
messages = [{"role": "user", "content": task}]
tools = self.get_tools()
for step in range(max_steps):
response = self.client.beta.messages.create(
model=self.model,
max_tokens=4096,
tools=tools,
messages=messages,
betas=["computer-use-2024-10-22"]
)
# Check for completion
if response.stop_reason == "end_turn":
return {
"success": True,
"result": response.content[0].text if response.content else "",
"steps": step + 1
}
# Handle tool use
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = self.execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
return {"success": False, "error": "Max steps reached"}For browser-only automation, using structured DOM access is more efficient than pixel-based computer use. Playwright MCP allows LLMs to control browsers using accessibility snapshots rather than screenshots.
Advantages over vision-based:
When to use vision vs structured:
When to use: Browser-only automation tasks,Form filling and web interactions,When speed and cost matter more than visual understanding
from playwright.async_api import async_playwright from dataclasses import dataclass from typing import Optional import asyncio
@dataclass class BrowserAction: """Structured browser action.""" action: str # click, type, navigate, scroll, extract selector: Optional[str] = None text: Optional[str] = None url: Optional[str] = None
class BrowserUseAgent: """ Browser automation using Playwright with structured commands. More efficient than pixel-based for web tasks. """
def __init__(self):
self.browser = None
self.page = None
async def start(self, headless: bool = True):
"""Start browser session."""
self.playwright = await async_playwright().start()
self.browser = await self.playwright.chromium.launch(headless=headless)
self.page = await self.browser.new_page()
async def get_page_snapshot(self) -> dict:
"""
Get structured snapshot of page for LLM.
Uses accessibility tree for efficiency.
"""
# Get accessibility tree
snapshot = await self.page.accessibility.snapshot()
# Get simplified DOM info
elements = await self.page.evaluate('''() => {
const interactable = [];
const selector = 'a, button, input, select, textarea, [role="button"]';
document.querySelectorAll(selector).forEach((el, i) => {
const rect = el.getBoundingClientRect();
if (rect.width > 0 && rect.height > 0) {
interactable.push({
index: i,
tag: el.tagName.toLowerCase(),
text: el.textContent?.trim().slice(0, 100),
type: el.type,
placeholder: el.placeholder,
name: el.name,
id: el.id,
class: el.className
});
}
});
return interactable;
}''')
return {
"url": self.page.url,
"title": await self.page.title(),
"accessibility_tree": snapshot,
"interactable_elements": elements[:50] # Limit for token efficiency
}
async def execute_action(self, action: BrowserAction) -> dict:
"""Execute structured browser action."""
try:
if action.action == "navigate":
await self.page.goto(action.url, wait_until="domcontentloaded")
return {"success": True, "url": self.page.url}
elif action.action == "click":
await self.page.click(action.selector, timeout=5000)
await self.page.wait_for_load_state("networkidle", timeout=5000)
return {"success": True}
elif action.action == "type":
await self.page.fill(action.selector, action.text)
return {"success": True}
elif action.action == "scroll":
direction = action.text or "down"
distance = 500 if direction == "down" else -500
await self.page.evaluate(f"window.scrollBy(0, {distance})")
return {"success": True}
elif action.action == "extract":
# Extract text content
if action.selector:
text = await self.page.text_content(action.selector)
else:
text = await self.page.text_content("body")
return {"success": True, "text": text[:5000]}
elif action.action == "screenshot":
# Fall back to vision when needed
screenshot = await self.page.screenshot(type="png")
import base64
return {
"success": True,
"image": base64.b64encode(screenshot).decode()
}
except Exception as e:
return {"success": False, "error": str(e)}
return {"success": False, "error": f"Unknown action: {action.action}"}
async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
"""
Run browser task with LLM decision making.
Uses structured DOM instead of screenshots.
"""
system_prompt = """You are a browser automation agent. You receive
page snapshots with interactable elements and decide actions.
Respond with JSON action:
- {"action": "navigate", "url": "https://..."}
- {"action": "click", "selector": "button.submit"}
- {"action": "type", "selector": "input[name='email']", "text": "..."}
- {"action": "scroll", "text": "down"}
- {"action": "extract", "selector": ".results"}
- {"action": "done", "result": "task completed"}
Use CSS selectors based on the element info provided.
Prefer id > name > class > text content for selectors.
"""
messages = []
for step in range(max_steps):
# Get current page state
snapshot = await self.get_page_snapshot()
user_message = f"""Task: {task}
Current page:
URL: {snapshot['url']}
Title: {snapshot['title']}
Interactable elements:
{snapshot['interactable_elements']}
What action should I take?"""
messages.append({"role": "user", "content": user_message})
# Get LLM decision
response = llm_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
messages=messages
)
assistant_text = response.content[0].text
messages.append({"role": "assistant", "content": assistant_text})
# Parse and execute
import json
action_dict = json.loads(assistant_text)
if action_dict.get("action") == "done":
return {"success": True, "result": action_dict.get("result")}
action = BrowserAction(**action_dict)
result = await self.execute_action(action)
if not result.get("success"):
messages.append({
"role": "user",
"content": f"Action failed: {result.get('error')}"
})
await asyncio.sleep(0.5) # Rate limit
return {"success": False, "error": "Max steps reached"}
async def close(self):
"""Clean up browser."""
if self.browser:
await self.browser.close()
if hasattr(self, 'playwright'):
await self.playwright.stop()async def main(): agent = BrowserUseAgent() await agent.start(headless=False)
from anthropic import Anthropic
result = await agent.run_with_llm(
"Go to weather.com and find the weather for New York",
Anthropic()
)
print(result)
await agent.close()asyncio.run(main())
For sensitive actions, agents should pause and ask for human confirmation. "ChatGPT agent also pauses and asks for confirmation prior to taking sensitive steps such as completing a purchase."
Sensitivity levels:
When to use: Actions with real-world consequences,Financial transactions,Authentication flows,File modifications
from enum import Enum from dataclasses import dataclass from typing import Callable, Optional import asyncio
class ActionSeverity(Enum): LOW = "low" # Auto-approve MEDIUM = "medium" # Log, optional confirm HIGH = "high" # Always confirm CRITICAL = "critical" # Confirm + review details
@dataclass class SensitiveAction: """Action that may need user confirmation.""" action_type: str description: str severity: ActionSeverity details: dict
class ConfirmationGate: """ Gate sensitive actions through user confirmation. """
# Action type -> severity mapping
ACTION_SEVERITY = {
# LOW - auto-approve
"navigate": ActionSeverity.LOW,
"scroll": ActionSeverity.LOW,
"read": ActionSeverity.LOW,
"screenshot": ActionSeverity.LOW,
# MEDIUM - log and maybe confirm
"click": ActionSeverity.MEDIUM,
"type": ActionSeverity.MEDIUM,
"search": ActionSeverity.MEDIUM,
# HIGH - always confirm
"download": ActionSeverity.HIGH,
"submit_form": ActionSeverity.HIGH,
"login": ActionSeverity.HIGH,
"file_write": ActionSeverity.HIGH,
# CRITICAL - confirm with full review
"purchase": ActionSeverity.CRITICAL,
"enter_password": ActionSeverity.CRITICAL,
"enter_credit_card": ActionSeverity.CRITICAL,
"send_money": ActionSeverity.CRITICAL,
"delete": ActionSeverity.CRITICAL,
}
def __init__(
self,
confirm_callback: Callable[[SensitiveAction], bool] = None,
auto_confirm_low: bool = True,
auto_confirm_medium: bool = False
):
self.confirm_callback = confirm_callback or self._default_confirm
self.auto_confirm_low = auto_confirm_low
self.auto_confirm_medium = auto_confirm_medium
self.action_log = []
def _default_confirm(self, action: SensitiveAction) -> bool:
"""Default confirmation via CLI prompt."""
print(f"\n{'='*60}")
print(f"ACTION CONFIRMATION REQUIRED")
print(f"{'='*60}")
print(f"Type: {action.action_type}")
print(f"Severity: {action.severity.value.upper()}")
print(f"Description: {action.description}")
print(f"Details: {action.details}")
print(f"{'='*60}")
while True:
response = input("Allow this action? [y/n]: ").lower().strip()
if response in ['y', 'yes']:
return True
elif response in ['n', 'no']:
return False
def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
"""Classify action severity, considering context."""
base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)
# Escalate based on context
if context.get("involves_credentials"):
return ActionSeverity.CRITICAL
if context.get("involves_money"):
return ActionSeverity.CRITICAL
if context.get("irreversible"):
return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)
return base_severity
def check_action(
self,
action_type: str,
description: str,
details: dict = None
) -> tuple[bool, str]:
"""
Check if action should proceed.
Returns (approved, reason).
"""
details = details or {}
severity = self.classify_action(action_type, details)
action = SensitiveAction(
action_type=action_type,
description=description,
severity=severity,
details=details
)
# Log all actions
self.action_log.append({
"action": action,
"timestamp": __import__('datetime').datetime.now().isoformat()
})
# Auto-approve low severity
if severity == ActionSeverity.LOW and self.auto_confirm_low:
return True, "auto-approved (low severity)"
# Maybe auto-approve medium
if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
return True, "auto-approved (medium severity)"
# Request confirmation
approved = self.confirm_callback(action)
if approved:
return True, "user approved"
else:
return False, "user rejected"class ConfirmedComputerUseAgent: """ Computer use agent with confirmation gates. """
def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
self.agent = base_agent
self.gate = confirmation_gate
def execute_action(self, action: dict) -> dict:
"""Execute action with confirmation check."""
action_type = action.get("type", "unknown")
# Build description
if action_type == "click":
desc = f"Click at ({action.get('x')}, {action.get('y')})"
elif action_type == "type":
text = action.get('text', '')
# Mask if looks like password
if self._looks_sensitive(text):
desc = f"Type sensitive text ({len(text)} chars)"
else:
desc = f"Type: {text[:50]}..."
else:
desc = f"Execute: {action_type}"
# Context for severity classification
context = {
"involves_credentials": self._looks_sensitive(action.get("text", "")),
"involves_money": self._mentions_money(action),
}
# Check with gate
approved, reason = self.gate.check_action(
action_type, desc, context
)
if not approved:
return {
"success": False,
"error": f"Action blocked: {reason}",
"action": action_type
}
# Execute if approved
return self.agent.execute_action(action)
def _looks_sensitive(self, text: str) -> bool:
"""Check if text looks like sensitive data."""
if not text:
return False
# Common patterns
patterns = [
r'\b\d{16}\b', # Credit card
r'\b\d{3,4}\b.*\b\d{3,4}\b', # CVV-like
r'password',
r'secret',
r'api.?key',
r'token'
]
import re
return any(re.search(p, text.lower()) for p in patterns)
def _mentions_money(self, action: dict) -> bool:
"""Check if action involves money."""
text = str(action)
money_patterns = [
r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
r'credit', r'debit', r'invoice', r'payment'
]
import re
return any(re.search(p, text.lower()) for p in money_patterns)gate = ConfirmationGate( auto_confirm_low=True, auto_confirm_medium=False # Confirm clicks, typing )
agent = ConfirmedComputerUseAgent(base_agent, gate) result = agent.execute_action({"type": "click", "x": 500, "y": 300})
All computer use agent actions should be logged for:
Log format should capture:
When to use: Production computer use deployments,Debugging automation failures,Security-sensitive environments
from dataclasses import dataclass, field from datetime import datetime from typing import Optional, Any import json import os
@dataclass class ActionLogEntry: """Single action log entry.""" timestamp: datetime action_type: str parameters: dict success: bool error: Optional[str] = None screenshot_before: Optional[str] = None # Path to screenshot screenshot_after: Optional[str] = None model_reasoning: Optional[str] = None duration_ms: Optional[int] = None
def to_dict(self) -> dict:
return {
"timestamp": self.timestamp.isoformat(),
"action_type": self.action_type,
"parameters": self._sanitize_params(self.parameters),
"success": self.success,
"error": self.error,
"screenshot_before": self.screenshot_before,
"screenshot_after": self.screenshot_after,
"model_reasoning": self.model_reasoning,
"duration_ms": self.duration_ms
}
def _sanitize_params(self, params: dict) -> dict:
"""Remove sensitive data from params."""
sanitized = {}
sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']
for k, v in params.items():
if any(s in k.lower() for s in sensitive_keys):
sanitized[k] = "[REDACTED]"
elif isinstance(v, str) and len(v) > 100:
sanitized[k] = v[:100] + "...[truncated]"
else:
sanitized[k] = v
return sanitized@dataclass class TaskSession: """A complete task execution session.""" session_id: str task: str start_time: datetime end_time: Optional[datetime] = None actions: list[ActionLogEntry] = field(default_factory=list) success: bool = False final_result: Optional[str] = None
class ActionLogger: """ Comprehensive action logging for computer use agents. """
def __init__(self, log_dir: str = "./agent_logs"):
self.log_dir = log_dir
self.screenshot_dir = os.path.join(log_dir, "screenshots")
os.makedirs(self.screenshot_dir, exist_ok=True)
self.current_session: Optional[TaskSession] = None
def start_session(self, task: str) -> str:
"""Start a new task session."""
import uuid
session_id = str(uuid.uuid4())[:8]
self.current_session = TaskSession(
session_id=session_id,
task=task,
start_time=datetime.now()
)
return session_id
def log_action(
self,
action_type: str,
parameters: dict,
success: bool,
error: Optional[str] = None,
screenshot_before: bytes = None,
screenshot_after: bytes = None,
model_reasoning: str = None,
duration_ms: int = None
):
"""Log a single action."""
if not self.current_session:
raise RuntimeError("No active session")
# Save screenshots if provided
screenshot_paths = {}
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
if screenshot_before:
path = os.path.join(
self.screenshot_dir,
f"{self.current_session.session_id}_{timestamp_str}_before.png"
)
with open(path, "wb") as f:
f.write(screenshot_before)
screenshot_paths["before"] = path
if screenshot_after:
path = os.path.join(
self.screenshot_dir,
f"{self.current_session.session_id}_{timestamp_str}_after.png"
)
with open(path, "wb") as f:
f.write(screenshot_after)
screenshot_paths["after"] = path
# Create log entry
entry = ActionLogEntry(
timestamp=datetime.now(),
action_type=action_type,
parameters=parameters,
success=success,
error=error,
screenshot_before=screenshot_paths.get("before"),
screenshot_after=screenshot_paths.get("after"),
model_reasoning=model_reasoning,
duration_ms=duration_ms
)
self.current_session.actions.append(entry)
# Also append to running log file
self._append_to_log(entry)
def _append_to_log(self, entry: ActionLogEntry):
"""Append entry to JSONL log file."""
log_file = os.path.join(
self.log_dir,
f"session_{self.current_session.session_id}.jsonl"
)
with open(log_file, "a") as f:
f.write(json.dumps(entry.to_dict()) + "\n")
def end_session(self, success: bool, result: str = None):
"""End current session."""
if not self.current_session:
return
self.current_session.end_time = datetime.now()
self.current_session.success = success
self.current_session.final_result = result
# Write session summary
summary_file = os.path.join(
self.log_dir,
f"session_{self.current_session.session_id}_summary.json"
)
summary = {
"session_id": self.current_session.session_id,
"task": self.current_session.task,
"start_time": self.current_session.start_time.isoformat(),
"end_time": self.current_session.end_time.isoformat(),
"duration_seconds": (
self.current_session.end_time -
self.current_session.start_time
).total_seconds(),
"total_actions": len(self.current_session.actions),
"successful_actions": sum(
1 for a in self.current_session.actions if a.success
),
"failed_actions": sum(
1 for a in self.current_session.actions if not a.success
),
"success": success,
"final_result": result
}
with open(summary_file, "w") as f:
json.dump(summary, f, indent=2)
self.current_session = None
def get_session_replay(self, session_id: str) -> list[dict]:
"""Get all actions from a session for replay/debugging."""
log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")
actions = []
with open(log_file, "r") as f:
for line in f:
actions.append(json.loads(line))
return actionsclass LoggedComputerUseAgent: """Computer use agent with comprehensive logging."""
def __init__(self, base_agent, logger: ActionLogger):
self.agent = base_agent
self.logger = logger
def run_task(self, task: str) -> dict:
"""Run task with full logging."""
session_id = self.logger.start_session(task)
try:
result = self._run_with_logging(task)
self.logger.end_session(
success=result.get("success", False),
result=result.get("result")
)
return result
except Exception as e:
self.logger.end_session(success=False, result=str(e))
raise
def _run_with_logging(self, task: str) -> dict:
"""Internal run with action logging."""
# This would wrap the base agent's run method
# and log each action
passSeverity: CRITICAL
Situation: Computer use agent browsing the web
Symptoms: Agent suddenly performs unexpected actions. Clicks malicious links. Enters credentials on phishing sites. Downloads files it shouldn't. Ignores your instructions and follows embedded commands instead.
Why this breaks: "While all agents that process untrusted content are subject to prompt injection risks, browser use amplifies this risk in two ways. First, the attack surface is vast: every webpage, embedded document, advertisement, and dynamically loaded script represents a potential vector for malicious instructions. Second, browser agents can take many different actions— navigating to URLs, filling forms, clicking buttons, downloading files— that attackers can exploit."
Real attacks have already happened:
Even a 1% attack success rate represents meaningful risk at scale.
Recommended fix:
Sandboxing (most effective):
# Docker with strict isolation
docker run \
--security-opt no-new-privileges \
--cap-drop ALL \
--network none \ # No internet!
--read-only \
computer-use-agentClassifier-based detection:
def scan_for_injection(content: str) -> bool:
"""Detect prompt injection attempts."""
patterns = [
r"ignore.*instructions",
r"disregard.*previous",
r"new.*instructions",
r"you are now",
r"act as if",
r"pretend to be",
]
return any(re.search(p, content.lower()) for p in patterns)
# Check page content before processing
page_text = await page.text_content("body")
if scan_for_injection(page_text):
return {"error": "Potential injection detected"}User confirmation for sensitive actions:
SENSITIVE_ACTIONS = {"download", "submit", "login", "purchase"}
if action_type in SENSITIVE_ACTIONS:
if not await get_user_confirmation(action):
return {"error": "User rejected action"}Scoped credentials:
Severity: MEDIUM
Situation: Agent clicking on UI elements
Symptoms: Agent's clicks are detectable as non-human. Websites may block or CAPTCHA the agent. Anti-bot systems flag the interaction.
Why this breaks: "When a vision model identifies a button, it calculates the center. Click coordinates land at mathematically precise positions—often exact element centers or grid-aligned pixel values. Humans don't click centers; their click distributions follow a Gaussian pattern around targets."
The screenshot loop also creates detectable patterns: "Predictable pauses. Vision agents are completely still during their 'thinking' phase. The pattern looks like: Action → Complete stillness (1-5 seconds) → Action → Complete stillness → Action."
Sophisticated anti-bot systems detect:
Recommended fix:
import random
import time
def humanized_click(x: int, y: int) -> tuple[int, int]:
"""Add human-like variance to click coordinates."""
# Gaussian distribution around target
# Humans typically land within ~10px of target
x_offset = int(random.gauss(0, 5))
y_offset = int(random.gauss(0, 5))
return (x + x_offset, y + y_offset)
def humanized_delay():
"""Add human-like delay between actions."""
# Humans have variable reaction times
base_delay = random.uniform(0.3, 0.8)
# Occasionally longer pauses (reading, thinking)
if random.random() < 0.2:
base_delay += random.uniform(0.5, 2.0)
time.sleep(base_delay)
def humanized_movement(from_pos: tuple, to_pos: tuple):
"""Move mouse in curved path like human."""
# Bezier curve or similar
# Humans don't move in straight lines
steps = random.randint(10, 20)
for i in range(steps):
t = i / steps
# Simple curve approximation
x = from_pos[0] + (to_pos[0] - from_pos[0]) * t
y = from_pos[1] + (to_pos[1] - from_pos[1]) * t
# Add wobble
x += random.gauss(0, 2)
y += random.gauss(0, 2)
pyautogui.moveTo(int(x), int(y))
time.sleep(0.01)USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/...",
# ... more realistic agents
]
await page.set_extra_http_headers({
"User-Agent": random.choice(USER_AGENTS)
})Severity: HIGH
Situation: Agent interacting with complex UI elements
Symptoms: Agent fails to select dropdown options. Scroll doesn't work as expected. Drag and drop completely fails. Hover menus disappear before clicking.
Why this breaks: "Computer Use currently struggles with certain interface interactions, particularly scrolling, dragging, and zooming operations. Some UI elements (like dropdowns and scrollbars) might be tricky for Claude to manipulate."
Why these are hard:
Vision models see pixels, not DOM structure. They don't "know" that a dropdown is a dropdown - they have to infer from visual cues.
Recommended fix:
# Instead of clicking dropdown, use keyboard
async def select_dropdown_option(page, dropdown_selector, option_text):
# Focus the dropdown
await page.click(dropdown_selector)
await asyncio.sleep(0.3)
# Use keyboard to find option
await page.keyboard.type(option_text[:3]) # Type first letters
await asyncio.sleep(0.2)
await page.keyboard.press("Enter")# Instead of drag-and-drop
async def reliable_drag(page, source, target):
# Step 1: Click and hold
await page.mouse.move(source["x"], source["y"])
await page.mouse.down()
await asyncio.sleep(0.2)
# Step 2: Move in steps
steps = 10
for i in range(steps):
x = source["x"] + (target["x"] - source["x"]) * i / steps
y = source["y"] + (target["y"] - source["y"]) * i / steps
await page.mouse.move(x, y)
await asyncio.sleep(0.05)
# Step 3: Release
await page.mouse.move(target["x"], target["y"])
await asyncio.sleep(0.1)
await page.mouse.up()# If vision fails, try direct DOM manipulation
async def robust_select(page, select_selector, value):
try:
# Try vision approach first
await vision_agent.select(select_selector, value)
except Exception:
# Fall back to direct DOM
await page.select_option(select_selector, value=value)async def verified_scroll(page, direction):
# Get current scroll position
before = await page.evaluate("window.scrollY")
# Attempt scroll
await page.mouse.wheel(0, 500 if direction == "down" else -500)
await asyncio.sleep(0.3)
# Verify it worked
after = await page.evaluate("window.scrollY")
if before == after:
# Try alternative method
await page.keyboard.press("PageDown" if direction == "down" else "PageUp")Severity: MEDIUM
Situation: Automating any computer task
Symptoms: Task that takes human 1 minute takes agent 3-5 minutes. Users complain about speed. Timeouts occur.
Why this breaks: "The technology can be slow compared to human operators, often requiring multiple screenshots and analysis cycles."
Why so slow:
A task requiring 20 actions takes 40-140 seconds minimum. Humans do the same actions in 20-30 seconds.
Recommended fix:
Computer use is for:
# 1. Reduce screenshot resolution
SCREEN_SIZE = (1280, 800) # Not 4K
# 2. Batch similar actions
# Instead of: type "hello", wait, type " world"
await page.type("hello world")
# 3. Parallelize independent tasks
# Run multiple sandboxed agents concurrently
# 4. Cache repeated computations
# If same screenshot, reuse analysis
# 5. Use smaller models for simple decisions
simple_model = "claude-haiku-..." # For "is task done?"
complex_model = "claude-sonnet-..." # For complex reasoning# Estimate task duration
def estimate_duration(task_complexity: str) -> int:
"""Estimate task duration in seconds."""
estimates = {
"simple": 30, # Single page, few actions
"medium": 120, # Multi-page, moderate actions
"complex": 300, # Many pages, complex interactions
}
return estimates.get(task_complexity, 120)
# Inform users
estimated = estimate_duration("medium")
print(f"Estimated completion: {estimated // 60}m {estimated % 60}s")Severity: HIGH
Situation: Long-running computer use tasks
Symptoms: Agent forgets earlier steps. Starts repeating actions. Errors increase as task progresses. Costs explode.
Why this breaks: Each screenshot is ~1500-3000 tokens. A task with 30 screenshots uses 45,000-90,000 tokens just for images - before any text.
Claude's context window is finite. When full:
"Getting agents to make consistent progress across multiple context windows remains an open problem. The core challenge is that they must work in discrete sessions, and each new session begins with no memory of what came before." - Anthropic engineering blog
Recommended fix:
class ContextManager:
"""Manage context window usage for computer use."""
MAX_SCREENSHOTS = 10 # Keep only recent screenshots
MAX_TOKENS = 100000
def __init__(self):
self.messages = []
self.screenshot_count = 0
def add_screenshot(self, screenshot_b64: str, description: str):
"""Add screenshot with automatic pruning."""
self.screenshot_count += 1
# Keep only recent screenshots
if self.screenshot_count > self.MAX_SCREENSHOTS:
self._prune_old_screenshots()
# Store with description for context
self.messages.append({
"role": "user",
"content": [
{"type": "text", "text": description},
{"type": "image", "source": {...}}
]
})
def _prune_old_screenshots(self):
"""Remove old screenshots, keep text summaries."""
new_messages = []
screenshots_kept = 0
for msg in reversed(self.messages):
if self._has_image(msg):
if screenshots_kept < self.MAX_SCREENSHOTS:
new_messages.insert(0, msg)
screenshots_kept += 1
else:
# Convert to text summary
summary = self._summarize_screenshot(msg)
new_messages.insert(0, {
"role": msg["role"],
"content": summary
})
else:
new_messages.insert(0, msg)
self.messages = new_messages
def _summarize_screenshot(self, msg) -> str:
"""Summarize screenshot to text."""
# Extract any text description
for content in msg.get("content", []):
if content.get("type") == "text":
return f"[Previous screenshot: {content['text']}]"
return "[Previous screenshot - details pruned]"
def add_checkpoint(self):
"""Create a checkpoint summary."""
summary = self._create_progress_summary()
self.messages.append({
"role": "user",
"content": f"CHECKPOINT: {summary}"
})async def run_with_checkpoints(task: str, checkpoint_every: int = 10):
"""Run task with periodic checkpoints."""
context = ContextManager()
step = 0
while not task_complete:
step += 1
# Take action...
if step % checkpoint_every == 0:
# Create checkpoint
context.add_checkpoint()
# Optional: persist to disk
save_checkpoint(context, step)# Instead of one 50-step task:
subtasks = [
"Navigate to the website and login",
"Find the settings page",
"Update the email address to ...",
"Save and verify the change"
]
for subtask in subtasks:
result = await agent.run(subtask)
if not result["success"]:
handle_error(subtask, result)
breakSeverity: HIGH
Situation: Running computer use at scale
Symptoms: API bill is 10x higher than expected. Single task costs $5+ instead of $0.50. Monthly costs reach thousands of dollars quickly.
Why this breaks: Vision tokens are expensive. Each screenshot:
But it compounds:
Complex task: 50 turns × average 25 images in context = 1250 image tokens Plus text = could easily hit $5-10 per task.
Recommended fix:
class CostTracker:
"""Track and limit computer use costs."""
# Anthropic pricing (approximate)
INPUT_COST_PER_1K = 0.003 # Text
OUTPUT_COST_PER_1K = 0.015
IMAGE_COST_PER_1K = 0.01 # Roughly
def __init__(self, max_cost_per_task: float = 1.0):
self.max_cost = max_cost_per_task
self.current_cost = 0.0
self.total_tokens = 0
def add_turn(
self,
input_tokens: int,
output_tokens: int,
image_tokens: int
):
"""Track cost of a single turn."""
cost = (
input_tokens / 1000 * self.INPUT_COST_PER_1K +
output_tokens / 1000 * self.OUTPUT_COST_PER_1K +
image_tokens / 1000 * self.IMAGE_COST_PER_1K
)
self.current_cost += cost
self.total_tokens += input_tokens + output_tokens + image_tokens
if self.current_cost > self.max_cost:
raise CostLimitExceeded(
f"Cost limit exceeded: ${self.current_cost:.2f} > ${self.max_cost:.2f}"
)
return cost
class CostLimitExceeded(Exception):
pass
# Usage
tracker = CostTracker(max_cost_per_task=2.0)
try:
for turn in turns:
tracker.add_turn(turn.input, turn.output, turn.images)
except CostLimitExceeded:
print("Task aborted due to cost limit")# 1. Lower resolution
SCREEN_SIZE = (1024, 768) # Smaller = fewer tokens
# 2. JPEG instead of PNG (when quality ok)
screenshot.save(buffer, format="JPEG", quality=70)
# 3. Crop to relevant region
def crop_relevant(screenshot: Image, focus_area: tuple):
"""Crop to area of interest."""
return screenshot.crop(focus_area)
# 4. Don't include screenshot every turn
if not needs_visual_update:
# Text-only turn
messages.append({"role": "user", "content": "Continue..."})async def tiered_model_selection(task_complexity: str):
"""Use appropriate model for task."""
if task_complexity == "simple":
return "claude-haiku-..." # Cheapest
elif task_complexity == "medium":
return "claude-sonnet-4-20250514" # Balanced
else:
return "claude-opus-4-5-..." # Best but expensiveSeverity: CRITICAL
Situation: Testing or deploying computer use
Symptoms: Agent deletes important files. Sends emails from your account. Posts on social media. Accesses sensitive documents.
Why this breaks: Computer use agents make mistakes. They can:
Without sandboxing, these mistakes happen on your real system. There's no undo for "agent sent email to all contacts" or "agent deleted project folder."
"Autonomous agents that can access external systems and APIs introduce new security risks. They may be vulnerable to prompt injection attacks, unauthorized access to sensitive data, or manipulation by malicious actors."
Recommended fix:
# Minimum viable sandbox: Docker with restrictions
docker run -it --rm \
--security-opt no-new-privileges \
--cap-drop ALL \
--network none \
--read-only \
--tmpfs /tmp \
--memory 2g \
--cpus 1 \
computer-use-sandbox# Defense 1: Docker isolation
# Defense 2: Non-root user
# Defense 3: Network restrictions
# Defense 4: Filesystem restrictions
# Defense 5: Resource limits
# Defense 6: Action confirmation
# Defense 7: Action logging
@dataclass
class SandboxConfig:
docker_image: str = "computer-use-sandbox:latest"
network: str = "none" # or specific allowlist
readonly_root: bool = True
max_memory_mb: int = 2048
max_cpu: float = 1.0
max_runtime_seconds: int = 300
require_confirmation: list = field(default_factory=lambda: [
"download", "submit", "login", "delete"
])
log_all_actions: bool = Trueclass SandboxedTestRunner:
"""Run tests in throwaway containers."""
async def run_test(self, test_task: str) -> dict:
# Spin up fresh container
container_id = await self.create_container()
try:
# Run task
result = await self.execute_in_container(container_id, test_task)
# Capture state for verification
state = await self.capture_container_state(container_id)
return {
"result": result,
"final_state": state,
"logs": await self.get_logs(container_id)
}
finally:
# Always destroy container
await self.destroy_container(container_id)Severity: ERROR
Computer use agents MUST run in sandboxed environments
Message: Computer use without sandboxing detected. Use Docker containers with restrictions.
Severity: ERROR
Sandboxed agents should have restricted network access
Message: Sandbox has full network access. Use --network=none or specific allowlist.
Severity: ERROR
Container agents should run as non-root user
Message: Container running as root. Add --user flag or USER directive in Dockerfile.
Severity: WARNING
Containers should drop unnecessary capabilities
Message: Container has full capabilities. Add --cap-drop ALL.
Severity: WARNING
Containers should use seccomp profiles for syscall filtering
Message: No security options set. Consider --security-opt seccomp:profile.json
Severity: WARNING
Computer use loops should have maximum step limits
Message: Infinite loop risk. Add max_steps limit (recommended: 50).
Severity: WARNING
Computer use should have timeout limits
Message: No timeout on execution. Add timeout (recommended: 5-10 minutes).
Severity: WARNING
Containers should have memory limits to prevent DoS
Message: No memory limit on container. Add --memory 2g or similar.
Severity: WARNING
Computer use should track API costs
Message: No cost tracking. Monitor token usage to prevent bill surprises.
Severity: INFO
Consider adding cost limits per task
Message: Consider adding max_cost_per_task to prevent expensive runaway tasks.
93c57b2
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.