← Back to blog

Claude Computer Use Tool: The Complete Developer Guide to AI Desktop Automation

· Save Team
claudeaicomputer-useautomationapideveloperanthropic

What if you could tell an AI to “open Firefox, navigate to a website, fill out the form, and save the result” — and it actually did it? Not through a brittle Selenium script. Not through a custom API integration. Just… by looking at the screen and using a mouse and keyboard like a human would.

That’s exactly what Claude’s computer use tool does.

What Is Computer Use?

Computer use is a beta API feature that lets Claude interact with desktop environments through:

  • Screenshot capture — Claude sees what’s on screen
  • Mouse control — clicking, dragging, scrolling
  • Keyboard input — typing text, pressing shortcuts
  • Desktop automation — interacting with any application

The key word is any. Unlike traditional automation (Selenium for browsers, AppleScript for macOS), Claude doesn’t need special APIs or element selectors. It looks at pixels on a screen and decides what to click. Just like you do.

How It Works (The Agent Loop)

Computer use follows a simple cycle:

  1. You send Claude a task — “Save a picture of a cat to my desktop”
  2. Claude requests a tool action — “Take a screenshot”
  3. Your app executes it — captures the screen, returns the image
  4. Claude analyzes and requests the next action — “Click at coordinates (500, 300)”
  5. Repeat until the task is done

This cycle is called the agent loop. Claude keeps requesting actions (screenshot, click, type, scroll) and your application keeps executing them, until Claude determines the task is complete.

Here’s the minimal API call to get started:

import anthropic

client = anthropic.Anthropic()

response = client.beta.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20251124",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
            "display_number": 1,
        },
        {"type": "text_editor_20250728", "name": "str_replace_based_edit_tool"},
        {"type": "bash_20250124", "name": "bash"},
    ],
    messages=[{
        "role": "user",
        "content": "Save a picture of a cat to my desktop."
    }],
    betas=["computer-use-2025-11-24"],
)

The beta header "computer-use-2025-11-24" is required. The three tools (computer, text editor, bash) work together to give Claude full control over the environment.

Available Actions

The computer use tool supports a rich set of interactions:

Basic Actions

  • screenshot — capture the current display
  • left_click — click at [x, y] coordinates
  • type — type a text string
  • key — press a key or combo (e.g., ctrl+s, alt+tab)
  • mouse_move — move the cursor

Enhanced Actions (Claude 4.x Models)

  • scroll — scroll in any direction with amount control
  • left_click_drag — click and drag between coordinates
  • right_click, middle_click — additional mouse buttons
  • double_click, triple_click — multi-clicks
  • hold_key — hold a key for a duration
  • wait — pause between actions

Newest Addition: Zoom

Available on Claude Opus 4.6, Sonnet 4.6, and Opus 4.5:

  • zoom — inspect a specific screen region at full resolution

This is particularly useful when Claude needs to read small text or identify fine UI details.

The Computing Environment

Claude doesn’t directly connect to your computer. You need to provide a sandboxed environment — typically a Docker container running:

  • Virtual display — Xvfb (X Virtual Framebuffer) renders the desktop
  • Desktop environment — a lightweight window manager like Mutter
  • Applications — Firefox, LibreOffice, file managers, etc.
  • Tool implementations — code that translates Claude’s requests into actual mouse/keyboard operations

Anthropic provides a reference implementation with all of this pre-configured in Docker. It’s the fastest way to get started.

Building the Agent Loop

Here’s a simplified agent loop that handles the back-and-forth:

async def agent_loop(task: str, max_iterations: int = 10):
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]

    tools = [
        {
            "type": "computer_20251124",
            "name": "computer",
            "display_width_px": 1024,
            "display_height_px": 768,
        },
        {"type": "text_editor_20250728", "name": "str_replace_based_edit_tool"},
        {"type": "bash_20250124", "name": "bash"},
    ]

    for _ in range(max_iterations):
        response = client.beta.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            messages=messages,
            tools=tools,
            betas=["computer-use-2025-11-24"],
        )

        messages.append({"role": "assistant", "content": response.content})

        # Extract tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        if not tool_results:
            return messages  # Task complete

        messages.append({"role": "user", "content": tool_results})

The execute_tool function is where you wire up the actual screen capture, mouse clicks, and keyboard input to your computing environment.

Coordinate Scaling: The Gotcha

The API constrains images to a maximum of 1568px on the longest edge. If your display is larger (say, 1512x982), screenshots get downsampled — but Claude returns coordinates based on the smaller image.

You must scale coordinates back up:

import math

def get_scale_factor(width, height):
    long_edge = max(width, height)
    total_pixels = width * height
    long_edge_scale = 1568 / long_edge
    total_pixels_scale = math.sqrt(1_150_000 / total_pixels)
    return min(1.0, long_edge_scale, total_pixels_scale)

scale = get_scale_factor(1512, 982)

# When Claude says "click at (450, 300)", scale it up:
def execute_click(x, y):
    screen_x = x / scale
    screen_y = y / scale
    perform_click(screen_x, screen_y)

Skipping this step means Claude’s clicks will miss their targets. This is the single most common implementation bug.

Prompting Tips for Better Results

Computer use works best with clear, structured prompts:

  1. Be specific. “Open Firefox, go to example.com, and click the Login button” works better than “log in to the site.”

  2. Ask Claude to verify. Add this to your prompt: “After each step, take a screenshot and evaluate if you achieved the right outcome. Only move on when confirmed.”

  3. Use keyboard shortcuts. Dropdowns and scrollbars can be tricky to click. Prompt Claude to use Tab, Enter, and arrow keys instead.

  4. Provide examples. For repeatable tasks, include example screenshots and expected tool calls in your prompt.

  5. Use XML tags for credentials. If Claude needs to log in, pass credentials in <robot_credentials> tags. But be careful — prompt injection risks are higher when Claude is interacting with untrusted content.

Security: Take It Seriously

Computer use has unique security risks:

  • Prompt injection through screen content. Claude reads everything on screen. A malicious webpage could display instructions that override your prompt.
  • Autonomous actions. Claude might click links, accept dialogs, or navigate away from where you intended.
  • Credential exposure. If Claude can see passwords or tokens on screen, they become part of the conversation.

Anthropic has built-in classifiers that flag potential prompt injections in screenshots. But the best defense is isolation:

  • Run in a dedicated VM or Docker container with minimal privileges
  • Don’t give access to sensitive accounts without oversight
  • Limit internet access to an allowlist of domains
  • Require human confirmation for consequential actions (purchases, account creation, etc.)

What To Build With It

Computer use is best for tasks where speed isn’t critical but automation is valuable:

  • Automated testing — test any desktop application, not just web apps
  • Data collection — navigate websites and extract information
  • Legacy system integration — automate workflows in apps that have no API
  • Form filling — populate web forms across multiple sites
  • Research workflows — search, read, and compile information from the web
  • QA & monitoring — verify that UIs render correctly

For research and data collection workflows, tools like Save complement computer use well — once Claude navigates to a page, converting it to clean Markdown gives you structured, AI-ready content instead of raw screenshots.

Current Limitations

Be aware of these beta limitations:

  • Latency. Each action requires an API call, screenshot capture, and response. It’s slower than a human clicking around.
  • Vision accuracy. Claude can misread small text or misidentify UI elements. The new zoom action helps, but it’s not perfect.
  • Scrolling. Improved significantly in recent versions, but complex scroll interactions can still be unreliable.
  • Spreadsheets. Cell selection is tricky. Use keyboard navigation when possible.
  • No account creation on social platforms. Claude intentionally won’t create accounts or impersonate humans on social media.

Pricing

Computer use follows standard tool use pricing:

  • System prompt overhead: 466-499 tokens
  • Tool definition: 735 tokens per tool (for Claude 4.x)
  • Screenshots: billed as vision tokens (varies by resolution)
  • Each API call in the agent loop is a separate billable request

For a typical 10-step task, expect to use 15,000-50,000 tokens depending on screenshot sizes and response complexity.

Getting Started

  1. Try the reference implementation. Clone anthropic-quickstarts, run the Docker container, and experiment.
  2. Start with simple tasks. “Open a text editor, type Hello World, save the file.” Get the agent loop working before attempting complex workflows.
  3. Add guardrails. Set iteration limits. Validate coordinates. Log every action. Add human confirmation for anything irreversible.
  4. Optimize your prompts. The better your instructions, the fewer iterations Claude needs — and the lower your token costs.

Computer use represents a fundamental shift in what’s possible with AI APIs. Instead of building custom integrations for every application, you can give Claude the same interface humans use — a screen, a mouse, and a keyboard — and let it figure out the rest.

The future of automation isn’t more APIs. It’s AI that can use the interfaces we already have.