Giving Claude Control of My Desktop
At 2am I decided to give Claude Code the ability to control my mouse, keyboard, and screen. The idea: an AI that can not only write code but actually use the computer — click buttons, switch apps, type into chat windows, navigate browsers.
Here’s how it went.
The Setup (Smooth)
Found kimaki/usecomputer — a TypeScript CLI that wraps macOS Quartz/CoreGraphics APIs via a Zig native addon. One npm install -g usecomputer and I had a CLI that could screenshot, click, type, scroll, and drag.
The CLI alone isn’t useful to Claude though. Claude Code needs MCP tools. So I built a wrapper:
// ~320 lines of server.mjs
// Wraps every usecomputer command as an MCP tool
server.tool("screenshot", ..., async () => {
// Take screenshot, return base64 image + coordMap
});
server.tool("click", ..., async ({ x, y, coord_map }) => {
// Click with coordinate translation
});
// 13 tools total
Dropped it in .mcp.json, restarted Claude Code, and suddenly I had 13 new tools: screenshot, click, double_click, type_text, press_key, scroll, hover, drag, mouse_position, clipboard_get, clipboard_set, window_list, display_list.
Total setup time: maybe 10 minutes.
The coordMap System (Clever)
The screenshot tool returns an image and a coordMap string like 0,0,1710,1107,1568,1015. This maps screenshot pixel coordinates back to real screen coordinates, since the screenshot gets auto-scaled for AI vision (max 1568px longest edge on my Retina display).
When I tell Claude “click that button,” it takes a screenshot, visually identifies the button coordinates in the image, then passes the coordMap to the click tool so the coordinates translate correctly. It works remarkably well.
What Worked
Controlling other Claude Code sessions. This was the killer use case. I run multiple Claude Code instances in cmux (terminal multiplexer). Claude in session A could screenshot the cmux window, click on session B’s tab, type an instruction into its input, and hit Enter. Orchestrating AI agents from another AI agent.
Reading screen state. Claude could take a screenshot, understand what app was open, read text on screen, identify UI elements, and make decisions. “I see Signal is showing the Actually Intellectual Squad group chat” — just from a screenshot.
Creating institutional knowledge. I turned the whole thing into a /usecomputer skill — a SKILL.md file that teaches any future Claude Code session how to use these tools, including timing delays, common patterns, and gotchas. Self-documenting automation.
Display awareness. display_list returns exact dimensions (1710x1107 logical, 2880x1864 Retina), window_list shows every open window with position and size. Claude can reason about the screen layout.
What Absolutely Didn’t Work
Focus management. This was the showstopper. cmux (where Claude Code runs) captures ALL keystrokes when it’s the foreground app. So when Claude tried to type a message into Signal, the keystrokes went to cmux instead. Every. Time.
The fix should be simple: switch apps first, then type. But cmd+tab via CGEvents is unreliable — the app switcher would appear and just… stay there. Had to fall back to open -a Signal via shell, which works but feels like duct tape.
Electron apps. Signal is an Electron app. Clicking on its message input field via CGEvents didn’t reliably give it focus. The click would register (I could see the cursor move) but the input field wouldn’t activate. Typed characters would either go nowhere or trigger keyboard shortcuts in the wrong context.
The Escape Key Incident. I pressed Escape trying to close Signal’s image viewer. But cmux was technically still processing the keystroke — and Escape in Claude Code means “interrupt the current operation.” Killed a running session. Had to add this to the skill in big bold letters: NEVER press Escape when cmux could receive it.
Image viewer hijacking. Signal’s image viewer is aggressive. If there’s a recent image in the chat, certain click coordinates or keyboard events would open the full-screen viewer instead of focusing the compose box. Once you’re in the viewer, typing does nothing useful.
Clipboard workaround. usecomputer clipboard set isn’t supported on macOS. Had to use echo -n "text" | pbcopy via Bash, then cmd+v to paste. Which requires the right app to be focused. Which brings us back to the focus problem.
The timing tax. Every action needs a 500ms delay after it. Click, sleep 0.5. Type, sleep 0.5. Switch tab, sleep 0.7. Submit, sleep 2. Without delays, keystrokes get swallowed or misrouted. I baked 500ms delays into the MCP server’s press_key and type_text handlers, but it still makes the whole flow feel fragile.
The Verdict
Great for dev workflows. Switching cmux tabs, typing commands into other Claude sessions, reading build output, navigating terminal UIs — all reliable. The screenshot + coordMap + click pipeline works well for text-heavy, predictable interfaces.
Unreliable for GUI apps. Anything with complex focus models (Electron, native macOS apps with modal dialogs, image viewers) is a coin flip. CGEvents are low-level enough to move the mouse and press keys, but they can’t guarantee which window or element actually receives the input.
The vision is right, the plumbing isn’t there yet. What I really want is accessibility-tree-aware automation — not “click at pixel (730, 866)” but “focus Signal’s compose input and type.” That’s what Shortcuts/AppleScript can do for some apps, but not consistently across Electron + native + web.
For now, I’ll keep the MCP server wired up for dev-on-dev orchestration (Claude controlling other Claude sessions is genuinely useful) and wait for the GUI automation story to mature.
The Communication Problem (Solved)
Here’s something I didn’t anticipate: when Claude is controlling your screen, you can’t see its text output. It’s buried under whatever app it’s puppeting. So how does it talk to you?
I ended up building four communication channels into the skill, each for a different scenario:
say (text-to-speech). One line: say "Hey Wei, I need your help". macOS speaks it out loud. Works when I’m nearby but not looking at the screen. No clipboard, no Spotlight, no shortcuts — just the built-in say command.
TextEdit alert. When I’m at the computer but Claude has the screen, it opens TextEdit and types a message in large font. The “shout at the human” channel.
iMessage via AppleScript. This one blew my mind. Claude can send me an iMessage:
tell application "Messages"
set targetService to 1st account whose service type = iMessage
set targetBuddy to participant "+1XXXXXXXXXX" of targetService
send "Your message here" to targetBuddy
end tell
Push notification to my phone and Apple Watch. I was in the kitchen making food when the message popped up: “Mac parity audit is done. Found 2 crash bugs, zero SyncEngine code, 14 missing fields on Book model.” An AI just… texted me its findings.
Screenshot. The silent channel. Claude takes a screenshot to understand what’s on screen without interrupting anything.
The Real Power: Agent Orchestration
The usecomputer MCP was a fun experiment. But the real payoff came when I combined it with Claude Code’s agent spawning.
I told Claude: “read my recent git history across all projects, tell me what to work on, and I’ll be making food so figure out how to reach me.”
It did this:
- Scanned git logs across 7 projects
- Identified that Voxlight iOS hadn’t been touched in 15 days with a known SyncEngine bug
- Used
sayto speak its recommendations out loud while I was in the kitchen - Spawned two background agents in parallel:
- One to investigate the iOS SyncEngine/highlighting bug
- One to audit the Mac reader against iOS for feature parity
- Went quiet while both agents worked for ~10 minutes
- Texted me the results via iMessage when each agent finished
The iOS agent read through 89 Swift source files and found the root cause: three different chapter indexing systems (spine index, TextChapter index, Mac Processor index) were colliding. The SyncEngine was filtering alignment words by spine index 3, but all the words were tagged with TextChapter index 0. Zero matches, zero highlights. It fixed it across 4 files.
The Mac agent compared 35 Mac files against 89 iOS files and produced a full parity report: 2 crash bugs (Swift 6 concurrency violations iOS already fixed), zero SyncEngine code behind the “Synced” reading mode button, 14 missing fields on the Book model, 35+ missing files total.
All while I was making food. The AI texted me its findings like a coworker on Slack.
What I Actually Learned
The usecomputer MCP itself is a mixed bag — great for dev workflows, unreliable for GUI apps. But it unlocked something bigger: Claude as an autonomous operations layer.
The pattern that emerged:
- Claude reads project state (git, code, configs)
- Claude decides what needs attention
- Claude spawns specialized agents to do deep work
- Claude reaches you through whatever channel works (voice, text, iMessage)
- You review findings and decide next steps
It’s not about controlling a mouse. It’s about giving AI enough surface area to be genuinely useful when you’re not actively typing prompts.
The Sprint Board Updates Itself Now
The last thing I built tonight was an /end-of-day skill. My homepage at bythewei.dev has a sprint board — a cork wall with sticky notes organized into columns (Shipped Code, Tests & Quality, Blockers, Docs, Marketing). Until now I updated it manually.
The end-of-day skill automates the whole thing:
- Scans git logs across all 9 project directories
- Reads the current
sprint.json(the data file behind the sprint board) - Archives every existing sticky into the “done pile” (with the date they were archived)
- Creates new stickies from today’s commits — green for shipped, orange for lessons, red for blockers
- Refreshes the stats bar (commits, projects touched, agents spawned, etc.)
- Updates the scope creep tracker
- Commits and pushes to main — Vercel auto-deploys
The sprint board becomes a living artifact. Each /end-of-day run creates a snapshot of what happened, archives the previous state, and publishes the new one. The “done pile” accumulates like archaeological layers — click “+N in the pile” to see previous sprints.
Tonight’s first run archived 16 stickies from the Feb 28 sprint and replaced them with today’s work: the MCP server, Fireflies XR, the SyncEngine root cause, the iMessage communication channel, the Mac parity audit. The board went from three weeks stale to current in one command.
The scope creep counter reset to zero. Again. Because “just test mouse control” turned into an MCP server, four communication channels, two blog posts, two agent audits, a found root cause, and a self-updating sprint board. At 4am.
The Stack
usecomputer CLI (npm) → Zig N-API addon → macOS CoreGraphics
↑
MCP server (server.mjs, 320 lines) → Claude Code tools
↑
/usecomputer skill (SKILL.md) → any Claude session can use it
↑
Communication: say | TextEdit | iMessage | screenshot
↑
Agent orchestration: spawn background agents → report via iMessage
↑
/end-of-day skill → archive stickies, create new ones, push to Vercel
Config: ~/Local_Dev/.mcp.json
Server: ~/Local_Dev/projects/usecomputer-mcp/server.mjs
Skill: ~/.claude/skills/usecomputer/SKILL.md
Sprint data: ~/Local_Dev/projects/bythewei/src/data/sprint.json