Thirteen Win32 actions over a private Tailscale tunnel — so AI agents can operate a Windows desktop.
Hermes Agent
Cloud AI agents have no native way to interact with a Windows desktop. Tasks that require a GUI are completely out of reach for server-side agents — filling in an enterprise portal, reading a locally-installed dashboard, or clicking through a Windows-only workflow. The macOS computer-use driver existed; Windows had no equivalent.
Exposing a local Windows machine to a remote EC2 server is a networking problem with no clean solution. Port forwarding requires router access and a static IP. SSH tunnels are fragile. Any approach that opened a port directly on the public internet was a non-starter. The connection layer needed to be encrypted, authenticated, and NAT-traversal capable with zero router configuration.
AI computer-use is only as good as its targeting precision. A screenshot plus a best-guess pixel coordinate fails on dense, dynamically-laid-out UIs. The agent needed to see interactive elements by name and role — not just pixels — so it could click a button by label rather than by coordinate, and handle layouts that change between runs.
- AI agent runs on EC2 server
- Cannot interact with Windows desktop at all
- GUI tasks completely out of reach
- No secure tunnel to the local machine
- Only API-accessible software automatable
- 13 desktop actions as tool calls
- Tailscale tunnel — zero public portsencrypted
- Click by label, not pixel coordinate
Any Hermes agent conversation can now reach a Windows desktop. The agent issues a tool call — screenshot, click, type, scroll, focus window — and the action executes on the target machine within milliseconds, routed through a Tailscale tunnel that requires no router configuration, no port forwarding, and exposes zero public ports. The Windows machine is completely unreachable from the public internet; only enrolled Tailscale peers can connect.
The agent sees what a user would see. Every screenshot request can return a full list of interactive UI elements — their names, roles, and on-screen positions — so the agent clicks a button by its label rather than hunting for pixel coordinates. Non-ASCII text types correctly. Windows that are minimised or hidden behind other windows still capture. Drags animate across multiple steps to prevent frame-skip failures.
Thirteen actions are available as first-class tools in any Hermes conversation: capture, screenshot, click, double-click, right-click, type, key, scroll, drag, list apps, list windows, focus app, and set-of-marks element targeting. The agent can open applications, navigate interfaces, fill forms, and extract information from Windows desktop software with no public API — making the class of automatable tasks dramatically larger.
- Full-screen screenshot
- UI element tree (14 types)
- Element names & bounding boxes
- Works on hidden windows
- Click, double-click, right-click
- Type text (Unicode-safe)
- Key combos (Ctrl+C, Alt+Tab)
- Scroll and drag-and-drop
- List running applications
- Focus any window by title
- Set-of-Marks element targeting
- Tailscale-only — zero public ports
Ready to build something similar?
Book a discovery meeting ↗