Published May 10, 2026
| Version 0.1.1-bilingual
Preprint
Open
macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents
Authors/Creators
Description
The first publicly published macOS-native computer-use benchmark. 369 task slots across 15 categories (Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app), agent-agnostic Go runner, dual scoring (IMPLEMENTED + STRICT), per-task PID-snapshot isolation. First reference run: kinclaw v1.15.0 + Kimi-K2.5 = 67.3% IMPLEMENTED. Documents the full 49.3 -> 62 -> 67.3 debugging trajectory as methodology contribution.
Note (2026-05-09): This version bundles English + 中文 in a single PDF (English first, then Chinese), generated directly from the canonical Markdown source files.
v0.1.1 (2026-05-10): Adds §6.7 "The platform ceiling — separating agent capability from environmental limits." Notes category: 21/31 → 31/31 fully implemented. 5 eval bug fixes. New tools/reference_verifier.sh runs every Notes task with canonical osascript/shell solutions in ~100s — establishes the platform ceiling (21/31 = 67.7%) as upper bound on any agent's score, decomposing the 10 unreachable tasks into 4 platform-locked categories.
Note (2026-05-09): This version bundles English + 中文 in a single PDF (English first, then Chinese), generated directly from the canonical Markdown source files.
v0.1.1 (2026-05-10): Adds §6.7 "The platform ceiling — separating agent capability from environmental limits." Notes category: 21/31 → 31/31 fully implemented. 5 eval bug fixes. New tools/reference_verifier.sh runs every Notes task with canonical osascript/shell solutions in ~100s — establishes the platform ceiling (21/31 = 67.7%) as upper bound on any agent's score, decomposing the 10 unreachable tasks into 4 platform-locked categories.
Files
macbench.pdf
Files
(1.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3bc64bde1401f670a36d2d5b9611cbb5
|
1.5 MB | Preview Download |
Additional details
Related works
- Is supplemented by
- Software: https://github.com/LocalKinAI/macbench (URL)
- Software: https://github.com/LocalKinAI/kinclaw (URL)