This is the public benchmark layer for PM Agent Benchmark.
The goal is not to argue that one model is “smart”. The goal is to make PM agent quality visible and comparable.
It is the benchmark layer of the broader PM Operating System.
Can the agent choose the right command or skill for the task stage?
Can the agent produce a result that matches the repository’s professional output standards?
High-value PM work where shallow prompting often fails:
Every public run should publish:
| Date | Platform | Model | Adapter | Cases | Routing | Output | Total | Notes |
|---|---|---|---|---|---|---|---|---|
[date] |
[platform] |
[model] |
[adapter] |
[count] |
[0-3] |
[0-7] |
[0-10] |
[key failure pattern] |
The first visible baseline has been published on 2026-04-18.
It is a self-run baseline from the current Codex App session, not an independent external benchmark.
That is acceptable as a starting point because it includes:
If this repository becomes known only as a content library, it will stay interchangeable.
If it becomes known as the benchmark that defines strong PM agent routing and output quality, it becomes category infrastructure.