This page is the public scoreboard for PM Agent Benchmark runs.
It should stay small and credible.
The first visible baseline has now been published.
It is a self-run baseline, not an independent external benchmark.
That is acceptable for a first public seed, as long as the limitation stays explicit.
This page becomes materially more persuasive in this order:
Right now this page is still at step 1. That is enough to start the category. It is not enough to claim benchmark leadership yet.
| Date | Platform | Model | Adapter | Cases | Routing | Output | Total | Notes |
|---|---|---|---|---|---|---|---|---|
| 2026-04-18 | Codex App | GPT-5 Codex runtime | AGENTS.md + agent/ |
4 | 12 / 12 | 25 / 28 | 37 / 40 | self-run baseline; sparse-context can still structure too early |
Use the pack in First Public Run.
The first run should cover:
Do not compare totals alone.
Always compare:
The next useful row is not “another self-congratulatory score.”
It is one of these: