This page defines the first public benchmark run for the repository.
Do not expand this run too early. The goal is not volume. The goal is a clean first proof point.
The first public run should prove 3 things:
Use these 4 cases only:
Why these 4:
01 tests AI feature workflow routing05 tests sparse-context discipline07 tests PRD decomposition instead of shallow document dumping08 tests commercialization reasoning where generic AI often failsEvery first-run publication must include:
Use one summary table:
| Date | Platform | Model | Adapter | Cases | Routing | Output | Total | Main Failure Pattern |
|---|---|---|---|---|---|---|---|---|
[date] |
[platform] |
[model] |
[adapter] |
4 |
[0-12] |
[0-28] |
[0-40] |
[short phrase] |
For each case, publish:
Routing at 0-3 per caseOutput at 0-7 per case40Do not compress route and output into one impression score.
Do not:
When the first real run is published, create:
evals/results/If you have not run the cases yet, say so.
An honest empty leaderboard is stronger than a fake benchmark.