The first public benchmark baseline for this repository is now live.
Most important caveat first:
This is not an independent third-party evaluation.
It is a self-run baseline from the current Codex App session.
I still decided to publish it.
Because benchmark work becomes useless when it does one of two things:
This run used 4 fixed cases:
Current result:
12 / 1225 / 2837 / 40The interesting part is not the score.
The interesting part is the failure pattern.
This run exposed two clear weaknesses:
That is why benchmark matters.
Not because it helps claim “the model is strong”. Because it makes the next fix obvious.
If you want to inspect the baseline directly:
evals/results/2026-04-18-codex-app-gpt-5-self-run.mddocs/benchmarks/LEADERBOARD.mddocs/benchmarks/FIRST_PUBLIC_RUN.mdThe next step is not adding more skills.
The next step is turning this into a benchmark others can compare against.