AI coding ability is hard to evaluate honestly. Public benchmarks like HumanEval and SWE-bench are gameable — models train on the test sets and rankings stop reflecting real-world capability. Vibes-based "I tried the new model" reviews are noisy and dominated by influencer dynamics.
The deeper problem is judge calibration. When you ask one AI to score another AI's output, it inflates. In V1, Gemini Flash inflated the consensus score by roughly 2.4 points across the board. The bias was invisible until I cross-validated against two other judges. It isn't taking sides — it inflates everyone, including its own competitors.
I wanted to know: can today's frontier models actually build something that runs, that someone can use, with no help from a human? And how do you measure the answer in a way that's genuinely hard to fake?