BenchmarksThe AGI Scientist · June 9, 2026 · 8 min read

Beyond accuracy: measuring scientific reasoning

Why current evals miss the point, and an open protocol that doesn't.

Most benchmarks ask a model for a final answer and grade the string. That tells you whether the model arrived — it says almost nothing about whether it reasoned. For scientific work, the reasoning is the product.

The problem with accuracy

A single-number score collapses two very different failures into one:

The model that guessed and got lucky.
The model that reasoned soundly but slipped on arithmetic.

Optimize for the score and you reward the first as much as the second. Worse, you quietly select for benchmark-shaped behavior that doesn't survive contact with a real problem.

An open protocol

Our proposal grades the trajectory, not just the endpoint. Each task ships with:

a checkable hypothesis the model must state before acting,
intermediate artifacts (a plan, an experiment, a result) scored independently,
and a reproduction step so a third party can confirm the claimed result.

Everything is open and versioned. The grader is code you can read, and the rubric is the same one we apply to ourselves.

Where it goes next

We don't think this is the final word — it's a starting point we'd like the community to challenge. Submit a task, attack the rubric, or port it to your own domain. A benchmark is only as honest as the people willing to break it.