The Origin Story for Rohe Nordberg Review

Long before we started thinking about Large Language Models, we began studying what we called the “1000x statistics problem.” It is arguably the most important statistics problem and one that does not fit nicely into our standard (mathematical) toolbox. The problem is simple to state: In medicine, statistical errors appear in thousands of research papers each year. People know about this problem, but it seems intractable. The sheer scale means that any real solution likely requires systematic methods—individual statistical brilliance, no matter how deep, cannot address the volume of papers being published.

In 2020, we started simply: taking walks around Orton Park during the pandemic, wondering how to systematically understand statistics in research papers. Auden casually mentioned they would “just go scroll through PubMed for a good time,” and we knew we were onto something. Not a traditional research path, but one worth exploring. We started a reading group. Paper by paper, we began cataloging statistical problems, looking for patterns, trying to understand how these issues cluster together. What started as a collection of seeming one-off errors began to reveal systematic patterns—categories of problems that appeared again and again in the literature.

These insights became the foundation for a new undergraduate course about “listening to statistics.” We realized that scientific communication has always focused on how to write papers and present results—how to “talk” science—but rarely addresses how to systematically read and listen. Through teaching, we discovered something fascinating: people just needed good prompts to excel at statistical reading comprehension.

Then came a stroke of serendipitous timing. Just as we were deepening our understanding of statistical reading comprehension, Large Language Models (LLMs) had quietly evolved to become remarkably useful. We’ve developed and identified multiple insights to make our system work, but perhaps one of the most important is this: LLMs are poor judges but excellent readers. In this way, LLMs behaved just like our best undergraduate students—given clear instructions, they excel at listening to the statistical story in scientific papers.

We believe that just as the last century of developments in information technology supported the scaling of systematic statistical computations, LLMs can now support the scaling of systematic statistical reading comprehension.

What started as joyful wandering has led to something precise and powerful: a system where LLMs can evaluate statistical content in manuscripts, often exceeding human performance. Our system regularly excels beyond what we see (and thus expect) in humans. This has prompted us to develop novel statistical methodologies to identify what we call “super-human” performance: when the system makes better evaluations than the available “gold standard” labels.