The AI research experimentation problem

If you’re reading this, you probably follow AI research. If you follow AI research, you’ve probably seen debates about timelines to AGI. Perhaps you started building an underground bunker after Daniel Kokotalko and Scott Alexander warned that AI agents could topple governments and annihilate humanity by 2030¹.

Most people who believe superintelligence is imminent anticipate an “intelligence explosion” – a dramatic increase in AI progress once AI can automate AI research. Specifically, this scenario might unfold if advanced reasoning models can autonomously generate and select research ideas and implement these in code.

I believe we’ll soon train AI agents to generate diverse research ideas, select the most promising to test², and implement those ideas in code. But I don’t think these capabilities (alone) will meaningfully accelerate progress towards AGI.

To understand why, we need to examine what actually bottlenecks AI research today. When I ask AI researchers “what keeps you up at night?,” answers range from “AI developing biochemical weapons” to “problematic data causing training divergence.” Alas, both are valid concerns.

From these conversations, one thing is clear: the bottleneck is not a lack of ideas. Researchers are overflowing with ideas, and thousands remain untested. As more people pursue careers in AI research, the backlog only grows - and it isn’t that hard to implement many of these ideas in code³.

Compute is a critical bottleneck; we just don’t have enough GPUs/TPUs to power AI experiments, but many organizations are focused on the current/looming data center capacity crunch.

The overlooked bottleneck in AI research is designing rigorous experiments, running these experiments, and analyzing the results.

If we want to accelerate progress, we don’t need more ideas or more code. We need to run better experiments faster⁴. So, how do we get there? From talking with researchers and engineers, three problems stand out:

Better experiment design (including stronger evaluation methods)
Better approaches to run experiments (detecting/debugging issues in training code; managing/tracking experiments end-to-end)
Better experiment analysis (diagnosing failures; extrapolating outcomes at scale from smaller runs)

When researchers can design experiments that carefully test hypotheses, run them without constant instability, rigorously assess impact and understand why ideas fail, and extrapolate from small runs; we’ll unlock a new phase of acceleration. Not because we automated AI research, but because we empowered thousands of humans to move faster. The first intelligence explosion may be human.

These problems are genuinely hard⁵. As a VC/techno-optimist, I believe better tools can help, but tools alone won’t fix experiment design, execution, and analysis. Many challenges arise due to complex organizational dynamics and incentives: leadership must weigh competitive advantage against open science; managers fear rigorous experimentation slows iteration; ICs are rewarded for SOTA performance, not reproducibility. In fast-paced markets, teams ship models and software when they’re merely good enough. These pressures mount as the genAI market grows more crowded, progress outpaces anyone’s ability to track it, and media coverage stokes hype and controversy.

Solving these issues will require deep collaboration among researchers, statisticians, engineers, and others - many of whom have never faced challenges this hard or worked together before. I can highlight clever opportunities to sprinkle AI fairy dust on interesting problems, but the real work lies in building the right culture, incentives, and processes. Still, I believe focused research organizations committed to transparent science and careful experimentation can make meaningful progress. We can figure this out.

‍

Part 1: Designing better experiments

Too many AI research experiments lack rigor

In 2013, DeepMind researchers presented a Q-learning variant that could learn to play Atari 2600 games directly from raw pixels. It seemed to outperform human players, sparking excitement and becoming one of the most-cited ML papers of the decade. However, in 2018, Peter Henderson and colleagues from McGill and Microsoft demonstrated that much of the reported superiority came from flawed evaluation protocols.The original study relied on one or a few random seeds and compared AI to a small pool of non-expert humans playing under less favorable conditions.

This was not an isolated incident. Even in a community that values reproducibility, several breakthroughs have been retracted or reframed after later studies revealed flaws in the original experiment design.

This is not a critique of individual researchers. Competitive pressures, tight timelines, and genuine excitement about high-impact missions push everyone to move fast. But progress in AI hinges on good science, and good science hinges on good experiments. Experiments show when new research ideas actually work. We need well designed experiments so researchers can develop better algorithms based on empirical evidence - not just vibes.

If better experiment design is so essential, why is it so hard for research labs to prioritize it?

Well firstly, it’s hard to do well. Designing internally valid AI experiments is difficult since researchers need to control many variables (e.g. data distributions, sampling methodologies, training protocols, hyperparameter configurations) that can subtly affect outcomes. Researchers must exercise tight control over all these variables, even under intense pressure to ship results. This might mean rejecting the use of a new dataset that could impact scaling predictions or shelving an architectural change that could destabilize model behavior. The high dimensionality and stochastic nature of genAI systems make it harder to spot biases and minimize random errors. And in practice, many improvements to experiment design are only apparent after a flawed experiment.

These problems intensify when large research teams evaluate multiple ideas simultaneously, embedding several hypotheses within a single training run to save compute. While resource-efficient, this multiplexing strategy limits the ability to attribute observed performance gains to specific interventions, undermining causal interpretability. Researchers must balance designing experiments that preserve causal interpretability with extracting the maximum possible signal per unit time and per unit of compute. This balance is really tough to achieve in stochastic domains, where multiple runs are often needed to reliably separate signal from noise.

AI 🤝 Statistics (How statistical approaches can improve experiment design)

Recognizing these pitfalls, the AI research community is beginning to integrate principles from classical statistics to strengthen the internal validity of pre- and post-training studies.

For example, Anthropic’s recent paper “Adding Error Bars to Evals: A Statistical Approach to Model Evaluations” shows how common evaluation practices like reporting single-run metrics without confidence intervals are statistically fragile and can lead to overconfident or misleading conclusions. The author, Evan Miller (who maintains a killer blog on A/B testing), recommends reporting uncertainty via confidence intervals, using hypothesis tests to compare models, and explicitly planning for statistical power so experiments can reliably detect meaningful effects.

Many labs are now recruiting statisticians to rigorously vet experimental protocols, ensuring experiments are properly randomized, appropriately powered, and robust to alternative explanations. If done well, this could improve the credibility of reported findings.

But alas, the world needs more statisticians, and more students are learning to prompt models than run t-tests. In the future, AI agents could enforce experimental rigor. They could review proposed experiment designs for common errors like improper randomization or underpowered tests or generate experiment protocols that human statisticians refine. A formal grammar of experiment design (e.g. a DSL like PLanet) might enable humans and AI systems to compose, analyze, and critique experiments more systematically while also making complex design choices explicit to improve reproducibility.

Even then, a core challenge is designing experiments that maximize information gain for a fixed compute and time budget. The best researchers craft studies that probe several hypotheses in a single run and efficiently rule out many possibilities so they can converge on the most promising directions with fewer iterations. Better tools, methodologies, and shared design patterns could help all researchers apply these strategies effectively.

But tooling alone isn’t enough. Without major organizational changes that elevate experiment design to a first-class priority, even the best frameworks risk being sidelined. And statistical rigor, while essential, can backfire when it’s treated as an end in itself. Just as P-hacking and cherry-picking plague other fields, AI researchers with enough compute can run so many random seeds and/or hyperparameter configurations that one eventually produces a statistically significant result even when the effect size is negligible. Avoiding this trap requires norms that balance impact, reproducibility, and statistical soundness. Combined with strong tooling and a commitment to careful experiment design, these norms can provide a clear path to scaling high-quality empirical science.

‍

Lost in evaluation: Building evals is hard

Almost every researcher and practitioner I’ve spoken to cites model evaluation (evals) as their most pressing challenge. Unfortunately, crafting meaningful evals isn’t just about choosing a few benchmarks and applying statistical rigor to analyze the results…

When running A/B tests, one of the hardest problems is picking the right metrics. Similarly, when running AI experiments, one of the hardest problems is selecting and/or designing the right benchmarks. First, researchers must decide what they’re optimizing for: “general intelligence,” specific capabilities (e.g. mathematical reasoning, coding), user experience, revenue growth, or something else. Picking and prioritizing goals is contentious and complex (whoever invents an AI to solve this deserves a Nobel Peace Prize). And once a metric becomes the target, Goodhart’s law warns it may get gamed until it no longer reflects the real objective.

Even with clear goals, it’s still hard to get the right input-output pairs. A poorly chosen set might miss critical weaknesses or measure something orthogonal to the real objective. Most scientists think good benchmarks should represent the kinds of tasks or questions that humans do. For example, SWE-bench captures real bugs reported in popular GitHub repos⁶. To keep benchmarks relevant, some teams adopt practices that resemble software regression testing by logging model failures in production and folding them into their evaluation suites.

Benchmarks should reflect the way humans and agents interact, but most ignore asynchronous, iterative exchanges. As Shunyu Yao points out in The Second Half, “But in reality, an agent has to engage with a human throughout the task — you don’t just text customer service a super long message, wait for 10 minutes, then expect a detailed response to settle everything.” Many evals also assume human preferences are fixed when they often shift through repeated interaction with an AI. Well-designed benchmarks should capture these evolving preferences.

Good benchmarks must be easy to evaluate automatically so researchers can iterate fast without costly human annotation. As models improve at assessing outputs, practitioners aren’t limited to closed formats like multiple-choice. But models remain fragile - sensitive to prompt formatting or slight changes in task framing - so benchmark design should carefully guard against such fragility.

Contamination is another major challenge. This is Data Science 101: keep your training and test sets separate to fairly assess generalization. But in practice, separation is hard. Once published, benchmark questions circulate online. They’re translated into Japanese on a subreddit or embedded in obscure tutorials. Before long, someone trains a model on that data without knowing they’ve effectively included test questions in their training set. The problem gets gnarlier because the massive training datasets often contain benchmark-related artifacts (metadata, label distributions, or contextual clues) that can subtly shape the models’ behavior and inflate performance on downstream evals.

‍

Bye, bye benchmarks?

Unfortunately, fixing this problem isn’t as simple as scanning datasets for exact matches. Scrubbing is practically impossible given the countless permutations, paraphrases, and translations of benchmark items. Some researchers are watermarking benchmark datasets so models can recognize and skip over them during training.

More radically, there’s growing interest in moving beyond static benchmarks. Open-ended systems like LMSys Chatbot Arena rely upon humans who submit prompts, but Elo ratings used to aggregate model scores across prompts may reinforce biases since they’re sensitive to redundancy. Some are proposing novel ratings methods that are invariant to redundancy.

Others have proposed using LLMs or multi-agent systems to expand and refine benchmark datasets by generating new, harder questions from the original contexts. Benchmark-free methods like TreeEval use LLMs to host irreproducible evaluations immune to data leakage and better able to discriminate between models with superficially similar performance. Automated Capability Discovery takes this further, treating one model as a “scientist” tasked with systematically proposing new, open-ended challenges for a “subject” model and evaluating its performance. These techniques apply constant, adaptive pressure that exposes brittleness and weaknesses that fixed benchmarks miss.

By turning evaluation into a living process, we can ensure that our measures of progress remain honest signals of genuine capability and that improvements reflect real advances, not artifacts of the benchmark itself.

‍

Part 2: Running better experiments

Mid-training is in. But training is mid.

Imagine you are a scientist running genetic studies. You use a centrifuge to isolate nucleic acids from cell debris (a basic step in molecular biology). Now, imagine that centrifuge breaks 30% of the time. You’d lose samples, ruin experiments, and undermine your ability to generate valid results. If you nevertheless refused to fix it, most people would agree: you shouldn’t be running a lab.

And yet, this is the norm in AI research. Researchers routinely tell me that 30-80% of training runs fail - not because the scientific ideas are bad, but because training (and the software to run it) is brittle. Training failures have become the rule rather than the exception, driven by:

The sheer complexity of the hardware stack, the scale of the data involved, and the prolonged duration of training runs.
Software to run training is often a patchwork of loosely coupled libraries, custom scripts, and rapidly evolving frameworks.
A training apparatus - spanning algorithms, infrastructure, and tooling - that evolves so quickly, breakthroughs at any layer can render a system obsolete.

Balancing the stability needed for reliable experimentation with the need to implement the latest advances is a constant challenge.

Failures usually arise due to bugs in the training code, subtle data issues, and/or infrastructure problems. Silent errors are common: tokenizer mismatches, partial checkpoint restores, corrupted or duplicated data. Hardware defects can silently pollute training outputs without triggering explicit failure signals. A single faulty GPU may throttle communication across thousands of otherwise healthy devices. GPUs may crash mid-run. Training often continues, but quality perturbations emerge, and it’s hard to tell if they’re caused by a code bug, bad hardware, or statistical noise.

This is bad in any context, but worse when teams don’t know anything broke until it’s way too late (when millions have been incinerated). Stochastic gradient descent can keep converging in the presence of bugs, adapting to whatever signals it’s given, but producing degraded results. To make matters worse, evaluation is often deferred until the end of a long training run; treated as a final box to check rather than a feedback mechanism to be applied throughout training⁷. By the time a researcher runs an eval that detects failures, the trail has gone cold. No one knows whether the root cause was in the data, the model logic, the optimizer settings, or the compute stack.

The observability gap makes things even more dire. A study of 110 open-source ML projects found that logging was less prevalent in ML applications than in traditional software applications. While most projects contained at least one logging statement, many omitted logs from critical stages like data loading and pre-processing. Without these important metrics, it can be much harder to catch subtle failures.

Other projects suffer from too much logging: thousands of lines, duplicated across ranks, which makes it nearly impossible to spot real issues. Worse, most of the output is meaningless - lines like “loading tensorflow.so” or vague warnings that suggest something might be wrong but don’t demand action. This noise desensitizes developers, who eventually stop reading logs and focus only on the loss plot. While many researchers want to address this problem, competing priorities and the high opportunity cost of implementing good instrumentation often push it down the queue.

In any other domain, you'd fix the centrifuge⁸. In AI, we just spin it again.

Waiting to eval: Running evals is hard too

I’ve already covered how constructing evals is tough, but so is running them. Most researchers run evals late in the training process…because evals are slow. The predictable result: they discover failures days or weeks later after wasting time and compute on flawed ideas, misconfigured data, or buggy code. Running evals during training could catch issues earlier⁹, but this rarely happens because the current toolchain makes it difficult.

For example, one of the most widely adopted eval frameworks, LM Evaluation Harness, is quite slow. Running a full sweep of benchmarks can take hours, so teams wait until the end of training, saving wall-clock time but increasing the overall cost of failure.

This is solvable. With modest engineering effort, we could make valuation pipelines zippier, thereby tightening feedback loops and surfacing issues while they’re still actionable. Consider this a call to action: making evals faster and easier to integrate into training workflows is one high-impact way to accelerate AI progress¹⁰.

But the harness’s performance isn’t the only problem. Evaluations lag when developers use bloated or poorly curated test sets. Teams should start with a small, meaningful subset and expand as the model improves. Full-suite evals aren’t necessary early on when the model can’t predict anything, but they become useful quickly. Early evaluations may yield poor results (that’s expected!), but they offer a valuable signal if you know how to separate acceptable underperformance from genuine failure.

Most critically, teams should plan and allocate sufficient resources for evaluation from the start. Speed depends on resourcing: if you want fast evals, you need to budget for them. Too often, training gets all the compute, while evaluation limps along with whatever is left. The real problem isn’t just slow tooling; it’s that evaluation isn’t treated as a first-class part of the training pipeline.

Fast evals matter more than they seem. If you want to ship really good software, you need a toolchain that lets you test and verify code quickly. Fast feedback cycles are core to writing good code faster. The same dictum applies to building models - we’re just not doing the requisite work to optimize these evals.

‍

When AI Bugs Bite

Bugs in software are common, but bugs in training code may be more frequent and more catastrophic. Training pipelines are notoriously complex, spanning tokenization, data loading, distributed training, checkpointing, optimizer configuration, and more. Many components are low-level and custom or lack robust abstractions, thereby increasing the likelihood that subtle bugs occur as engineers write critical pieces from scratch.

But bugs in training code aren’t like bugs in traditional software. Software developers can catch bugs with unit, integration, and systems tests or with debugging tools and static analysis. GPU code is a different beast: no tidy for-loops, if-statements, or clean, modular structures - just dense, vectorized JAX or PyTorch with masking tricks to juice performance. Many bugs are numerical - subtle mismatches between forward and backward computations - that can stem from framework internals rather than the developer’s code. Reading it is hard, running it is hard, and rerunning it is expensive. Developers may lack direct GPU access for testing, so they mock runs on other hardware. To improve readability, teams often strip out debugging code.

Unlike with traditional software, most training bugs fail silently. A subtle mistake in the attention mask or an incorrect positional embedding won’t crash the system; it just reduces model quality. The loss might still decrease, but the model takes a big quality hit. Worse, it may learn despite the bugs, but adapt in ways that make it brittle; it will only continue to perform well on systems with the same bugs.

When research engineers do catch bugs (perhaps after days or weeks of training), they must rollback to the last-known good position and isolate the failure, usually by reproducing it at a smaller scale. However, some issues only emerge at large scale and many arise due to multiple interacting factors.

Diagnosing failures at scale is especially challenging. Training LLMs can involve thousands of nodes, each with multiple layers (e.g. hardware accelerators, frameworks, algorithms). Dependencies across these layers produce noisy, entangled failure signals, and faults often propagate, obscuring their origin. Cluster-level communication strategies like data, pipeline, and tensor parallelism add coordination challenges. Failures might originate from subtle timing issues or misaligned messages that only appear under certain loads. Pinpointing bugs within this dense graph of interdependencies is often slow and labor-intensive.

Nets not gnats: Finding and eliminating those pesky bugs

Some engineers are developing systematic methodologies for debugging training runs by borrowing from traditional software debugging. For example, Stanislav Bekman’s Art of Debugging proposes several practical principles: keep the debug loop tight by shrinking the data and reducing startup time; reproduce bugs on the smallest possible model using synthetic or controlled inputs; force sync execution when possible; and prioritize atomic debug cycles. These aren’t silver bullets, but when applied systematically, they make large-scale debugging far more tractable.

Meanwhile, other teams are developing tools to proactively detect silent failures during training. TRAINCHECK continuously validates training tasks with training invariants (rules that must hold throughout training) to capture issues caused by various root causes. The training invariants are inferred from traces collected by instrumenting a given training program, and checked automatically.

While we see some potential to use genAI to debug training code, full automation is unlikely. Few relevant datasets exist to train such models since detailed logs from large training runs are rare and most developers don’t document failures (which also limits collective learning). Still, better tools could help engineers isolate and reproduce bugs. Once that’s done, AI might be able to help identify and remediate them.

The almighty log

Most software engineers know that writing code is easier than maintaining it at scale. That’s why they respect logs. Logs are how you debug systems, catch regressions, and make sense of failure. But in AI, logging is often treated as a nuisance.

It shows. Training logs can be sparse, unstructured, or spread across many systems. Failures are common, but diagnosing them is tough with no signal. Profilers like PyTorch’s offer deep visibility, but the overhead makes them unusable for long jobs. Teams must choose: log conservatively and slow training/runtime performance, or log optimistically and fly blind. Most pick speed and hope nothing breaks. When it does, they’re stuck. Distributed training only makes things worse. Failures propagate across thousands of GPUs and parallelism layers. A single bad log line with no context can stall debugging for days.

Hindsight logging is a promising idea: training runs execute with minimal instrumentation but save periodic checkpoints. If something goes wrong, developers can replay from a checkpoint and inject logging after the fact. This balances performance and observability, but it introduces nontrivial challenges in data management and code complexity. Poorly implemented, hindsight logging can create technical debt and redundant code paths. Frameworks like FLOR make it easier, but this workflow is far from standard.

Some labs are investing in better logging infrastructure. Meta continuously ingests and indexes high-volume training logs, attaching metadata like iteration numbers and rank IDs to every log line. Systems like XPUTimer monitor key GPU-level operations at runtime, avoiding the cost of full tracing while still surfacing low-level anomalies. Research systems like L4 automate diagnostics by mining logs for spatial, temporal, and cross-job failure patterns, enabling faster localization of faults without rerunning massive jobs.

But most teams aren’t there yet. And that’s a problem. Logging isn’t just for postmortems (which are rarely written) – it’s foundational to reproducibility, experiment tracking, and understanding what your model learned. Without reproducibility, a training run is doomed if it needs to be rolled back or remediated; you can’t fix what you can’t faithfully recreate. The ability to rerun an experiment exactly (same code, same config, same data) turns a mysterious failure into a solvable problem. As experiments grow more complex and expensive, observability must get better too. We need logs that are structured, queryable, and rich with context…not just scattered print statements and a few TensorBoard scalars.

‍

Part 3: Analyzing experiments better

So let’s say you expressed a compelling hypothesis, designed a good experiment, and executed a training run without any hiccups. Now you need to analyze it. Easy right? Nope. Analyzing experiments was never easy.

Even after defining a hypothesis and executing a well-designed training experiment, researchers still may not know if their idea was flawed, the implementation was buggy, or the experiment infrastructure was not robust. Most experiments implicitly test three things:

The research idea;
Its implementation;
The experimental setup and infrastructure.

When results don’t align with expectations, it’s hard to tell if the model failed to learn what it should, the wrong thing was measured, or the right thing was measured incorrectly. Proper randomization and statistical rigor help. So do diagnostics for interaction or outlier effects. But in practice, researchers still struggle to isolate causes and interpret effects when training large models. We need better post hoc tools to determine what changed, why, and whether those changes are meaningful or incidental.

Advances in mechanistic interpretability may be critical to analyzing experiments. Techniques like activation patching, attribution patching, and probing can isolate the specific neurons or layers responsible for certain behaviors. They could show if/how a training intervention affects internal representations and if those changes alter outputs¹¹. Recent work from Anthropic introduces circuit tracing to identify interpretable features and map how they causally interact across layers to produce an output. By tracing which components (e.g. combination of neurons, attention heads) influence or pass information through a circuit, researchers can apply this method to form testable hypotheses about internal model mechanisms (e.g. how Claude 3.5 performs multistep reasoning to answer geography questions or proactively compose rhymes) and validate them through targeted perturbations.

Influence functions¹², which quantify the impact of individual training data points on a model’s predictions, can also guide data research. They can detect which training examples become more important after a specific intervention and whether shifts in output are driven by intended changes or by interactions with unrelated training data. What’s more, influence functions can help identify contaminated, mislabeled, or harmful training examples that disproportionately affect predictions.

While these techniques extract richer insights from individual experiments, the next challenge is to identify meaningful results from far more experiments. As automation accelerates hypothesis generation, experiment design, and execution, experiment volume will surge. It will become too onerous for humans to determine which findings matter, which are spurious, and which deserve follow-up. Researchers will need access to the most relevant subsets of data, drawn from an ever-expanding stream of results. LLM-driven tools are well-positioned to address this by sifting, ranking, and contextualizing findings far more effectively.

These tools don't replace good experimental design, but they complement it, offering concrete, model-internal signals that help distinguish between a broken idea, a broken setup, or a misleading success.

‍

Extrapolation is hard

Doing good science may require us to run more experiments. But more experiments mean more time and money. We can’t run everything at full scale, so teams develop scaling methodologies that let them train smaller models¹³ and predict how those results might reproduce at larger scales. This is common in biology - we test on cells before mice, on mice before humans. But unlike biology, where extrapolation has a more mature empirical foundation, we lack methods for predicting how results from LLM experiments will scale - especially when assessing model capabilities.

That’s a problem. To iterate faster and avoid wasting millions of GPU hours, we need to know when and how results from small models transfer. Unfortunately, this is a wide-open question. Scaling laws show that loss curves often follow smooth, predictable patterns. But capabilities don’t. Some emerge abruptly (like in-context learning), others disappear as models get bigger, and others wobble unpredictably.

In theory, we have some tools. Neural Tangent Kernels (NTKs) approximate training dynamics in the infinite-width limit, where models behave like linear function approximators with fixed features. This makes them useful for understanding early learning behavior. However, NTKs assume the model's features don’t evolve, only the output layer adjusts; so they fail to capture the rich, nonlinear dynamics and representation learning that occur in SOTA LLMs. While useful as a baseline, NTKs cannot explain how or why complex behaviors arise at scale.

Maximal Update Parameterization (μP) helps align learning dynamics across model widths, making small- and large-scale models behave more comparably during training. That’s great for studying learning rates, optimizer schedules or loss curves. But it doesn’t tell you if a new attention routing trick will unlock reasoning at 7B the same way it did at 70M. μP preserves update magnitudes (not emergent behavior), and doesn’t account for nonlinear effects introduced by scaling depth, context length, or data diversity (though recent work is exploring these).

What we need is an actual science of extrapolation: frameworks, metrics, and experiments that tell us when behaviors transfer across scale, and why. What internal signals predict emergent capabilities? What inductive bias becomes more or less important as models grow? How can we separate the effects of scale from the effects of dataset size, model depth, or optimization regime? Extrapolation is very hard - so right now, we don’t have good answers. However, if we’re serious about accelerating progress, this might be one of the highest-leverage problems we can work on.

An ode to science

AI progress in the last decade has been remarkable, aided by an influx of scientists from fields like physics, neuroscience, and biology who brought the rigor and good habits of their disciplines. But doing science well takes more than individual skill; it requires organizations to adapt their structure and processes and foster real collaboration. They may need to trade off iteration speed for scientific rigor. They may need to give up some secrecy to support reproducibility. These tradeoffs are tough, but worth it. If we want to move faster, we need to treat AI research like real science. That means better experimental design, more rigorous evaluation, and tools that help us understand what models are actually learning. It means logging what matters, spotting failures early, and analyzing results with the same statistical discipline you’d expect in any credible empirical field.

AI and better tooling can help - not by replacing researchers, but by giving them sharper instruments to catch bugs, flag confounders, generate clean protocols, and accelerate the path from idea to evidence. Smarter experiments won’t just prevent wasted compute, they will unlock faster insight and more confident progress. But real progress will require organizations to change how they work. They will need to incentivize teams to focus on deep, hard, technical problems that require interdisciplinary collaboration.

We talk a lot about scaling models. Maybe it's time we scaled good science.

¹While I make quippy comments, I believe we are significantly underinvesting in AI safety.

‍

²Researchers have already made meaningful progress towards generating novel research ideas.

‍

³Developing a framework that enables researchers to implement ideas and tune this implementation at different levels of abstraction is much more challenging.

‍

⁴Most of the content in this post has clear applications to pre-training experiments. However, many of the same problems will arise as we start to scale post-training.

‍

⁵Understatement of the century?

‍

⁶Some practitioners have observed that SWE-bench is heavily weighted toward issues from Django, which may limit its representativeness. Expanding it to include a broader set of open-source projects could yield a more balanced and informative benchmark.

‍

⁷To implement evaluation mid-training effectively, teams should use separate validation and test sets. Unfortunately, some evaluations still omit this separation.

‍

⁸The issues which we’ve highlighted are, admittedly, hard to fix. However, AI engineering teams can make tangible improvements by systematically reducing defects across the model development and deployment pipeline.

‍

⁹While it is true that evals aren’t useful very early in training (when the model can’t predict anything), that changes quickly. By a few percent in, evals start to reveal real issues.

‍

¹⁰Carefully choosing which examples to include in evals is critical. The temptation to keep adding new ones drains compute and slows experimentation. Discipline in selection ensures evals stay focused and cost‑effective.

‍

¹¹Notably, these methods are no substitute for robust evaluation (i.e. comparing model performance with and without the intervention) to confirm that changes correlate with desired behavior.

‍

¹²Ironically, studies suggest that influence functions are unstable and unreliable at scale.

‍

¹³Unfortunately, the smaller the scale, the greater the uncertainty.

Authors

Sarah Catanzaro

Editors

Justin Gage

Acknowledgments

Thanks to Jacob Ausin, Edward Hughes, Jonathan Frankle, Aristotelis Economides, Emmanuel Ameisen, Jason Liu, and Stanislav Bekman for reviewing earlier drafts of this post.