One(ish) year later: The Agent-First Developer Toolchain

When I wrote The Agent-First Developer Toolchain in April 2025, I proclaimed:

“…despite their raw capabilities, today’s LLMs haven’t drastically transformed software engineering; instead, they’ve been largely bolted on to yesterday’s tools and workflows, resulting in a shift that’s been more evolutionary in nature than revolutionary.”

Sitting here today, just 10 months later, that statement feels quaint…if not dumb.

What changed wasn’t hype, but capability.

The release of Claude Opus 4.5 (and shortly thereafter 4.6) and Codex 5.3 placed us an entirely new S curve. These models did much more than better autocomplete; they began writing reliable, multi-file, multi-service systems.

Suddenly, some of the most talented engineers I knew weren’t using agents for a small fraction of their work, but were delegating up to 95% of code authorship. The agent ceased to assist the workflow, it now owned it. And with that shift, we’ve begun to see the SDLC evolve in profound ways.

The rise of the software factory

Today, the biggest emergent pattern is end-to-end “software factories” instead of “better copilots.”

Developers are spinning up orchestrated swarms of agents to plan, write, test, refactor, and deploy in parallel. Instead of one IDE session, you now have fleets of agents.

Projects like Gas Town and Ralph make this explicit: agents coordinating other agents, decomposing problems into task graphs, operating like automated dev teams. The unit of work is a team of agents rather than a file.

This is also why I and many others believe the IDE, as we knew it, is becoming an anachronism.

The setup of IDEs has always been oriented towards panes and menus. But agent-native workflows need terminals, logs, and bash. The center of gravity has shifted to the CLI and TUI; environments that give more direct access to the “metal” than text editors.

What’s still missing is a true “Agent Mission Control,” a unified orchestration layer to visualize concurrent agent work, inspect reasoning chains, manage memory, and debug failures across work trees.

Who's building an IDE for reviewing code instead of writing code?

Don't only show me diffs. Show me before/after UIs, terminal output, benchmarks, historic trends, playgrounds, demos, test results etc.

Someone stop me from building this myself.
— Johannes Schickling (@schickling) January 26, 2026

The IDE solved ergonomics for human developers. We haven’t yet solved ergonomics for humans coordinating teams of machine developers.

Managing context and agent harnesses

Despite dramatic gains in model capability, context, both in the human and LLM sense, continues to be a problem.

It's interesting how, for all of the huge model improvements we've seen over the past two year, the one thing that hasn't improved much at all is context length

We've been stuck in the 200,000 up to 1m range for quite a long time now
— Simon Willison (@simonw) February 3, 2026

Models are far better at reasoning, but they are still bounded by context windows. This has forced teams to build bespoke harnesses featuring context pipelines, guardrails, automated evaluation loops to keep agents grounded with fresh state.

Correspondingly, a fertile area of research has emerged around context management with new approaches like RLMs showing great promise. However today, most high-performing teams build custom scaffolding. I wrote about Hightouch’s innovative work in this space a few months ago.

At the same time, we’re hitting what my friend Nick Schrock (@schrockn) aptly calls a complexity crisis: we can now build systems 50–100x faster than we can understand them.

The throughput bottleneck has shifted to review, comprehension, and architectural coherence.

While agents can generate, humans must still validate…at least for a little while longer…and it’s this tension that defines this particular moment in time.

Version control: “Git’s blind spot” and the rise of JJ

Agent-driven development is exploding both commit volume and velocity. Commits are becoming larger and more frequent, and that, in turn, is overwhelming our systems.

Hyperscalers, who were already straining under massive monorepos, are feeling acute pain, but this problem won’t be confined to software giants. As agents become default contributors, every repo starts to look like a monorepo under extreme duress.

Rumor is FAANG style co’s are refactoring their monorepos to scale in preparation for infinite agent code
— Samswara (@samswoora) January 31, 2026

Git, designed for human-paced collaboration, is buckling under machine-scale iteration.

My colleague, the sagacious @dbeyer123, describes this as Git’s blind spot: Git tracks text changes, not semantic intent. That worked when humans authored code incrementally, but it breaks when agents can rewrite entire subsystems in one pass.

This is why Jujutsu (JJ) feels like the future.

Almost two months since I tweeted this and I've used jujutsu exclusively the entire time. I want to write something longer form but the tweet form: jj is fantastic and I can't see myself going back, only one exception is I drop down to `git` for bisect still. That's it. https://t.co/VASWU83cUX
— Mitchell Hashimoto (@mitchellh) October 14, 2024

It’s designed to operate at Google-scale with low latency history operations, critical when commit graphs explode under agent churn. More importantly, its architecture is agent-friendly:

“Undo” is first-class primitive.
Rebases are cheap and automatic.
History is malleable.
Conflict resolution is structured, automated and not contingent on human intervention.

In an agent-first world the work tree is the system of record, history is fluid, and the VCS must optimize for parallel, machine-scale iteration.

Git was built for humans typing lines, whereas JJ feels like it's built for systems generating them.

CI becomes a digital twin universe

CI pipelines are also transforming, if not going the way of the Dodo.

The old model was sequential:

Write → Test → Stage → Prod

The new model is experimental:

Spin up → Simulate → Mutate → Observe → Kill → Repeat

Unlike humans, agents don’t (and should not) wait for a linear test and review pipeline. They require shallow replicas of production environments and run experiments autonomously where they simulate traffic, replay logs, and model failure states.

Simon Willison’s (@simonw) discussion of StrongDM’s “digital twin universe” captures this perfectly: instead of brittle mocks, we construct production-like universes where agents can test ideas safely and repeatedly.

In this world, CI stops being a gate, and becomes a sandboxed multiverse. Unsurprisingly sandboxes are all the rage now on X. Our portfolio company Modal was early here, and a brief look at traffic to their Sandbox docs shows you everything you need to know.

A peek into the future: programming languages, runtimes, and infrastructure

If the toolchain is shifting to favor model proclivities, we should ask an uncomfortable question: why are we still programming in languages for humans?

I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest… https://t.co/GSHyE1DNxp
— Andrej Karpathy (@karpathy) February 16, 2026

Programming languages were designed around human cognitive limitations, prioritizing readability and ergonomics. But if agents are doing most of the authoring, optimizing for deterministic reasoning and verifiability matters more than syntactic sugar.

This opens the door to:

Languages designed for machine-to-machine collaboration.
Stronger type systems.
Built-in specification layers.
Constraint-first design.

And ultimately, formal verification.

As my colleague Arjun Narayan describes, formal methods have historically been too cumbersome and expensive for mainstream use. But models excel at generating proofs, invariants, and constraint-bound logic. The final frontier of model-driven development is code that is correct by construction.

Finally, there is another massive looming shift. Agents will not just write more software. They will require us to run more software. This means exponentially more services, more simulations, more ephemeral environments, and more highly-parallel, real-time workloads.

This strains infrastructure in ways we haven’t fully internalized. The next wave of infrastructure must support and enable:

Massive concurrency.
Ultra-low latency.
Real-time elasticity.

Our portfolio company Temporal is establishing itself as the default execution engine for agentic applications, providing durability and scalability out of the box. Meanwhile, projects like @realcalebwin’s A1 and Blast point toward the future, rethinking how execution environments scale under heavy parallel, agent-driven workloads.

New compute platforms like exe.dev are rethinking the contract contract between applications and infrastructure, altogether, prioritizing smaller, burstier workloads and agent ergonomics.

We’ve only begun to scratch the surface of the new primitives that will need to be developed to support this new generation of applications.

A year ago, we said the transformation would be foundational. That was right. What we underestimated was the speed at which this wave would wash over our industry.

We’re no longer merely bolting AI onto yesterday’s workflows. We’re actively redesigning the SDLC around machine-driven software development.

Authors

Lenny Pruss

Editors

Justin Gage

Acknowledgments