Part 5A: What the Research Actually Says About AI Replacing Engineers

Series: We built a pipeline with tens of thousands of lines of code. Why agents could not do it.
Previous: Part 4 — An Honest Comparison.
Every time something moves in AI — a new model, a new agent framework, a new benchmark result — countless posts on Twitter and public feeds announce that “everything has changed.” The research says something more interesting, and more useful.
The previous essay compared pipelines and agents: costs separated by orders of magnitude, failure rates compounded exponentially, and the probability of 100 consecutive decisions all being correct fell as low as 0.6%. But what do researchers — not venture capitalists, not social media — actually say?
Everyone has an opinion. Very few people have data.
Every time the AI industry stirs — a model release, an agent framework, a benchmark climbing the leaderboard — the posts arrive on schedule: “The big one is coming.” “The world has fundamentally changed.” “The outlook for engineers is not good.” Each one sounds more apocalyptic than the last, as if every product launch were humanity’s final day at work.
It feels like the weather forecast: every week brings a “once-in-a-century storm.” After hearing enough of them, you start leaving home in shorts.
The venture-capital world says agents will take over every workflow.
Some developers are equally certain that their jobs are untouchable.
The noise is loud. The data is quiet. And the story told by the data pleases neither side, because it supports neither hype nor denial. Much like this series.
Let us look at what happened when people actually measured the claims.
METR’s Time Horizon: The Reliability Cliff
METR — Model Evaluation & Threat Research — tracks what may be the most important measure of agent capability: at a given success threshold, how complex a task can a model complete autonomously?
They express complexity as a “time horizon”: how long the task would take a human expert. Then they ask how long a task the best models can handle at a 50% or 80% success rate.
METR’s v1.1 results, updated in May 2026, tell a story of rapid progress coexisting with a hard bottleneck. Here are selected point estimates for the 50% time horizon:
| Model | 50% time horizon |
|---|---|
| GPT-2 (2019) | About 3 seconds |
| GPT-4 (2023) | About 4 minutes |
| Claude 3.5 Sonnet, October 2024 | About 21 minutes |
| o3 (2025) | About 2 hours |
| GPT-5 (2025) | About 3 hours 23 minutes |
| Claude Opus 4.5 (2025) | About 4 hours 53 minutes |
| Claude Opus 4.6 (2026) | About 12 hours |
From 3 seconds to roughly 12 hours: an increase of more than ten thousand times in seven years. If your child grew at that rate, beginning at 50 centimeters, the result would quickly stop being a parenting problem.
The curve is not linear. It is exponential. METR v1.1 estimates an overall doubling time of about 6.2 months, accelerating to about 4.2 months when measured from 2023 onward. Anyone building production systems should take this trend seriously. The capability curve is steeper than most people imagine.
But the next set of numbers changes the discussion.
Those headline time horizons use a 50% success threshold. The model has a coin-flip chance of completing the task.
At an 80% threshold — closer to the reliability a production environment actually needs — the numbers collapse:
| Model | 50% time horizon | 80% time horizon |
|---|---|---|
| Claude Opus 4.5 | About 4 hours 53 minutes | About 49 minutes |
| Claude Opus 4.6 | About 12 hours | About 1 hour 10 minutes |
Read that again. Claude Opus 4.6 can handle a roughly twelve-hour task if you accept that it will fail half the time. Require an 80% success rate, and you are back to an hour-long task.
Imagine a car capable of 300 kilometers per hour whose brakes work only half the time. Would you drive it at 300? Probably not. You would stay near the speed at which you were confident it could stop. The gap between “the demo completed” and “production can depend on it” is not a small adjustment. It is an order of magnitude.
That is the reliability cliff.
METR’s own caveats matter here. The uncertainty intervals around individual model estimates are wide, especially for long tasks. The benchmark also concentrates on well-specified, automatically scored software engineering, machine learning, and cybersecurity tasks with relatively clean context. A twelve-hour time horizon does not mean twelve hours of an engineer’s real job can simply be replaced.
Even with those boundaries, the reliability cliff maps directly onto Part 4. Running a pipeline for 5,000 companies is not one hour-long task that needs to succeed once. It is thousands of tasks that all need to run reliably, with failure rates multiplying across steps. If each step succeeds 80% of the time, a ten-step workflow completes correctly only 10.7% of the time. At 50% per step, the final probability is roughly 0.1%.
It is like passing through ten security gates, each with an 80% chance of letting you through. That sounds acceptable until only one person in ten reaches the other side. Everyone else is stuck at a different gate, luggage scattered across the floor while security calls for assistance.
Our pipeline avoids this problem not because it is more intelligent, but because 90% of its steps are deterministic code. When inputs and external systems behave within known boundaries, those steps do not reconsider their behavior on every run. The LLM handles the classification step, where probabilistic output is acceptable, can be verified, and can be constrained by later mechanisms.
METR’s data explains precisely why this architecture works: confine the unreliable component to the part where uncertainty does not cause the whole system to lose control.
This is not merely a statistical curiosity. An agent that processes 80% of companies correctly but silently fails on the other 20% may be worse than useless for institutional work, because you do not know which 20% failed. The point of a production system is to succeed verifiably or fail explicitly: graceful degradation, not silent corruption.
The gap between 50% and 80% is the difference between an excellent demo and “I can entrust critical business data to this.”
Two Anthropic Studies Both Say: Not Yet
Anthropic published two related labor-market studies across 2025 and 2026. They point in the same direction while measuring different things.
Study One: Mapping Economic Tasks
The first study used millions of real Claude conversations to map how AI was actually being used across the economy. Its central findings:
AI remains far below its theoretical ceiling. Even computer programmers — among the occupations with the highest measured AI coverage, at roughly 75% of tasks — still had about a quarter of their tasks untouched. Most occupations were below one-third.
57% augmentation, 43% automation. The dominant pattern in professional AI use was collaboration between people and AI, not AI autonomously executing complete tasks. The “agent replaces the person” pattern was the minority, not the norm.
In other words, most people use AI as the prep cook chopping ingredients, not as the person left alone to run the entire kitchen.
Study Two: A New Measure and Early Employment Evidence
The second study, Labor Market Impacts of AI: A New Measure and Early Evidence, was published on March 5, 2026. It introduced “observed exposure,” combining theoretical LLM capability with real usage data, then compared that measure with employment outcomes.
The headline result: from late 2022 through the study period, unemployment did not systematically increase in highly exposed occupations. The estimated employment effect was statistically indistinguishable from zero.
That does not mean AI has had no effect. The paper explicitly warns that enterprise adoption may lag, public usage data does not capture every internal deployment, and the available evidence remains an early signal.
For a general reader, the more important message is structural: higher exposure is associated with pressure on employment prospects, but not with a single uniform outcome.
On average, more exposed occupations have weaker projected employment growth. Individual occupations still diverge. Some roles look more directly substitutable and may feel pressure earlier. Others use AI to amplify output and may retain positive growth even with substantial exposure.
That is the heart of the leverage gap: high exposure does not imply one shared destiny. The question is not only whether you use AI. It is whether AI replaces your core value or amplifies it.
One terminology point matters. In the US Bureau of Labor Statistics occupational system, “computer programmers” and “software developers” are separate occupations. Programmers appear highly exposed in the research; software developers are a different category, and the two results should not be blended together.
Even within software development, the variance is large. Standardized implementation work performed against a fixed specification is more exposed. Work depending on domain context, architectural trade-offs, and expert judgment is less exposed.
There is another early signal worth watching, but it needs careful language. Within highly exposed occupations, the employment entry rate for workers aged 22–25 was roughly 14% below its pre-ChatGPT trend, while older workers did not show the same clear change. The paper stresses that the estimate is imprecise, may reflect different pre-existing trends, and cannot establish that AI caused the decline.
Still, the possibility deserves attention: AI may not have replaced experienced engineers, but it may be reducing the opportunities through which new engineers become experienced. Part 6 will return to this.
Taken together, the two studies offer one key insight: autonomous task execution — the mode promised by agents — is still not the dominant pattern of professional AI use today. Augmentation is.
That puts the pipeline discussed in this series in an interesting position. The engineer designing the architecture is augmentation. The LLM performing classification inside that architecture is automation applied to the one task where it fits.
More interestingly, high exposure may mean not only higher substitution risk but stronger augmentation leverage. The important work is the architecture around the AI, not the AI alone.
That is what the data says. But which frameworks explain why the numbers look this way? Karpathy’s Software 1.0/2.0/3.0, the SWE-CI long-term maintenance benchmark, and the long-tail problem all point in the same direction.
Next: Part 5B — What the Research Actually Says: The Frameworks.
Series Map
| Part | Core idea |
|---|---|
| 00 — Introduction | Why this series exists |
| 01 — The Impossible Task | Where everything started |
| 02 — Where 7,400+ Lines Came From | How the pipeline snowballed |
| 03A — Brain and Body | LLM = 10% brain, code = 90% body |
| 03B — Six Simple-Looking Problems | Edge cases that break agents |
| 04 — An Honest Comparison | Pipeline vs. agents, by the numbers |
| 05A — What the Research Says: Data | You are here |
| 05B — What the Research Says: Frameworks | The “why” behind the data |
| 06 — The Leverage Gap | Who benefits from AI, and who does not |
| 07 — Context Accumulation | What agents never learn |
| 08A — The Delegation Problem | Why you cannot simply throw it over the wall |
| 08B — The Autonomy Spectrum | Finding the right level |
| 09 — The Other Extreme | When skepticism becomes paralysis |
| 10 — Two Rooms | Demo enthusiasts vs. domain skeptics |
| 11 — The Evidence | The pipeline as evidence |
| Bonus — The Counterargument | AI argues against the entire series |
| Bonus — Standing in the Middle | The thought that arrives in the middle of the night |
References
- METR. “Measuring AI Ability to Complete Long Tasks.” v1.1, updated May 2026.
- Handa, K. et al. “Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations.” arXiv:2503.04761, 2025.
- Massenkoff, M. & McCrory, P. “Labor Market Impacts of AI: A New Measure and Early Evidence.” Anthropic, 2026.