SB
2026-04-20 · 37 min read

After the fine-tune: what agent scaffolding does that training cannot

A training run produces capability. The scaffold produces the system. On identical weights, scaffold choices swing task success by ten to twenty points. Here is the case study, the evidence, and the math.

Last weekend I was in Lausanne, walking the lakefront between Ouchy and the Château de Chillon. The castle has stood on that rock for nearly a thousand years, and the thing that strikes you up close is how much of its survival is owed to the scaffolding around it: the curtain walls, the defensive towers added in stages across five centuries, the adaptations that made a medieval fortress useful long after trebuchets stopped being the relevant threat. I took the train back to Bern that evening thinking about the model I had been training, and the gap between what a good core structure gets you and what it takes to keep that structure useful in the real world. The piece that follows is what came out of that train ride.

Château de Chillon, Lake Geneva

During that weekend, I trained a LoRA adapter on Qwen3.5-4B to produce structured tool calls for a twenty-tool agent system, published it openly, and ran it through a seventy-entry behavioral evaluation. Between the first training round and the second, the training loss stayed essentially flat, moving from 0.0659 to 0.0663. Token accuracy on the validation set stayed identical at 98.10 percent. The behavior the second round was specifically designed to improve moved by seventy-six percentage points.

That outcome is worth examining carefully. It tells you something about the limits of what a training run can see on its own, and it points at where the rest of the work actually happens once a trained model is deployed inside an agent system. This article walks through the case study first, then zooms out to what the literature and the benchmarks say about the layer between a fine-tuned model and a working agent. That layer, often called agent scaffolding or the agent harness, turns out to have more leverage on production outcomes than most training runs do. The empirical data on that claim is surprisingly consistent across benchmarks.

The two rounds

The base model was Qwen3.5-4B in bf16. The LoRA configuration used rank 128 with alpha 256, giving roughly 84.9 million trainable parameters, about 1.98 percent of the total. Training ran for two epochs on a single H100 80GB, using bf16 with Flash Attention 2, sequence length 2048, effective batch size 64, peak learning rate 2e-4 with cosine decay. Wall time was four hours. Cost was about fifteen dollars.

The training mix was the Salesforce xLAM function-calling dataset at 58,716 examples, supplemented by a curriculum of 630 synthetic examples designed to teach three specific behaviors. The curriculum was upweighted three times to 1,890 examples, bringing its share of the total training set to 3.1 percent. The three behaviors were: structured tool-call emission with valid JSON schemas; governance compliance (always call a confirmation tool before any destructive action); and direct-answer discrimination (recognize settled-knowledge questions and answer without calling tools at all).

Round one produced a model that was strong on the first two behaviors and weak on the third. On the seventy-entry behavioral evaluation, the "Should Not Call" family (Family F) passed one entry out of eight, a twelve percent pass rate. Governance compliance on entries that did require tool calls was ninety-nine percent. Two of seventy outputs contained unparseable tool-call JSON.

Round two used the same base and training configuration, with the curriculum adjusted: four hundred additional "Should Not Call" examples, one hundred and fifty parallel-search examples, and eighty threading-order examples. Same epochs, same hyperparameters. The training curves were monotonic and showed no signs of overfitting, with the train/eval loss gap staying below 0.001 throughout.

Final numbers: training loss moved from 0.0659 to 0.0663, a rounding error in either direction. Token accuracy stayed at 98.10 percent. On the behavioral evaluation, Family F moved from one of eight to seven of eight. Governance compliance reached one hundred percent. Unparseable outputs dropped to zero of seventy. The full model card and evaluation artifacts are public on HuggingFace.

What each metric could see

The training loss is a cross-entropy objective averaged across every token in the evaluation batch. It aggregates the model's fit to the entire training distribution. Token accuracy is the per-token argmax match against the reference output. Both are averages, weighted by the frequency of each token class.

Family F entries represent roughly ten percent of the behavioral evaluation set, which itself is a tiny slice of the training distribution. A thirty-point swing on Family F, expressed as a change in the average training loss, is small enough to disappear inside noise at the fourth decimal place. That is what happened.

The training harness performed exactly as designed. It told the training loop that round two was fitting the data at about the same quality as round one. A cross-entropy average is built to summarize a distribution, and a summary cannot reveal a localized movement within its own averaging.

The seventy-entry behavioral evaluation caught the movement because it did not aggregate in the same way. It scored each family of behaviors separately, and within each family it scored each entry on a binary pass or fail. Family F had eight entries, each explicitly designed to test whether the model would refrain from calling tools when the user's query was settled professional knowledge. When seven of eight flipped from fail to pass, the family score moved from twelve percent to eighty-eight percent and stayed visible.

The observation here is structural. Training loss can tell you whether the model is fitting its data. A behavioral evaluation with well-chosen families can tell you whether specific capabilities are present or absent. These are different questions, and they require different instruments. A training run, on its own, cannot tell you whether an agent will behave correctly in production, because behavior in production is not a cross-entropy average over a token stream.

This is where the training harness stops doing useful work, and where the next layer of engineering begins.

The training harness as a starting line

A fair amount of academic literature on language model evaluation still treats the evaluation harness as the terminal step. Xia and colleagues at CSIRO's Data61 surveyed this recently in their 2024 paper on evaluation-driven development for LLM agents, and observed that 97.76 percent of academic evaluation sources still rely on static, predefined benchmarks. They argue that this static approach cannot capture the adaptive, emergent behaviors that define agents in real-world deployments.

Benchmarks like MMLU, HumanEval, MT-Bench, and the xLAM function-calling dataset are designed to answer a specific question: given a clean input and a curated reference output, does the model produce approximately the right thing? That question is the right one to ask during training. It measures whether the training signal has shaped the model's weights in a useful direction, and it gives you a comparable score across runs.

What static benchmarks cannot answer is the set of questions that start to matter once the model is deployed. How does the model behave when given a messy user input that does not resemble any training example? How does it chain tool calls when the output of one call informs the input of the next? What does it do when a tool returns an error? When the context window fills up? When another agent in the pipeline hands it a partially completed task? When the user changes the subject halfway through? When the correct action is to refuse and ask for clarification?

None of these questions can be answered by a better loss curve. They can only be answered by the runtime system that wraps the model, observes its outputs, repairs its mistakes, manages its context, and coordinates its interactions with the rest of the world. That runtime system is the agent scaffold.

The Qwen adapter discussed above, even at its best, produces a single generation. It emits tokens, which form text, some of which parses as tool calls. What happens to those tokens after they leave the model is a separate set of engineering problems, and it is where the majority of variance in production outcomes comes from.

What an agent scaffold actually contains

The term "agent scaffold" is used casually enough that it helps to be specific. In the SWE-bench evaluation literature, it has a concrete meaning. Anthropic's 2024 write-up of SWE-bench described scaffolding as "the software around the model" that is responsible for "generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is incorporated into its next prompt." That definition covers the core mechanics. In production agents, the list extends further.

At minimum, a production agent scaffold contains the following.

A system prompt and tool surface definition. The scaffold decides which tools the model can see on any given turn, how they are described, what their input schemas are, and how they are formatted in the context window. A model that was trained on a twenty-tool surface can be deployed with access to five, fifty, or two hundred, and its effective behavior will differ dramatically across those configurations. Tool naming, description length, example usage, and schema structure are all scaffold-level decisions that the model has no control over.

An output parser. The scaffold takes the raw token stream the model produces and attempts to extract structured actions. For a tool-calling model, this usually means finding tool-call blocks, parsing their JSON, and validating arguments against the declared schema. When the model produces partially malformed output (a missing comma, a truncated field, a field with the wrong type), the parser decides what happens next.

A repair loop. When parsing fails or a tool call is invalid, the scaffold can retry, ask the model to correct its output, fall back to a simpler format, or surface the failure to the user. The design of the repair loop often determines whether a five percent parse-failure rate shows up as a five percent user-facing failure rate or closer to zero.

A context manager. As conversations extend, as tool outputs accumulate, and as the model approaches its context limit, the scaffold decides what to retain, what to summarize, and what to discard. Anthropic's 2025 work on context engineering reports a 39 percent performance improvement and an 84 percent reduction in token usage on long tasks when combining their memory tool with context editing, compared to running the same model without those components.

A state and trajectory manager. The scaffold tracks which tools have been called, with what arguments, returning what results, across the lifetime of a task. It decides which of these are relevant to the next turn and how they are presented to the model. For multi-step tasks, this becomes one of the more consequential engineering surfaces in the entire system.

An orchestration graph. When more than one agent or model is involved, the scaffold decides who does what: which agent handles routing, which handles specialist subtasks, which aggregates results, and what happens when any of them fails. Anthropic's "Building Effective Agents" guide enumerates five canonical patterns here: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. Each is a scaffold-level architectural choice.

Governance gates. Many production systems require confirmations before destructive actions, audit logging for regulated contexts, and human-in-the-loop approval on certain trajectories. These are not model behaviors. They are scaffold behaviors that the model participates in through specific tool calls.

Observability. Every tool call, every repair, every retry, every context edit produces a trace. The scaffold decides what is captured, what is surfaced for debugging, and what feeds back into the evaluation harness for the next training iteration.

Each of these components is a design surface. Each has a default, and each has better-than-default configurations that can meaningfully change the system's behavior. The Airbyte review of context engineering from early 2026 observes that "most agent failures come from missing, stale, or incorrectly scoped context, rather than from limited model capabilities. Prompting and model selection cannot fix these problems because they operate after data has already been retrieved."

That framing is consistent across the literature. The model is one component of the system. The scaffold is the other component. The system as a whole is what the user experiences.

The empirical weight of scaffolding choices

The clearest data on scaffold-versus-model impact comes from SWE-bench. The benchmark measures an agent's ability to resolve real software engineering tasks from open-source GitHub repositories, and it runs the model inside a scaffold that handles code navigation, file editing, test execution, and the iteration loop between them.

Because SWE-bench evaluates "the entire agent system" rather than the model alone, scaffold choice produces a very large spread on identical weights. OpenAI's 2024 technical post on SWE-bench Verified put hard numbers on this. GPT-4 on SWE-bench Lite produced scores ranging from 2.7 percent using an early RAG-based scaffold up to 28.3 percent using the CodeR scaffold. Same model, same weights, more than a ten-fold swing in task success. On the harder SWE-bench Verified set, GPT-4o moved from 16 percent on the original SWE-bench setup to 33.2 percent with the Agentless scaffold. The model did not change. The software around it did.

That pattern holds across newer models. In February 2026, Augment Code published a comparison of four coding agents on SWE-bench Pro, three of which (Auggie, Cursor, Claude Code) ran identical Claude Opus 4.5 weights. Auggie solved 51.80 percent of tasks. Cursor solved 15 fewer problems out of 731. Claude Code solved 17 fewer. The performance gap came from agent architecture rather than from the model, which was the same in all three systems.

The Scale SWE-Bench Pro leaderboard documents a similar phenomenon at the frontier. The best-performing models, GPT-5 and Claude Opus 4.1, score 23.3 percent and 23.1 percent respectively on Pro when run through Scale's SWE-Agent scaffold. These same models score over 70 percent on SWE-Bench Verified with better-tuned scaffolds on easier tasks. The scaffold, the task difficulty, and the model all contribute to the final number. At the top of the leaderboard the scaffold is often the dominant factor.

The benchmark designers are aware of this. The Vals.ai SWE-bench evaluation specifically uses a minimal bash-only harness, reasoning that "the benchmark's dual evaluation of both the agentic harness and the underlying foundation model" makes cross-model comparison difficult otherwise. They constrain the scaffold deliberately to isolate model capability from scaffold design, because a good scaffold can mask a weak model and a weak scaffold can hide a strong one.

The Digital Applied analysis of the 2026 SWE-bench leaderboard phrases the same observation in operational terms: "agent scaffolding around a model, such as iteration budget, tool availability, reflection loops, and repository navigation heuristics, can swing results by ten to twenty percentage points on identical underlying weights." Iteration budget alone (how many back-and-forth turns the agent gets before giving up) commonly adds several points when raised from five to twenty.

The same separation of concerns shows up in function-calling benchmarks. The Berkeley Function Calling Leaderboard, now in its fourth major version, splits its evaluation into sub-benchmarks that probe different capabilities: simple single-call accuracy, multi-call composition, parallel calls, multi-turn conversations, and the "relevance" tests that check whether the model abstains when no tool applies. The maintainers report that "top AIs ace the one-shot questions but still stumble when they must remember context, manage long conversations, or decide when not to act." Single-call accuracy is a model property. The rest depend on how the scaffold constructs and maintains the conversation.

None of this means the model does not matter. A better model, paired with a better scaffold, will always beat a worse model paired with the same scaffold. What the data shows is that the scaffold is often the dominant variance source at any given capability frontier, which has implications for where engineering effort produces the largest returns.

Orchestration compounds small errors

Agent systems almost always chain multiple steps, and the math of chaining is unforgiving. If a single step succeeds ninety-five percent of the time, and the task requires twenty independent steps, the overall success rate is 0.95 raised to the twentieth power, which is about thirty-six percent. More than half of the runs fail before completion. At ninety-nine percent per-step reliability, which is extremely optimistic for current tool-using LLMs, a twenty-step task succeeds about eighty-two percent of the time. One run in five still fails somewhere along the chain.

Those numbers come from a standard application of the reliability-in-series formula to LLM agent pipelines. They are why demos and production systems behave so differently. A demo optimizes the happy path with clean inputs and perfect conditions. Production hits every edge case, every rate limit, every stale cache, every malformed tool output, every ambiguous user instruction. Each of those triggers a step-level failure, and each step-level failure propagates through the rest of the chain.

The academic literature has started to catalog how these failures actually manifest. The Multi-Agent System Failure Taxonomy, or MAST, published at NeurIPS 2025 by researchers at UC Berkeley's Sky Computing Lab, is the first large-scale empirical study of why multi-agent LLM systems fail. The work analyzed 1,642 execution traces across seven popular multi-agent frameworks (MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, AG2) and identified fourteen distinct failure modes, clustered into three categories: system design issues, inter-agent misalignment, and task verification gaps.

The prevalence data is instructive. System design issues (specification ambiguity, role definition problems, coordination protocol gaps) accounted for the largest share of observed failures. Inter-agent misalignment (information withheld during handoffs, task derailment into irrelevant discussion, context loss between agents) came second. Task verification failures (generators acting as their own verifiers, insufficient output checking) came third. Across all three categories, the issues were rarely model-level. They were scaffolding-level and orchestration-level.

The production data matches the research. The τ-bench benchmark, published by researchers at Sierra and Princeton, evaluates agents on customer-service-style interactions with a simulated user and a real database, scoring success by comparing the final database state against the ground truth. On τ-bench, state-of-the-art function-calling models like GPT-4o succeed on under fifty percent of tasks in the retail domain. The pass^8 metric, which asks whether the agent can complete the same task correctly eight times in a row, drops below twenty-five percent for the best models on retail. Reliability, at the consistency level that production systems require, is significantly lower than single-shot accuracy suggests.

The market is absorbing this. In June 2025, Gartner forecast that more than forty percent of agentic AI projects would be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The analyst framing was blunt: "Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied. This can blind organizations to the real cost and complexity of deploying AI agents at scale, stalling projects from moving into production." Gartner estimated that of the thousands of vendors marketing "agentic AI," only about 130 offered what they considered genuine agentic capabilities.

The throughline connecting MAST, τ-bench, and the Gartner forecast is the same one the SWE-bench data surfaces. The gap between a model that can do a task once under ideal conditions and a system that does the task reliably under production conditions is filled almost entirely by scaffolding engineering. Compounding errors can be mitigated by smaller step granularity, better retry logic, explicit verification stages, and orchestration patterns that isolate failure domains. Each of those is a scaffold-level decision. None of them are produced by a better loss curve.

Anthropic's "Building Effective Agents" guide offers a practical framing of the available mitigations. Their five workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer) are essentially scaffolding templates, each with different trade-offs around reliability, cost, latency, and observability. Prompt chaining works when a task can be decomposed into a sequence of well-defined subtasks with verifiable intermediate outputs. Routing works when different inputs need different specialized handlers. Parallelization (sectioning or voting) works when independent subtasks can run simultaneously, or when multiple attempts improve confidence. Orchestrator-workers works when the decomposition is dynamic and determined at runtime. Evaluator-optimizer works when an output can be iteratively refined by a critic. The right choice depends on the task, the reliability requirements, and the cost budget. All of them are scaffolding-level architectural decisions made after the model is already trained.

Back to the case study

With that framing in hand, it is worth walking back through what it actually takes to deploy the Qwen3.5-4B adapter in a production setting. The trained adapter, at its final evaluation, calls tools correctly on 100 percent of the entries that require tool calls, abstains correctly on 88 percent of the entries that should receive a direct answer, and produces zero unparseable outputs across all seventy evaluation entries. Those numbers describe the capability that the training run produced.

Making that capability useful in production requires a scaffold that handles at least the following.

Tool surface exposure. The production system has to decide how to present the twenty-tool toolset to the model on each turn: whether to expose all twenty every time, or to route to a reduced subset based on the conversation state. Exposing all twenty bloats the context and may degrade the model's direct-answer discrimination, since the model was trained on the full surface and a trimmed surface is a different distribution. Exposing a reduced subset requires a routing decision made before the model runs, which is itself a scaffold design question with its own reliability characteristics.

Output parsing. The training encouraged the model to emit tool calls in a specific <tool_call>{...}</tool_call> format. The scaffold has to extract those blocks from the raw token stream, parse the JSON inside, validate the argument types against the tool's declared schema, and handle the cases where any of those steps fails. Zero unparseable outputs in evaluation does not guarantee zero unparseable outputs in production under novel inputs, especially as context length grows beyond the training sequence length of 2048.

Repair on partial failures. The one remaining behavioral failure on the evaluation is an entry where the model still over-calls on ISO 27001 certification timelines. That is a known class of input that the trained model gets wrong roughly twelve percent of the time within its family. A production scaffold can catch this in several ways: by detecting the unnecessary tool call before executing it (a cheap classifier on the first few tokens of output), by running a second pass with a more constrained tool surface, or by surfacing the result to the user with an option to reject the tool call. None of these are model changes. They are scaffold behaviors that compensate for the remaining failure mode without requiring another training run.

Context window management. The training context was 2048 tokens. A production conversation can exceed that in a few turns, especially once tool outputs (search results, document snippets, database records) start accumulating. The scaffold has to decide when to summarize, what to summarize, and what to drop. Context engineering decisions interact with the governance tool: if the "ask before write" confirmation tool is called on turn three and the approval comes on turn five, the scaffold has to retain that state when the conversation moves forward. A naive summarizer that drops the confirmation record is a scaffold-level defect with model-safety consequences.

Governance gate enforcement. The model was trained to call the confirmation tool before any destructive action, and it does so reliably on the evaluation set. The scaffold still has to enforce that the confirmation tool's response is honored. If the confirmation is rejected, the scaffold must not allow the subsequent tool call to execute. If the confirmation is granted, it must be logged for audit. The governance behavior is a collaboration between the model, which knows to ask, and the scaffold, which knows to enforce. A model that asks reliably paired with a scaffold that does not enforce is not a governed system.

Observability and the next training loop. Every tool call, every parse, every repair, every context edit produces a trace that the scaffold records. These traces feed two downstream systems. The first is production monitoring: did the user's task actually complete, and if not, where did it fail? The second is the next training iteration's evaluation set: what new failure families need targeting, what behaviors have degraded, what edge cases are coming up in production that were not represented in the original seventy-entry eval? A scaffold that does not record traces produces a system that cannot improve. The relationship between the scaffold and the training harness is therefore bidirectional. The scaffold generates the data that the next evaluation family is built from, and the evaluation family generates the training signal that moves the model's behavior on that family.

Each of these is an engineering decision with meaningful variance on outcomes. Based on the SWE-bench comparisons cited above, the total variance from scaffold design choices on a complex task is often larger than the variance between two adjacent fine-tuning rounds. The Qwen adapter's seventy-six-point gain on Family F was the result of a fifteen-dollar, four-hour training run. Reaching a similar order-of-magnitude improvement on end-to-end task success in production typically requires scaffold work that takes months, because scaffold iterations depend on production traces that can only be gathered at deployment scale.

Closing observations

Two conclusions are worth stating directly from the data, and worth leaving for the reader to consider in the context of their own systems.

The first is that a training run produces capability, which is a necessary condition for a working agent and is not a sufficient one. A well-designed training harness, one that scores behavioral families separately rather than only averaging cross-entropy, is what makes capability visible during training. The seventy-six-point gain on Family F in the Qwen case study would have been invisible to a training loop that only watched the loss curve. The behavioral evaluation harness was what let the training loop be useful. Teams that fine-tune without a behavioral eval are running training that cannot see its own outcomes.

The second is that once the capability is produced, the engineering surface moves. The SWE-bench literature shows consistently that scaffold choice produces ten-to-twenty point swings on identical model weights, sometimes considerably larger. The MAST taxonomy locates the majority of observed multi-agent failures at the system design and coordination level rather than at the model level. The τ-bench reliability numbers and the Gartner forecast both suggest that production-grade agent work is bottlenecked by the layer above the model rather than by the model itself. Context engineering, orchestration patterns, repair loops, and state management are where most of the remaining variance lives once a capable model is in hand.

For teams building agent systems, the practical implication is that training and scaffolding are two separate engineering disciplines with different skill requirements, different iteration speeds, and different returns on effort. A fifteen-dollar training run can move a specific behavioral family by seventy-six points when the training data is well-targeted. A scaffold redesign can move end-to-end task success by as much again, on the same model weights. The questions worth asking at the start of an agent project are which of the two is currently the bottleneck, and what instrument is required to measure it.

The data suggests those are separate questions, with separate answers, asked at separate points in the lifecycle. Training tells you what the model can do. The scaffold tells you what the system actually does. Both matter. The difference is where the next hour of engineering time should go.


Notes and sources

Case study. Full training artifacts, evaluation set, and model card: huggingface.co/Bharambe-NL/mimir-qwen3-4b-lora-v2. Training code and seed curriculum: github.com/Bharambe-NL/mimir-training.

SWE-bench scaffold variance. OpenAI, Introducing SWE-bench Verified, August 2024. Anthropic, Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet, October 2024. Augment Code, Auggie tops SWE-Bench Pro, February 2026. Scale AI, SWE-Bench Pro Leaderboard, 2026. Digital Applied, SWE-Bench Live Leaderboard Q2 2026: Complete Deep Analysis, April 2026. Vals.ai SWE-bench methodology.

Multi-agent failure taxonomy. Cemri et al., Why Do Multi-Agent LLM Systems Fail?, NeurIPS 2025 Datasets and Benchmarks Track (spotlight), arXiv:2503.13657. MAST dataset of 1,642 annotated execution traces across seven frameworks.

τ-bench. Yao et al., τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, ICLR 2025, arXiv:2406.12045. τ²-bench dual-control extension: Barres et al., arXiv:2506.07982.

Function calling benchmark. Patil et al., The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models, ICML 2025, PMLR v267.

Evaluation-driven development. Xia et al., Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture, arXiv:2411.13768, v3 (November 2025). CSIRO Data61.

Orchestration patterns. Anthropic Engineering, Building Effective Agents, December 2024. Anthropic, How we built our multi-agent research system, 2025.

Context engineering. Anthropic, Effective Context Engineering for AI Agents, September 2025. Airbyte, What Is Context Engineering?, 2026.

Market forecast. Gartner, Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027, press release, 25 June 2025, Sydney.


Appendix: the math behind the reliability simulator

A companion to the interactive widget above. This appendix documents the formulas, assumptions, and calibration process behind every number the simulator produces, so readers can verify the arithmetic, challenge the modeling choices, and adapt the approach to their own systems.

A.1 The core model

The simulator treats an agent pipeline as a chain of independent steps, each with its own per-step reliability. The end-to-end success of the pipeline is the probability that every step succeeds in sequence.

End-to-end success. Let r denote the per-step reliability (the probability that a single step produces a correct, usable output) and n denote the number of steps in the pipeline. Assuming step-level failures are independent, the probability that the entire pipeline succeeds is:

P_end-to-end(r, n) = r^n

In code:

function endToEndSuccess(r, n) {
  return Math.pow(r, n);
}

This is the standard reliability-in-series formula from engineering. It applies to any system where every component must succeed for the overall system to succeed, and the components fail independently. The formula has been used to reason about everything from mechanical assemblies to distributed software systems long before LLM agents existed.

Pass^k. The pass^k metric, introduced in the τ-bench benchmark paper, measures whether an agent can complete the same task correctly k times in a row. It answers the question: given that a single run succeeds with probability P_end-to-end, what is the probability that k independent runs all succeed?

P_pass^k(r, n, k) = (P_end-to-end)^k = r^(n·k)

The simulator fixes k = 8, which is the default τ-bench setting. In code:

function passK(r, n, k) {
  return Math.pow(endToEndSuccess(r, n), k);
}

Pass^8 decays much faster than single-shot success because it compounds an already-compounded quantity. A pipeline that succeeds 50% of the time on any given run succeeds 8 times in a row only 0.5^8 ≈ 0.39% of the time. This is why τ-bench reports pass^8 below 25% for every model the benchmark tested on the retail domain, even though some of those models reach 50% pass@1.

Inverse calculation: reliability needed for a target end-to-end. When the simulator reports "to reach 50% end-to-end at the same step count, per-step reliability would have to climb to X%", it inverts the core formula. Given a target end-to-end success P_target and a fixed step count n, the required per-step reliability is:

r_required = P_target^(1/n)

In code:

function reliabilityNeededFor(targetE2E, n) {
  return Math.pow(targetE2E, 1 / n);
}

This inverse is what makes the tradeoff between pipeline length and reliability visible. Reaching 50% end-to-end success at 20 steps requires per-step reliability of 0.5^(1/20) ≈ 96.6%. Reaching the same 50% at 40 steps requires 0.5^(1/40) ≈ 98.3%. Doubling pipeline length adds roughly 1.7 percentage points to the required per-step reliability, which is a steep climb once the model is already above 95%.

A.2 Why the independence assumption is load-bearing, and where it breaks

The formula r^n assumes that step-level failures are statistically independent. In agent systems, this assumption is partially true and partially false, and understanding where it breaks matters for interpreting the simulator's outputs.

Where independence approximately holds. Independence is a reasonable approximation when each step is a distinct tool call with different inputs, different tool schemas, and different model reasoning requirements; when the scaffold resets conversational context between subtasks or summarizes intermediate results; and when failures are driven by random sampling from the model's output distribution rather than by systematic input patterns. In these cases, the failures at step i and step j are caused by different distributional edges of the model, and treating them as independent Bernoulli trials produces numbers close to what empirical benchmarks report.

Where independence breaks. Independence breaks, in both directions, in several common situations.

Correlated failures (worse than the model suggests). If step 1 produces a subtly wrong output that passes the parser but is factually incorrect, steps 2 through n all execute on contaminated input. The downstream steps might individually be 95% reliable on clean input, but are effectively zero reliable once the upstream error has propagated. The chain's overall success is lower than r^n.

Recoverable failures (better than the model suggests). If the scaffold detects a failure at step 3 and retries, the pipeline may eventually succeed even though the model failed once. With a well-designed repair loop, the effective per-step reliability (after retries) is higher than the model's single-attempt reliability. The chain's overall success is higher than r^n would predict from the raw model rate.

State-dependent failures (the model changes across steps). The model's per-step reliability is not a fixed constant. It varies with context length, with the specific tool being called, with the number of prior tool calls in the conversation, and with the complexity of the current subtask. A model that is 98% reliable on the first tool call can drop to 85% on the fifteenth tool call once the context is crowded and the task has drifted. The formula r^n assumes a fixed r, and the real behavior is r_1 · r_2 · ... · r_n where each r_i varies.

How to read the simulator given these caveats. The simulator treats r as the effective per-step reliability after the scaffold's retry and repair mechanisms have done their work. That is why a modern coding agent at 95.7% per-step is plausible even though the underlying model might be closer to 85% reliable on any single attempt. The scaffold recovers the remaining ten percentage points through retries, reflection loops, and parser repair. This framing is what lets the simulator be useful as an intuition tool rather than a predictive model. It will not tell you your system's exact success rate, but it will show you how sensitively that rate depends on pipeline length and per-step quality, and where on the curve your system sits relative to published benchmarks.

A.3 Preset calibration

Every preset in the simulator was calibrated so that r^n produces the cited benchmark number. The process was: identify the benchmark and the reported end-to-end success rate, estimate a realistic step count for that benchmark's task profile, and solve for the per-step reliability that would produce the reported end-to-end. Formally, given a target end-to-end P and a step count n, the calibrated per-step reliability is r = P^(1/n). The step count is the modeling choice; reasonable people can pick different step counts for the same benchmark, and the calibrated r will shift accordingly.

GPT-4 on SWE-bench Lite, early RAG scaffold. End-to-end target 2.7% (OpenAI technical post on SWE-bench Verified, reporting GPT-4 on SWE-bench Lite with an early RAG-based scaffold). Step count n = 12, a typical length for a code navigation loop on SWE-bench Lite: identify file, read, search for usage, propose edit, apply, test, repeat. Calculated r = 0.027^(1/12) ≈ 0.740. At 74% per-step across 12 steps, the pipeline produces 0.74^12 ≈ 2.70% end-to-end, matching the reported 2.7%. The same model on the CodeR scaffold reached 28.3%, which corresponds to per-step reliability around 0.898 at the same 12-step count. The ten-fold swing on identical model weights is what makes the SWE-bench literature so useful as a scaffold-versus-model lever demonstration.

Claude Opus 4.5 on SWE-bench Pro (Auggie). End-to-end target 51.8% (Augment Code's February 2026 report on SWE-bench Pro, Auggie scaffold). Step count n = 15, reflecting longer multi-file edit trajectories on SWE-bench Pro vs SWE-bench Lite. Calculated r = 0.518^(1/15) ≈ 0.957. At 95.7% per-step across 15 steps, the pipeline produces 0.957^15 ≈ 51.7% end-to-end. Cursor on identical Claude Opus 4.5 weights resolved 15 fewer problems out of 731 (48.0%, corresponding to r ≈ 0.953 at the same step count). Claude Code resolved 17 fewer (47.7%, r ≈ 0.952). The per-step reliability delta between the top scaffold and its competitors is less than half a percentage point, yet it compounds to a measurable task-success gap over 15 steps. This is the exact phenomenon the article describes: small scaffold advantages compound.

τ-bench retail (GPT-4o). End-to-end target ~50% (Yao et al., τ-bench paper, reporting GPT-4o below 50% pass@1 on retail domain). Step count n = 8, the average tool-call sequence length for τ-bench retail tasks (cancellations, modifications, multi-step lookups). Calculated r = 0.50^(1/8) ≈ 0.917; the simulator uses 0.920. At 92.0% per-step across 8 steps, the pipeline produces 0.92^8 ≈ 51.3% end-to-end. Pass^8 at this configuration is 0.513^8 ≈ 0.48%. The τ-bench paper reports pass^8 below 25% on retail for every model tested, which is a looser bound that accommodates the stronger models. The 0.48% here is well within that bound and reflects what an average-performing model produces on a full pass^8 evaluation.

4B adapter governance workflow (case study). Per-step reliability 99.9% (Qwen3.5-4B LoRA v2 behavioral eval: 100% governance compliance on 57 entries requiring tool calls, rounded conservatively to 99.9% for any single governance gate). Step count n = 5, a representative governance workflow: draft, confirm, approve, execute, log. End-to-end 0.999^5 ≈ 99.50%. This preset demonstrates the scaffold-level point in the article's closing section. Even at 99.9% per-step, a 5-step governed workflow drops to 99.5% end-to-end. The remaining half a percent of failures requires scaffold-level enforcement (idempotency checks, audit logging, human-in-the-loop fallback) rather than additional model training, because closing that gap through training alone would require moving the model from 99.9% to effectively 100%, which is not a meaningful target at current evaluation set sizes.

MAST ChatDev on ProgramDev. End-to-end target 33.3% (Cemri et al., MAST paper, NeurIPS 2025, reporting ChatDev correctness on ProgramDev benchmark). Step count n = 12, a typical multi-agent conversation length in ChatDev: CEO, CTO, programmer, reviewer, tester, with iteration. Calculated r = 0.333^(1/12) ≈ 0.912. At 91.2% per-step across 12 steps, the pipeline produces 0.912^12 ≈ 33.1% end-to-end. The MAST taxonomy attributes most of these failures to system design and coordination problems (specification ambiguity, role definition gaps, inter-agent information loss) rather than to the underlying model's capability. In this framework, that is equivalent to saying the effective per-step reliability is capped around 91% by the orchestration, independent of what the model could theoretically achieve with better context and clearer instructions.

Skilled human operator reference line. Per-step reliability 99%, an informal reference point for a skilled human executing a routine workflow. Step count n = 20, matching the article's default example. End-to-end 0.99^20 ≈ 81.79%. The human baseline is a reference line rather than a benchmark citation. It exists to anchor the reader: even an extremely reliable human operator fails on roughly one in five attempts across a 20-step workflow. Reliability at scale is a system property, and the humans we rely on for high-stakes processes (pilots, surgeons, financial controllers) succeed at their jobs through scaffolding (checklists, cross-checks, handoff protocols, audit trails) rather than through individual-step perfection.

A.4 What the simulator does not model

Several important effects are deliberately out of scope. Listing them here to pre-empt confusion about what the numbers mean.

Cost. The simulator does not model the token cost or latency of a pipeline. A 30-step pipeline at 99% per-step may succeed 74% of the time end-to-end, but it also costs 30 times as many tokens as a 1-step pipeline and takes 30 times as long to run. Real architectural decisions trade off success rate against cost and latency, and the simulator only shows one axis of that tradeoff.

Partial success. The model assumes binary per-step outcomes: each step either succeeds or fails. Real systems often produce partial successes, where an output is correct in some respects and wrong in others. Whether a partial success counts as a success depends on the downstream step's tolerance, which is itself a scaffold design choice.

Human-in-the-loop recovery. The simulator treats the pipeline as fully automated. In production, humans often intervene when an agent gets stuck, which dramatically shifts the effective end-to-end reliability. Modeling human-in-the-loop properly requires a different framework, because the per-intervention cost and the escalation thresholds matter.

Retry with different sampling. Many production scaffolds retry failed steps by sampling the model again with a different temperature or a slightly modified prompt. If the retries are independent, k retries at per-step reliability r give effective reliability 1 - (1 - r)^k, which can be substantially higher than r. The simulator does not apply this correction, because doing so would require the user to also specify the retry budget, which adds a third slider and dilutes the core intuition.

Orchestrator-level failure modes. The MAST taxonomy documents 14 failure modes, some of which (like role definition gaps or inter-agent misalignment) are not captured by any per-step reliability number. They are properties of the orchestration graph itself, and they show up as "pipeline looks correct step-by-step but produces the wrong answer overall." Modeling these requires graph-level analysis, which the simulator does not attempt.

A.5 How to adapt this for your own system

The simulator was built to produce intuition about published benchmarks. To make it predictive for a specific production system, a team would need to measure per-step reliability from production traces (instrument the scaffold to record every tool call, its input, its output, and whether the output was correct; aggregate across traces to estimate the effective per-step reliability after retries), measure actual step counts from production traces (the representative step count for a production task is rarely what the scaffold designer assumes; real user queries produce longer or shorter chains depending on task complexity), measure correlation between step failures (if errors at step 3 predict errors at step 5, the independence assumption breaks and the r^n formula overestimates end-to-end success; a team with good observability can estimate this correlation directly from their trace data), and re-derive the per-step reliability that the observed end-to-end success implies under the independence assumption (the gap between that number and the directly measured per-step reliability is a diagnostic for how much failure correlation is present). None of this is in the simulator because the simulator is a teaching tool, not a diagnostic one. The purpose is to let a reader adjust two parameters and see the curve bend. The purpose of the diagnostic version is different, and it would belong in an internal engineering tool rather than in a web widget attached to an article.