A Financial Contagion agent assesses liquidity stress at 78%. A Macroeconomic agent assesses it at 42%. Most systems would average these to 60% and move on.
That average is worse than either individual assessment. It has destroyed the single most important fact the system produced: two well-reasoned analytical frameworks, operating on the same evidence, reached fundamentally different conclusions. The 36-point spread between 78% and 42% carries genuine signal: the clearest possible indication that real uncertainty exists, and that the uncertainty has a specific, diagnosable structure.
The Contagion agent sees a sovereign-bank nexus creating a correlation spiral. Bond spreads widen, collateral values drop, funding costs spike, and the feedback loop accelerates. The Macro agent sees an ECB backstop that capped precisely this kind of contagion within 72 hours during the 2012 sovereign debt crisis. Both are reasoning correctly from their frameworks. The question is whether the 2012 precedent still holds in 2026, with different political constraints on ECB intervention and a different composition of sovereign debt holdings.
That question, not 60%, is what the decision-maker needs.
The consensus instinct
Averaging feels responsible. It feels balanced. It strips away the extremes and produces a moderate, defensible number that can be dashboarded, compared to thresholds, and reported upward without controversy.
This instinct is so deeply embedded in quantitative systems that most practitioners don't recognise it as a design choice. Ensemble methods in machine learning (bagging, boosting, random forests) are built on the principle that aggregating many weak predictions produces a strong one. And for prediction tasks where the goal is to reduce variance around a single ground truth, this works beautifully.
Risk assessment operates in a fundamentally different regime. It is an analytical task where multiple valid interpretations of the same evidence exist simultaneously. A geopolitical analyst, a financial contagion specialist, and a cyber threat researcher will reach different conclusions about the same scenario, not because some of them are wrong, but because they are applying different causal models to the same observations. The disagreement between their models contains structural information about the problem that no individual model captures alone.
Averaging discards that structural information. It treats disagreement as error to be smoothed away, when disagreement is actually the system's most honest statement about what it does and does not know.
The Gaussian copula: when everyone agreed and everyone was wrong
The canonical example of consensus-as-failure in financial risk comes from the models that priced collateralised debt obligations before 2008.
In 2000, David X. Li published a framework for modelling default correlation using a Gaussian copula, a mathematical function that captured the likelihood of multiple borrowers defaulting simultaneously through a single correlation parameter. The elegance was seductive: a complex, high-dimensional dependency structure reduced to one number. Rating agencies adopted it. Banks adopted it. The entire structured credit market converged on the same model, the same parameter, and the same conclusion: senior CDO tranches were safe.
The model had a structural blind spot that any genuinely independent analytical perspective would have flagged. The Gaussian copula cannot model tail dependence: the tendency for extreme events to cluster. It systematically underestimates the probability that a large number of defaults happen simultaneously. This is precisely the scenario that matters for senior tranches, which are only impacted when losses cascade beyond the junior and mezzanine layers. Li himself acknowledged the limitation in 2005, noting that very few people understood the essence of the model. The market used it anyway, because the model produced a number and the number could be agreed upon.
Plenty of researchers in quantitative finance understood tail dependence. They had been writing about it for years. What they lacked was structural: a mechanism for a dissenting analytical framework to surface its disagreement in a way that could reach decision-makers. Every institution was using the same model, calibrated to the same implied correlation parameters, producing the same risk assessment. The quant who worried about tail risk and the quant who trusted the copula were not operating in a system designed to preserve and present their disagreement. They were operating in a system designed to produce a single price.
When US housing markets turned and defaults began to correlate far beyond what the model predicted, the losses did not stay in the equity tranches where the model said they would. They cascaded upward through mezzanine into senior tranches that were supposed to be safe. The model's consensus was catastrophically wrong, in exactly the direction that an independent tail-risk analysis would have flagged.
What Roach would have seen
Imagine running the pre-2008 CDO correlation question through a multi-agent system designed for structured disagreement rather than consensus.
Fourteen agents, operating in full isolation, assess the same portfolio. Each applies a distinct analytical framework to the question: how correlated are the defaults in this pool under stress?
FinStress (Financial Contagion agent) models the feedback loop: falling house prices reduce homeowner equity, which increases strategic default incentives, which increases losses on mortgage-backed securities, which tightens credit conditions, which further depresses house prices. It identifies a nonlinear amplification mechanism that the Gaussian copula's single correlation parameter cannot capture. Its assessment: default correlation under stress is significantly higher than implied by the copula, with estimated tail correlation of 0.6 to 0.8 versus the model's implied 0.2 to 0.3. Confidence: Medium. The mechanism is clear but the magnitude depends on the speed and depth of the housing correction.
GeoRisk (Macroeconomic agent) analyses the structural conditions: a decade of loose monetary policy, historically low default rates used to calibrate the model, and a housing market that has not experienced a nationwide decline since the Great Depression. It flags that the calibration data (the historical default correlations feeding the copula) comes from a benign period and may not represent stressed conditions. Its assessment: model parameters are calibrated to a regime that is unlikely to persist. Confidence: Medium.
RegGuard (Regulatory Compliance agent) notes that the rating agencies' use of the Gaussian copula for tranche ratings creates a regulatory feedback loop: capital requirements are based on ratings, ratings are based on the copula, and the copula's implied correlation is derived from market spreads that themselves reflect the ratings. The circularity means the system cannot self-correct. Its assessment: systemic risk is being underpriced because the regulatory framework and the pricing model share the same blind spot. Confidence: High.
CyberWatch and OpResilience mark this dimension as outside their analytical framework. That itself is useful information. It identifies which agents have nothing to contribute and prevents dilution of the assessments that matter.
Now compute the disagreement structure. Three agents with relevant frameworks have assessed the question. Two identify mechanisms through which default correlation under stress would be materially higher than the model assumes. One identifies a structural feedback loop that prevents the system from self-correcting. None of them support the prevailing market consensus.
The spread between the copula's implied correlation (~0.2) and the agents' stress estimates (0.6 to 0.8) would have been the single most important output. A measured gap between what the market model assumes and what independent analytical frameworks predict under stress. No average. No compromise. Just the gap itself, with the reasoning chains that produced each side.
A decision-maker receiving this output would not have known that 2008 was coming. But they would have known that the consensus was fragile: that it depended on a specific correlation regime persisting, that the model could not capture the tail dynamics that mattered most, and that the regulatory framework was circular. That is enough to hedge. That is enough to reduce exposure. That is enough to survive.
The actual market produced none of this signal, because it was architecturally incapable of preserving disagreement.
Why LLMs make this worse
Human analysts at least bring genuinely independent training and experience to a problem. Two LLM-based agents running the same base model share the same training distribution, the same biases, the same failure modes. Their "disagreement" is often superficial: differences in sampling rather than differences in reasoning.
This means that when LLM-based agents do agree, their agreement may reflect shared training bias rather than independent confirmation. And when they disagree, the disagreement may reflect sampling variance rather than genuine analytical tension. Distinguishing between informative disagreement and noise requires agents that are architecturally differentiated, not just prompted differently.
The multi-agent debate literature confirms a deeper problem. Wu et al. (2025) conducted a controlled study of LLM debate using logic puzzles with verifiable ground truth and found that majority opinion actively suppresses independent correction, with weak agents rarely overturning initial majorities regardless of argument quality. Choi et al. (2025), in work presented as a NeurIPS spotlight, went further: they proved that under homogeneous agents with unweighted belief updates, debate dynamics form a martingale over belief trajectories, meaning debate alone does not improve expected correctness beyond simple majority voting. Cui et al. (2025) proposed anti-conformity mechanisms specifically to counteract this majority pressure.
The implication is stark. If you let LLM agents see each other's work, the first agent to publish an assessment anchors every subsequent agent. The majority doesn't just influence the minority. It absorbs it. This is the LLM equivalent of every bank calibrating to the same Gaussian copula: correlated failure dressed up as independent confirmation.
The architectural response is isolation. If you want genuine analytical diversity from a multi-agent system, you cannot let agents see each other's work. Isolation serves as the mechanism that preserves the diversity you paid for.
Isolation produces diversity. Diversity produces signal.
In Roach, fourteen specialised agents assess every scenario in full parallel with zero access to each other's outputs. Each agent receives the same scenario brief and the same entity profile. Each produces a structured assessment using the same output schema. But each operates within a distinct analytical framework: geopolitical conflict (GP), macroeconomic and monetary policy (ME), supply chain and outsourcing (SC), cyber threat and information warfare (CT), regulatory compliance (RC), financial contagion (FC), AI developments risk (AI), liquidity stress (LS), operational resilience (OR), emerging markets and FX (EM), climate and ESG risk (CL), systemic risk (SR), adversarial red team challenge (RT), and strategic scenario planning (ST).
The isolation boundary is strict:
- Same input. Same output schema. No shared memory.
- No message passing between agents during assessment.
- No access to intermediate reasoning or chain-of-thought from other agents.
- Each agent runs in its own execution context with its own system prompt.
The system prompt is where framework diversity lives. The GeoRisk agent is prompted to reason through escalation dynamics, alliance structures, and sanctions propagation pathways. The FinStress agent is prompted to reason through liquidity spirals, collateral chains, and counterparty exposure. They will disagree not because they sampled differently from the same distribution, but because their analytical lenses make different aspects of the scenario visible.
This goes beyond prompt engineering in the usual sense. It is the design of an analytical institution where each member has a defined role, a defined expertise boundary, and a defined blind spot. The blind spots serve a purpose: they are the reason the ensemble produces more insight than any individual agent.
Disagreement as a structured output
When all fourteen agents complete their assessments in isolation, the results are collected, not merged. The system computes per-dimension spread metrics across all agents. Three categories emerge:
Consensus dimensions (spread below 10 points). Most agents agree. This is a high-confidence finding, not because it is necessarily correct, but because the analytical diversity of the ensemble has not found a reason to disagree. Example: all fourteen agents assess that payment processing dependency on Equens/Worldline creates operational concentration risk. This is structurally visible from any analytical perspective.
Contested dimensions (spread above 25 points). Agents disagree significantly. This is where the real intelligence lives. The system surfaces both the scores and the reasoning chains: why does the Contagion agent see 78%? What specific mechanism drives its assessment? Why does the Macro agent see 42%? What precedent or structural argument supports its lower estimate? The decision-maker sees the debate, not the average.
Sparse dimensions (fewer than three agents assessed). Most agents marked this dimension as outside their analytical framework. This is a coverage gap, a signal that the scenario touches a domain that the current agent pool does not adequately cover. Sparse dimensions deserve attention precisely because they represent blind spots made visible by the output schema's explicit "what I cannot assess" field.
This three-category structure constitutes the system's actual output. A map of what is known, what is contested, and what is not covered. That map, rather than any single number or dashboard-friendly risk score, is what reaches the decision-maker.
The adversarial layer
After the assessment agents complete in isolation and the disagreement structure is computed, the Adversarial Red Team (RT) activates. Unlike the twelve domain-specialist agents, the Red Team does not assess the scenario independently. It attacks the other agents' assessments.
Its job is to find the assumptions that no agent examined. The Macro agent's 42% assessment relies on the 2012 Draghi precedent. The Red Team asks: what if the ECB's political constraints have changed since 2012? What if the backstop doesn't arrive within 72 hours this time? What if the sovereign debt composition (more concentrated in peripheral economies, with higher rates) means the transmission dynamics are structurally different?
The Red Team needs specificity, not correctness. Each challenge names the assumption being tested, the evidence that supports it, and the evidence that undermines it. The goal is to stress-test the reasoning chains that produced the existing estimates, not to produce a better single number.
If the Red Team's challenge is strong, if it demonstrates that the Draghi precedent genuinely may not hold, then the contested dimension stays contested and the decision-maker is warned. If the challenge is weak, if the ECB's mandate and toolkit are substantially unchanged, then the Macro agent's reasoning is strengthened by having survived adversarial scrutiny.
Either outcome is valuable. Neither requires averaging.
When averaging is appropriate
Averaging has its place. It becomes destructive only in a specific context: when agents represent genuinely different analytical perspectives and their disagreement carries structural information about the problem.
When agents share the same analytical framework and you are simply reducing sampling noise, averaging is exactly right. A classic ensemble of five identical models with different random seeds should be averaged. The variance between them is noise, not signal. There is no structural interpretation of why one random seed produced 0.63 and another produced 0.67.
The test is simple. If two agents disagree, can a domain expert explain why they disagree by pointing to different frameworks, assumptions, or evidence? If yes, the disagreement is informative and averaging destroys information. If no, the disagreement is noise and averaging reduces it.
This test has a design implication: if you want informative disagreement, you must design for framework diversity. Two agents prompted to "analyse this scenario" with different temperatures will produce noise. Two agents prompted to reason through different causal models (one through financial contagion pathways, one through geopolitical escalation dynamics) will produce signal. The quality of your disagreement is a direct function of the diversity of your analytical frameworks, not the number of your agents.
The decision-maker's actual need
Risk committees do not need a single number. They need answers to specific questions: Where are we most exposed? What are we most uncertain about? What would change the assessment most? What are we not seeing?
A system that averages fourteen agents into a single score can answer the first question, poorly. It cannot answer the other three at all.
A system that preserves disagreement, surfaces reasoning chains, and makes coverage gaps explicit can answer all four. The contested dimensions tell the committee where uncertainty is highest. The sensitivity analysis (which input parameters drive the most outcome variance) tells them what would change the assessment. The sparse dimensions tell them what the system cannot see.
This is more complex than a single number. It requires decision-makers who can work with structured uncertainty rather than false precision. But that is what DORA, DNB, and every modern resilience framework actually demand: not that institutions produce confident assessments, but that they understand and can articulate the boundaries of their knowledge.
The entire structured credit market agreed on a single correlation parameter and called it risk management. The 36-point spread between 78% and 42% deserves preservation, not resolution. It is the answer.
References
-
Wu, H. et al. (2025). Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning. arxiv.org/abs/2511.07784. Process-level analysis showing that majority pressure suppresses independent correction in LLM debate, with minority agents rarely overturning incorrect majorities.
-
Choi, H.K. et al. (2025). Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? NeurIPS 2025 Spotlight. arxiv.org/abs/2508.17536. Proves that homogeneous multi-agent debate forms a martingale over belief trajectories, and that debate alone does not improve expected correctness beyond majority voting.
-
Cui, Y. et al. (2025). Free-MAD: Consensus-Free Multi-Agent Debate. arxiv.org/abs/2509.11035. Introduces anti-conformity mechanisms that enable agents to resist excessive majority influence, improving reasoning accuracy in single-round debate.