Single-shot success
35.8%
0.95020. Most attempts fail somewhere in the chain.
Pass⁸ (8 runs in a row)
0.02%
Probability the same task succeeds 8 times in a row. τ-bench's passk metric.
Success rate across pipeline length
At the current per-step reliability. Dashed line shows pass⁸.
Single attempt
Pass⁸
At 95% per-step reliability, a 20-step
pipeline succeeds 35.8% of the time. End-to-end success and
consistent reliability measure different things.