Skip to content

Results

The most useful current report is:

benchmark/results/guarded-mid-vs-balanced-combined/report.md

It runs the guarded-mid policy against balanced and guarded-late variants on a combined suite:

  • 1000 long-run user turns;
  • validation scale 2;
  • 1188/1188 completed simulated turns for every listed guarded-mid row;
  • threshold 0.8, target 0.3;
  • seven model pricing profiles.
Model profileGuarded-mid resultVersus balanced
openai-gpt-5.4USD 1199.144484, 96.6% cache hit, 1188/1188 turns22.6% lower simulated cost and 23.7% fewer expected tokens.
openai-gpt-5.5USD 2398.288967, 96.6% cache hit, 1188/1188 turns22.6% lower simulated cost and 23.7% fewer expected tokens.
openai-gpt-5.4-miniUSD 195.798758, 87.5% cache hit, 1188/1188 turns8.1% lower simulated cost and 30.8% fewer expected tokens.
kimi-k2.6USD 302.727538, 87.3% cache hit, 1188/1188 turns14.5% lower simulated cost and 30.6% fewer expected tokens.
deepseek-chatUSD 74.952810, 91.3% cache hit, 1188/1188 turnsSame as balanced because the 128K prompt profile falls back to the 80% guard.
deepseek-v4-flashUSD 29.190571, 97.0% cache hit, 1188/1188 turns16.9% lower simulated cost and 21.3% fewer expected tokens.
deepseek-v4-pro-promoUSD 71.593043, 97.0% cache hit, 1188/1188 turns15.6% lower simulated cost and 21.3% fewer expected tokens.

Across the guarded-mid rows, the benchmark reports roughly 108M raw tool tokens saved by structured tool projection while keeping recovery handles available.

DuckAgent does not blindly summarize every tool result as soon as it appears. It keeps the active loop raw until prompt pressure appears, then switches to recoverable summaries with exact-evidence budgets and handles. In the current combined report, that is the best-looking default because it keeps all simulated turns complete while reducing long-context cost on large-window and mid-window models.

These numbers come from an offline simulator. They are useful for policy selection and regression checks, not exact billing promises. Pricing profiles live under benchmark/pricing/ and should be reviewed whenever provider prices or model limits change.