Skip to content

Benchmark

DuckAgent includes an offline context-policy benchmark under benchmark/. It exists to answer one practical question:

How should a direct local agent keep enough evidence for long work without paying to resend every raw tool result forever?

The benchmark simulates long-running Agent Loops, tool output, prompt-cache reuse, model limits, pricing profiles, and expected recovery reads when compressed evidence needs to be revisited. It does not call an LLM and it does not mutate real session files.

The current runtime uses the guarded recoverable context policy.

Runtime behaviorValue
Active policy in sourceContextProjectionPolicy::guarded_mid
Benchmark familyrecoverable_decay_guarded_mid
Prompt principleCache-Friendly Prompt
Active loop pressure80% below a 200K prompt window, 85% at 200K+
Active exact evidence budget18K tokens
Completed-loop exact evidence budget2K tokens
Tool preview budget220 tokens
Recovery modelKeep path, offset, limit, cursor, process id, and query handles so exact details can be fetched again.

In generated benchmark reports and benchmark source this policy appears as duckagent_recoverable_decay_guarded_mid.

The benchmarked policy keeps the active Agent Loop rich while it is still safe to do so. Once prompt pressure appears, it projects tool output into a smaller model-visible shape:

  • read_file keeps exact evidence while budget remains, then falls back to path and range handles.
  • process output keeps status, cursor, and preview fields so logs can be resumed.
  • completed loops become recoverable summaries instead of raw tool transcripts.
  • the session and prompt design remain append-only and cache-friendly; projection happens before model send.

That gives the model a small prompt, stable recovery handles, and enough recent evidence to avoid the worst read-compact-read loop.

ReportUse it for
benchmark/results/guarded-mid-vs-balanced-combined/report.mdBest current summary for the selected guarded-mid policy versus balanced and guarded-late variants.
benchmark/results/matrix/report.mdWider policy and model matrix across long-run workloads.
benchmark/results/recoverable-sweep/report.mdBudget sweep around recoverable-decay policy variants.
benchmark/context_policy_benchmark.pySimulator and policy definitions.

Start with Current Context Policy if you want to understand the runtime behavior, read Cache-Friendly Prompt for the prompt-cache principle, then read Results for the numbers worth quoting.