Benchmark
DuckAgent includes an offline context-policy benchmark under benchmark/. It exists to answer one practical question:
How should a direct local agent keep enough evidence for long work without paying to resend every raw tool result forever?
The benchmark simulates long-running Agent Loops, tool output, prompt-cache reuse, model limits, pricing profiles, and expected recovery reads when compressed evidence needs to be revisited. It does not call an LLM and it does not mutate real session files.
Current runtime policy
Section titled “Current runtime policy”The current runtime uses the guarded recoverable context policy.
| Runtime behavior | Value |
|---|---|
| Active policy in source | ContextProjectionPolicy::guarded_mid |
| Benchmark family | recoverable_decay_guarded_mid |
| Prompt principle | Cache-Friendly Prompt |
| Active loop pressure | 80% below a 200K prompt window, 85% at 200K+ |
| Active exact evidence budget | 18K tokens |
| Completed-loop exact evidence budget | 2K tokens |
| Tool preview budget | 220 tokens |
| Recovery model | Keep path, offset, limit, cursor, process id, and query handles so exact details can be fetched again. |
In generated benchmark reports and benchmark source this policy appears as duckagent_recoverable_decay_guarded_mid.
Why it matters
Section titled “Why it matters”The benchmarked policy keeps the active Agent Loop rich while it is still safe to do so. Once prompt pressure appears, it projects tool output into a smaller model-visible shape:
read_filekeeps exact evidence while budget remains, then falls back to path and range handles.- process output keeps status, cursor, and preview fields so logs can be resumed.
- completed loops become recoverable summaries instead of raw tool transcripts.
- the session and prompt design remain append-only and cache-friendly; projection happens before model send.
That gives the model a small prompt, stable recovery handles, and enough recent evidence to avoid the worst read-compact-read loop.
Reports
Section titled “Reports”| Report | Use it for |
|---|---|
benchmark/results/guarded-mid-vs-balanced-combined/report.md | Best current summary for the selected guarded-mid policy versus balanced and guarded-late variants. |
benchmark/results/matrix/report.md | Wider policy and model matrix across long-run workloads. |
benchmark/results/recoverable-sweep/report.md | Budget sweep around recoverable-decay policy variants. |
benchmark/context_policy_benchmark.py | Simulator and policy definitions. |
Start with Current Context Policy if you want to understand the runtime behavior, read Cache-Friendly Prompt for the prompt-cache principle, then read Results for the numbers worth quoting.