Benchmark accuracy — recall@5 confirmed on the full 500-question run
A documentation-accuracy release. No code or tool-contract changes — purely making sure a published number reproduces on the full set.
We re-ran our own number before you could
The recency reranker’s recall@5 introduced in 1.2.2 was first measured on a 100-question sample. Before launch we re-ran it on the complete 500-question LongMemEval longmemeval_s set and made that the published figure:
- 89.2% recall@5 — keyless default (
hybrid), 446/500. - 91.6% recall@5 — with the recency reranker (
temporal), 458/500.
The default actually scored higher on the full set than the sample showed, and the reranker held ~92%. Independently re-run end-to-end and reproduced exactly — zero drift, because retrieval is deterministic. Reproduce it yourself:
python -m evals.longmemeval.run --subset longmemeval_s -n 500 --no-qaCompare the hybrid and temporal rows (ev_at_5). Every number we publish has to reproduce on the full set — so when a sample and a full run disagree, the full run wins and we say so.
Full details in the repository changelog. No breaking changes to any brain_* tool contract.