v1.2.32026-06-16

Benchmark accuracy: recall@5 confirmed on the full 500-question run

A documentation-accuracy release. No code or tool-contract changes, purely making sure a published number reproduces on the full set.

We re-ran our own number before you could

The recency reranker’s recall@5 introduced in 1.2.2 was first measured on a 100-question sample. Before launch we re-ran it on the complete 500-question LongMemEval longmemeval_s set and made that the published figure:

89.2% recall@5, keyless default (hybrid), 446/500.
91.6% recall@5, with the recency reranker (temporal), 458/500.

The default actually scored higher on the full set than the sample showed, and the reranker held ~92%. Independently re-run end-to-end and reproduced exactly. Zero drift, because retrieval is deterministic. Reproduce it yourself:

python -m evals.longmemeval.run --subset longmemeval_s -n 500 --no-qa

Compare the hybrid and temporal rows (ev_at_5). Every number we publish has to reproduce on the full set, so when a sample and a full run disagree, the full run wins and we say so.

Full details in the repository changelog. No breaking changes to any brain_* tool contract.