Skip to content
← Changelog
v1.2.32026-06-16

Benchmark accuracy — recall@5 confirmed on the full 500-question run

A documentation-accuracy release. No code or tool-contract changes — purely making sure a published number reproduces on the full set.

We re-ran our own number before you could

The recency reranker’s recall@5 introduced in 1.2.2 was first measured on a 100-question sample. Before launch we re-ran it on the complete 500-question LongMemEval longmemeval_s set and made that the published figure:

The default actually scored higher on the full set than the sample showed, and the reranker held ~92%. Independently re-run end-to-end and reproduced exactly — zero drift, because retrieval is deterministic. Reproduce it yourself:

python -m evals.longmemeval.run --subset longmemeval_s -n 500 --no-qa

Compare the hybrid and temporal rows (ev_at_5). Every number we publish has to reproduce on the full set — so when a sample and a full run disagree, the full run wins and we say so.

Full details in the repository changelog. No breaking changes to any brain_* tool contract.

v1.2.3 Changelog — Myco Brain