Memanto Benchmarks & Evaluation

This repository contains the official evaluation suite, benchmark results, and an interactive visualization tool for Memanto, the Information-Theoretic agent memory system.

Our goal with this repository is transparency and reproducibility. Rather than just publishing static accuracy tables, we provide the complete pipelines and an interactive web interface so you can explore the data, test the system yourself, and verify the results.

This repository includes:

Industry Benchmarks: Official LoCoMo and LongMemEval results comparing Memanto to other leading memory frameworks.
Evaluation Pipelines: Python scripts to locally reproduce the ingestion and LLM-as-a-Judge evaluation processes.
Interactive Visualizer: A local web application to deeply explore the datasets, construct your own queries, and run real-time evaluations.

Memory Performance & Accuracy

Memanto achieves state-of-the-art performance on both LongMemEval and LoCoMo benchmarks by utilizing Moorcheh's Information-Theoretic Scoring (ITS) engine. It outperforms the vast majority of hybrid knowledge-graph architectures without requiring an LLM during the ingestion phase.

LongMemEval Results

Method	Single-session User	Single-session Assistant	Single-session Preference	Knowledge Update	Temporal Reasoning	Multi-session	Overall
Hindsight (Gemini-3)	97.1%	96.4%	80.0%	94.9%	91.0%	87.2%	91.4%
Memanto (Gemini-3)	95.7%	100.0%	93.3%	93.6%	88.0%	81.2%	89.8%
Hindsight (OSS-120B)	100.0%	98.2%	86.7%	92.3%	85.7%	81.2%	89.0%
Supermemory (Gemini-3)	98.6%	98.2%	70.0%	89.7%	82.0%	76.7%	85.2%
Supermemory (GPT-5)	97.1%	100.0%	76.7%	87.2%	81.2%	75.2%	84.6%
Hindsight (OSS-20B)	95.7%	94.6%	66.7%	84.6%	79.7%	79.7%	83.6%
Supermemory (GPT-4o)	97.1%	96.4%	70.0%	88.5%	76.7%	71.4%	81.6%
Zep (GPT-4o)	92.9%	80.4%	56.7%	83.3%	62.4%	57.9%	71.2%
Full-context (GPT-4o)	81.4%	94.6%	20.0%	78.2%	45.1%	44.3%	60.2%
Full-context (OSS-20B)	38.6%	80.4%	20.0%	60.3%	31.6%	21.1%	39.0%

LoCoMo Results

Method	Single-Hop	Multi-Hop	Open Domain	Temporal	Overall
Hindsight (Gemini-3)	86.17%	70.83%	95.12%	83.80%	89.61%
Memanto (Gemini-3)	78.72%	70.83%	92.39%	85.36%	87.08%
Hindsight (OSS-120B)	76.79%	62.50%	93.68%	79.44%	85.67%
Hindsight (OSS-20B)	74.11%	64.58%	90.96%	76.32%	83.18%
Memobase (v0.0.37)	70.92%	46.88%	77.17%	85.05%	75.78%
Zep	74.11%	66.04%	67.71%	79.79%	75.14%
Mem0-Graph	65.71%	47.19%	75.71%	58.13%	68.44%
Mem0	67.13%	51.15%	72.93%	55.51%	66.88%
LangMem	62.23%	47.92%	71.12%	23.43%	58.10%
OpenAI	63.79%	42.92%	62.29%	21.71%	52.90%

Note on Benchmark Validity

While Memanto achieves state-of-the-art results on both LongMemEval and LoCoMo, we share the broader industry sentiment that the current generation of agentic memory benchmarks is deeply flawed and should not be used as the sole indicator of a memory system's quality or production readiness.

As detailed in the original Memanto paper (and mirrored by other researchers), these datasets suffer from:

Missing Ground Truth: Several categories have missing answers, incorrect speaker attributions, and subjective edge cases.
Ambiguous Questions: Questions often lack the specificity needed for a single correct answer, causing LLM judges to unfairly penalize semantically correct responses.
Insufficient Challenge (Saturated Baselines): The conversations in LoCoMo, for example, are too short (16k-26k tokens) to genuinely stress-test retrieval. They easily fit within the standard context windows of modern foundational models without needing a memory system at all. Because of this, leading architectures are quickly reaching the practical accuracy ceilings of both LoCoMo and LongMemEval, creating an illusion of parity that breaks down at true long-horizon scales.
Poor Context Simulation: LongMemEval struggles to accurately simulate organic, long-horizon temporal drift, often relying on contrived scenarios rather than authentic state changes.

Quick Start: Interactive Visualizer

You can access the interactive visualizer at https://memanto.ai/evaluation.

We provide a web application that allows you to interactively explore and evaluate the datasets yourself.

Instead of just trusting our static scores, the visualizer lets you:

Dive into the Data: Browse individual questions from the LongMemEval and LoCoMo datasets.
Inspect the Reasoning: See exactly which memories Memanto retrieved and read the LLM judge's reasoning for why an answer passed or failed.
Bring Your Own Keys: Plug in your own LLM provider keys (e.g., OpenAI, Google) to evaluate different inference models. (Note: Do NOT supply a MOORCHEH_API_KEY to the visualizer, as it needs to run against the pre-indexed public dataset.)
Choose Your Models: Select which models you want to use for inference and which to use for judging.
Construct Custom Scenarios: Write your own complex, multi-hop questions against the dataset history to see how the architecture responds in real-time.

Reproducing the Benchmarks (CLI)

If you prefer to run the full end-to-end evaluation pipeline via the command line to reproduce our static scores, you can use the included Python scripts.

Ensure you have the following environment variables set before starting:

MOORCHEH_API_KEY
GEMINI_API_KEY (or the respective key for your chosen LLM evaluator)

To run a benchmark, execute the ingestion script followed by the evaluator. Replace <benchmark> with either locomo or longmem:

Ingest Data (populates the memory store):
```
python <benchmark>/ingestor.py
```
Run Evaluation (generates results):
```
python <benchmark>/evaluator.py
```
Generate Report (compiles the final evaluation report):
```
python <benchmark>/generate_report.py
```

(Note: Running the full LongMemEval benchmark pipeline requires a Moorcheh Pro subscription due to the dataset size).

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
assets		assets
frontend		frontend
locomo		locomo
longmem		longmem
shared		shared
.gitattributes		.gitattributes
.gitignore		.gitignore
PROGRESS_HISTORY.md		PROGRESS_HISTORY.md
README.md		README.md
amplify.yml		amplify.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memanto Benchmarks & Evaluation

Memory Performance & Accuracy

LongMemEval Results

LoCoMo Results

Note on Benchmark Validity

Quick Start: Interactive Visualizer

Reproducing the Benchmarks (CLI)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memanto Benchmarks & Evaluation

Memory Performance & Accuracy

LongMemEval Results

LoCoMo Results

Note on Benchmark Validity

Quick Start: Interactive Visualizer

Reproducing the Benchmarks (CLI)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages