Looking for the managed version? Extremis Cloud โ†’
Extremis ยท MIT-licensed memory library

Memory that gets smarter the more your agent uses it.

Layered, learning memory for AI agents in two lines of config. Self-host the OSS library, or skip the infrastructure and use the managed version โ€” same engine, your call.

โœ“ RL-scored retrievalโœ“ Knowledge graphโœ“ MCP-native (9 tools)โœ“ Hallucination detection
agent.py
# pip install extremis
from extremis import Extremis

mem = Extremis()

mem.remember("User is building a WhatsApp AI")
hits = mem.recall("what is the user building?")

for r in hits:
    print(r.memory.content)
    print(r.reason)  # similarity 0.87 ยท score +2.0 ยท used 5ร— ยท 3d

How it works

Three primitives. Everything else is consolidation that runs in the background.

1mem.remember()

append to fsync'd log + episodic store

mem.remember(
  "user wants the SLA in writing",
  conversation_id="c1",
)
2mem.recall()

ranked by cosine ร— RL score ร— recency

hits = mem.recall("SLA")
# returns ranked results,
# each with .reason and .verification
3mem.reinforce()

asymmetric 1.5ร— weight on negative signals

mem.report_outcome(
  [h.memory.id for h in hits[:2]],
  success=True,
)
Plus a nightly dream pass. A consolidation worker reads the day's episodes and distils durable facts into the semantic layer. On self-host you run it; on Cloud it runs for you.

Every recall explains itself

Debuggable by default.

No black box. Every result carries a one-line reason โ€” the same string the library returns whether you self-host or use Cloud. You see exactly why a memory surfaced, in plain English.

example reason strings

  • โ€บ similarity 0.87 ยท score +2.0 ยท used 5ร— ยท 3d old
  • โ€บidentity layer (ร—2 weight) ยท matched user's prior preference
  • โ€บ downranked: judge flagged unverified at write time

Vs. the alternatives

What sets Extremis apart.

FeatureExtremisMem0LettaRaw RAG
Layered memory (identity/semantic/episodic/procedural)โœ“โ€”โ€”โ€”
RL-scored retrieval (1.5ร— asymmetric on negatives)โœ“โ€”โ€”โ€”
Per-recall reason stringsโœ“โ€”โ€”โ€”
Knowledge graph built inโœ“โ€”partialโ€”
Hallucination detection bundledโœ“โ€”โ€”โ€”
Self-hostable (MIT)โœ“โ€”โ€”โœ“
Managed option availableโœ“โœ“โœ“โ€”
MCP server (9 tools)โœ“โ€”โ€”โ€”

Benchmarks

LongMemEval-S ยท 500 QA instances ยท ~53 sessions each.

Hosted Extremis is the identical engine โ€” same numbers, fully managed. Methodology: see the reproducible benchmark run. QA accuracy is downstream-model-dependent; stronger answerers raise it significantly.

94.4%

Retrieval R@5

top-5 includes the answer session

38.8%

QA Accuracy

claude-haiku-4-5 as answerer

~35ms

p50 recall latency

local model ยท MPS ยท varies in prod

Hallucination detection

Wrong memories are flagged, not stored quietly.

A two-tier verifier runs at write time: a fast NLI model first, then an LLM judge for grey-zone scores. Failing memories aren't silently dropped โ€” they're tagged unverified and downranked at recall time. Every recall returns a verification trace you can inspect.

  • โ€บOn self-host: configure your own thresholds, pick the NLI model, point at any judge LLM.
  • โ€บOn Cloud: dashboard surfaces flagged memories as a triage queue and renders the trace tree with red rows on failures.

example: contradicted recall

extremis.recall               124ms โŒ
  embedder.embed             10ms โœ“
  retrieve.hybrid            11ms โœ“ (semantic + BM25)
  verifier.nli               14ms โŒ grounded 0.42
  verifier.judge             47ms โŒ grounded 0.18

why it failed:
  sources self-correct from 99.95% to 99.9%;
  extracted memory captured the pre-correction value.

what to try:
  mem.remember_now(layer="semantic", confidence=0.95)
  to pin the corrected fact.

Privacy

MIT-licensed. No lock-in. Cloud is optional.

The library is open-source under MIT. Self-host on SQLite locally or any of five production backends. If Cloud isn't for you (or shuts down tomorrow), point HostedClient.base_url at your own deployment and nothing else changes. Cloud is a convenience, not a dependency.

Pick the version that fits your stack.

Both run the same engine. Self-host gives you full control; Cloud gives you back the afternoon.