Sergey Orsik.dev
← notes

2026-05-10

How I Designed a Memory System for AI Chat

Tiered memory, retrieval paths, and the consistency tradeoffs that actually matter.

The problem

Users don't experience "RAG" — they experience continuity. If the assistant forgets what you said five messages ago, the product feels broken even when latency is perfect.

Design principles

  1. Hot path stays hot — session state in Redis, always in-memory on the read path.
  2. Semantic search is async-friendly — vector retrieval can be slightly stale; session cannot.
  3. Summaries are lossy by design — compress threads into facts, not transcripts.

Three tiers

TierStoreLatency targetConsistency
SessionRedis< 10msStrong
Semanticpgvector< 80msEventual
Long-termPostgres summariesBackgroundEventual

Retrieval flow

On each turn:

  1. Load session messages from Redis.
  2. Embed latest user message; query vector tier with userId filter.
  3. Merge results with reciprocal rank fusion against keyword hits.
  4. Inject top-k into prompt; never exceed token budget — trim by recency score.

The biggest win wasn't a better embedding model — it was not blocking the response on tier-3 consolidation.

What I'd do differently

  • Start with explicit memory "slots" (preferences, goals) instead of only free-form chunks.
  • Ship observability for recall@k per session type from day one.