How I Designed a Memory System for AI Chat

Tiered memory, retrieval paths, and the consistency tradeoffs that actually matter.

The problem

Users don't experience "RAG" — they experience continuity. If the assistant forgets what you said five messages ago, the product feels broken even when latency is perfect.

Design principles

Hot path stays hot — session state in Redis, always in-memory on the read path.
Semantic search is async-friendly — vector retrieval can be slightly stale; session cannot.
Summaries are lossy by design — compress threads into facts, not transcripts.

Three tiers

Tier	Store	Latency target	Consistency
Session	Redis	< 10ms	Strong
Semantic	pgvector	< 80ms	Eventual
Long-term	Postgres summaries	Background	Eventual

Retrieval flow

On each turn:

Load session messages from Redis.
Embed latest user message; query vector tier with userId filter.
Merge results with reciprocal rank fusion against keyword hits.
Inject top-k into prompt; never exceed token budget — trim by recency score.

The biggest win wasn't a better embedding model — it was not blocking the response on tier-3 consolidation.

What I'd do differently

Start with explicit memory "slots" (preferences, goals) instead of only free-form chunks.
Ship observability for recall@k per session type from day one.