The problem
Users don't experience "RAG" — they experience continuity. If the assistant forgets what you said five messages ago, the product feels broken even when latency is perfect.
Design principles
- Hot path stays hot — session state in Redis, always in-memory on the read path.
- Semantic search is async-friendly — vector retrieval can be slightly stale; session cannot.
- Summaries are lossy by design — compress threads into facts, not transcripts.
Three tiers
| Tier | Store | Latency target | Consistency |
|---|---|---|---|
| Session | Redis | < 10ms | Strong |
| Semantic | pgvector | < 80ms | Eventual |
| Long-term | Postgres summaries | Background | Eventual |
Retrieval flow
On each turn:
- Load session messages from Redis.
- Embed latest user message; query vector tier with
userIdfilter. - Merge results with reciprocal rank fusion against keyword hits.
- Inject top-k into prompt; never exceed token budget — trim by recency score.
The biggest win wasn't a better embedding model — it was not blocking the response on tier-3 consolidation.
What I'd do differently
- Start with explicit memory "slots" (preferences, goals) instead of only free-form chunks.
- Ship observability for recall@k per session type from day one.