Boring connection layer
WebSocket servers should do one thing: hold connections and forward frames. Business logic belongs upstream in consumers that write to pub/sub channels.
Write path vs read path
Client → API → Kafka → Consumer → Redis PUBLISH → WS pod → Client
Kafka gives you replay and backpressure. Redis gives you fan-out speed. Don't skip the durable log — you'll need it for reconnect snapshots.
Partitioning strategy
- Channel key =
roomIdoruserIddepending on isolation needs. - Edge pods subscribe to a hash slot subset to limit Redis fan-in.
Deploy without dropping messages
- Mark pod as draining (stop new connections).
- Send
GOAWAYwith reconnect hint to existing clients. - Wait for connection count → 0 or timeout.
- Kill pod.
Metrics that matter
- Publish lag (Kafka → Redis)
- Per-pod connection count
- Reconnect rate after deploy
If reconnect spikes, your drain window is too aggressive.