Insights

Game Server Architecture: Reliability for Slots and Tables

The 47‑second wobble

It was 19:03 on a Friday. Spins looked fine on the dashboard. P95 sat at 210 ms. But one live blackjack table froze for 47 seconds. Eight players hit “Stand” and saw nothing. Chat went silent. The dealer kept talking, but bets did not lock. After the wobble, two re-buys failed. Support lit up.

We had CPU to spare and green pods. Yet the table stream fell behind. A slow quorum write and a backoff loop in the fan‑out path lined up at the worst time. Slots hid the pain because each spin was a short, self‑contained round. The table could not hide. Money and trust were at risk. The lesson was simple: not all games fail the same way.

What matters when money is on the wire

“Availability” is not one number. A slot spin can retry if the server does not confirm in 250 ms. A live table cannot reorder steps. A dealer action must lock in sequence and show up to every seat in near real time. A wallet must be right even if the UI is late. A regulator may ask you to prove each move in a dispute, months later.

Players also notice more than uptime. They sense jitter, rollbacks, and odd round ends. Many compare studios, RTPs, and bugs before they play. They read reviews that score stability and fairness. If you build or run games, assume users do their homework. A good place to see how people judge slot engines and vendors is here: Review online slots software. Learn what they value. Then bake that into your design and your SLOs.

A map of the moving parts, minus buzzwords

Most stacks share a core set: session and auth, wallet and ledger, RNG, game logic, state cache and DB, message bus, real‑time fan‑out, storage for logs and replays, and an observability suite. On the edge you steer traffic across regions, often with DNS or Anycast. You may use sticky sessions for WebSockets, but those can trap users on a bad node. Latency budgets must account for TLS, network hop, queue wait, and storage. Read and write paths are not the same. Learn them both end to end. See how Anycast routing shifts users to the closest healthy site.

Keep one “source of truth” for money. Cache is a helper, not a bank. For slots, aim for small, stateless actions with clear ids you can retry. For tables, plan for strict order and shared state. That means a stronger store, careful locks, and fan‑out that does not stall when one client is slow.

Slots and tables fail in different ways

Slots feel bursty. A spin is short. The server can accept a spin id, pull an RNG value, compute the win, write a result, and send the outcome. If the client drops, the server can return the same result on resume. Good idempotency keys make this safe. You can even queue a spin if the ledger is slow, then settle when the wallet is free.

Tables are the opposite. They are shared. All seats must see the same thing in the same order. Bets must close before the dealer acts. The server must fan out updates to many clients at once. Jitter shows up as “lag” to all players. One slow seat cannot block the table, but it also cannot see skips or jumps. You need replay buffers and backpressure controls.

State order is key. You will dedupe messages. You may use sequence numbers. Some teams use streams for this. If you go that way, study exactly-once semantics. Know how it behaves on failover. Tables also carry chat, tips, and side bets. These add more fan‑out and raise the chance of head‑of‑line blocking.

In short: slots reward simple, idempotent calls and fast writes. Tables demand strict order, low jitter, and good fan‑out. Treat them as different SLOs and even different runbooks.

The numbers you must own

Set clear budgets per path. If you do not, you will chase noise. Put a number on round‑trip, on queue wait, on DB write, and on fan‑out. Track them as SLOs. Here are sane starting points from real teams. Tune them for your stack and regions.

Targets below assume a healthy network. They are P95s. Tune P99s once you get P95s stable.

RNG microservice ≤ 5 ms per call in‑region Entropy source stall; bad seed; slow pool Spin lag; trust risk if bias found All games using this RNG Trace span for RNG; health of entropy; chi‑square smoke tests Local cache of draws; dual RNG path; circuit breaker RTO: 5 min; RPO: zero (no lost draws) Slots: small bursts; Tables: less frequent but must be in order
Game state store (cache + DB) Cache hit ≤ 1 ms; DB write ≤ 15 ms Hot key; eviction; leader failover Spin retry; table freeze or rollback Game family or all tables on shard Cache hit ratio; DB commit time; failover count Sharding; hot key split; fast failover runbook RTO: 10–15 min; RPO: seconds with sync replicas Tables need quorum writes; slots accept async ack with idempotency
Messaging bus / stream Produce ≤ 5 ms; consume ≤ 10 ms Partition skew; consumer lag; rebalances Late updates; out‑of‑order state Topic‑wide; often cross‑service Lag metrics; rebalancing rate; duplicate rate Key by table; consumer pinning; backpressure RTO: 15 min; RPO: zero with durable log Tables rely on order; slots can ignore late bus events
Wallet / ledger Credit / debit ≤ 25 ms Lock contention; double spend guard Balance wrong or slow; disputes All players touching wallet Lock wait time; conflict rate; reconciliation diffs Short tx; idempotent ops; async notifications RTO: 15–30 min; RPO: zero (authoritative) Slots can queue settle; tables must block until funds lock
Session / auth Login ≤ 120 ms; token check ≤ 5 ms Token drift; cache miss storm Kicks; ghost seats; failed rejoin Large; all users 401/403 rate; token TTL errors Edge cache; staged rollouts; key rotation plan RTO: 10 min; RPO: n/a Both suffer; tables need smooth rejoin to keep order
Real‑time fan‑out (WebSocket / SFU) Push ≤ 50 ms to last seat Slow clients; head‑of‑line block Lag; missed tells; late closes Per table; can stack up fast Per‑seat send time; drop rate; buffer depth Per‑client queues; drop old frames; adaptive bitrate RTO: 5–10 min; RPO: n/a Critical for tables; slots care less
Content / CDN Asset fetch ≤ 50 ms edge hit Stale assets; purge lag UI bugs; mismatch with server Wide; all clients Cache hit; 4xx/5xx rate; version skew Immutable assets; version pins; blue/green RTO: 15 min; RPO: n/a Impacts both; slow UI makes lag feel worse
Cross‑region failover (control plane) Route flip ≤ 60 s; cold start ≤ 5 min Split brain; partial cut Disconnects; stuck tables; duplicates Regional; can be global Health by slice; error budget burn; async lag Runbooks; health‑based GSLB; pre‑warmed pools RTO: 5–15 min; RPO: seconds to zero Tables need consistent clocks; slots tolerate brief read‑only

Note: For end‑to‑end spans and service maps, see OpenTelemetry traces. Trace the user journey, not just pods.

RNG, determinism, and compliance realities

RNG is your heart. In most stacks it is a PRNG with strong seeds. Entropy must be sound. See NIST SP 800-90B for guidance on sources and tests. Log seeds in a safe way so you can replay rounds for audits. Do not expose them. Keep a secure replay tool. Use HMAC to prove a round has not changed.

Labs will check your methods. They will test bias and the way you store and show results. Read the GLI-19 standards so you match what they need. Keep a clean chain for each spin and each table action. Be able to show it years later. Treat logs as evidence. Use write‑once storage for key events.

Determinism also helps support. If a player says “My spin was wrong,” you can replay the exact steps with the same input and show the same output. That builds trust and cuts refunds.

State is where outages hide

Caches can lie. A 98% hit rate can hide hot keys. One slot theme can set off skew. A table with 10k watchers can crush a shard. Watch not just hit rate, but top keys, TTL churn, and write amp. Use short, clear keys. Set sane TTLs. Avoid “thundering herds.”

Make each spin idempotent. That means the same call with the same id does the same thing, even if it runs twice. Use server‑made ids. Keep small, atomic steps. When using cache, learn how to group them with transactions. The docs for Redis transactions explain common patterns.

For the source of truth, pick a store you can run under stress. Use a leader and followers with clear failover. Keep quorum sizes modest. Test partitions. Read the guide on PostgreSQL high availability if you run Postgres. It shows how to build safe failover and avoid split brain.

Platform choices you cannot kick down the road

Cross‑region is a hard call. Tables want strong order. That pushes you toward global clocks and quorum writes. But these choices add latency. For many teams, a single write region with fast read replicas and a hot standby is a good start. If you need global, study Google Spanner multi-region replication. It shows what global consensus costs and how to place replicas to keep latency in check.

Also pick your id space early. Use ULIDs or time‑sorted ids for better cache and DB scans. Decide how to shard tables. Many route by table id so that all actions land on one partition. This reduces cross‑shard hops.

Observability that sees players, not just pods

Watch what the player feels. Define SLOs for a spin (request to settle), for a bet (lock to confirm), and for a live action (tap to fan‑out to last seat). Track four signals: latency, traffic, errors, and saturation. For time series and simple alerts, start with the Prometheus overview. Keep labels low‑card so your store does not explode.

Use traces to tie all steps. Sample at a higher rate for money paths. Alert on user‑journey spans that miss SLO, not on CPU. Read the Google SRE principles on SLOs, error budgets, and alerting. Add synthetic seats to live tables that click and bet like humans. Rehearse incidents. Record replays to debug after the fact.

Security and rules that shape your design

Wallet paths may fall under PCI. Even if you do not store card data, you may touch tokens and payment flows. Study the PCI DSS overview. Keep the card zone small. Segment the network. Use HSMs for keys. Rotate secrets. Log access to wallets and ledger writes.

Remote rules differ by market. Many ask for strict logging, fair RNG, and clear recovery plans. For the UK, read the UKGC Remote Technical Standards. Map each rule to a test. Tie it to a log field. Keep proofs handy for audits.

The five‑minute DR playbook (and what it really costs)

Disaster recovery is not a slide. It is a drill. Pick your top five failure modes. Run a tabletop and a real failover for each one. Time the steps. Log who did what. Count what broke. Decide if you want N+1 or N+2. Price both. Read the AWS Well-Architected resilience guide for trade‑offs.

Good DR keeps state safe first. Backups are not enough. You need restores that pass checks and replays that prove money is right. Keep one command to fail to a hot site. Keep one to fail back. Keep data change windows small so RPO is near zero. Train new hires on this in month one.

Build vs buy: when to own the pipes

Some parts are fine to rent: CDN, log storage, managed DB, even a managed stream. But know your exit paths. Export formats, topic schemas, and ledger snapshots must be portable. For your core run‑time, learn the platform you use. If you run K8s, read the Kubernetes architectural patterns. Know where it fails and how to spread failure domains.

Real‑time websockets at scale sound easy. They are not. If you buy, test under chaos and churn. If you build, plan for rolling restarts, key rotation, and slow clients. In both cases, measure tail latencies and drop rates.

Monday‑morning changes you can ship

  • Add idempotency keys to all spin and bet calls. Reject duplicates with a clear code.
  • Set tight P95 budgets: ≤ 250 ms spin (server side), ≤ 350 ms dealer action commit (to last seat).
  • Pin table updates by table id to one partition. Add backpressure on slow seats.
  • Expose a “late settle” flag in the client UI so retries feel safe, not scary.
  • Track error budgets per game type. Pause feature rollouts when you burn them.
  • Add a chaos test that kills your cache leader at peak. Watch for hot key skew.
  • Put a “replay this round” button in your back office. Require HMAC checks.
  • Pre‑warm a small pool in your failover region. Drill failover for one table daily.

If you want to see how users judge stability and RTP in the wild, scan a few independent review hubs. Note what they praise and what they warn about. Then fix those gaps in your stack first.

FAQ

What is an acceptable P95 for a live blackjack action?

Good targets: ≤ 350 ms from player tap to commit and fan‑out to the last seat in the same region. Aim for ≤ 200 ms to commit, ≤ 150 ms to push. Keep P99 visible too.

Do I need global transactions for slots?

Most do not. Slots work well with a single write region, idempotent spin ids, and async settle if the wallet is slow. Keep a hot standby in another region and drill the cutover.

How do I test RNG determinism without leaking seeds?

Use a sealed replay tool. Feed it the same seed and round input. Compare the output and an HMAC of the result. Store seeds and HMACs in a vault with strict access control. Never expose seeds to client apps.

Common traps to avoid

Sticky sessions that trap users on a sick node. Fan‑out that treats all seats as equal, so one slow seat stalls the table. Cache that becomes a bank. Streams with mixed keys that break order. Failovers you only tried in slides. Links that go stale in audits.

Closing thoughts

Reliability for games is not one knob. It is a set of small, sharp choices. Slots want fast, safe retries. Tables need order and smooth fan‑out. Wallets must be right every time. Trace what the player feels, not just what pods do. Drill failovers until they are boring. Keep proofs for audits. If you do these things, your 47‑second wobble will turn into a two‑second blip that few will notice.

Disclaimer: This article is technical guidance, not legal advice. Always check your design and logs with your regulator and your testing lab.

About the author

Written by an engineer who has built and run real‑money game platforms for 10+ years. Led PCI‑DSS projects, RNG audits, and cross‑region DR drills across EU and NA. Enjoys clean runbooks and short pager shifts.