It was 19:03 on a Friday. Spins looked fine on the dashboard. P95 sat at 210 ms. But one live blackjack table froze for 47 seconds. Eight players hit “Stand” and saw nothing. Chat went silent. The dealer kept talking, but bets did not lock. After the wobble, two re-buys failed. Support lit up.
We had CPU to spare and green pods. Yet the table stream fell behind. A slow quorum write and a backoff loop in the fan‑out path lined up at the worst time. Slots hid the pain because each spin was a short, self‑contained round. The table could not hide. Money and trust were at risk. The lesson was simple: not all games fail the same way.
“Availability” is not one number. A slot spin can retry if the server does not confirm in 250 ms. A live table cannot reorder steps. A dealer action must lock in sequence and show up to every seat in near real time. A wallet must be right even if the UI is late. A regulator may ask you to prove each move in a dispute, months later.
Players also notice more than uptime. They sense jitter, rollbacks, and odd round ends. Many compare studios, RTPs, and bugs before they play. They read reviews that score stability and fairness. If you build or run games, assume users do their homework. A good place to see how people judge slot engines and vendors is here: Review online slots software. Learn what they value. Then bake that into your design and your SLOs.
Most stacks share a core set: session and auth, wallet and ledger, RNG, game logic, state cache and DB, message bus, real‑time fan‑out, storage for logs and replays, and an observability suite. On the edge you steer traffic across regions, often with DNS or Anycast. You may use sticky sessions for WebSockets, but those can trap users on a bad node. Latency budgets must account for TLS, network hop, queue wait, and storage. Read and write paths are not the same. Learn them both end to end. See how Anycast routing shifts users to the closest healthy site.
Keep one “source of truth” for money. Cache is a helper, not a bank. For slots, aim for small, stateless actions with clear ids you can retry. For tables, plan for strict order and shared state. That means a stronger store, careful locks, and fan‑out that does not stall when one client is slow.
Slots feel bursty. A spin is short. The server can accept a spin id, pull an RNG value, compute the win, write a result, and send the outcome. If the client drops, the server can return the same result on resume. Good idempotency keys make this safe. You can even queue a spin if the ledger is slow, then settle when the wallet is free.
Tables are the opposite. They are shared. All seats must see the same thing in the same order. Bets must close before the dealer acts. The server must fan out updates to many clients at once. Jitter shows up as “lag” to all players. One slow seat cannot block the table, but it also cannot see skips or jumps. You need replay buffers and backpressure controls.
State order is key. You will dedupe messages. You may use sequence numbers. Some teams use streams for this. If you go that way, study exactly-once semantics. Know how it behaves on failover. Tables also carry chat, tips, and side bets. These add more fan‑out and raise the chance of head‑of‑line blocking.
In short: slots reward simple, idempotent calls and fast writes. Tables demand strict order, low jitter, and good fan‑out. Treat them as different SLOs and even different runbooks.
Set clear budgets per path. If you do not, you will chase noise. Put a number on round‑trip, on queue wait, on DB write, and on fan‑out. Track them as SLOs. Here are sane starting points from real teams. Tune them for your stack and regions.
Targets below assume a healthy network. They are P95s. Tune P99s once you get P95s stable.
| RNG microservice | ≤ 5 ms per call in‑region | Entropy source stall; bad seed; slow pool | Spin lag; trust risk if bias found | All games using this RNG | Trace span for RNG; health of entropy; chi‑square smoke tests | Local cache of draws; dual RNG path; circuit breaker | RTO: 5 min; RPO: zero (no lost draws) | Slots: small bursts; Tables: less frequent but must be in order |
| Game state store (cache + DB) | Cache hit ≤ 1 ms; DB write ≤ 15 ms | Hot key; eviction; leader failover | Spin retry; table freeze or rollback | Game family or all tables on shard | Cache hit ratio; DB commit time; failover count | Sharding; hot key split; fast failover runbook | RTO: 10–15 min; RPO: seconds with sync replicas | Tables need quorum writes; slots accept async ack with idempotency |
| Messaging bus / stream | Produce ≤ 5 ms; consume ≤ 10 ms | Partition skew; consumer lag; rebalances | Late updates; out‑of‑order state | Topic‑wide; often cross‑service | Lag metrics; rebalancing rate; duplicate rate | Key by table; consumer pinning; backpressure | RTO: 15 min; RPO: zero with durable log | Tables rely on order; slots can ignore late bus events |
| Wallet / ledger | Credit / debit ≤ 25 ms | Lock contention; double spend guard | Balance wrong or slow; disputes | All players touching wallet | Lock wait time; conflict rate; reconciliation diffs | Short tx; idempotent ops; async notifications | RTO: 15–30 min; RPO: zero (authoritative) | Slots can queue settle; tables must block until funds lock |
| Session / auth | Login ≤ 120 ms; token check ≤ 5 ms | Token drift; cache miss storm | Kicks; ghost seats; failed rejoin | Large; all users | 401/403 rate; token TTL errors | Edge cache; staged rollouts; key rotation plan | RTO: 10 min; RPO: n/a | Both suffer; tables need smooth rejoin to keep order |
| Real‑time fan‑out (WebSocket / SFU) | Push ≤ 50 ms to last seat | Slow clients; head‑of‑line block | Lag; missed tells; late closes | Per table; can stack up fast | Per‑seat send time; drop rate; buffer depth | Per‑client queues; drop old frames; adaptive bitrate | RTO: 5–10 min; RPO: n/a | Critical for tables; slots care less |
| Content / CDN | Asset fetch ≤ 50 ms edge hit | Stale assets; purge lag | UI bugs; mismatch with server | Wide; all clients | Cache hit; 4xx/5xx rate; version skew | Immutable assets; version pins; blue/green | RTO: 15 min; RPO: n/a | Impacts both; slow UI makes lag feel worse |
| Cross‑region failover (control plane) | Route flip ≤ 60 s; cold start ≤ 5 min | Split brain; partial cut | Disconnects; stuck tables; duplicates | Regional; can be global | Health by slice; error budget burn; async lag | Runbooks; health‑based GSLB; pre‑warmed pools | RTO: 5–15 min; RPO: seconds to zero | Tables need consistent clocks; slots tolerate brief read‑only |
Note: For end‑to‑end spans and service maps, see OpenTelemetry traces. Trace the user journey, not just pods.
RNG is your heart. In most stacks it is a PRNG with strong seeds. Entropy must be sound. See NIST SP 800-90B for guidance on sources and tests. Log seeds in a safe way so you can replay rounds for audits. Do not expose them. Keep a secure replay tool. Use HMAC to prove a round has not changed.
Labs will check your methods. They will test bias and the way you store and show results. Read the GLI-19 standards so you match what they need. Keep a clean chain for each spin and each table action. Be able to show it years later. Treat logs as evidence. Use write‑once storage for key events.
Determinism also helps support. If a player says “My spin was wrong,” you can replay the exact steps with the same input and show the same output. That builds trust and cuts refunds.
Caches can lie. A 98% hit rate can hide hot keys. One slot theme can set off skew. A table with 10k watchers can crush a shard. Watch not just hit rate, but top keys, TTL churn, and write amp. Use short, clear keys. Set sane TTLs. Avoid “thundering herds.”
Make each spin idempotent. That means the same call with the same id does the same thing, even if it runs twice. Use server‑made ids. Keep small, atomic steps. When using cache, learn how to group them with transactions. The docs for Redis transactions explain common patterns.
For the source of truth, pick a store you can run under stress. Use a leader and followers with clear failover. Keep quorum sizes modest. Test partitions. Read the guide on PostgreSQL high availability if you run Postgres. It shows how to build safe failover and avoid split brain.
Cross‑region is a hard call. Tables want strong order. That pushes you toward global clocks and quorum writes. But these choices add latency. For many teams, a single write region with fast read replicas and a hot standby is a good start. If you need global, study Google Spanner multi-region replication. It shows what global consensus costs and how to place replicas to keep latency in check.
Also pick your id space early. Use ULIDs or time‑sorted ids for better cache and DB scans. Decide how to shard tables. Many route by table id so that all actions land on one partition. This reduces cross‑shard hops.
Watch what the player feels. Define SLOs for a spin (request to settle), for a bet (lock to confirm), and for a live action (tap to fan‑out to last seat). Track four signals: latency, traffic, errors, and saturation. For time series and simple alerts, start with the Prometheus overview. Keep labels low‑card so your store does not explode.
Use traces to tie all steps. Sample at a higher rate for money paths. Alert on user‑journey spans that miss SLO, not on CPU. Read the Google SRE principles on SLOs, error budgets, and alerting. Add synthetic seats to live tables that click and bet like humans. Rehearse incidents. Record replays to debug after the fact.
Wallet paths may fall under PCI. Even if you do not store card data, you may touch tokens and payment flows. Study the PCI DSS overview. Keep the card zone small. Segment the network. Use HSMs for keys. Rotate secrets. Log access to wallets and ledger writes.
Remote rules differ by market. Many ask for strict logging, fair RNG, and clear recovery plans. For the UK, read the UKGC Remote Technical Standards. Map each rule to a test. Tie it to a log field. Keep proofs handy for audits.
Disaster recovery is not a slide. It is a drill. Pick your top five failure modes. Run a tabletop and a real failover for each one. Time the steps. Log who did what. Count what broke. Decide if you want N+1 or N+2. Price both. Read the AWS Well-Architected resilience guide for trade‑offs.
Good DR keeps state safe first. Backups are not enough. You need restores that pass checks and replays that prove money is right. Keep one command to fail to a hot site. Keep one to fail back. Keep data change windows small so RPO is near zero. Train new hires on this in month one.
Some parts are fine to rent: CDN, log storage, managed DB, even a managed stream. But know your exit paths. Export formats, topic schemas, and ledger snapshots must be portable. For your core run‑time, learn the platform you use. If you run K8s, read the Kubernetes architectural patterns. Know where it fails and how to spread failure domains.
Real‑time websockets at scale sound easy. They are not. If you buy, test under chaos and churn. If you build, plan for rolling restarts, key rotation, and slow clients. In both cases, measure tail latencies and drop rates.
If you want to see how users judge stability and RTP in the wild, scan a few independent review hubs. Note what they praise and what they warn about. Then fix those gaps in your stack first.
Good targets: ≤ 350 ms from player tap to commit and fan‑out to the last seat in the same region. Aim for ≤ 200 ms to commit, ≤ 150 ms to push. Keep P99 visible too.
Most do not. Slots work well with a single write region, idempotent spin ids, and async settle if the wallet is slow. Keep a hot standby in another region and drill the cutover.
Use a sealed replay tool. Feed it the same seed and round input. Compare the output and an HMAC of the result. Store seeds and HMACs in a vault with strict access control. Never expose seeds to client apps.
Sticky sessions that trap users on a sick node. Fan‑out that treats all seats as equal, so one slow seat stalls the table. Cache that becomes a bank. Streams with mixed keys that break order. Failovers you only tried in slides. Links that go stale in audits.
Reliability for games is not one knob. It is a set of small, sharp choices. Slots want fast, safe retries. Tables need order and smooth fan‑out. Wallets must be right every time. Trace what the player feels, not just what pods do. Drill failovers until they are boring. Keep proofs for audits. If you do these things, your 47‑second wobble will turn into a two‑second blip that few will notice.
Disclaimer: This article is technical guidance, not legal advice. Always check your design and logs with your regulator and your testing lab.
Written by an engineer who has built and run real‑money game platforms for 10+ years. Led PCI‑DSS projects, RNG audits, and cross‑region DR drills across EU and NA. Enjoys clean runbooks and short pager shifts.