Updated: 2026-06-12 • Author: Alex Morozov, Principal SRE (10+ years scaling regulated wagering)
Note: This is a technical guide. It is not legal advice.
It is derby night. Odds refresh fast. Traffic jumps 15x in seconds. Users tap “Place bet” and wait. Some bets hang. The risk engine lags. The cache is hot. A queue grows. Your pager screams. You have one job: do not melt.
Teams that stay up at peak share a pattern. They plan for the edge, the queue, the state, and the shed. They know what must be strong and strict, and what may be soft and late by a bit. They keep a calm loop: limit, absorb, commit, confirm.
Latent harm is real here. Odds move. Time is tight. A wrong price for two seconds can burn money. Fair play means no ghost wins, no double spend, no lost tickets. Use a clear design lens, like the AWS Well‑Architected Framework, but adapt it to fast writes and strict audit.
Spikes are not smooth. A goal alert. A promo push. A top fight starts. You get a “thundering herd.” Many calls are writes, not reads. That flips the usual playbook for web shops. The edge must hold the herd, and your core must not lock up.
Rules are strict and checks are deep. You log all changes. You prove who did what and when. You answer for data life and money flow. This is not a “nice to have.” It is part of the product, same as the odds feed.
Trace the path. Client → edge → API gateway → compute → cache/DB → risk → ledger → reply. Each hop costs time. Each extra round trip is pain. You win by cutting hops and by keeping hot paths warm. Google’s SRE book is a great base on SLIs and latency math: Google SRE book.
Small tricks add up: keep warm pools. Reuse TCP. Use idempotency keys for all bet writes. Pre‑compute common odds lines. Keep risk features in a fast cache near compute. Never fan out in sync on the hot path.
Stop floods at the edge. Use a CDN, WAF, and smart rate limits. Shape bot loads. Block bad ranges. Let the edge serve static and “safe to cache” widgets. Read the playbook from Cloudflare on DDoS.
Bet writes must be safe to retry. Make every request idempotent with a client key. Put a thin queue in front of the write core to smooth bursts. Use timeouts and backoff. Reject too many in flight. Return fast, then confirm when done.
Odds fan out to many users. Use a log stream with consumer groups. Keep backpressure. Put slow work off the hot path. See the Apache Kafka docs for core ideas that hold across clouds.
Odds are read a lot. Put hot keys in Redis. Watch for hot spots. Split keys by sport or league. Tune TTLs. Decide on write‑through vs write‑behind based on risk. The Redis docs on caching patterns show simple shapes that work.
Money and tickets live here. Keep ACID. Use strict isolation for the ledger. Use CQRS: writes in a clean OLTP store, reads in views fed by events. Start with Postgres unless you have proof you must not. The PostgreSQL docs on transactions explain levels and trade‑offs.
Be strict on balance, ledger, and bet state. No drift. No guess. Be okay with a small lag on odds feeds, leaderboards, and some UX hints. Use sagas for cross‑service steps. Use an outbox to ship events from the same commit as the write. Dedup in sinks. Keep all writes idempotent end to end.
When you must choose CAP trade‑offs, be clear. If the user sees an old price for a short time, it is fine, but the placed bet must bind to the true price at commit. Show that rule in code and in logs.
Set SLOs per slice: bet place API, bet settle, login, odds feed. Tie alerts to user pain, not host CPU. Read the SLO chapter in the SRE guide. Keep an error budget and use it to pick work: ship fast or harden now.
Plan for region loss. Run active‑active when the law allows. Use latency‑aware routing. Drill failover often. Chaos is a skill. Learn the core ideas at Principles of Chaos Engineering. Keep runbooks. Name an incident lead at start. Fix the system, not the person.
| Realtime odds distribution | Thundering herd, slow fans | Kafka/Pub‑Sub, consumer groups, backpressure | MSK, Lambda for transforms | Pub/Sub, Dataflow | Event Hubs, Functions | Keep hot partitions small; monitor lag |
| Bet placement API | Double spend, retries storm | Idempotency keys, FIFO queue, short timeouts | SQS FIFO, API Gateway | Pub/Sub with ordering, API Gateway | Service Bus Sessions, Front Door | Persist idempotency window (e.g., 2–5 min) |
| Risk engine scoring | Cold starts, feature fetch lag | Warm pools, feature cache near CPU | Lambda + Provisioned Concurrency | Cloud Functions min instances | Azure Functions pre‑warm | Pin hot models; async enrich |
| Session store | Hot keys, lock fights | Sharded Redis, local token cache | ElastiCache | Memorystore | Azure Cache for Redis | Hash slots and fallback tokens |
| Ledger / transactions | Lost updates, drift | ACID DB, CQRS, strict isolation | Aurora PostgreSQL | Cloud SQL PostgreSQL | Azure Database for PostgreSQL | Keep write path small and simple |
| DDoS / edge | Bursts, bot swarms | WAF, rate limits, bot score | CloudFront + AWS WAF/Shield | Cloud CDN + Cloud Armor | Front Door + Azure WAF | Block by ASNs; observe false hits |
| Observability | Blind spots, high MTTR | OpenTelemetry, RED/USE, SLO boards | Amazon Managed Grafana/Prometheus | Cloud Monitoring + Managed Prometheus | Azure Monitor + Managed Grafana | Trace hot path first; sample with care |
| DR / Resilience | Region outage, split brain | Active‑active, traffic steering | Route 53, Global Accelerator | Cloud DNS, Cloud Load Balancing | Traffic Manager, Front Door | Write quorum and conflict rules |
| Data governance | PII leak, law gaps | KMS, tokenization, data maps | AWS KMS, Macie | Cloud KMS, DLP | Key Vault, Purview | Track residency per user |
| Cost control | Runaway scale, idle waste | SLO‑driven autoscale, budgets | Budgets, Cost Explorer | Cost Management, Recommender | Cost Management + Advisor | Link costs to SLO per slice |
Lock secrets down. Rotate often. Use a KMS and, if you can, HSM for keys. Map to a known control set like NIST SP 800‑53. Store audit logs in an append‑only form. Limit who can touch prod. Use short‑lived tokens. Test backup restores, not just backups.
Build auth and app safety in from day one. Align with the OWASP Top 10. Use strong user auth. Use OpenID Connect if you need SSO. Keep rate limits on login and bet endpoints. Scrub PII in logs. Encrypt in flight and at rest. Note: you handle money, so follow PCI DSS for the card flow, or isolate that flow to a vendor.
Do not only cut cost. Tie cost to SLO. Buy headroom where a miss hurts trust. Use scale rules from SLO, not from raw CPU. Use mixed tiers (on‑demand for hot paths; spot for batch). Track unit cost per settled bet. The FinOps Framework can guide the org parts.
Measure three things: metrics, logs, and traces. Standardize on OTel. Add span links for a bet across edge, API, risk, and ledger. Start with the hot path and the slow path. See OpenTelemetry docs. For time series and SLO boards, Prometheus is a safe bet in any cloud.
Test game days. Rehearse a big night with synthetic load. Try dark launches and canaries. Flip features with flags. Make safe defaults. Drill a cache‑off day and a broker‑slow day. You will learn fast what breaks and what bends.
Trust comes from clear rules and clear talk. Show status. Explain payout delays in plain words. Keep a fair help path. In some regions, study the UK Gambling Commission Remote Technical Standards and their spirit. Then bake that into tests and logs.
Listen to users, not only to graphs. Independent review hubs can show real pain points that logs miss. For example, guides to bästa mobil casinon (best mobile casinos) collect reports on payout speed, mobile UX, and trust. Read such notes with care. They can help you pick fixes that boost real trust, not just metrics.
Deploy in small steps. Keep feature flags. Roll back fast. Keep config in git. Use IaC for all infra. Treat runbooks as code. For K8s shops, review the Kubernetes production checklist and apply only what you need for the hot path first.
How do I stop double charges on retry? Use an idempotency key from the client. Store the key and result for a short time. On repeat, return the same result.
What if Kafka lags at peak? Shed load at producers. Add partitions with care. Slow consumers should be in their own group. Do not block the hot path on stream work.
Do I need strict serializable for the ledger? Often yes. If not, you must prove safety for your bet model. Test with chaos and high write rates.
Which metrics matter most? SLO per slice. For bets: p95 latency, error rate, dedup hits, queue depth, and DB lock waits. For odds: publish lag and subscriber lag.
Alex Morozov is a Principal SRE and former Head of Platform in iGaming. He has led edge, data, and SRE teams through top sports nights across EU and APAC. He writes field‑first guides for engineers who carry the pager.