Insights

Cloud Infrastructure for High-Traffic Betting Platforms

Updated: 2026-06-12 • Author: Alex Morozov, Principal SRE (10+ years scaling regulated wagering)

Note: This is a technical guide. It is not legal advice.

One minute to kickoff, graphs go vertical

It is derby night. Odds refresh fast. Traffic jumps 15x in seconds. Users tap “Place bet” and wait. Some bets hang. The risk engine lags. The cache is hot. A queue grows. Your pager screams. You have one job: do not melt.

Teams that stay up at peak share a pattern. They plan for the edge, the queue, the state, and the shed. They know what must be strong and strict, and what may be soft and late by a bit. They keep a calm loop: limit, absorb, commit, confirm.

How betting is not “just e‑commerce”

Latent harm is real here. Odds move. Time is tight. A wrong price for two seconds can burn money. Fair play means no ghost wins, no double spend, no lost tickets. Use a clear design lens, like the AWS Well‑Architected Framework, but adapt it to fast writes and strict audit.

Spikes are not smooth. A goal alert. A promo push. A top fight starts. You get a “thundering herd.” Many calls are writes, not reads. That flips the usual playbook for web shops. The edge must hold the herd, and your core must not lock up.

Rules are strict and checks are deep. You log all changes. You prove who did what and when. You answer for data life and money flow. This is not a “nice to have.” It is part of the product, same as the odds feed.

Latency on a napkin: where the millis go

Trace the path. Client → edge → API gateway → compute → cache/DB → risk → ledger → reply. Each hop costs time. Each extra round trip is pain. You win by cutting hops and by keeping hot paths warm. Google’s SRE book is a great base on SLIs and latency math: Google SRE book.

Small tricks add up: keep warm pools. Reuse TCP. Use idempotency keys for all bet writes. Pre‑compute common odds lines. Keep risk features in a fast cache near compute. Never fan out in sync on the hot path.

The building blocks, from edge to data

Edge and perimeter

Stop floods at the edge. Use a CDN, WAF, and smart rate limits. Shape bot loads. Block bad ranges. Let the edge serve static and “safe to cache” widgets. Read the playbook from Cloudflare on DDoS.

API layer

Bet writes must be safe to retry. Make every request idempotent with a client key. Put a thin queue in front of the write core to smooth bursts. Use timeouts and backoff. Reject too many in flight. Return fast, then confirm when done.

Streaming for odds and events

Odds fan out to many users. Use a log stream with consumer groups. Keep backpressure. Put slow work off the hot path. See the Apache Kafka docs for core ideas that hold across clouds.

Cache

Odds are read a lot. Put hot keys in Redis. Watch for hot spots. Split keys by sport or league. Tune TTLs. Decide on write‑through vs write‑behind based on risk. The Redis docs on caching patterns show simple shapes that work.

Transactional core

Money and tickets live here. Keep ACID. Use strict isolation for the ledger. Use CQRS: writes in a clean OLTP store, reads in views fed by events. Start with Postgres unless you have proof you must not. The PostgreSQL docs on transactions explain levels and trade‑offs.

Consistency: where you can relax, where you must not

Be strict on balance, ledger, and bet state. No drift. No guess. Be okay with a small lag on odds feeds, leaderboards, and some UX hints. Use sagas for cross‑service steps. Use an outbox to ship events from the same commit as the write. Dedup in sinks. Keep all writes idempotent end to end.

When you must choose CAP trade‑offs, be clear. If the user sees an old price for a short time, it is fine, but the placed bet must bind to the true price at commit. Show that rule in code and in logs.

Reliability you can count on

Set SLOs per slice: bet place API, bet settle, login, odds feed. Tie alerts to user pain, not host CPU. Read the SLO chapter in the SRE guide. Keep an error budget and use it to pick work: ship fast or harden now.

Plan for region loss. Run active‑active when the law allows. Use latency‑aware routing. Drill failover often. Chaos is a skill. Learn the core ideas at Principles of Chaos Engineering. Keep runbooks. Name an incident lead at start. Fix the system, not the person.

Workload → Risks → Patterns → Managed Options

Realtime odds distribution	Thundering herd, slow fans	Kafka/Pub‑Sub, consumer groups, backpressure	MSK, Lambda for transforms	Pub/Sub, Dataflow	Event Hubs, Functions	Keep hot partitions small; monitor lag
Bet placement API	Double spend, retries storm	Idempotency keys, FIFO queue, short timeouts	SQS FIFO, API Gateway	Pub/Sub with ordering, API Gateway	Service Bus Sessions, Front Door	Persist idempotency window (e.g., 2–5 min)
Risk engine scoring	Cold starts, feature fetch lag	Warm pools, feature cache near CPU	Lambda + Provisioned Concurrency	Cloud Functions min instances	Azure Functions pre‑warm	Pin hot models; async enrich
Session store	Hot keys, lock fights	Sharded Redis, local token cache	ElastiCache	Memorystore	Azure Cache for Redis	Hash slots and fallback tokens
Ledger / transactions	Lost updates, drift	ACID DB, CQRS, strict isolation	Aurora PostgreSQL	Cloud SQL PostgreSQL	Azure Database for PostgreSQL	Keep write path small and simple
DDoS / edge	Bursts, bot swarms	WAF, rate limits, bot score	CloudFront + AWS WAF/Shield	Cloud CDN + Cloud Armor	Front Door + Azure WAF	Block by ASNs; observe false hits
Observability	Blind spots, high MTTR	OpenTelemetry, RED/USE, SLO boards	Amazon Managed Grafana/Prometheus	Cloud Monitoring + Managed Prometheus	Azure Monitor + Managed Grafana	Trace hot path first; sample with care
DR / Resilience	Region outage, split brain	Active‑active, traffic steering	Route 53, Global Accelerator	Cloud DNS, Cloud Load Balancing	Traffic Manager, Front Door	Write quorum and conflict rules
Data governance	PII leak, law gaps	KMS, tokenization, data maps	AWS KMS, Macie	Cloud KMS, DLP	Key Vault, Purview	Track residency per user
Cost control	Runaway scale, idle waste	SLO‑driven autoscale, budgets	Budgets, Cost Explorer	Cost Management, Recommender	Cost Management + Advisor	Link costs to SLO per slice

Security and compliance without breaking prod

Lock secrets down. Rotate often. Use a KMS and, if you can, HSM for keys. Map to a known control set like NIST SP 800‑53. Store audit logs in an append‑only form. Limit who can touch prod. Use short‑lived tokens. Test backup restores, not just backups.

Build auth and app safety in from day one. Align with the OWASP Top 10. Use strong user auth. Use OpenID Connect if you need SSO. Keep rate limits on login and bet endpoints. Scrub PII in logs. Encrypt in flight and at rest. Note: you handle money, so follow PCI DSS for the card flow, or isolate that flow to a vendor.

Performance economics: spend where it pays

Do not only cut cost. Tie cost to SLO. Buy headroom where a miss hurts trust. Use scale rules from SLO, not from raw CPU. Use mixed tiers (on‑demand for hot paths; spot for batch). Track unit cost per settled bet. The FinOps Framework can guide the org parts.

Operate with eyes open

Measure three things: metrics, logs, and traces. Standardize on OTel. Add span links for a bet across edge, API, risk, and ledger. Start with the hot path and the slow path. See OpenTelemetry docs. For time series and SLO boards, Prometheus is a safe bet in any cloud.

Test game days. Rehearse a big night with synthetic load. Try dark launches and canaries. Flip features with flags. Make safe defaults. Drill a cache‑off day and a broker‑slow day. You will learn fast what breaks and what bends.

User trust and the wider ecosystem

Trust comes from clear rules and clear talk. Show status. Explain payout delays in plain words. Keep a fair help path. In some regions, study the UK Gambling Commission Remote Technical Standards and their spirit. Then bake that into tests and logs.

Listen to users, not only to graphs. Independent review hubs can show real pain points that logs miss. For example, guides to bästa mobil casinon (best mobile casinos) collect reports on payout speed, mobile UX, and trust. Read such notes with care. They can help you pick fixes that boost real trust, not just metrics.

From concept to a working rollout

Days 0–30

Define SLOs for bet place, settle, login, and odds feed.
Make all bet writes idempotent end to end.
Add a minimal queue in front of the write core.
Design your event schema. Add an outbox to the write DB.
Trace the hot path with OTel. Add span IDs to logs.

Days 30–60

Edge hardening: WAF rules, rate limits, bot scores.
Cache plan: hot keys map, TTLs, write policy.
Streaming plan: topic map, partitions, consumer groups.
DB plan: strict isolation for ledger, read views via events.
Basic chaos drills: broker lag, cache miss storm.

Days 60–90

Active‑active or active‑passive by region with clear failover steps.
SLO dashboards, error budget policy live.
Cost and capacity reviews tied to SLO and unit cost per settled bet.
Run a full game day before a major event.
Post‑incident KPIs: time to detect, time to mitigate, user impact minutes.

Anti‑patterns to avoid

One big RDBMS for all reads and writes.
No outbox, events fired “on success” only.
No rate limits at the edge.
No warm capacity for peak minutes.
No load shedding when a backend is slow.
Zero idempotency for bet writes.
All odds fetch and risk enrich on the sync path.
Logs without trace IDs.

Ops craft: small practices that pay off

Deploy in small steps. Keep feature flags. Roll back fast. Keep config in git. Use IaC for all infra. Treat runbooks as code. For K8s shops, review the Kubernetes production checklist and apply only what you need for the hot path first.

If a cup final is tomorrow, check these in 30 minutes

Edge limits in place; WAF rules tested on a small canary.
API timeouts sane; retries with jitter; total cap per client.
Idempotency keys stored with a 5‑minute TTL; dedup tested.
Cache hit rate and hot key list on a big screen.
Queue depth alerts; broker lag alerts; clear runbook for both.
DB write IOPS headroom; lock wait time chart live.
Risk features pre‑warmed; model cold start removed for top sports.
Incident roles named; status page ready; comms template at hand.
DR switch test done this week; health checks by region look good.
On‑call rested; handoff notes clear.

FAQ (quick hits)

How do I stop double charges on retry? Use an idempotency key from the client. Store the key and result for a short time. On repeat, return the same result.

What if Kafka lags at peak? Shed load at producers. Add partitions with care. Slow consumers should be in their own group. Do not block the hot path on stream work.

Do I need strict serializable for the ledger? Often yes. If not, you must prove safety for your bet model. Test with chaos and high write rates.

Which metrics matter most? SLO per slice. For bets: p95 latency, error rate, dedup hits, queue depth, and DB lock waits. For odds: publish lag and subscriber lag.

References worth your time

AWS Well‑Architected Framework
Cloudflare on DDoS
Google SRE book
Apache Kafka documentation
Redis docs
PostgreSQL docs
Kubernetes production checklist
SRE: Service Level Objectives
Principles of Chaos Engineering
NIST SP 800‑53
OWASP Top 10
PCI DSS
OpenTelemetry docs
Prometheus overview
UKGC Remote Technical Standards

Quick checklist before you ship

Edge: WAF + smart rate limits live and tested.
API: idempotent writes, strict timeouts, backoff with jitter.
Queue: depth alerts, dead letter policies, runbook for drain.
Cache: hot key map, TTLs by type, write policy chosen and tested.
DB: ACID for ledger, CQRS for reads, lock wait alerts.
Streams: topic map set, consumer groups clear, lag SLOs.
Observability: OTel end to end; SLO dashboards; trace IDs in logs.
Security: keys in KMS; least privilege; audit log immutable.
DR: region failover drill done; traffic steering tested.
Cost: unit cost per settled bet tracked; scale rules tied to SLO.

About the author

Alex Morozov is a Principal SRE and former Head of Platform in iGaming. He has led edge, data, and SRE teams through top sports nights across EU and APAC. He writes field‑first guides for engineers who carry the pager.