ticketmaster

Introduction

Designing a system like Ticketmaster is deceptively challenging. Selling a ticket sounds simple — a user picks a seat, pays, and receives a confirmation. But Ticketmaster operates at the intersection of three engineering extremes that make it one of the most demanding system design problems:

Extreme Traffic Spikes — When Taylor Swift tickets go on sale, traffic can spike 100× over baseline within seconds. Unlike social media feeds where a slow page is annoying, a slow ticketing system means lost revenue and angry customers. The system must absorb millions of concurrent users hitting the same inventory simultaneously.
Zero Tolerance for Overselling — If two users buy the same seat, one gets turned away at the venue. This isn't just a bad experience — it's a legal and financial liability. Every seat reservation must be exactly-once: no double bookings, no phantom tickets, no race conditions.
Time-Pressured Transactions — Users expect to complete a purchase within seconds. But the flow involves inventory checks, temporary holds, payment processing with external providers, and confirmation — a multi-step distributed transaction with external dependencies and failure modes at every step.

The core tension is between consistency (no overselling) and availability (serving millions of concurrent users). Most distributed systems can relax one or the other. Ticketmaster can't — it needs strong consistency on seat inventory while handling massive, bursty read traffic.

The system manages ~10 million daily active users, with peaks of 100,000 concurrent users during hot on-sale events. Of these, roughly 20,000 are concurrently attempting to book, while the rest are browsing — giving a heavily read-centric 5:1 read/write ratio.

Ticketmaster system overview showing event browsing, seat selection, reservation, and booking flow

Functional Requirements

We extract verbs from the problem statement to identify core operations:

"searches for events" → READ operation (Event Discovery)
"views available seats" → READ operation (Seat Availability)
"reserves a seat" → WRITE operation (Temporary Hold)
"books the seat" via payment → WRITE operation (Purchase Confirmation)
"joins a waitlist" if the seat is taken → WRITE operation (Waitlist)

Each verb maps to a functional requirement. The requirements form a pipeline: search → view → reserve → pay → confirm.

Event Discovery — Users search for events by name, artist, venue, location, or date. The system returns matching events with basic details (name, date, venue, price range).
Seat Availability — For a selected event, display all seats with their current status (available, reserved, booked). Typically rendered as an interactive seat map with color-coded indicators.
Seat Reservation — When a user selects a seat, temporarily hold it for a limited time window (e.g., 2 minutes) while they complete payment. No other user can reserve this seat during the hold.
Booking Confirmation — Process payment through an external provider (Stripe, PayPal). On success, transition the seat from reserved to booked. On failure or timeout, release the seat.
Waitlist — If a seat is already reserved, the user joins a FIFO queue. If the reservation expires without payment, the next waitlisted user gets the opportunity to reserve.

Out of Scope

Event creation and venue management (admin side)
Dynamic pricing / auction-style bidding
Resale / secondary marketplace
Seating chart designer
Push notifications for upcoming events
Social features (sharing, group bookings)
Loyalty programs and promotional codes

Non-Functional Requirements

We extract adjectives and descriptive phrases to identify quality constraints:

"no double bookings" → Strong consistency on seat inventory; exactly-once reservation semantics
"millions of users" during on-sale events → High concurrency with extreme traffic spikes (100× baseline)
"first-come, first-served" → Strict ordering of reservation requests
"limited time window" for payment → Reservation expiry with automatic release
"real-time" seat status updates → Low-latency seat availability updates as seats are reserved/released

NFR	Target	Reasoning
Strong Consistency	Zero double-bookings across all race conditions	A sold seat must never be assigned to two users — legal and financial liability
High Availability	99.99% during on-sale events	Downtime during a Taylor Swift on-sale is catastrophic — millions of users, minutes of window
Low Latency	<200ms for seat availability; <500ms for reservation	Users won't wait; slow = lost sales and user exodus to competitors
Burst Scalability	100K → 10M+ concurrent users in seconds	On-sale traffic is not gradual — it's a step function at the announced time
Fairness	First-come, first-served ordering	Users who arrive first should get first access to seats
Reliability	Reservation expires if payment fails	Prevent seat hoarding — seats must be released back to inventory

Key insight: The core tension is consistency vs availability during peak traffic. The system resolves this by separating the read path (seat availability — eventually consistent, cached) from the write path (reservation — strongly consistent, serialized). Reads scale horizontally via caching; writes are serialized through an ordered queue to guarantee consistency.

Resource Estimation

Scale Assumptions

Parameter	Value
Daily active users	~10 million
Peak concurrent users	~100,000 (on-sale events)
Concurrent booking attempts	~20,000
Read:Write ratio	5:1
Average write requests per user per day	~5
Data per write request	~1 KB

Throughput

Read QPS (average):

$\frac{10{,}000{,}000 \text{ users} \times 25 \text{ reads/day}}{86{,}400 \text{ sec}} \approx 2{,}894 \text{ QPS (avg)}$

Write QPS (average):

$\frac{10{,}000{,}000 \times 5 \text{ writes/day}}{86{,}400} \approx 579 \text{ QPS (avg)}$

Peak write QPS (on-sale burst):

During a hot on-sale event, 20,000 users concurrently attempting to book in the first few seconds:

$\sim20{,}000 \text{ reservation requests in } \sim10 \text{ sec} = 2{,}000 \text{ writes/sec (peak)}$

Storage

Event + seat data:

Assume 50,000 events/year, average 10,000 seats per event:

$50{,}000 \times 10{,}000 \times 1 \text{ KB} = 500 \text{ GB/year}$

Booking + transaction records:

$\frac{10{,}000{,}000 \times 5}{\text{day}} \times 1 \text{ KB} \times 365 \approx 18.25 \text{ TB/year}$

Bandwidth

Inbound (writes):

$579 \text{ writes/sec} \times 1 \text{ KB} \approx 0.6 \text{ MB/sec}$

Outbound (reads with seat maps):

Seat map responses are larger (~50 KB with seat layout + status):

$2{,}894 \text{ reads/sec} \times 50 \text{ KB} \approx 145 \text{ MB/sec} \approx 1.16 \text{ Gbps}$

During on-sale peaks, read traffic can spike 50× — CDN and caching are critical to absorb this without overwhelming the backend.

API Design

We derive API endpoints from the functional requirements. The read path serves event browsing and seat availability; the write path handles reservations and bookings.

# ── Event Search (Read Path) ─────────────────────────────────
GET /events?query={text}&location={city}&date={date}&page={n}
→ 200 OK
{
  "events": [
    {
      "event_id": "evt_ts_2024",
      "title": "Taylor Swift | The Eras Tour",
      "venue": "SoFi Stadium, LA",
      "date": "2024-08-15T19:30:00Z",
      "price_range": { "min": 49, "max": 899, "currency": "USD" }
    }
  ],
  "total": 24,
  "page": 1
}

# ── Seat Availability ────────────────────────────────────────
GET /events/{event_id}/seats
→ 200 OK
{
  "event_id": "evt_ts_2024",
  "sections": [
    {
      "section": "A",
      "seats": [
        { "seat_id": "A-101", "status": "available", "price": 299 },
        { "seat_id": "A-102", "status": "reserved", "price": 299 },
        { "seat_id": "A-103", "status": "booked", "price": 299 }
      ]
    }
  ]
}

# ── Reserve a Seat (Write Path) ──────────────────────────────
POST /events/{event_id}/seats/{seat_id}/reserve
{ "user_id": "u_12345" }
→ 201 Created
{
  "reservation_id": "res_abc",
  "seat_id": "A-101",
  "status": "reserved",
  "expires_at": "2024-08-15T10:02:00Z",
  "payment_url": "https://pay.stripe.com/session/xyz"
}

# ── Confirm Booking (after payment webhook) ──────────────────
POST /bookings/{reservation_id}/confirm
{ "payment_id": "pay_xyz", "provider": "stripe" }
→ 200 OK
{
  "booking_id": "bk_789",
  "status": "confirmed",
  "seat_id": "A-101",
  "event": "Taylor Swift | The Eras Tour"
}

# ── Cancel Reservation ───────────────────────────────────────
POST /reservations/{reservation_id}/cancel
→ 200 OK  { "status": "cancelled", "seat_status": "available" }

# ── Join Waitlist ────────────────────────────────────────────
POST /events/{event_id}/seats/{seat_id}/waitlist
{ "user_id": "u_67890" }
→ 201 Created
{
  "position": 3,
  "estimated_wait": "unlikely"
}

WebSocket / SSE for Real-Time Updates

For live seat map updates during on-sale events, clients subscribe to a Server-Sent Events (SSE) stream:

Direction	Event Type	Payload
Server → Client	`seat_status_changed`	`{seat_id, new_status, section}`
Server → Client	`reservation_expiring`	`{reservation_id, seconds_left}`
Server → Client	`booking_confirmed`	`{booking_id, seat_id}`
Server → Client	`waitlist_offer`	`{seat_id, expires_at}`

SSE is preferred over WebSocket here because the communication is predominantly one-directional (server → client). The client only needs to know about seat status changes — it doesn't need to send frequent messages upstream.

Data Model

Five core entities drive the architecture, each owned by its respective microservice:

Event — Represents a scheduled event at a venue.

Field	Type	Description
`event_id`	string	Primary key
`title`	string	Event name
`venue_id`	string	FK to venue
`date`	timestamp	Event date/time
`on_sale_date`	timestamp	When tickets go on sale
`total_seats`	int	Total inventory count

Seat — Individual seat within an event/venue.

Field	Type	Description
`seat_id`	string	Primary key
`event_id`	string	FK to event
`section`	string	Section/area in venue
`row`	string	Row identifier
`number`	int	Seat number
`status`	enum	available, reserved, booked
`price`	decimal	Ticket price
`reserved_by`	string	User ID (null if available)
`reserved_until`	timestamp	Reservation expiry
`version`	int	For optimistic concurrency

Booking — Confirmed purchase linking user, event, and seat.

Field	Type	Description
`booking_id`	string	Primary key
`event_id`	string	FK to event
`seat_id`	string	FK to seat
`user_id`	string	FK to user
`status`	enum	pending, confirmed, cancelled
`created_at`	timestamp	When booking was created

Transaction — Payment record for a booking.

Field	Type	Description
`transaction_id`	string	Primary key
`booking_id`	string	FK to booking
`payment_status`	enum	pending, success, failed, refunded
`payment_provider_id`	string	External payment reference
`amount`	decimal	Payment amount

Waitlist — FIFO queue per seat for users waiting for availability.

Field	Type	Description
`waitlist_id`	string	Primary key
`event_id`	string	FK to event
`seat_id`	string	FK to seat
`user_id`	string	FK to user
`position`	int	Queue position
`created_at`	timestamp	When joined

Relationships: Event → Seat is 1:many. Seat → Booking is 1:1 (per event). Booking → Transaction is 1:1. Seat → Waitlist is 1:many.

Each microservice owns its database: Search Service owns Events, Booking Service owns Seats + Bookings + Waitlist, Payment Service owns Transactions. Cross-service interaction is via API calls, not shared databases.

High-Level Design

We build the architecture progressively, starting from a naive single-service design and evolving through three iterations. Each step addresses specific failure modes revealed by the previous design.

Step 1: Naive Design — Direct Database Booking

The Starting Architecture

Start with the simplest possible approach: a single service that handles all operations — search, reservation, payment — with a single relational database.

The Flow

User searches for events → direct database query
User selects a seat → UPDATE seat SET status = 'reserved' WHERE seat_id = X
User pays → synchronous call to payment provider
On payment success → UPDATE seat SET status = 'booked'

Why This Breaks

This design fails under three simultaneous pressures:

Read contention: 100K concurrent users all querying the same event's seats. Every seat map request hits the database. At 100K QPS with complex seat map queries, the database connection pool is exhausted.
Write races: 20,000 users trying to reserve the same popular seats. Without proper concurrency control, two users reserve the same seat → double booking.
Synchronous payment: The reservation holds a database connection while waiting for Stripe/PayPal to respond (1-3 seconds). At 2,000 concurrent payments, that's 2,000 database connections held open doing nothing.

Naive architecture: single service, single database, all operations synchronous

✅ Works for: Small venue with a few hundred seats and gradual ticket sales

❌ Fails at scale: Database is the bottleneck for both reads and writes. No caching, no queue, no separation of concerns. Double-booking possible without explicit locking. Synchronous payment blocks resources.

Step 2: Read/Write Path Separation — Caching + Booking Service

Separating Reads from Writes

The fundamental insight: reads and writes have completely different characteristics and should be handled by different infrastructure.

Reads (event search, seat availability): high volume, tolerant of slight staleness (a seat showing as 'available' for an extra second is acceptable)
Writes (reservation, booking): low volume but must be strictly consistent (no double-booking)

Read Path: Search Service + Cache

The Search Service handles all read operations. Event data and venue information changes infrequently, so it's cached aggressively:

Event catalog → cached with long TTL (hours)
Seat availability → cached with short TTL (seconds) in a Seat Cache (Redis or Memcached)

When a user opens the seat map, the Search Service reads from the Seat Cache. The cache is updated when seats are reserved or released. This offloads 95%+ of read traffic from the database.

Read/write separation: Search Service with cache for reads, Booking Service for writes

Write Path: Booking Service

The Booking Service handles all reservation and booking operations. When a user reserves a seat:

Booking Service receives the reservation request
Attempts to mark the seat as reserved in the database
Updates the Seat Cache to reflect the new status
Returns reservation details + payment link to the user

The Problem: Concurrent Write Races

Even with read/write separation, 20,000 concurrent booking attempts can still cause race conditions. Two users simultaneously read a seat as available, both attempt to reserve it, and both succeed — double booking.

✅ NFRs addressed: Read scalability via caching (burst traffic absorbed), separation of concerns

❌ Still missing: No ordering of write requests — race conditions on concurrent reservations. Synchronous payment still blocking. No reservation expiry mechanism. No waitlist.

Step 3: Ordered Queue + Asynchronous Payment Pipeline

Serializing Writes with a Message Queue

The write race problem from Step 2 has a clean solution: serialize all reservation requests through an ordered message queue.

Instead of the Booking Service writing directly to the database, it pushes reservation requests into a message queue (Kafka, SQS, or RabbitMQ). A Queue Consumer processes requests one at a time in FIFO order.

The Reservation Flow

User clicks "Reserve" → Booking Service enqueues the request
Booking Service immediately returns to the user: "Your seat is being reserved..." (spinner on UI)
Queue Consumer picks up the request:
- Checks if the seat is still available in the database
- If yes: marks it as reserved, updates the cache, triggers payment flow
- If no: rejects the request, notifies the user
The queue ensures first-come, first-served ordering — the first request enqueued wins

This eliminates write races by design: only one consumer processes reservations for a given seat at a time. The queue also acts as a buffer during traffic spikes — it absorbs burst writes that would otherwise overwhelm the database.

Asynchronous Payment

Payment is decoupled from the reservation. After a seat is reserved:

Booking Service sends a payment link to the user (via SSE/notification)
User completes payment through the external provider
Payment provider sends a webhook to our Payment Service
Payment Service records the transaction and notifies Booking Service
Booking Service transitions the seat from reserved to booked
A message is sent to the queue to update the Seat Cache

Architecture with ordered message queue and async payment pipeline

Why a Queue Instead of Database Locking?

Database-level locks (e.g., SELECT ... FOR UPDATE) also prevent double-booking, but they hold connections open under contention. With 20,000 concurrent requests for the same event, lock contention cascades into connection pool exhaustion and timeouts. The queue decouples admission from processing — the database sees a steady, manageable stream of writes regardless of how bursty the incoming traffic is.

The queue also provides natural fairness: requests are processed in arrival order. With database locking, the request that acquires the lock first wins — which may depend on network latency rather than user arrival time.

✅ NFRs addressed: No double-booking (serialized writes), first-come first-served ordering, async payment (no blocking resources), burst absorption

❌ Still missing: No reservation expiry — what if a user reserves but never pays? No notifications to users about reservation status. No waitlist. No protection against bots and unfair access.

Step 4: Complete Architecture — Reservation Expiry, Notifications, and Waitlist

Adding Expiry, Notifications, Waitlist, and Fair Access

The final architecture addresses the remaining gaps: reservation timeout, user notifications, waitlist management, and bot protection.

Reservation Expiry via Delayed Task Scheduler

When a seat is reserved, the system simultaneously schedules a delayed task (e.g., 2 minutes later). The Scheduler checks: is the seat now booked? If not, the reservation expired without payment — release it back to available.

Implementation options:

Redis key with TTL: Set a key reservation:{id} with a 2-minute TTL. On expiry, a notification triggers the release logic.
Delayed message queue: Send a message with a 2-minute delivery delay. When the consumer receives it, check the seat status.
Scheduled task service: A cron-like service that polls for expired reservations every N seconds.

The delayed message queue approach is simplest and most reliable — it's idempotent (checking status before acting) and doesn't require polling.

Reservation expiry flow using delayed task scheduler

Notification Service

The Booking Service needs to notify users about:

Seat is being reserved — proceed to payment
Payment successful — booking confirmed
Reservation expired — seat released
Waitlist offer — it's your turn to reserve

This is done through a Notification Service that consumes events from the message queue and pushes updates to the user via Server-Sent Events (SSE). SSE is ideal here: communication is server-to-client only, and it works over standard HTTP — no WebSocket upgrade needed.

Notification Service consuming events from message queue and pushing to users via SSE

Waitlist Implementation

When a user tries to reserve an already-reserved seat, they can join a waitlist. The waitlist is a FIFO queue per seat (Redis list is ideal):

Join: LPUSH waitlist:{event_id}:{seat_id} {user_id}
Dequeue: RPOP waitlist:{event_id}:{seat_id} → returns the longest-waiting user

When a reservation expires (user didn't pay), the Scheduler releases the seat and immediately dequeues the next waitlisted user, giving them a fresh reservation window.

Fair Access and Bot Protection

During high-demand events, bots can monopolize the booking flow. Protections include:

Virtual waiting room: Before on-sale time, users enter a queue. At on-sale, users are admitted in random or first-come order at a controlled rate.
CAPTCHA/verification: Challenge users before allowing reservation to filter automated traffic.
Rate limiting: Per-user and per-IP request limits.
Lottery system: For extremely high-demand events, randomly select users who get the opportunity to purchase (used by some real-world ticketing systems).
Time-slot allocation: Spread buying activity across intervals to prevent simultaneous request tsunamis.

Complete Ticketmaster architecture with all components

NFR Scorecard

NFR	Target	How It's Met
Strong Consistency	Zero double-bookings	Write path serialized through ordered message queue; Queue Consumer processes one seat reservation at a time; optimistic concurrency control as additional safeguard
High Availability	99.99% during on-sale events	Read path served entirely from cache (Redis + CDN); write path decoupled via queue — Booking Service stays responsive even if Consumer is slow
Low Latency	<200ms reads, <500ms reservation	Seat Cache hit ratio >95%; reservation is an async queue push (<50ms); payment is fully async
Burst Scalability	100K+ concurrent users	CDN absorbs static content; Seat Cache absorbs seat queries; message queue buffers write bursts; Search Service scales horizontally
Fairness	First-come, first-served	FIFO message queue guarantees ordering; virtual waiting room + CAPTCHA prevent bots; waitlist serves users in order
Reliability	No seat hoarding	Delayed task scheduler releases unpaid reservations after timeout; waitlist automatically offers to next user

Deep Dives

How do you prevent double-booking?

Preventing Double Bookings

Double-booking is the cardinal sin of a ticketing system. This deep dive explores three complementary approaches at different layers.

Layer 1: Application-Level Serialization (Queue)

The ordered message queue is the primary defense. All reservation requests for a given event are routed to the same queue partition (partitioned by event_id). The consumer processes them sequentially. When it encounters a request for a seat that's already reserved, it rejects the request — no race condition possible.

This works because the queue is ordered and the consumer is single-threaded per partition. For systems handling multiple events simultaneously, each event gets its own partition — parallelism across events, serialization within an event.

Layer 2: Database-Level Pessimistic Locking

As a second line of defense (defense in depth), the Queue Consumer uses a pessimistic lock when updating the seat:

BEGIN TRANSACTION;

-- Acquire exclusive lock on the seat row
SELECT * FROM seats
WHERE seat_id = 'A-101' AND event_id = 'evt_ts_2024'
FOR UPDATE;

-- Check if still available
-- IF status = 'available' THEN:
UPDATE seats
SET status = 'reserved',
    reserved_by = 'u_12345',
    reserved_until = NOW() + INTERVAL '2 minutes'
WHERE seat_id = 'A-101'
  AND status = 'available';

COMMIT;

The FOR UPDATE lock prevents any other transaction from modifying this row until the current transaction commits. Even if (due to a bug) two consumers process the same seat concurrently, the lock serializes them at the database level.

Layer 3: Optimistic Concurrency Control

An alternative or complementary approach is optimistic locking with a version field:

UPDATE seats
SET status = 'reserved',
    reserved_by = 'u_12345',
    version = version + 1
WHERE seat_id = 'A-101'
  AND status = 'available'
  AND version = 42;

-- If 0 rows affected → someone else got there first

Optimistic locking is better for high-contention scenarios because it doesn't hold a lock while processing — it tries the update and handles failure. But for the ticketing use case where the queue already serializes writes, pessimistic locking in the Consumer is safe (no contention, since only one Consumer processes a given event's requests).

Distributed Locking (Redis)

A third option is a distributed lock using Redis (SET seat_lock:{seat_id} NX EX 30). This provides a lock outside the database, useful when multiple services need to coordinate.

However, for this use case, a distributed lock adds unnecessary complexity. The queue + database lock already guarantees exactly-once reservation. A Redis lock would be useful if you had multiple independent Booking Services writing directly to different databases (which our architecture avoids).

Recommendation: Queue-based serialization as the primary mechanism. Database-level pessimistic lock as defense in depth. Avoid distributed locks unless the architecture demands cross-service coordination.

Cache-Database Consistency

A subtle edge case: the Seat Cache shows a seat as available but the database has it as reserved. Two users see it as available, both submit reservation requests. The queue serializes them — only the first succeeds. The second gets rejected and can join the waitlist.

The cache is eventually consistent by design. It's updated after each reservation and release, but there's always a small window (milliseconds to seconds) where the cache lags behind the database. This is acceptable because the cache is not the source of truth — the database is, and the queue ensures only valid reservations succeed.

How do you implement a waitlist?

Waitlist Design

When a seat is already reserved, the user shouldn't just see "unavailable" — they should have a way to queue up. If the current holder doesn't complete payment, the next person in line gets a shot.

Data Structure: Redis List per Seat

A Redis list provides O(1) push/pop operations and natural FIFO ordering:

# User joins waitlist
redis.lpush(f"waitlist:{event_id}:{seat_id}", user_id)

# Check position
position = redis.lpos(f"waitlist:{event_id}:{seat_id}", user_id)

# When seat becomes available, dequeue next user
next_user = redis.rpop(f"waitlist:{event_id}:{seat_id}")
if next_user:
    # Create reservation for next_user
    create_reservation(event_id, seat_id, next_user)
    # Notify via SSE
    notify_user(next_user, "Your waitlisted seat is now available!")

Waitlist Lifecycle

User A reserves seat A-101 (status → reserved)
User B tries to reserve A-101 → seat is taken → offered to join waitlist
User B joins → LPUSH waitlist:evt:A-101 user_B (position 1)
User C joins → LPUSH waitlist:evt:A-101 user_C (position 2)
User A's reservation expires (didn't pay within 2 minutes)
Scheduler releases seat A-101 → status back to available
Scheduler dequeues User B → RPOP waitlist:evt:A-101
System creates a fresh reservation for User B with a new 2-minute window
User B receives SSE notification: "Seat A-101 is available — complete payment now!"

Edge Cases

What if User B's app is closed when they're dequeued? Send a push notification and give them a slightly longer window (e.g., 5 minutes instead of 2).
What if the waitlist is very long? Show the user their position and a realistic estimate ("unlikely" if they're position 50 for a single seat).
What if the user no longer wants the seat? Allow them to leave the waitlist (LREM), which doesn't affect other users' positions.

How do you ensure fair access during high-demand events?

Fair Access and Virtual Waiting Rooms

When 10 million fans refresh the page at exactly 10:00 AM for a Taylor Swift concert with 80,000 seats, the system faces a thundering herd problem. Without protection, the first few hundred milliseconds determine everything — and bots are faster than humans.

Virtual Waiting Room

Before the on-sale time, users enter a virtual queue (like a checkout line). The system works like this:

Users arrive at the event page before on-sale time and click "Join Queue"
Each user gets a random queue position (lottery-based) or a first-come position (arrival-based)
At on-sale time, the system admits users in controlled batches (e.g., 1,000 users every 5 seconds)
Admitted users can browse available seats and make reservations
Users still in the queue see their position and estimated wait time

The waiting room is typically implemented as a separate lightweight service (or even a CDN-level gate) that sits in front of the API Gateway. It serves a static waiting page to queued users and only routes admitted users to the actual Booking Service.

Why Random Position?

Arrival-based ordering (first to click "Join" gets position 1) rewards fast internet connections and bots. A randomized lottery at on-sale time gives everyone who joined before on-sale an equal chance, regardless of when they clicked (as long as it was before the cutoff). This is fairer and simulates a physical venue where doors open simultaneously.

Bot Protection Stack

Layer	Mechanism	What It Catches
Network	Rate limiting per IP + per user ID	Rapid-fire reservation attempts
Application	CAPTCHA before reservation	Automated scripts without browser
Behavioral	Device fingerprinting + interaction patterns	Sophisticated bots that solve CAPTCHAs
Business	Purchase limits per user/account	Scalpers buying bulk tickets
Monitoring	Anomaly detection on request patterns	Coordinated bot networks

Time-Slot Allocation

For extremely high-demand events, split the available inventory across multiple time slots (e.g., 10:00 AM, 10:15 AM, 10:30 AM). Each slot opens a portion of seats. This spreads the traffic spike into three smaller peaks, each manageable by the system. Users are assigned a time slot when they join the waiting room.

How do you scale for hot events without over-provisioning?

Hot Event Scaling Strategy

Most events see modest traffic — a local theater show or a minor league game. But a Taylor Swift or Beyoncé concert can spike traffic 100× within seconds. If you provision for peak Taylor Swift traffic all the time, you're wasting 99% of your infrastructure budget most of the time.

Traffic Tiers

Classify events into traffic tiers based on expected demand:

Tier	Examples	Expected Peak Concurrent	Strategy
Standard	Local theater, comedy shows	<1,000	Shared infrastructure
High	Popular bands, sports playoffs	10K-50K	Pre-scaled, dedicated cache
Mega	Taylor Swift, World Cup	100K-10M+	Dedicated cluster + waiting room

Pre-Scaling for Mega Events

For known mega events, the system pre-scales before the on-sale time:

Cache warming: Pre-populate the Seat Cache with the entire venue layout before on-sale. No cache misses during the initial rush.
Queue partitioning: Create dedicated queue partitions for the mega event. Standard events share a common partition pool.
Read replica scaling: Spin up additional read replicas for the event's database shard 30 minutes before on-sale.
CDN pre-staging: Push static event assets (venue map, images, seat layouts) to CDN edge nodes in advance.
Auto-scaling triggers: Set aggressive auto-scaling rules that trigger on queue depth rather than CPU (queue depth reacts faster to traffic spikes).

Geographic Sharding

Events are naturally geographic — a concert in LA is only relevant to users in the western US (mostly). Shard the Seat DB by geographic region:

North America West, North America East, Europe, Asia-Pacific
Each region has its own Search Service, Seat Cache, and queue infrastructure
Global events (e.g., World Cup finals) are handled by a dedicated global cluster

This contains blast radius — a traffic spike for an LA concert doesn't affect users browsing events in London.

Post-Event Cleanup

After on-sale completes (usually within 30-60 minutes), automatically scale down the dedicated infrastructure. Set an auto-cleanup timer to decommission extra replicas and cache entries after a cooling period.

Staff-Level Discussion Topics

The following topics contain open-ended architectural questions for staff+ conversations.

Event Sourcing for Seat Inventory

Context: Instead of updating seat status in-place (available → reserved → booked), what if you stored every state transition as an immutable event? The current status is derived by replaying events.

Discussion Points:

Append-only event log guarantees full audit trail (who reserved, when, from which IP, what happened)
Replay events to rebuild the Seat Cache after a crash — no stale data
Temporal queries: "What was seat A-101's status at 10:00:03 AM?" — trivial with event sourcing, impossible with in-place updates
Tradeoff: event replay is slower than a simple status read. Snapshot the current state periodically to bound replay time.
CQRS natural fit: the event store is the write model, the Seat Cache is the read model. They evolve independently.

Multi-Region Active-Active for Global Events

Context: A World Cup final has fans buying tickets from every continent simultaneously. A single region can't serve all of them with low latency. But seat inventory is a shared, strongly-consistent resource. How do you go multi-region?

Discussion Points:

Reads can be served from local replicas per region (eventual consistency acceptable for seat availability)
Writes must be routed to a single leader region for inventory consistency
Alternatively: partition inventory by section — Region A processes sections 1-10, Region B processes sections 11-20. Each region is the leader for its partition.
Cross-region write latency (100-200ms) acceptable? Or does it cause timeout cascades during peak?
Conflict resolution: what if a section's leader region goes down mid-sale? Failover time vs data consistency trade-off.

Idempotency and the Payment-Booking Race

Context: The payment webhook from Stripe arrives but the booking confirmation message to the queue is lost due to a network glitch. The seat is marked booked in the payment system but still reserved in the Seat DB. The scheduler fires and releases it. Now the user paid but their seat is gone.

Discussion Points:

Idempotent payment webhook handler: use the payment_id as a deduplication key. Processing the same webhook twice is a no-op.
Outbox pattern: the Payment Service writes the booking confirmation to an outbox table in the same transaction as recording the payment. A separate reader publishes to the queue. No dual-write risk.
Reservation release should check the payment status before releasing — the scheduler queries the Payment Service before acting.
Compensating transactions: if the seat was released after payment, automatically issue a refund + notify the user with apology.

Level Expectations

Dimension
Requirements	Identify core FRs: search events, view seats, reserve, book. Mention no double-booking. Basic scale numbers.	Define NFRs precisely: consistency vs availability tension, burst scalability requirements, read/write ratio implications. Reservation expiry as a requirement.	Challenge assumptions — is first-come-first-served the right model? Argue for lottery-based access. Identify the payment-booking race as a distributed systems problem.
High-Level Design	Draw basic flow: User → Service → DB. Mention caching for reads. Understand that writes need some form of locking.	Separate read/write paths. Introduce message queue for write serialization. Design async payment with webhooks. Reservation expiry via scheduler.	Event sourcing for audit trail, CQRS for read/write model separation, virtual waiting rooms as a first-class component, geographic sharding for blast radius containment.
Consistency	Mention that double-booking is bad. Suggest database constraints or locking.	Explain pessimistic vs optimistic locking tradeoffs. Design queue-based serialization. Understand cache-DB consistency gap.	Multi-layer defense (queue + DB lock + version check). Outbox pattern for payment-booking atomicity. Compensating transactions for failure recovery.
Scalability	Mention caching and load balancing.	Design for 100× traffic spikes: CDN, cache warming, queue buffering. Explain read replica strategy.	Pre-scaling playbook for mega events. Traffic tier classification. Geographic sharding. Cost optimization via auto-scaling with queue-depth triggers.
Real-Time	Mention the user needs to know their seat status.	Choose SSE for server-to-client updates. Design notification flow through message queue. Handle reservation countdown on client.	SSE scaling across multiple server instances with a pub/sub backplane. Graceful degradation when notification service is overloaded — batch updates vs individual pushes.

Interview Cheatsheet

Core Architecture in 60 Seconds

"Read path → cache, write path → queue. The read side (event search, seat availability) is served entirely from cache (Redis + CDN). The write side (reservation, booking) goes through an ordered message queue that serializes requests — first-come, first-served, zero double-bookings."

"Reservation is a temporary hold. User reserves a seat → 2-minute payment window. Queue Consumer marks it reserved in the DB and cache. If payment doesn't arrive, a delayed task scheduler releases it back to available."

"Payment is asynchronous. Booking Service sends a payment link. External provider sends a webhook on success. Payment Service records it, Booking Service transitions seat to booked. Outbox pattern ensures no message loss."

"Waitlist for fairness. If a seat is taken, the user joins a FIFO Redis list. When a reservation expires, the next waitlisted user gets a fresh window. Notification via SSE."

"Virtual waiting room for mega events. Before on-sale, users enter a queue. At on-sale time, users are admitted in controlled batches. Lottery randomization for fairness. CAPTCHA and rate limiting for bot protection."

Key Trade-offs to Mention

Trade-off	Option A	Option B	When to Choose
Double-booking prevention	Queue serialization	Database locking	Queue for high concurrency (buffering); DB lock as defense in depth
Locking strategy	Pessimistic (`FOR UPDATE`)	Optimistic (version check)	Pessimistic when single consumer; optimistic for concurrent consumers
Seat map freshness	Real-time SSE push	Short-TTL cache + poll	SSE for on-sale events; polling for standard browsing
Queue ordering	Lottery (random)	FIFO (arrival)	Lottery for fairness; FIFO when arrival order matters
Notification channel	SSE (unidirectional)	WebSocket (bidirectional)	SSE for server-push-only; WebSocket if client needs to send frequent messages
Reservation expiry	Redis TTL	Delayed queue message	Redis TTL for simplicity; delayed queue for reliability and idempotency

Common Mistakes to Avoid

❌ Calling a payment provider synchronously while holding a database lock — blocks resources for seconds per reservation
❌ Using the cache as the source of truth for seat availability — cache lag means it's eventually consistent, not strongly consistent
❌ Ignoring reservation expiry — seats get locked forever if users don't pay
❌ No plan for traffic spikes — "just add more servers" isn't a strategy when traffic goes from 1K to 10M in 1 second
❌ Skipping the waiting room for high-demand events — bots win, real fans lose
❌ Forgetting about the payment-booking race condition — if the webhook and the scheduler race, the user can pay but lose their seat