yelp

Introduction

Yelp is a local business discovery platform where users search for nearby businesses, browse their profiles with ratings and reviews, and contribute their own reviews. At its core, the system solves three interconnected problems: geospatial search over hundreds of millions of business listings, high-throughput page rendering for business profiles with aggregated ratings, and durable review ingestion that updates derived data asynchronously.

The engineering challenges are layered:

Challenge	Why It's Hard
Geospatial search	Finding businesses within a radius of a point on Earth's surface can't use regular B-tree indexes — latitude and longitude are two-dimensional, and distance calculations involve spherical trigonometry
Full-text + geo combined	Users search "sushi near me" — combining keyword relevance with geographic proximity requires specialized indexing structures
Read-heavy traffic	A 1000:1 read-to-write ratio means nearly every request is a read. Without aggressive caching, databases become the bottleneck long before write capacity matters
Rating aggregation	Computing `AVG(stars)` across millions of reviews on every page load would be catastrophically slow — aggregates must be precomputed and kept fresh
Search index freshness	When a restaurant changes its hours, how quickly does the search index reflect the change? The sync mechanism determines the trade-off between freshness and operational complexity

LLD Connection: This problem connects to search indexing concepts (inverted indexes, scoring) and message queue patterns for decoupling writes from derived data updates.

The system serves 10 million daily active users, hosts 200 million business listings, and handles 50,000 peak search queries per second — a scale where naive approaches collapse and every architectural decision carries real performance implications.

Functional Requirements

We extract three core user operations from the problem statement:

FR1 — Search Businesses by Text and Location. Users enter a search query (e.g., "italian restaurant") along with their location or a specified area. The system returns a ranked list of matching businesses within the geographic radius, with support for filters (price range, rating, open now) and sorting (distance, rating, relevance). Results must be paginated for mobile and web clients.

FR2 — View Business Profile. Each business has a detail page showing its name, address, phone number, hours, photos, aggregate star rating, total review count, and paginated reviews sorted by date or helpfulness. The page loads fast despite aggregating data from multiple sources with no noticeable latency.

FR3 — Write a Review. Authenticated users submit a review consisting of a star rating (1–5), text body, and optional photos. The review must be durably persisted, and derived data — the business's aggregate rating and search index signals — must eventually update to reflect the new review.

Out of Scope

Business owner management — Claiming/editing business profiles, responding to reviews
Photo gallery browsing — Dedicated photo browsing beyond review attachments
Social features — Following users, activity feeds, check-ins
Advertising / sponsored results — Paid placement in search results
Messaging — Direct messages between users and business owners
Reservations / ordering — Table booking, food delivery integration

Scale Requirements

Metric	Value
Daily active users	10,000,000
Total businesses	200,000,000
Total reviews	500,000,000
Read-to-write ratio	1000:1
Peak search QPS	50,000
Average search QPS	~12,000
Review writes	~1.2 per second (103K per day)
Average review size	~1 KB (text) + optional photos

The 1000:1 read-to-write ratio is the defining characteristic. Nearly every request is a search or page view. Write volume is modest — about 100K reviews per day — but each write triggers a cascade of derived data updates: aggregate recalculation, cache invalidation, and search index refresh.

Non-Functional Requirements

Requirement	Target	Rationale
Low Search Latency	p95 < 200 ms	Users expect instant results; every 100 ms of latency costs ~1% of engagement
Low Page Load Latency	p95 < 100 ms	Business pages must feel instantaneous — users bounce if the page takes > 1 second
High Scalability	50K peak search QPS, 200M businesses	Must handle viral traffic spikes (TikTok-famous restaurants) without degradation
Eventual Consistency	Reviews visible within seconds; aggregates within minutes	Users accept slight delay for rating updates; reviews must be durable immediately
High Availability	99.95% uptime	Yelp is a destination for real-time decisions ("where to eat right now") — downtime directly impacts revenue
Durability	Zero lost reviews after acknowledgment	Once we confirm a review submission, it must survive infrastructure failures

The consistency model is intentionally tiered:

Strong consistency for the review write itself — the author's review is durable the moment the write returns
Read-your-writes for the author — they see their own review immediately
Eventual consistency for everything else — other users see the review within seconds, aggregate ratings update within minutes, search index reflects the change within minutes

Resource Estimation

Traffic Estimation

Metric	Average	Peak
Search queries per second	~12,000	50,000
Business page views per second	~1,700	~7,000
Review writes per second	~1.2	~5
Total read QPS	~14,000	~57,000

With 10M DAU and an average of ~100 search queries per user per day: 10M × 100 / 86,400 ≈ 11,574 QPS.
Page views follow search at roughly an 8:1 ratio (users view ~12 business pages per session).
Review writes: ~100K reviews/day ÷ 86,400 ≈ 1.2 QPS — trivially low write throughput.

Storage Estimation

Data	Calculation	Result
Business metadata	200M × ~2 KB avg	~400 GB
Reviews (text)	500M × ~1 KB avg	~500 GB
Review photos	500M × 5% with photos × 2 MB avg	~50 TB (object storage)
Search index	200M documents × ~3 KB avg	~600 GB
Rating aggregates	200M × 16 bytes	~3.2 GB
Redis cache	Hot business data + popular searches	~50 GB

Infrastructure Estimation

Component	Requirement
Search cluster (Elasticsearch)	5–10 data nodes, 3 masters, ~600 GB index storage
Business database (PostgreSQL)	Primary + 2 read replicas, ~400 GB
Review database (PostgreSQL)	Primary + 2 read replicas, ~500 GB
Redis cache cluster	3–6 nodes, ~50 GB total
Message queue (Kafka)	3-broker cluster for CDC + review events
CDN	Global edge network for photos and static assets

The system is read-dominated and latency-sensitive. The primary engineering challenge is serving 57K peak read QPS with sub-200ms latency — this requires effective caching, index optimization, and horizontal scaling of read paths.

API Endpoints

Search Businesses

GET /api/v1/search?query=pizza&lat=40.75&lng=-73.99&radius=5000&filters=price:$$,open_now:true&sort=relevance&cursor=eyJv...

Query Parameters:

Parameter	Type	Required	Description
`query`	string	Yes	Search text (e.g., "pizza", "hair salon")
`lat`	float	Yes	User's latitude
`lng`	float	Yes	User's longitude
`radius`	integer	No	Search radius in meters (default: 5000, max: 40000)
`filters`	string	No	Comma-separated key:value pairs (price, rating, open_now, category)
`sort`	string	No	`relevance` (default), `distance`, `rating`, `review_count`
`cursor`	string	No	Opaque pagination cursor for next page

Response:

{
  "businesses": [
    {
      "id": "biz_4kMBvIEWPxWk...",
      "name": "Joe's Pizza",
      "rating_avg": 4.5,
      "rating_count": 1243,
      "category": "Pizza",
      "price_level": "$$",
      "distance_meters": 320,
      "address": "7 Carmine St, New York, NY",
      "is_open": true,
      "thumbnail_url": "https://cdn.yelp.example/photos/biz_4k/thumb.jpg"
    }
  ],
  "next_cursor": "eyJvZmZzZXQiOjIwfQ==",
  "total_results": 847
}

View Business Details

GET /api/v1/businesses/{business_id}

Response:

{
  "id": "biz_4kMBvIEWPxWk...",
  "name": "Joe's Pizza",
  "rating_avg": 4.5,
  "rating_count": 1243,
  "category": "Pizza",
  "price_level": "$$",
  "address": "7 Carmine St, New York, NY 10014",
  "phone": "+1-212-366-1182",
  "hours": {
    "monday": {"open": "10:00", "close": "23:00"},
    "tuesday": {"open": "10:00", "close": "23:00"}
  },
  "coordinates": {"lat": 40.7305, "lng": -74.0023},
  "photos": ["https://cdn.yelp.example/photos/biz_4k/1.jpg"],
  "is_open": true
}

Get Business Reviews

GET /api/v1/businesses/{business_id}/reviews?cursor=eyJv...&sort=recent

Response:

{
  "reviews": [
    {
      "id": "rev_8xNqP2...",
      "user": {"id": "usr_3Kd...", "name": "Sarah M.", "avatar_url": "..."},
      "stars": 5,
      "text": "Best pizza in NYC. The crust is perfectly crispy...",
      "photos": ["https://cdn.yelp.example/photos/rev_8x/1.jpg"],
      "created_at": "2026-03-15T14:30:00Z",
      "helpful_count": 42
    }
  ],
  "next_cursor": "eyJvZmZzZXQiOjEwfQ==",
  "total_reviews": 1243
}

Submit a Review

POST /api/v1/businesses/{business_id}/reviews
Authorization: Bearer <token>

Request Body:

{
  "stars": 5,
  "text": "Best pizza in NYC. The crust is perfectly crispy and the sauce is incredible.",
  "photo_urls": ["https://uploads.yelp.example/tmp/abc123.jpg"]
}

Response: 201 Created

{
  "id": "rev_9yMrQ3...",
  "stars": 5,
  "text": "Best pizza in NYC...",
  "created_at": "2026-03-17T10:30:00Z"
}

High-Level Design

The high-level design addresses three core design decisions (DDQs), each building on the previous:

Search Businesses — How do we make 200M businesses searchable by text + location at 50K QPS?
View Business Details — How do we serve business pages with aggregated ratings at sub-100ms latency?
Write a Review — How do we reliably ingest reviews and update all derived data?

Each DDQ follows the same pattern: identify the naive approach, explain why it breaks at scale, then build the solution incrementally.

1. Search Businesses

How do we search 200M businesses by text and location efficiently?

Users type "sushi near me" and expect ranked, filtered results within 200ms. With 200 million business listings, the search path is the single most critical component. Let's build it incrementally.

The Problem

A direct SQL query against the business table hits two walls simultaneously:

Full-text search — WHERE name LIKE '%sushi%' OR description LIKE '%sushi%' requires a full table scan. Even with a GIN index on tsvector, PostgreSQL isn't designed for 50K QPS full-text search across 200M rows.
Geospatial filtering — WHERE ST_DWithin(location, user_point, 5000) computes haversine distance for every candidate. Even with a spatial index, combining this with text search in a single query is slow.

At 50K peak QPS, a relational database simply cannot serve both dimensions efficiently. We need a dedicated search engine.

Yelp search architecture with Elasticsearch and CDC pipeline

The Solution: Elasticsearch for Combined Geo + Text Search

Elasticsearch is purpose-built for this workload. It provides:

Inverted indexes for fast full-text search — terms like "sushi" map directly to document IDs without scanning
geo_point field type with built-in geohash encoding — geo_distance queries efficiently filter by radius
Combined queries — a single bool query can match text, filter by radius, and sort by a scoring function in one pass
Horizontal scaling — shards distribute across data nodes, with each shard handling a subset of the 200M documents

A search query becomes:

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "sushi",
            "fields": ["name^3", "category^2", "description"],
            "type": "best_fields"
          }
        }
      ],
      "filter": [
        {
          "geo_distance": {
            "distance": "5km",
            "coordinates": {"lat": 40.75, "lon": -73.99}
          }
        },
        {"term": {"is_active": true}},
        {"term": {"price_level": "$$"}}
      ]
    }
  },
  "sort": [
    {"_score": "desc"},
    {"_geo_distance": {
      "coordinates": {"lat": 40.75, "lon": -73.99},
      "order": "asc"
    }}
  ],
  "size": 20,
  "from": 0
}

The multi_match query scores text relevance with boosted weights — name^3 means a match in the business name is 3× more important than a match in the description. The geo_distance filter eliminates businesses outside the radius without affecting the relevance score. Filters are cached by Elasticsearch for repeated queries.

Why Elasticsearch over alternatives?

Option A: PostgreSQL with PostGIS + Full-Text Search

PostgreSQL supports both tsvector for full-text search and PostGIS for geospatial queries. For a small catalogue (<1M businesses), this works well — a single database handles everything, reducing operational complexity.

Limitation: At 200M rows and 50K QPS, neither the full-text nor the spatial index scales horizontally. PostgreSQL read replicas help but are limited by single-node index size and rebuild times. Query planning for combined text + geo predicates produces suboptimal plans — the optimizer can't efficiently merge two different index types.

Option B: Separate Text Index + PostGIS

Use Elasticsearch for text search, PostGIS for geospatial filtering, and merge results in the application layer. Each system does what it's best at.

Limitation: Application-level joins are complex and slow. You need to query both systems, intersect results, then sort by combined relevance. Two network round-trips, two result sets in memory, and edge cases where pagination breaks because the intersection is smaller than either result set.

Option C: Elasticsearch for Both (Recommended)

Elasticsearch handles text and geo in a single query plan. The bool query composes naturally — text match in must, geo filter in filter, additional filters stack without performance cliffs. Horizontal scaling via sharding supports 200M documents and 50K QPS.

Trade-off: Elasticsearch is not the source of truth — it's a read-optimized index that must be kept in sync with the primary database. This sync mechanism becomes a critical architectural decision.

Keeping the Search Index Fresh: How Do Changes Reach Elasticsearch?

When a business owner updates their hours or a new business registers, how does Elasticsearch learn about the change? Two primary approaches:

Approach 1: Dual Write (Simple but Fragile)

The application writes to both PostgreSQL and Elasticsearch in the same request handler:

def update_business(business):
    db.update(business)                  # Step 1: Write to database
    elasticsearch.index(business)        # Step 2: Write to search index

Advantages:

Trivially simple — no extra infrastructure
Changes appear in search immediately (synchronous)
Easy to understand, debug, and test

Failure mode: If the database write succeeds but the Elasticsearch write fails (network timeout, ES cluster overloaded), the data diverges silently. The database shows updated hours, but search returns stale data. There's no built-in mechanism to detect or repair this drift.

At small scale (<10K businesses), a nightly reconciliation script fixes drift before users notice. At 200M businesses, reconciliation takes hours and can't keep up with change volume.

Approach 2: Change Data Capture — CDC (Reliable at Scale)

A CDC connector (Debezium) reads the database's write-ahead log (WAL) and streams every change to Kafka. A consumer reads from Kafka and updates Elasticsearch. The application only writes to the database — it doesn't even know Elasticsearch exists.

How it works:

Application writes to PostgreSQL (business update, new listing, etc.)
PostgreSQL writes the change to its WAL (transaction log) — this happens anyway as part of normal database operation
Debezium reads the WAL entry and publishes a change event to a Kafka topic
An index consumer reads from Kafka and upserts the document in Elasticsearch

Advantages:

Guaranteed consistency — every committed database change eventually reaches the index
Decoupled — the application has zero knowledge of Elasticsearch
Replayable — Kafka retains events; replaying the topic rebuilds the entire index
No performance impact on writes — the application only waits for the database commit

Trade-off: The pipeline introduces operational complexity — Debezium, Kafka, and the consumer all need monitoring, alerting, and failure recovery. CDC lag (typically 1–5 seconds) means the index is slightly behind the database.

Our Choice: CDC for Yelp Scale

At 200M businesses and 50K search QPS, data drift is unacceptable. A user searching "restaurants open now" must see accurate hours. CDC guarantees that every database change reaches the search index, with typical lag under 5 seconds. The operational cost of running the CDC pipeline is justified by the consistency guarantee.

For a startup with <100K listings, dual writes with a reconciliation script would be the pragmatic choice — simpler infrastructure, acceptable drift.

2. View Business Details

How do we serve business detail pages with aggregated ratings at low latency?

A user taps a restaurant in search results. The business page must load in under 100ms, showing the name, hours, photos, a 4.5-star aggregate rating from 1,243 reviews, and the first page of reviews. Let's build this incrementally.

The Problem

The naive approach queries the business row and then computes the aggregate on the fly:

SELECT b.*, AVG(r.stars) AS rating_avg, COUNT(r.id) AS rating_count
FROM businesses b
LEFT JOIN reviews r ON r.business_id = b.id
WHERE b.id = 'biz_4kMBvIEWPxWk'
GROUP BY b.id;

For a popular restaurant with 10,000 reviews, this JOIN + aggregation takes hundreds of milliseconds. At 1,700 page views per second (7K peak), the database is overwhelmed. The aggregate computation runs 7,000 times per second for the same slowly changing data.

Step 1: Separate Business and Review Services

Business metadata (name, hours, location) has different access patterns than reviews (paginated, user-specific, sorted). Splitting into two services allows independent scaling:

Business Service — Reads/writes business metadata from the Business Database
Review Service — Handles review CRUD, pagination, and submission

Yelp business page architecture with separate services and precomputed aggregates

Step 2: Precompute Rating Aggregates

Instead of calculating AVG(stars) on every page load, store rating_avg and rating_count directly on the business row:

ALTER TABLE businesses ADD COLUMN rating_avg DECIMAL(3,2) DEFAULT 0;
ALTER TABLE businesses ADD COLUMN rating_count INTEGER DEFAULT 0;

When a review is submitted, the aggregate is updated asynchronously (via the message queue). The business page reads a single row — no JOIN, no aggregation, no scan of thousands of review rows.

Review Storage: Relational vs Wide-Column

Option 1: PostgreSQL (Recommended)

Reviews have a well-defined schema: user_id, business_id, stars, text, created_at. PostgreSQL provides ACID guarantees — each review is durable the moment the write returns. Flexible indexing supports multiple access patterns: reviews by business (paginated), reviews by user (profile page), and aggregate calculations.

At ~1.2 write QPS, horizontal write scaling isn't needed. 500M reviews at ~1 KB each is ~500 GB — fits on a single node with read replicas for query load. Compound index on (business_id, created_at DESC) supports efficient cursor-based pagination.

Option 2: Cassandra / DynamoDB

Partition by business_id, cluster by created_at. Pagination is natural — read the next N rows from the partition. Scales horizontally with no theoretical limit.

Limitation: Cross-partition queries are harder. "All reviews by user X" requires either a global secondary index (expensive) or a separate denormalized table. Eventual consistency by default complicates aggregate calculations — summing stars across replicas can produce stale results. At 1.2 QPS write throughput, Cassandra's distributed write optimization provides no benefit while adding operational complexity.

Our Choice: PostgreSQL. The write volume (~1.2 QPS) doesn't justify a wide-column store's operational complexity. Relational databases handle our query patterns naturally — paginate by business, query by user, JOIN for analytics — and ACID guarantees simplify the write path. Read replicas handle the read load.

Aggregate Update Mechanism

When a review is written, a review_created event is published to Kafka. An aggregate updater consumer processes the event:

UPDATE businesses
SET rating_count = rating_count + 1,
    rating_avg = (rating_avg * rating_count + NEW_STARS) / (rating_count + 1)
WHERE id = 'biz_4kMBvIEWPxWk';

This incremental formula avoids recomputing from all reviews. The update is idempotent if we track which review IDs have been applied (a processed events table or Kafka offset tracking). The aggregate lags behind the actual review by 1–5 seconds — acceptable for a rating display.

3. Write a Review

How do we reliably ingest reviews and update all derived data?

A user finishes dinner and writes a 4-star review with a photo. The system must save the review durably, then propagate the change to multiple downstream systems: the rating aggregate, the search index, and the cache. Let's build this incrementally.

The Problem

When a user submits a review, the system needs to:

Save the review to the Review Database
Update the business's rating_avg and rating_count in the Business Database
Update search index signals (businesses with more/better reviews rank higher in search)
Invalidate cached business page data

The naive approach does all four synchronously. The user clicks submit → write review → update aggregates → update search index → invalidate cache → return success. If Elasticsearch is slow, the user waits. If Elasticsearch is down, the review submission fails even though the review itself could have been safely saved.

Yelp review write path with async fanout via Kafka

Step 1: Write to Database First, Return Success Immediately

The user cares that their review is saved. They don't care if the aggregate updates in the same request. The write path becomes:

Validate the request (authentication, input validation, duplicate check)
Write the review to the Review Database — commit
Return 201 Created to the user
Publish a review_created event to Kafka (fire-and-forget from the user's perspective)

Now writes are fast (~20ms). But derived data (aggregates, search signals, cache) is temporarily stale. The user submitted a 5-star review, but the business page might still show the old average for a few seconds.

Step 2: Async Fanout via Message Queue

After the review is committed, a review_created event is published to Kafka. Independent consumers handle each downstream update:

Aggregate Updater — Reads the event, increments rating_count and recalculates rating_avg in the Business Database
Index Updater — Updates the business's review_count and rating signals in Elasticsearch so search rankings reflect the new review
Cache Invalidator — Invalidates the cached business page and first review page so the next reader gets fresh data

This decoupling is critical: if Elasticsearch is temporarily down, reviews still save. The index updater retries when ES recovers. Even at low write QPS (~1.2/sec), the queue's value is decoupling and fault isolation, not raw throughput.

Step 3: Read-Your-Writes for the Author

The author should see their own review immediately after submission. Two approaches:

Read from primary — For the author's session, route review reads to the primary database instead of a read replica (which may have replication lag)
Optimistic client-side display — Include the full review in the 201 Created response. The client appends it to the review list locally without re-fetching from the server

Other users see the review with a slight delay (a few seconds) as replicas sync and caches invalidate. This is acceptable — most platforms work this way.

Idempotency and Duplicate Prevention

Users shouldn't submit duplicate reviews by double-clicking or network retries. We enforce this at two levels:

Database constraint: A unique index on (user_id, business_id) prevents duplicate rows. If a user tries to submit a second review for the same business, the INSERT fails with a constraint violation — the application returns a clear error message.

CREATE UNIQUE INDEX idx_unique_user_business
ON reviews(user_id, business_id);

Idempotency key: The client sends a unique X-Idempotency-Key header with each review submission. The server checks if a review with that key already exists — if so, it returns the existing review instead of creating a duplicate. This handles network retries transparently.

Both mechanisms are necessary: the database constraint is the safety net, the idempotency key provides a better user experience (no error on retry).

Photo Upload Flow

Review photos follow a separate upload path to avoid blocking the review submission:

Pre-upload: Client requests a presigned URL from the server: POST /api/v1/uploads/presigned-url
Direct upload: Client uploads the photo directly to object storage (S3) using the presigned URL — this bypasses the server entirely, reducing load
Attach to review: Client includes the uploaded photo's URL in the review submission body
CDN distribution: Photos are served via CDN with aggressive caching — review photos never change after upload

This pattern offloads large binary uploads to object storage infrastructure, keeping the review API lightweight and fast.

Deep Dive Questions

How does ranking and relevance work for search results?

Ranking & Relevance

When a user searches "pizza," they expect the best nearby pizzerias — not just the closest ones. Ranking must balance multiple competing signals at 50K QPS across 200M businesses.

The Challenge

Search results need to balance three signals:

Proximity — Users want nearby options, not restaurants an hour away
Quality — High ratings matter; a 4.5-star restaurant should outrank a 2-star one at the same distance
Confidence — A 4.5-star rating from 500 reviews is far more trustworthy than a perfect 5.0 from a single review

At 50K QPS, ranking must be fast — there's no room for complex per-document computation across millions of candidates.

Approach 1: Distance-Only Ranking

Sort results purely by distance from the user's location. Trivially simple and extremely fast — Elasticsearch handles this natively with _geo_distance sort.

Limitation: The closest pizza place might have a 2-star rating with reviews mentioning "cold pizza and rude staff." Users expect quality, not just proximity. Distance-only ranking creates a poor user experience for any query where quality matters more than convenience.

Approach 2: Rating × Inverse Distance

Combine signals with a formula: score = rating × (1 / distance). Higher-rated places rank above closer mediocre ones. This captures both proximity and quality.

Limitation: A brand-new restaurant with a single 5-star review (possibly from the owner) outranks an established 4.5-star restaurant with 500 genuine reviews. Raw star ratings don't account for statistical confidence — small sample sizes produce unreliable averages.

Approach 3: Bayesian Smoothed Ratings

Apply Bayesian smoothing to pull ratings with few reviews toward the global average. The formula:

$smoothed = \frac{\sum stars + prior \times global\_avg}{num\_reviews + prior}$

With prior = 10 and global_avg = 3.7:

A single 5-star review → smoothed to ~3.8 stars (pulled toward average)
500 reviews averaging 4.5 → smoothed to ~4.49 stars (barely affected)

This rewards established businesses with consistent quality over newcomers with inflated ratings.

Limitation: At 50K QPS, applying complex scoring formulas to millions of candidates is too slow. We need a way to limit the scoring to a manageable subset.

Approach 4: Two-Stage Retrieval (Recommended)

Split ranking into two stages:

Stage 1 — Retrieval: The search index returns the top ~1,000 candidates using simple, fast scoring: text relevance (BM25) + geo filter. This leverages precomputed index structures and runs in single-digit milliseconds.

Stage 2 — Reranking: A lightweight reranking service rescores only the 1,000 candidates with richer signals:

Bayesian-smoothed rating
Review freshness (recent reviews weighted more)
Category relevance (exact category match vs. partial)
User personalization (if available)
Photo quality signal (businesses with photos rank higher)

Returns the top 20 results to the user.

This is both fast (only 1,000 documents scored) and accurate (rich signals applied in reranking). The reranking service is stateless and horizontally scalable.

Our Choice: Two-stage retrieval with Bayesian-smoothed ratings in the reranking stage. Start with Elasticsearch's built-in function_score for the retrieval stage. Add a dedicated reranking service when you need ML models, A/B testing flexibility, or personalization — these are hard to implement inside Elasticsearch's scoring pipeline.

Two-stage retrieval and reranking pipeline for Yelp search

How does geo indexing work at scale?

Geo Indexing at Scale

A user at coordinates (40.75, -73.99) searches for pizza within 2 km. With 200 million businesses, we need a spatial index that narrows candidates efficiently. The naive approach — calculating haversine distance for every business — takes seconds and collapses at 50K QPS.

The Challenge

Finding "all points within a radius" appears simple on a 2D plane but is fundamentally hard because:

Earth is a sphere — distance calculations involve trigonometry, not Pythagoras
Standard B-tree indexes handle one dimension, not two
At 200M businesses, even efficient algorithms must eliminate 99.99%+ of candidates before computing exact distances

Approach 1: B-tree on Latitude / Longitude

Index the lat column and query WHERE lat BETWEEN 40.73 AND 40.77. The B-tree narrows to ~1M rows in that latitude band.

Limitation: Latitude 40.75 spans the entire globe — you get businesses in Spain, China, and everywhere at that latitude. A compound index (lat, lng) helps, but B-trees handle one-dimensional ranges efficiently. The second dimension (longitude) is still a scan within the latitude range. Circular radius queries become rectangular bounding boxes that require expensive post-filtering.

Approach 2: Geohash Encoding (Recommended)

Encode latitude/longitude as a single string: (40.75, -73.99) → "dr5ru7". The key insight: nearby locations share prefixes. The prefix "dr5ru" covers a roughly 1 km² area.

To find nearby businesses:

Compute which geohash prefixes intersect the search circle
Query for businesses with those prefixes — this is a simple string prefix match, which B-trees handle efficiently
Post-filter with exact haversine distance — the candidate set is now small

Why this works: Geohashing reduces a 2D spatial problem to a 1D string prefix problem. Standard indexes, caches, and distributed systems all handle strings well.

Edge case: Grid cells are rectangular, so businesses near cell boundaries may be closer than businesses in the same cell. The solution is to also query adjacent cells (the 8 neighbors). This is well-understood and handled automatically by Elasticsearch's geo_point type.

Approach 3: R-Tree (PostGIS)

R-trees group nearby points into bounding rectangles organized hierarchically. A query traverses the tree, pruning branches whose bounding boxes don't intersect the search area. More accurate for complex shapes (polygons, irregular boundaries).

Limitation: R-trees are single-node data structures — harder to distribute across shards. Well-suited for PostgreSQL with PostGIS, but scaling horizontally for 50K QPS requires application-level sharding or read replicas.

Approach 4: S2 / H3 Cells

Modern geo-indexing systems use spherical cells (Google's S2) or hexagonal cells (Uber's H3). These have mathematically superior properties:

S2 cells project the sphere onto a cube, then subdivide — cell sizes are more uniform than geohash rectangles
H3 hexagons have equidistant neighbors (unlike squares, where corner neighbors are √2 farther than edge neighbors)

Limitation: Both require specialized libraries and are less commonly supported by out-of-the-box search engines. Best suited for precision coverage analysis (delivery zones, ride-matching) rather than general proximity search.

Our Choice: Geohash encoding via Elasticsearch's geo_point field type. It combines naturally with text search in a single query, scales horizontally via sharding, and is operationally straightforward. Elasticsearch handles geohash encoding transparently — a geo_distance query is efficient across 200M documents at 50K QPS.

Consider R-tree (PostGIS) if you're already using PostgreSQL and don't need combined text + geo queries. Consider S2/H3 for precise area calculations (delivery zones, ride-matching) where cell uniformity matters.

Geo Distance vs. Road Distance

All spatial indexes above compute geodesic distance — straight-line, "as the crow flies." Real-world travel follows roads. Two restaurants 3 km apart across the San Francisco Bay might be 40 km by road, around the Golden Gate Bridge and back.

How much does this matter? In dense urban grids (Manhattan, central London), geodesic distance closely approximates road distance. Near rivers, bays, mountains, or highways without exits, the gap widens significantly. For areas with many users near such barriers, this affects a meaningful share of results.

A layered approach handles this without slowing search:

First pass: geo distance. The search index filters candidates by straight-line radius. This is fast and eliminates 99.9%+ of irrelevant businesses.
Reranking: cached route estimates. Precompute approximate driving distances between geohash-pair grid cells using a routing API (Google Maps, OSRM). Store results in a lookup table. During reranking, replace geodesic distance with the cached road distance for the top ~1,000 candidates.
Detail view: on-demand routing. When a user opens a specific business page, fetch the actual driving/walking route. A single routing API call is acceptable latency for a detail view.

For interview discussions, acknowledging the geo-vs-road distinction and sketching the layered mitigation demonstrates real-world thinking.

Geohash grid showing how nearby locations share prefixes for efficient spatial queries

What caching strategy handles read-heavy traffic?

Caching Strategy for 1000:1 Read/Write Ratio

With a 1000:1 read-to-write ratio, every read hitting a database won't scale. At 1,700 business page views per second (7K peak), the database connection pool saturates. Caching is not an optimization — it's a requirement.

The Challenge

Caching involves two decisions: what to cache (and for how long) and how to keep the cache fresh when the underlying data changes. Different data types have different staleness tolerances — business hours should be accurate within minutes, but a rating aggregate lagging by a minute is fine.

What to Cache: TTL by Data Type

Data Type	TTL	Rationale
Business metadata (name, hours, location)	1 hour	Rarely changes. High cache hit rate — a popular restaurant's metadata is requested thousands of times per hour
Rating aggregates (avg stars, count)	5 minutes	Changes with each review submission, but slight staleness (minutes) is acceptable for a display number
Search results (for popular queries)	1 minute	Ranking signals change with each review and rating update. Short TTL still absorbs burst traffic effectively
Review pages (paginated)	5 minutes	First page changes most often (new reviews appear at top). Deeper pages are more stable

Invalidation Approach 1: TTL-Only

Let cached entries expire naturally based on their TTL. No invalidation logic needed — the simplest possible caching strategy.

Limitation: A restaurant permanently closes. With a 1-hour TTL on business metadata, users see "Open" for up to an hour. For critical updates like business closures, hours changes, or safety alerts, passive expiration creates unacceptable staleness.

Invalidation Approach 2: Event-Based Invalidation

Publish cache invalidation events whenever data changes:

Business updated → invalidate biz:{business_id} immediately
Review created → invalidate agg:{business_id} and reviews:{business_id}:page:1

This provides near-instant freshness for important changes.

Limitation: If the event system (Kafka consumer, invalidation worker) fails or lags, the cache serves stale data indefinitely — there's no fallback expiration.

Invalidation Approach 3: TTL + Event-Based (Recommended)

Combine both strategies: event-based invalidation for immediate freshness on important changes, TTL as a safety net for eventual expiration. If the event system fails, data still expires at the TTL boundary.

Implementation:

Business metadata changes → fire invalidation event + 1-hour TTL fallback
Review submitted → fire invalidation for aggregate + first review page + 5-minute TTL fallback
Search results → TTL-only (1 minute); no event invalidation needed because the cache turns over fast enough

This provides fast freshness when the event system is healthy and guaranteed maximum staleness when it isn't.

Our Choice: TTL + event-based invalidation. Set TTLs by data type based on staleness tolerance. Fire invalidation events for critical changes (business closures, hours updates, new reviews). Redis cluster with key prefixes (biz:*, agg:*, reviews:*, search:*) handles Yelp-scale workloads; add local in-process caching (e.g., Caffeine/Guava) for extreme hot spots.

How do you handle celebrity businesses (hotspots)?

Hotspot Handling

A viral TikTok video sends 100,000 users to a single restaurant's page within an hour. The page normally receives 10 requests per minute. Now it receives 1,000 requests per second — a 6,000× spike. How do we prevent this from degrading the entire system?

The Challenge

Without protection, the spike hammers a single database row and cache key. The database connection pool fills with queries for one business ID. A single Redis node serving the hot key spikes to 100% CPU. Other businesses' pages slow down. The entire system degrades because of one hot key — a classic noisy neighbor problem.

Mitigation 1: Aggressive Caching with Short TTL

Cache the full business page response with a short TTL (10 seconds). At 1,000 req/sec, the cache absorbs 9,990 requests out of every 10,000 — only one request per TTL window reaches the database.

Limitation: All 1,000 requests per second still hit the same Redis key on the same Redis node. The cache layer itself becomes a hotspot. This is the distinction between reducing database load (solved) and distributing cache load (unsolved).

Mitigation 2: Cache Key Replication

Replicate the hot key across multiple cache nodes: instead of a single biz:viral-restaurant, create biz:viral-restaurant:shard:{0-9}. Each request randomly picks a shard. Load distributes across 10 Redis nodes instead of concentrating on one.

Limitation: When the TTL expires, all replicas expire simultaneously. The next burst of requests all see a cache miss — a thundering herd hits the database at once.

Mitigation 3: Request Coalescing (Single-Flight)

When multiple requests hit a cache miss for the same key simultaneously, only the first request fetches from the database. All others wait for that result. 1,000 simultaneous cache misses collapse into a single database query.

Implementation: Use a distributed lock or an in-process single-flight library (e.g., Go's singleflight, Java's CacheLoader).

Limitation: Only helps when many requests arrive during the same cache-miss window. Doesn't help with different cache keys (e.g., paginated review pages — each page is a separate key).

Mitigation 4: Graceful Degradation

When a service is overwhelmed, serve partial data instead of failing entirely:

Return cached business info without live reviews: "Reviews are temporarily unavailable"
Pre-cache the first 5 review pages; return a "try again later" message for deeper pages
If the Review Service is unresponsive, serve the business page without the review section

Users see the business name, hours, and location — the most critical information for deciding whether to visit. Reviews load when traffic subsides.

Detection and Preemptive Warming

Monitor per-key access rates in real time. When a single key exceeds a threshold (e.g., 500 req/sec), automatically:

Extend the TTL for that key
Replicate the key across additional cache shards
Enable request coalescing for that key's cache misses

Integrating with social media monitoring (tracking Yelp business mentions on TikTok/Instagram) can preemptively warm caches before the traffic spike fully materializes.

Our Choice: Layer all four mitigations: aggressive caching reduces database load, key replication distributes cache load, request coalescing prevents thundering herds, and graceful degradation keeps the system usable under extreme conditions. Automated detection triggers escalation through these layers as traffic intensity increases.

What are the trade-offs between search index sync approaches?

CDC Trade-offs for Search Index Sync

The search index must reflect changes in the Business Database. A restaurant updates its hours from "closes at 9 PM" to "closes at 10 PM." How quickly and reliably does the search index reflect that change? The sync mechanism determines the freshness-vs-complexity trade-off.

Approach 1: Dual Write

The application writes to both the database and the search index in the same request handler:

def update_business(business):
    db.update(business)               # Step 1: primary write
    search_index.upsert(business)     # Step 2: index write

The simplest approach — no extra infrastructure, no event pipeline, no operational overhead.

Failure Mode: If step 2 fails after step 1 succeeds, the database and index silently diverge. There's no built-in mechanism to detect or repair the inconsistency. Wrapping both in a distributed transaction adds latency and couples availability — if the search index is down, database writes fail too.

Best for: Most applications — startups, internal tools, any system where a small team values simplicity. A nightly reconciliation script catches drift before users notice. At Yelp's scale (200M businesses), reconciliation becomes impractical — scanning 200M rows takes hours.

Approach 2: Periodic Batch Reindex

A scheduled job (every 5–15 minutes) queries the database for recently changed records and bulk-updates the index:

def reindex_job():
    changed = db.query(
        "SELECT * FROM businesses WHERE updated_at > %s",
        [last_run_timestamp]
    )
    search_index.bulk_upsert(changed)

Limitation: The index can be up to 15 minutes behind the database. A restaurant marks itself as closed, but search results show it as "Open" for the full reindex interval. The batch job also creates periodic load spikes on the database — a thundering herd of reads every 5 minutes.

Best for: Systems where minutes of staleness are acceptable and operational simplicity matters more than freshness — internal tools, analytics dashboards, secondary search features.

Approach 3: CDC Pipeline (Recommended at Scale)

A CDC connector (Debezium) tails the database's transaction log (WAL) and streams changes to the index in near real-time. The database doesn't know the pipeline exists — no application code changes needed.

CDC solves the consistency problem but introduces four dimensions of operational complexity:

1. Stream Joins. Yelp stores businesses and reviews in separate tables. CDC emits row-level change events — a review event carries business_id but not the business name, hours, or coordinates. The consumer must hydrate each event by joining with business data, either via a stream processor (Kafka Streams, Flink) or a lookup service. This is significantly more complex than the SQL JOIN that dual write can do inline.

2. Infrastructure Sprawl. Dual write has two components: the application and the search index. CDC introduces a longer chain: App → DB (WAL) → Connector → Kafka → Consumer → Search Index. Each link needs monitoring, alerting, and operational runbooks. If the connector crashes and the database log rotates before recovery, those changes are permanently lost — a full reindex is the only recovery path.

3. Schema Evolution. An ALTER TABLE ADD COLUMN changes the binary log format. The connector may not recognize the new field, consumers crash on unknown columns, and the search index has no mapping for them. Production CDC pipelines use a schema registry (Confluent Schema Registry, AWS Glue) that enforces compatibility rules (backward, forward, or full compatibility) so schema changes propagate safely through the pipeline.

4. Ordering and Duplication. A user creates a review, then deletes it seconds later. If the delete event arrives before the create event — possible when the consumer parallelizes across partitions — the review becomes a "zombie": permanently in the index, deleted in the database. The fix is partitioning by entity ID so all events for a given entity are processed sequentially within a single partition.

When to Use Each Approach

The decision depends on three factors: required index freshness, team operational capacity, and write volume.

Scenario	Recommended Approach
Write volume < 10 QPS, small team, staleness acceptable	Dual write + nightly reconciliation
Minutes of staleness acceptable, batch-friendly workload	Periodic reindex every 5–15 minutes
Near real-time freshness required, production-grade system	CDC pipeline with monitoring and alerting

Comparison Summary

Dimension	Dual Write	CDC Pipeline
Implementation effort	Trivial (few lines of code)	High (DevOps/data engineering required)
Data consistency	Poor — network failures cause permanent drift	Excellent — guaranteed eventual consistency
Write latency impact	Higher — user waits for both writes	None — user only waits for DB commit
Coupling	High — application knows about search index schema	Low — application only writes to database
Reindexing	Difficult — must query production database	Easy — replay the Kafka topic from offset 0
Failure detection	Manual — requires reconciliation scripts	Automated — consumer lag monitoring alerts

Verdict for Yelp: CDC is the right choice. Users filtering by "open now" must see accurate hours across 200M businesses. Data drift directly erodes user trust. The operational cost of the CDC pipeline is justified by the consistency guarantee and the ability to replay/rebuild the index from Kafka. For a smaller directory with 10K–100K listings, dual write with reconciliation is the pragmatic choice.

Staff-Level Discussion Topics

The following topics contain open-ended architectural questions without prescriptive answers. They're designed for staff+ conversations where you demonstrate systems thinking, trade-off analysis, and cross-functional strategic reasoning.

Search Index Operations

Context: Your search index has drifted out of sync with the database. Some businesses display incorrect hours, permanently closed businesses still appear in search results, and newly registered businesses aren't discoverable. Users are filing complaints.

Discussion Points:

How do you detect index drift? What metrics or monitoring would surface the problem — consumer lag, document count divergence, spot-check sampling?
How do you design reindex/backfill jobs that don't impact production traffic? (Blue-green indexing, throttled reindex, off-peak scheduling)
What's your strategy for zero-downtime index schema changes? (Alias swapping, reindex-in-background, compatible vs. breaking schema changes)
How do you handle the trade-off between index freshness and indexing throughput? (Batch size tuning, backpressure, consumer scaling)
What organizational processes ensure index health is monitored and maintained over time?

Multi-Region Deployment

Context: Yelp expands to Europe. Users in Paris searching for restaurants should hit European infrastructure, not US servers. But a business owner in the US updates their Paris restaurant's hours — that update must propagate to Europe.

Discussion Points:

How do you route users to the nearest region while maintaining data consistency? (GeoDNS, Anycast, region-aware load balancing)
Where does the source-of-truth database live? How do you handle cross-region writes? (Single-leader, multi-leader, CRDTs, conflict resolution)
What's acceptable replication lag for search index vs. business data vs. reviews?
How do you handle a region outage? Failover strategy — active-passive vs. active-active?
What trade-offs exist between latency, consistency, and operational complexity in a multi-region setup?

Review Integrity and Abuse

Context: A competitor is paying people to leave 1-star reviews on a client's restaurant chain. Legitimate business owners are furious. Your trust & safety team needs engineering solutions alongside policy.

Discussion Points:

What signals indicate fake or coordinated review attacks? (Burst patterns, new account clustering, IP correlation, review text similarity, rating distribution anomalies)
How do you balance aggressive fraud detection with not blocking legitimate negative reviews? (False positive cost vs. false negative cost)
What's the user experience for flagged reviews? Immediate removal vs. held for manual review vs. visible with reduced trust weight?
How do you handle appeals from businesses who claim reviews are fake? (Automated review, human escalation queue, transparency reports)
What cross-functional processes involve legal, support, and engineering in abuse response?

Level Expectations

The following table summarizes what interviewers typically expect at each seniority level when discussing the Yelp system design.

Dimension	Mid-Level (L4)	Senior (L5)	Staff (L6)
Requirements	List core features (search, view, review); identify the read-heavy traffic pattern	Define latency SLAs per endpoint; distinguish consistency requirements for writes vs. derived data	Detailed failure mode analysis; cross-region consistency trade-offs; capacity planning
Search Architecture	Recognize the need for a dedicated search index; mention Elasticsearch	Explain combined geo + text query flow; discuss index sync mechanisms (dual write vs. CDC)	Index operations (zero-downtime reindex, schema evolution); multi-region search; reranking pipeline
Data Model	Separate business and review storage; basic pagination approach	Precomputed rating aggregates; cache strategy with TTL by data type; cursor-based pagination	Hotspot handling (key replication, request coalescing); cross-region data ownership; eventual consistency proofs
Write Path	Async updates for derived data (aggregates, index) via message queue	Idempotency enforcement (unique constraints + idempotency keys); read-your-writes consistency	Abuse detection and review integrity; CDC operational challenges (stream joins, schema evolution, ordering)

Interview Cheatsheet

Core Architecture in 60 Seconds

"Search is geo + text → search index." Direct database queries can't achieve sub-200ms latency at 200M businesses. Elasticsearch handles geo filtering and text matching in a single query. CDC keeps the index in sync with the primary database.

"Business page reads details + reviews + cached aggregates." Separate services for business metadata and reviews (different access patterns, independent scaling). Precompute rating_avg and rating_count — never calculate aggregates on every request. Redis cache with TTL-based + event-driven invalidation.

"Write review to DB first; async updates aggregates and index." Durable write to the source of truth, return success. Then fan out via message queue: update aggregates, refresh search signals, invalidate caches. User sees their review immediately (read-your-writes); others see it within seconds (eventual consistency).

"Cache for read-heavy traffic; eventual consistency for derived data." 1000:1 read/write ratio demands aggressive caching. Business metadata (1h TTL), aggregates (5min TTL), search results (1min TTL), reviews (5min TTL). Layer TTL + event-based invalidation for critical updates.

Key Trade-offs to Mention

Trade-off	Option A	Option B
Search index sync	Dual write (simple, drifts at scale)	CDC pipeline (reliable, operationally complex)
Aggregate computation	Calculate on read (always fresh, slow at scale)	Precompute on write (fast reads, slight lag)
Cache invalidation	TTL-only (simple, may serve stale data)	TTL + events (fresh, more infrastructure)
Geo indexing	Geohash (simple, good for proximity)	R-tree/S2/H3 (precise, harder to distribute)
Review consistency	Strong everywhere (slow, complex)	Tiered: strong writes, eventual derived data
Hotspot handling	Over-provision (expensive, wasteful)	Layered mitigation (complex, efficient)