Skip to main content

Challenges in Complex Projects

Introduction

As your SQL projects grow in complexity, you'll encounter challenges that simple queries cannot solve elegantly. Understanding these challenges is the first step toward mastering subqueries – one of SQL's most powerful features.

In this tutorial, you will learn:

  • Common challenges in complex SQL projects
  • Why single-level queries become limiting
  • Real-world scenarios that demand advanced techniques
  • How subqueries address these challenges

The Problem with Simple Queries

When you first learn SQL, you write straightforward queries:

SELECT customer_city, COUNT(*) AS total
FROM olist_customers_dataset
GROUP BY customer_city
ORDER BY total DESC;

This works perfectly for simple questions. But what happens when your business asks:

"Show me customers who have spent more than the average customer."

Now you need to:

  1. Calculate the average spending
  2. Compare each customer against that average
  3. Return only those above the threshold

The challenge: You need to use a calculated value (the average) as a filter condition, but you don't know this value ahead of time.

Challenge 1: Filtering by Aggregated Values

The Scenario

You want to find all orders where the payment amount is above the overall average payment.

Why It's Hard

This doesn't work:

SELECT *
FROM olist_order_payments_dataset
WHERE payment_value > AVG(payment_value);  -- Error!

Why? Aggregate functions like AVG() cannot be used directly in the WHERE clause. The WHERE clause filters rows before aggregation happens.

Traditional Workaround

Without subqueries, you would need to:

  1. Run a query to calculate the average
  2. Note down the result (e.g., 154.10)
  3. Run another query with that hardcoded value
-- Step 1: Find the average (run separately)
SELECT AVG(payment_value) FROM olist_order_payments_dataset;
-- Result: 154.10

-- Step 2: Use that value (manual process)
SELECT * FROM olist_order_payments_dataset
WHERE payment_value > 154.10;

Problems with this approach:

  • Requires multiple queries
  • Hardcoded values become outdated
  • Manual intervention needed
  • Cannot be automated or scheduled

Challenge 2: Comparing Rows to Group Statistics

The Scenario

You want to find products that have a higher-than-average review score within their category.

Why It's Hard

You need to:

  1. Calculate the average review score per category
  2. Compare each product's score to its category average
  3. Return products that exceed their category's average

This requires comparing individual rows against aggregated group data – something a single-level query cannot do directly.

The Complexity

┌─────────────────────────────────────────────────────┐
│  Product A (Category: Electronics)                  │
│  Review Score: 4.5                                  │
│                                                     │
│  Need to compare against:                           │
│  Average score of ALL Electronics products          │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│  Calculate average by category FIRST               │
│  Then compare each product to its category average │
└─────────────────────────────────────────────────────┘

You need a way to nest one query inside another.

Challenge 3: Existence Checks

The Scenario

Find all customers who have never placed an order.

Why It's Hard

With a simple JOIN:

SELECT c.*
FROM olist_customers_dataset c
INNER JOIN olist_orders_dataset o ON c.customer_id = o.customer_id;

This gives customers who have orders. But you want the opposite – customers who don't exist in the orders table.

The Challenge

You need to check for the absence of matching rows, not their presence. This requires:

  • Looking up data in another table
  • Checking if rows exist or don't exist
  • Returning results based on that check

While LEFT JOIN with NULL check works, subqueries with EXISTS or NOT IN often provide clearer, more intuitive solutions.

Challenge 4: Dynamic Value Lists

The Scenario

Find all orders from customers located in the top 5 states by customer count.

Why It's Hard

First, you need to identify the top 5 states:

SELECT customer_state
FROM olist_customers_dataset
GROUP BY customer_state
ORDER BY COUNT(*) DESC
LIMIT 5;

Then, you need to use those states as a filter:

SELECT *
FROM olist_orders_dataset o
JOIN olist_customers_dataset c ON o.customer_id = c.customer_id
WHERE c.customer_state IN ('SP', 'RJ', 'MG', 'RS', 'PR');  -- Hardcoded!

The Problem

  • The list of top 5 states changes as new customers are added
  • Hardcoding values makes your query static and brittle
  • You want the query to dynamically determine which states to include

Challenge 5: Multi-Step Calculations

The Scenario

Calculate the percentage of total revenue that each product category contributes.

Why It's Hard

You need to:

  1. Calculate total revenue (sum of all sales)
  2. Calculate revenue per category
  3. Divide each category's revenue by the total
  4. Display as a percentage

The Multi-Step Nature

Step 1: Total Revenue = $15,000,000
        ↓
Step 2: Category Revenue
        Electronics = $5,000,000
        Furniture = $3,000,000
        ...
        ↓
Step 3: Percentage
        Electronics = 5M / 15M = 33.3%
        Furniture = 3M / 15M = 20.0%

Each step depends on a calculation from another level. A flat, single query cannot express this dependency naturally.

The Solution: Subqueries

All these challenges share a common theme:

You need to use the result of one query inside another query.

This is exactly what subqueries provide.

What is a Subquery?

A subquery is a query nested inside another query. It allows you to:

  • Calculate values dynamically
  • Compare against aggregated data
  • Check for existence of related rows
  • Generate dynamic lists for filtering
  • Build complex logic step by step

Preview: Solving Challenge 1

Remember the average payment challenge?

-- With a subquery:
SELECT *
FROM olist_order_payments_dataset
WHERE payment_value > (
    SELECT AVG(payment_value)
    FROM olist_order_payments_dataset
);

The inner query calculates the average, and the outer query uses that result directly. No manual steps, no hardcoded values.

Coming Up Next

In the following tutorials, you'll learn:

  1. What exactly is a subquery and its formal definition
  2. Why use subqueries over other techniques
  3. Types of subqueries and where they can appear
  4. Step-by-step examples with our e-commerce dataset

Summary

Key Takeaways

  1. Complex projects require more than simple SELECT queries

  2. Common challenges include:

    • Filtering by aggregated values (averages, sums, counts)
    • Comparing rows to group-level statistics
    • Checking for existence or non-existence of related data
    • Creating dynamic value lists for filtering
    • Performing multi-step calculations
  3. Traditional workarounds require:

    • Multiple separate queries
    • Hardcoded values
    • Manual intervention
    • Maintenance as data changes
  4. Subqueries solve these problems by allowing you to nest queries

  5. Benefits of subqueries:

    • Dynamic calculations
    • Single, self-contained queries
    • No manual intervention needed
    • Always uses current data

What's Next?

Now that you understand the challenges, let's formally learn what a subquery is and how to construct one.