Challenges in Complex Projects
Introduction
As your SQL projects grow in complexity, you'll encounter challenges that simple queries cannot solve elegantly. Understanding these challenges is the first step toward mastering subqueries – one of SQL's most powerful features.
In this tutorial, you will learn:
- Common challenges in complex SQL projects
- Why single-level queries become limiting
- Real-world scenarios that demand advanced techniques
- How subqueries address these challenges
The Problem with Simple Queries
When you first learn SQL, you write straightforward queries:
SELECT customer_city, COUNT(*) AS total
FROM olist_customers_dataset
GROUP BY customer_city
ORDER BY total DESC;
This works perfectly for simple questions. But what happens when your business asks:
"Show me customers who have spent more than the average customer."
Now you need to:
- Calculate the average spending
- Compare each customer against that average
- Return only those above the threshold
The challenge: You need to use a calculated value (the average) as a filter condition, but you don't know this value ahead of time.
Challenge 1: Filtering by Aggregated Values
The Scenario
You want to find all orders where the payment amount is above the overall average payment.
Why It's Hard
❌ This doesn't work:
SELECT *
FROM olist_order_payments_dataset
WHERE payment_value > AVG(payment_value); -- Error!
Why? Aggregate functions like AVG() cannot be used directly in the WHERE clause. The WHERE clause filters rows before aggregation happens.
Traditional Workaround
Without subqueries, you would need to:
- Run a query to calculate the average
- Note down the result (e.g., 154.10)
- Run another query with that hardcoded value
-- Step 1: Find the average (run separately)
SELECT AVG(payment_value) FROM olist_order_payments_dataset;
-- Result: 154.10
-- Step 2: Use that value (manual process)
SELECT * FROM olist_order_payments_dataset
WHERE payment_value > 154.10;
Problems with this approach:
- Requires multiple queries
- Hardcoded values become outdated
- Manual intervention needed
- Cannot be automated or scheduled
Challenge 2: Comparing Rows to Group Statistics
The Scenario
You want to find products that have a higher-than-average review score within their category.
Why It's Hard
You need to:
- Calculate the average review score per category
- Compare each product's score to its category average
- Return products that exceed their category's average
This requires comparing individual rows against aggregated group data – something a single-level query cannot do directly.
The Complexity
┌─────────────────────────────────────────────────────┐
│ Product A (Category: Electronics) │
│ Review Score: 4.5 │
│ │
│ Need to compare against: │
│ Average score of ALL Electronics products │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Calculate average by category FIRST │
│ Then compare each product to its category average │
└─────────────────────────────────────────────────────┘
You need a way to nest one query inside another.
Challenge 3: Existence Checks
The Scenario
Find all customers who have never placed an order.
Why It's Hard
With a simple JOIN:
SELECT c.*
FROM olist_customers_dataset c
INNER JOIN olist_orders_dataset o ON c.customer_id = o.customer_id;
This gives customers who have orders. But you want the opposite – customers who don't exist in the orders table.
The Challenge
You need to check for the absence of matching rows, not their presence. This requires:
- Looking up data in another table
- Checking if rows exist or don't exist
- Returning results based on that check
While LEFT JOIN with NULL check works, subqueries with EXISTS or NOT IN often provide clearer, more intuitive solutions.
Challenge 4: Dynamic Value Lists
The Scenario
Find all orders from customers located in the top 5 states by customer count.
Why It's Hard
First, you need to identify the top 5 states:
SELECT customer_state
FROM olist_customers_dataset
GROUP BY customer_state
ORDER BY COUNT(*) DESC
LIMIT 5;
Then, you need to use those states as a filter:
SELECT *
FROM olist_orders_dataset o
JOIN olist_customers_dataset c ON o.customer_id = c.customer_id
WHERE c.customer_state IN ('SP', 'RJ', 'MG', 'RS', 'PR'); -- Hardcoded!
The Problem
- The list of top 5 states changes as new customers are added
- Hardcoding values makes your query static and brittle
- You want the query to dynamically determine which states to include
Challenge 5: Multi-Step Calculations
The Scenario
Calculate the percentage of total revenue that each product category contributes.
Why It's Hard
You need to:
- Calculate total revenue (sum of all sales)
- Calculate revenue per category
- Divide each category's revenue by the total
- Display as a percentage
The Multi-Step Nature
Step 1: Total Revenue = $15,000,000
↓
Step 2: Category Revenue
Electronics = $5,000,000
Furniture = $3,000,000
...
↓
Step 3: Percentage
Electronics = 5M / 15M = 33.3%
Furniture = 3M / 15M = 20.0%
Each step depends on a calculation from another level. A flat, single query cannot express this dependency naturally.
The Solution: Subqueries
All these challenges share a common theme:
You need to use the result of one query inside another query.
This is exactly what subqueries provide.
What is a Subquery?
A subquery is a query nested inside another query. It allows you to:
- Calculate values dynamically
- Compare against aggregated data
- Check for existence of related rows
- Generate dynamic lists for filtering
- Build complex logic step by step
Preview: Solving Challenge 1
Remember the average payment challenge?
-- With a subquery:
SELECT *
FROM olist_order_payments_dataset
WHERE payment_value > (
SELECT AVG(payment_value)
FROM olist_order_payments_dataset
);
The inner query calculates the average, and the outer query uses that result directly. No manual steps, no hardcoded values.
Coming Up Next
In the following tutorials, you'll learn:
- What exactly is a subquery and its formal definition
- Why use subqueries over other techniques
- Types of subqueries and where they can appear
- Step-by-step examples with our e-commerce dataset
Summary
Key Takeaways
-
Complex projects require more than simple SELECT queries
-
Common challenges include:
- Filtering by aggregated values (averages, sums, counts)
- Comparing rows to group-level statistics
- Checking for existence or non-existence of related data
- Creating dynamic value lists for filtering
- Performing multi-step calculations
-
Traditional workarounds require:
- Multiple separate queries
- Hardcoded values
- Manual intervention
- Maintenance as data changes
-
Subqueries solve these problems by allowing you to nest queries
-
Benefits of subqueries:
- Dynamic calculations
- Single, self-contained queries
- No manual intervention needed
- Always uses current data
What's Next?
Now that you understand the challenges, let's formally learn what a subquery is and how to construct one.