Introduction to Hashing
Description
Hashing is a technique that maps data of arbitrary size to fixed-size values using a hash function. The resulting value (called a hash code or hash value) is used as an index into a hash table — a data structure that enables near-instant lookups, insertions, and deletions.
The three core components of hashing are:
- Key — The input data you want to store or look up (e.g., a string, number, or object)
- Hash Function — A function that converts the key into an index within a fixed range
- Hash Table — An array-like structure where data is stored at positions determined by the hash function
Hashing solves a fundamental problem: how can we check if an element exists in a collection in O(1) time, rather than O(n) for linear search or O(log n) for binary search?
In this lesson, we will understand how hashing works, explore what makes a good hash function, learn how to handle collisions (when two different keys produce the same hash), and see the distinction between a Hash Set (stores only keys) and a Hash Map (stores key-value pairs).
Given an array of integers and a series of queries, implement a hash-based structure to answer "does element X exist in the array?" in O(1) average time.
Examples
Example 1
Input: keys = ["apple", "banana", "cherry"], table_size = 7
Hash Computation:
- hash("apple") = (97+112+112+108+101) % 7 = 530 % 7 = 5
- hash("banana") = (98+97+110+97+110+97) % 7 = 609 % 7 = lemen
- hash("cherry") = (99+104+101+114+114+121) % 7 = 653 % 7 = 2
Hash Table:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Value | — | — | cherry | banana | — | apple | — |
Explanation: Each key is transformed by the hash function into an index. The key is stored at that index in the table. To look up "apple", we compute hash("apple")=5 and directly access index 5 — no searching needed.
Example 2
Input: arr = [4, 7, 1, 9, 3], query = 7
Without Hashing (Linear Search):
Compare 7 with arr[0]=4 (no), arr[1]=7 (YES!) → Found at index 1. Took 2 comparisons.
Worst case: O(n) comparisons.
With Hashing (Hash Set):
- Build hash set by inserting all elements: {4, 7, 1, 9, 3}
- Query: Is 7 in the set? → YES (single O(1) lookup)
Explanation: With a hash set, every lookup is O(1) regardless of the collection size. For repeated queries, this is dramatically faster than scanning the array each time.
Example 3 — Collision
Input: keys = [15, 25, 35], table_size = 10
Hash Function: hash(key) = key % 10
- hash(15) = 15 % 10 = 5
- hash(25) = 25 % 10 = 5 ← Same index as 15! COLLISION
- hash(35) = 35 % 10 = 5 ← Another collision!
Chaining Resolution: At index 5, store a linked list: 15 → 25 → 35
Explanation: When multiple keys hash to the same index, this is called a collision. One common resolution is chaining — each slot holds a list of all elements that map to it. Lookups check the list at the computed index.
Constraints
- Hash table size is typically a prime number for better distribution
- A good hash function should distribute keys uniformly across the table
- Average time complexity: O(1) for insert, delete, and search
- Worst case (all keys collide): O(n) — but extremely rare with a good hash function
- Space complexity: O(n) for storing n elements
Editorial
Brute Force
Intuition
Before understanding why hashing is powerful, let's first see the problem it solves. Suppose you have a collection of n elements and you repeatedly need to check whether a given element exists in it.
Without hashing, the simplest approach is linear search: scan through every element until you find a match or reach the end.
Imagine you have an unsorted bookshelf with 1000 books. Someone asks, "Do you have War and Peace?" You'd have to check each book one by one from left to right. If the book is the very last one — or doesn't exist at all — you'd check all 1000 books.
For a single query, linear search is acceptable. But what if you receive 1000 queries? That's 1000 × 1000 = 1,000,000 comparisons in the worst case.
Step-by-Step Explanation
Let's trace a simple existence check with arr = [4, 7, 1, 9, 3] and queries = [7, 5]:
Query 1: Does 7 exist?
- Compare 7 with arr[0]=4 → not equal
- Compare 7 with arr[1]=7 → MATCH! Return true.
- Took 2 comparisons.
Query 2: Does 5 exist?
- Compare 5 with arr[0]=4 → not equal
- Compare 5 with arr[1]=7 → not equal
- Compare 5 with arr[2]=1 → not equal
- Compare 5 with arr[3]=9 → not equal
- Compare 5 with arr[4]=3 → not equal
- Reached end. Return false.
- Took 5 comparisons (worst case).
Total for 2 queries: 7 comparisons. For q queries on n elements: O(q × n) total.
Linear Search — Scanning for Element Existence — Watch the pointer scan element by element to find whether the query exists. Every element must potentially be checked.
Algorithm
- Given array
arrof n elements and query elementx - For each index
ifrom 0 to n-1:- If
arr[i] == x, return true (element found)
- If
- If loop finishes without finding x, return false
- For q queries, repeat steps 2-3 for each query
Code
#include <vector>
using namespace std;
// Linear search for each query — O(q * n)
bool linearSearch(vector<int>& arr, int x) {
for (int i = 0; i < arr.size(); i++) {
if (arr[i] == x) return true;
}
return false;
}
vector<bool> answerQueries(vector<int>& arr, vector<int>& queries) {
vector<bool> results;
for (int q : queries) {
results.push_back(linearSearch(arr, q));
}
return results;
}# Linear search for each query — O(q * n)
def linear_search(arr, x):
for i in range(len(arr)):
if arr[i] == x:
return True
return False
def answer_queries(arr, queries):
results = []
for q in queries:
results.append(linear_search(arr, q))
return resultsimport java.util.*;
public class Solution {
// Linear search for each query — O(q * n)
static boolean linearSearch(int[] arr, int x) {
for (int i = 0; i < arr.length; i++) {
if (arr[i] == x) return true;
}
return false;
}
static List<Boolean> answerQueries(int[] arr, int[] queries) {
List<Boolean> results = new ArrayList<>();
for (int q : queries) {
results.add(linearSearch(arr, q));
}
return results;
}
}Complexity Analysis
Time Complexity: O(q × n)
For each of the q queries, we scan through the entire array of n elements in the worst case. This gives O(q × n) total time.
Space Complexity: O(1)
Linear search uses no extra data structures — just a loop variable. Space is O(1) excluding the output.
Why This Approach Is Not Efficient
Linear search takes O(n) per query. If we have q queries, the total time is O(q × n). With n = 10^5 and q = 10^5, that's 10^10 operations — impossibly slow.
The root problem: we're not remembering anything between queries. Each query starts from scratch, scanning the array from the beginning. There's no structure that lets us jump directly to where an element should be.
Hashing solves this by building a precomputed lookup structure in O(n) time, after which each query is answered in O(1) average time. Total: O(n + q) instead of O(n × q).
The key insight: a hash function converts the element directly into an address (index). Instead of searching for an element, we compute where it should be. This transforms search from a comparison-based operation into an arithmetic operation.
Optimal Approach - Hash Table
Intuition
A hash table is like a well-organized library with a special catalogue system. Instead of walking through every shelf to find a book, you consult the catalogue: it tells you exactly which shelf and position your book is on. The catalogue is the hash function, and the shelves form the hash table.
How it works:
-
Hash Function: Takes a key and produces an index. For integers, a simple hash function is
h(key) = key % table_size. For strings, we might sum the ASCII values and take modulo. -
Insertion: To insert element
x, computeindex = h(x), then storexattable[index]. -
Lookup: To check if
xexists, computeindex = h(x), then check ifxis attable[index]. This is O(1)! -
Collision Handling: Sometimes two different keys produce the same index. For example, with table_size=10, both 15 and 25 hash to index 5. This is a collision. Two common solutions:
- Chaining: Each slot holds a linked list of all elements that hash to that index
- Open Addressing: If a slot is occupied, probe the next slot (linear probing, quadratic probing, or double hashing)
Hash Set vs Hash Map:
- Hash Set: Stores only keys. Used to check existence: "Is X in the set?" (like Python's
set, C++unordered_set, JavaHashSet) - Hash Map: Stores key-value pairs. Used to associate values with keys: "What is the value for key X?" (like Python's
dict, C++unordered_map, JavaHashMap)
Step-by-Step Explanation
Let's build a hash table of size 7 and insert elements [4, 7, 1, 9, 3], then query for 7 and 5.
Hash Function: h(key) = key % 7
Step 1 — Insert 4: h(4) = 4 % 7 = 4. Store 4 at index 4.
Table: [_, _, _, _, 4, _, _]
Step 2 — Insert 7: h(7) = 7 % 7 = 0. Store 7 at index 0.
Table: [7, _, _, _, 4, _, _]
Step 3 — Insert 1: h(1) = 1 % 7 = 1. Store 1 at index 1.
Table: [7, 1, _, _, 4, _, _]
Step 4 — Insert 9: h(9) = 9 % 7 = 2. Store 9 at index 2.
Table: [7, 1, 9, _, 4, _, _]
Step 5 — Insert 3: h(3) = 3 % 7 = 3. Store 3 at index 3.
Table: [7, 1, 9, 3, 4, _, _]
Step 6 — Query: Does 7 exist?
h(7) = 0. Check table[0] = 7. MATCH! Return true. ✓ (1 operation)
Step 7 — Query: Does 5 exist?
h(5) = 5. Check table[5] = empty. Return false. ✗ (1 operation)
Both queries took O(1) time — no scanning needed!
Hash Table Construction and Lookup — Watch how elements are hashed to compute their index and inserted into the table. Then see how queries are answered in O(1) by computing the hash and checking the slot directly.
Algorithm
Building the Hash Table:
- Choose a table size
m(preferably prime for better distribution) - Define hash function:
h(key) = key % m - For each element in the input array:
- Compute
index = h(element) - Insert element at
table[index] - If a collision occurs, use chaining (linked list at each slot)
- Compute
Answering Queries:
- For query element
x:- Compute
index = h(x) - Check if
xexists attable[index](or in the chain at that index) - Return true/false
- Compute
Collision Handling — Chaining:
- Each slot in the table holds a linked list (or dynamic array)
- On collision, append the new element to the list at that slot
- On lookup, traverse the list at the computed slot
- Average chain length = n/m (called the load factor)
Collision Handling — Open Addressing (Linear Probing):
- If slot
h(key)is occupied, try sloth(key)+1, thenh(key)+2, etc. - Wrap around to the beginning if needed
- On lookup, follow the same probing sequence until finding the key or an empty slot
Code
#include <vector>
#include <unordered_set>
using namespace std;
// Using built-in hash set for O(1) lookups
vector<bool> answerQueries(vector<int>& arr, vector<int>& queries) {
// Build hash set — O(n)
unordered_set<int> hashSet(arr.begin(), arr.end());
// Answer each query in O(1)
vector<bool> results;
for (int q : queries) {
results.push_back(hashSet.count(q) > 0);
}
return results;
}
// --- Manual hash table with chaining ---
class HashTable {
int size;
vector<vector<int>> table; // Each slot is a chain (list)
int hashFunction(int key) {
return ((key % size) + size) % size; // Handle negatives
}
public:
HashTable(int sz) : size(sz), table(sz) {}
void insert(int key) {
int index = hashFunction(key);
table[index].push_back(key);
}
bool search(int key) {
int index = hashFunction(key);
for (int val : table[index]) {
if (val == key) return true;
}
return false;
}
};# Using built-in hash set for O(1) lookups
def answer_queries(arr, queries):
# Build hash set — O(n)
hash_set = set(arr)
# Answer each query in O(1)
return [q in hash_set for q in queries]
# --- Manual hash table with chaining ---
class HashTable:
def __init__(self, size):
self.size = size
self.table = [[] for _ in range(size)] # Each slot is a chain
def _hash(self, key):
return key % self.size
def insert(self, key):
index = self._hash(key)
self.table[index].append(key)
def search(self, key):
index = self._hash(key)
return key in self.table[index]
# Example usage:
ht = HashTable(7)
for num in [4, 7, 1, 9, 3]:
ht.insert(num)
print(ht.search(7)) # True — O(1)
print(ht.search(5)) # False — O(1)import java.util.*;
public class Solution {
// Using built-in HashSet for O(1) lookups
static List<Boolean> answerQueries(int[] arr, int[] queries) {
// Build hash set — O(n)
Set<Integer> hashSet = new HashSet<>();
for (int num : arr) hashSet.add(num);
// Answer each query in O(1)
List<Boolean> results = new ArrayList<>();
for (int q : queries) {
results.add(hashSet.contains(q));
}
return results;
}
// --- Manual hash table with chaining ---
static class HashTable {
int size;
List<List<Integer>> table;
HashTable(int sz) {
size = sz;
table = new ArrayList<>();
for (int i = 0; i < sz; i++) {
table.add(new ArrayList<>());
}
}
int hashFunction(int key) {
return ((key % size) + size) % size;
}
void insert(int key) {
int index = hashFunction(key);
table.get(index).add(key);
}
boolean search(int key) {
int index = hashFunction(key);
return table.get(index).contains(key);
}
}
}Complexity Analysis
Time Complexity:
- Build: O(n) — Insert each of n elements in O(1) amortized time
- Query: O(1) average per query, O(q) for q queries
- Total: O(n + q)
Worst case (all elements hash to the same slot): O(n) per query due to long chain. This is extremely rare with a good hash function and appropriately sized table.
Space Complexity: O(n)
The hash table stores all n elements. With chaining, each element occupies one node in a chain. The table itself has m slots (where m is the table size, usually proportional to n).
Load Factor: α = n/m. When α approaches 1, collisions increase. Most implementations automatically resize (rehash) when α exceeds a threshold (e.g., 0.75 in Java's HashMap).
Why this is optimal: We must read all n input elements (Ω(n) lower bound). We must answer each of q queries (Ω(q) lower bound). Our O(n + q) solution matches both bounds. No algorithm can do better for this problem.
Key Takeaways
When to Use Hashing
- Fast lookups: Check if an element exists in O(1)
- Counting frequencies: Count occurrences using a hash map
- Finding pairs/groups: Two Sum, grouping anagrams, etc.
- Deduplication: Remove duplicates using a hash set
- Caching: Memoization with hash maps
When NOT to Use Hashing
- Ordered data needed: Hash tables don't maintain insertion or sorted order (use balanced BST or sorted array instead)
- Range queries: "Find all elements between 10 and 20" — hash tables can't do this efficiently (use segment trees or sorted structures)
- Worst-case guarantees: If O(1) worst-case is required, hashing (which has O(n) worst case) may not be suitable
Hash Set vs Hash Map — Quick Reference
| Feature | Hash Set | Hash Map |
|---|---|---|
| Stores | Keys only | Key-value pairs |
| Use case | Existence checking | Associating values |
| C++ | unordered_set | unordered_map |
| Python | set() | dict() / {} |
| Java | HashSet | HashMap |
Properties of a Good Hash Function
- Deterministic: Same input always produces the same output
- Uniform distribution: Spreads keys evenly across the table
- Efficient: Computes quickly (O(1) for fixed-size keys)
- Minimizes collisions: Different keys should produce different indices as much as possible