Design HashSet

Description

Design a HashSet data structure from scratch without using any built-in hash table libraries.

A HashSet is a collection that stores unique elements and supports three core operations:

add(key): Insert the value key into the set. If the key already exists, the set remains unchanged.
contains(key): Check whether the value key exists in the set. Return true if it does, false otherwise.
remove(key): Delete the value key from the set. If the key does not exist, do nothing.

Your task is to implement the MyHashSet class with these three methods, handling all operations correctly and efficiently.

Examples

Example 1

Input:

["MyHashSet", "add", "add", "contains", "contains", "add", "contains", "remove", "contains"]
[[], [1], [2], [1], [3], [2], [2], [2], [2]]

Output:

[null, null, null, true, false, null, true, null, false]

Explanation:

MyHashSet myHashSet = new MyHashSet(); — Create an empty set.
myHashSet.add(1); — The set becomes {1}.
myHashSet.add(2); — The set becomes {1, 2}.
myHashSet.contains(1); — Returns true because 1 is in the set.
myHashSet.contains(3); — Returns false because 3 is not in the set.
myHashSet.add(2); — The set remains {1, 2} since 2 already exists.
myHashSet.contains(2); — Returns true because 2 is in the set.
myHashSet.remove(2); — The set becomes {1} after removing 2.
myHashSet.contains(2); — Returns false because 2 was removed.

Constraints

0 ≤ key ≤ 10^6
At most 10^4 calls will be made to add, remove, and contains.

Editorial

Brute Force

Intuition

The simplest way to design a HashSet is to completely avoid hashing altogether. Since the problem tells us that keys range from 0 to 1,000,000, we can create a giant boolean array of size 1,000,001 where each index represents a possible key.

Think of it like a wall of light switches in a hallway, one switch for every possible number from 0 to 1,000,000. To add a number, flip its switch ON. To remove it, flip it OFF. To check if it exists, just look at whether the switch is ON or OFF.

This approach is called direct addressing — we use the key itself as the index into our storage array. It works perfectly when the key range is known and bounded, but it wastes enormous memory if only a few keys are actually stored.

Step-by-Step Explanation

Let's trace through the operations from Example 1:

Step 1: Initialize a boolean array data of size 1,000,001, all set to false.

data = [false, false, false, false, ...] (1,000,001 slots)

Step 2: add(1) — Set data[1] = true.

data[1] is now true. The set logically contains {1}.

Step 3: add(2) — Set data[2] = true.

data[2] is now true. The set logically contains {1, 2}.

Step 4: contains(1) — Return data[1].

data[1] is true, so return true. Key 1 exists.

Step 5: contains(3) — Return data[3].

data[3] is false, so return false. Key 3 does not exist.

Step 6: add(2) — Set data[2] = true.

data[2] was already true, so nothing effectively changes. Duplicates are naturally handled.

Step 7: contains(2) — Return data[2].

data[2] is true, so return true.

Step 8: remove(2) — Set data[2] = false.

data[2] is now false. The set logically contains {1}.

Step 9: contains(2) — Return data[2].

data[2] is false, so return false. Key 2 was removed.

Direct Addressing — Boolean Array Operations — Watch how each add, contains, and remove operation directly accesses the boolean array at the key's index, achieving O(1) time for every operation.

Algorithm

Initialize: Create a boolean array data of size 1,000,001, filled with false.
add(key): Set data[key] = true.
remove(key): Set data[key] = false.
contains(key): Return data[key].

Code

class MyHashSet {
private:
    vector<bool> data;

public:
    MyHashSet() {
        data.resize(1000001, false);
    }

    void add(int key) {
        data[key] = true;
    }

    void remove(int key) {
        data[key] = false;
    }

    bool contains(int key) {
        return data[key];
    }
};

class MyHashSet:
    def __init__(self):
        self.data = [False] * 1000001

    def add(self, key: int) -> None:
        self.data[key] = True

    def remove(self, key: int) -> None:
        self.data[key] = False

    def contains(self, key: int) -> bool:
        return self.data[key]

class MyHashSet {
    private boolean[] data;

    public MyHashSet() {
        data = new boolean[1000001];
    }

    public void add(int key) {
        data[key] = true;
    }

    public void remove(int key) {
        data[key] = false;
    }

    public boolean contains(int key) {
        return data[key];
    }
}

Complexity Analysis

Time Complexity: O(1) per operation

Each of the three operations — add, remove, and contains — performs a single array access at a known index. Array access by index is a constant-time operation, so every call runs in O(1).

Space Complexity: O(N) where N = 10^6 + 1

We allocate a boolean array of fixed size 1,000,001 regardless of how many keys are actually stored. Even if we only add 5 keys, we still consume memory for over a million slots. This is extremely wasteful when the actual number of stored keys is much smaller than the key range.

Why This Approach Is Not Efficient

While the boolean array gives us O(1) time for every operation, it has a critical flaw: space waste.

We allocate an array of 1,000,001 booleans no matter what. If the problem only requires storing 10 keys, we are wasting 99.999% of the allocated memory. In real-world systems, the key range might be billions (e.g., IP addresses, user IDs), making a direct-addressing array completely impractical.

Moreover, this approach only works because the problem guarantees keys are non-negative integers within a fixed range. It cannot handle string keys, negative numbers, or keys beyond the pre-allocated range.

The fundamental insight: a real hash set uses a hash function to map keys to a much smaller array of buckets, and handles the inevitable collisions using techniques like chaining (linked lists within each bucket). This lets us use O(K) space where K is the number of stored keys, rather than O(N) where N is the entire key range.

Optimal Approach - Hash Function with Bucket Chaining

Intuition

A proper hash set works like a library with numbered shelves. Instead of having a separate shelf for every possible book (which would require billions of shelves), we use a formula to decide which shelf each book goes on. For example, shelf number = book_id modulo total_shelves. Multiple books might end up on the same shelf — that is called a collision — but we simply stack them on the same shelf and search through the small stack when needed.

Here is the design:

Choose a bucket count: Pick a number of buckets (e.g., 1000). A prime number is often chosen to reduce clustering, but any reasonable size works.
Hash function: hash(key) = key % bucket_count. This maps any key to a bucket index between 0 and bucket_count - 1.
Collision resolution via chaining: Each bucket holds a list of keys that hash to the same index. When adding, we check the list first to avoid duplicates. When searching or removing, we scan the list.

With a good bucket count, each bucket's list stays very short on average, so operations are effectively O(1) amortized while using memory proportional to the number of stored keys.

Hash set with bucket chaining: keys are mapped to buckets via modulo, and collisions are resolved by storing multiple keys in the same bucket's linked list

Step-by-Step Explanation

Let's trace through the Example 1 operations using bucket_count = 5 and hash(key) = key % 5:

Step 1: Initialize 5 empty buckets: [[], [], [], [], []].

Step 2: add(1) — hash(1) = 1 % 5 = 1. Bucket 1 is empty, so append 1.

Buckets: [[], [1], [], [], []]

Step 3: add(2) — hash(2) = 2 % 5 = 2. Bucket 2 is empty, so append 2.

Buckets: [[], [1], [2], [], []]

Step 4: contains(1) — hash(1) = 1. Scan Bucket 1: [1]. Found 1! Return true.

Step 5: contains(3) — hash(3) = 3. Scan Bucket 3: []. Empty bucket! Return false.

Step 6: add(2) — hash(2) = 2. Scan Bucket 2: [2]. Key 2 already present, do nothing.

Buckets remain: [[], [1], [2], [], []]

Step 7: contains(2) — hash(2) = 2. Scan Bucket 2: [2]. Found 2! Return true.

Step 8: remove(2) — hash(2) = 2. Scan Bucket 2: [2]. Found 2, remove it.

Buckets: [[], [1], [], [], []]

Step 9: contains(2) — hash(2) = 2. Scan Bucket 2: []. Empty! Return false.

Hash Set with Bucket Chaining — Full Operation Trace — Watch how the hash function maps keys to buckets using modulo, and how chaining handles collisions by storing multiple keys in the same bucket's list.

Algorithm

Initialize: Create an array of bucket_count empty lists (buckets). Choose a reasonable bucket count (e.g., 769 — a prime number).
Hash function: For any key, compute bucket_index = key % bucket_count.
add(key):
- Compute the bucket index.
- Scan the bucket's list. If key is already present, do nothing.
- Otherwise, append key to the list.
remove(key):
- Compute the bucket index.
- Scan the bucket's list. If key is found, remove it.
- If not found, do nothing.
contains(key):
- Compute the bucket index.
- Scan the bucket's list. Return true if key is found, false otherwise.

Code

class MyHashSet {
private:
    static const int BUCKET_COUNT = 769;
    vector<list<int>> buckets;

    int hash(int key) {
        return key % BUCKET_COUNT;
    }

public:
    MyHashSet() {
        buckets.resize(BUCKET_COUNT);
    }

    void add(int key) {
        int idx = hash(key);
        for (int val : buckets[idx]) {
            if (val == key) return;
        }
        buckets[idx].push_back(key);
    }

    void remove(int key) {
        int idx = hash(key);
        buckets[idx].remove(key);
    }

    bool contains(int key) {
        int idx = hash(key);
        for (int val : buckets[idx]) {
            if (val == key) return true;
        }
        return false;
    }
};

class MyHashSet:
    def __init__(self):
        self.bucket_count = 769
        self.buckets = [[] for _ in range(self.bucket_count)]

    def _hash(self, key: int) -> int:
        return key % self.bucket_count

    def add(self, key: int) -> None:
        idx = self._hash(key)
        if key not in self.buckets[idx]:
            self.buckets[idx].append(key)

    def remove(self, key: int) -> None:
        idx = self._hash(key)
        if key in self.buckets[idx]:
            self.buckets[idx].remove(key)

    def contains(self, key: int) -> bool:
        idx = self._hash(key)
        return key in self.buckets[idx]

class MyHashSet {
    private static final int BUCKET_COUNT = 769;
    private List<List<Integer>> buckets;

    public MyHashSet() {
        buckets = new ArrayList<>();
        for (int i = 0; i < BUCKET_COUNT; i++) {
            buckets.add(new LinkedList<>());
        }
    }

    private int hash(int key) {
        return key % BUCKET_COUNT;
    }

    public void add(int key) {
        int idx = hash(key);
        if (!buckets.get(idx).contains(key)) {
            buckets.get(idx).add(key);
        }
    }

    public void remove(int key) {
        int idx = hash(key);
        buckets.get(idx).remove(Integer.valueOf(key));
    }

    public boolean contains(int key) {
        int idx = hash(key);
        return buckets.get(idx).contains(key);
    }
}

Complexity Analysis

Time Complexity: O(N / K) per operation on average, where N is the number of stored keys and K is the bucket count.

With a well-chosen bucket count (e.g., 769) and a reasonable number of keys (up to 10^4), the average chain length per bucket is about 10^4 / 769 ≈ 13 elements. Scanning a chain of 13 elements is effectively constant. In the best case (uniform distribution), this approaches O(1). In the worst case (all keys hash to the same bucket), it degrades to O(N), but this is extremely unlikely with a prime bucket count.

Space Complexity: O(K + N) where K is the bucket count and N is the number of stored keys.

We allocate K empty bucket lists upfront, and each of the N inserted keys occupies one slot in one bucket's list. This is much more memory-efficient than the boolean array approach when N is much smaller than the key range (10^6).

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

MEDIUM

HARD

EASY

MEDIUM

HARD

MEDIUM

HARD

EASY

MEDIUM

HARD

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

HARD

EASY

MEDIUM

EASY

MEDIUM

Design HashSet

Description

Examples

Example 1

Constraints

Editorial

Brute Force

Intuition

Step-by-Step Explanation

Algorithm

Code

Complexity Analysis

Why This Approach Is Not Efficient

Optimal Approach - Hash Function with Bucket Chaining

Intuition

Step-by-Step Explanation

Algorithm

Code

Complexity Analysis