Skip to main content

KMP / Rabin-Karp String Search

Description

Given two strings haystack and needle, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

This is the classic substring search (or string matching) problem — one of the most fundamental problems in computer science. While built-in library functions solve it in one call, understanding the algorithms behind it (naive sliding window, Rabin-Karp rolling hash, and KMP prefix matching) is essential for interviews and for solving harder string problems that build on these techniques.

Examples

Example 1

Input: haystack = "sadbutsad", needle = "sad"

Output: 0

Explanation: The string "sad" appears at index 0 and at index 6 inside "sadbutsad". Since we want the first occurrence, we return 0.

Example 2

Input: haystack = "leetcode", needle = "leeto"

Output: -1

Explanation: "leeto" does not appear anywhere in "leetcode". The first four characters "leet" match, but the fifth character 'o' does not match 'c'. No other starting position works either, so we return -1.

Example 3

Input: haystack = "aabaabaaf", needle = "aabaaf"

Output: 3

Explanation: Starting at index 0, "aabaab" does not match "aabaaf" (the sixth character differs: 'b' vs 'f'). Starting at index 3, "aabaaf" matches perfectly. So we return 3. This example is particularly interesting because a naive approach wastes work on the failed attempt at index 0, while KMP reuses the partial match.

Constraints

  • 1 ≤ haystack.length, needle.length ≤ 10^4
  • haystack and needle consist of only lowercase English characters.

Editorial

Brute Force

Intuition

The most natural approach is to try every possible starting position in the haystack and check whether the needle matches character by character at that position.

Think of it like searching for a word on a ruler. You place the word over the ruler at position 0, compare letter by letter. If any letter disagrees, you slide the word one position to the right and start comparing from scratch. You repeat until the word fits perfectly somewhere or you run out of positions to try.

The last valid starting position is n - m (where n = len(haystack) and m = len(needle)), because starting any later would mean there are not enough characters remaining to fit the needle.

Step-by-Step Explanation

Let's trace Example 3: haystack = "aabaabaaf", needle = "aabaaf".

Step 1: n = 9, m = 6. Valid positions: 0 through 3 (9 - 6 = 3). Set i = 0.

Step 2: Position 0, j=0: h[0]='a' = n[0]='a' ✓. Match.

Step 3: j=1: h[1]='a' = n[1]='a' ✓. j=2: h[2]='b' = n[2]='b' ✓. Matching continues.

Step 4: j=3: h[3]='a' = n[3]='a' ✓. j=4: h[4]='a' = n[4]='a' ✓. Five of six characters match.

Step 5: j=5: h[5]='b' ≠ n[5]='f' ✗. Mismatch on the last character! All five characters of work are discarded. Move to position 1.

Step 6: Position 1, j=0: h[1]='a' = n[0]='a' ✓. j=1: h[2]='b' ≠ n[1]='a' ✗. Fail after one match. Move to position 2.

Step 7: Position 2, j=0: h[2]='b' ≠ n[0]='a' ✗. Immediate fail. Move to position 3.

Step 8: Position 3: h[3]='a'=n[0], h[4]='a'=n[1], h[5]='b'=n[2], h[6]='a'=n[3], h[7]='a'=n[4], h[8]='f'=n[5]. All 6 characters match!

Step 9: Return 3.

Brute Force — Slide and Compare at Every Position — Watch the naive approach try each starting position, comparing character by character, discarding all progress on mismatch and restarting from scratch.

Algorithm

  1. Let n = len(haystack), m = len(needle).
  2. For each starting position i from 0 to n - m:
    a. For each offset j from 0 to m - 1:
    • Compare haystack[i + j] with needle[j].
    • If they differ, break out of the inner loop (mismatch at this position).
      b. If the inner loop completed without a mismatch (all m characters matched), return i.
  3. If no position yielded a full match, return -1.

Code

class Solution {
public:
    int strStr(string haystack, string needle) {
        int n = haystack.size(), m = needle.size();
        for (int i = 0; i <= n - m; i++) {
            bool found = true;
            for (int j = 0; j < m; j++) {
                if (haystack[i + j] != needle[j]) {
                    found = false;
                    break;
                }
            }
            if (found) return i;
        }
        return -1;
    }
};
class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        n, m = len(haystack), len(needle)
        for i in range(n - m + 1):
            match = True
            for j in range(m):
                if haystack[i + j] != needle[j]:
                    match = False
                    break
            if match:
                return i
        return -1
class Solution {
    public int strStr(String haystack, String needle) {
        int n = haystack.length(), m = needle.length();
        for (int i = 0; i <= n - m; i++) {
            boolean found = true;
            for (int j = 0; j < m; j++) {
                if (haystack.charAt(i + j) != needle.charAt(j)) {
                    found = false;
                    break;
                }
            }
            if (found) return i;
        }
        return -1;
    }
}

Complexity Analysis

Time Complexity: O(n × m)

Where n = len(haystack) and m = len(needle). The outer loop runs at most n - m + 1 times, and for each position the inner loop compares up to m characters. In the worst case (e.g., haystack = "aaaa...aab", needle = "aa...ab"), nearly every position triggers a near-complete comparison before failing at the last character, giving O((n - m) × m) ≈ O(n × m).

Space Complexity: O(1)

We only use a few integer variables (i, j, found). No auxiliary data structures are created.

Why This Approach Is Not Efficient

The brute force takes O(n × m) time. With both n and m up to 10^4, that is up to 10^8 character comparisons — enough to cause timeouts under tight limits.

The core waste is re-examining characters we have already compared. In Example 3, when the match at position 0 fails at the 6th character, we already know that haystack[0..4] = "aabaa". Positions 1 and 2 can be ruled out from this knowledge alone (because "aabaa" does not start with "abaab" or "baaba"). The brute force ignores all this information and naively re-reads the same characters.

Two classic algorithms fix this:

  • Rabin-Karp uses a rolling hash to compare entire windows in O(1) expected time, avoiding per-character inner loops. Expected O(n + m).
  • KMP pre-processes the needle into a failure function that lets the text pointer only move forward, guaranteeing O(n + m) even in the worst case.

Better Approach - Rabin-Karp Rolling Hash

Intuition

Instead of comparing the needle character by character at every position, what if we could compare the entire window at once in O(1) time?

That is exactly what a hash function gives us. We compute a numeric fingerprint (hash) of the needle. Then for each window of the same length in the haystack, we compute a hash and compare it with the needle's hash. If the hashes differ, the strings definitely differ — skip immediately. If the hashes match, we verify character by character to rule out a false positive (hash collision).

The trick that makes this fast is the rolling hash. When we slide the window one position to the right, we do not recompute the hash from scratch. Instead, we:

  1. Remove the contribution of the old leftmost character.
  2. Add the contribution of the new rightmost character.

This update takes O(1), so scanning all n - m + 1 windows costs just O(n). The only O(m) work per window happens on a hash match (rare if the hash function is good), giving O(n + m) expected time.

Step-by-Step Explanation

Let's trace Example 3: haystack = "aabaabaaf", needle = "aabaaf". Use base = 26, mod = 101 for illustration.

Step 1: Compute needle hash. hash("aabaaf") = some value H_needle.

Step 2: Compute initial window hash. hash(haystack[0..5]) = hash("aabaab") = H_0.

Step 3: Compare H_0 vs H_needle. They differ ("aabaab" ≠ "aabaaf"). No match. Roll the window.

Step 4: Roll hash: remove h[0]='a', add h[6]='a'. H_1 = hash("abaaba"). Compare with H_needle → differ. Roll.

Step 5: Roll hash: remove h[1]='a', add h[7]='a'. H_2 = hash("baabaa"). Compare → differ. Roll.

Step 6: Roll hash: remove h[2]='b', add h[8]='f'. H_3 = hash("aabaaf"). Compare with H_needle → MATCH!

Step 7: Hash match — verify character by character: h[3..8] = "aabaaf" = needle. Confirmed.

Step 8: Return 3.

Rabin-Karp — Rolling Hash Window Scan — Watch how the rolling hash compares entire windows in O(1) by incrementally updating the hash value. Only when hashes match do we verify character by character.

Algorithm

  1. Compute hash of needlepatHash.
  2. Compute hash of first m characters of haystacktxtHash.
  3. Precompute power = base^(m-1) mod p for rolling.
  4. For each window position i from 0 to n - m:
    a. If txtHash == patHash, verify character by character. If verified, return i.
    b. Roll the hash: txtHash = ((txtHash - haystack[i] * power) * base + haystack[i + m]) mod p.
  5. Return -1.

Code

class Solution {
public:
    int strStr(string haystack, string needle) {
        int n = haystack.size(), m = needle.size();
        if (m > n) return -1;
        
        long long base = 31, mod = 1e9 + 7;
        long long power = 1;
        
        // Compute base^(m-1) % mod
        for (int i = 0; i < m - 1; i++)
            power = power * base % mod;
        
        // Compute initial hashes
        long long patHash = 0, txtHash = 0;
        for (int i = 0; i < m; i++) {
            patHash = (patHash * base + needle[i]) % mod;
            txtHash = (txtHash * base + haystack[i]) % mod;
        }
        
        for (int i = 0; i <= n - m; i++) {
            if (patHash == txtHash) {
                // Verify to rule out collision
                if (haystack.substr(i, m) == needle)
                    return i;
            }
            if (i < n - m) {
                // Roll the hash window
                txtHash = ((txtHash - haystack[i] * power % mod + mod) % mod
                           * base + haystack[i + m]) % mod;
            }
        }
        return -1;
    }
};
class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        n, m = len(haystack), len(needle)
        if m > n:
            return -1
        
        base, mod = 31, 10**9 + 7
        
        # Compute base^(m-1) % mod
        power = pow(base, m - 1, mod)
        
        # Compute initial hashes
        pat_hash = 0
        txt_hash = 0
        for i in range(m):
            pat_hash = (pat_hash * base + ord(needle[i])) % mod
            txt_hash = (txt_hash * base + ord(haystack[i])) % mod
        
        for i in range(n - m + 1):
            if pat_hash == txt_hash:
                # Verify to rule out hash collision
                if haystack[i:i + m] == needle:
                    return i
            if i < n - m:
                # Roll the hash: remove leftmost, add new rightmost
                txt_hash = ((txt_hash - ord(haystack[i]) * power) * base
                            + ord(haystack[i + m])) % mod
        return -1
class Solution {
    public int strStr(String haystack, String needle) {
        int n = haystack.length(), m = needle.length();
        if (m > n) return -1;
        
        long base = 31, mod = 1_000_000_007;
        long power = 1;
        
        // Compute base^(m-1) % mod
        for (int i = 0; i < m - 1; i++)
            power = power * base % mod;
        
        // Compute initial hashes
        long patHash = 0, txtHash = 0;
        for (int i = 0; i < m; i++) {
            patHash = (patHash * base + needle.charAt(i)) % mod;
            txtHash = (txtHash * base + haystack.charAt(i)) % mod;
        }
        
        for (int i = 0; i <= n - m; i++) {
            if (patHash == txtHash) {
                // Verify character by character
                if (haystack.substring(i, i + m).equals(needle))
                    return i;
            }
            if (i < n - m) {
                // Roll the hash
                txtHash = ((txtHash - haystack.charAt(i) * power % mod + mod)
                           % mod * base + haystack.charAt(i + m)) % mod;
            }
        }
        return -1;
    }
}

Complexity Analysis

Time Complexity: O(n + m) expected, O(n × m) worst case

Computing the initial hashes takes O(m). Scanning n - m + 1 windows takes O(n) since each hash roll is O(1). Verification happens only on hash matches. With a good hash function and large modulus, collisions are rare, so verification is O(m) amortized once. Expected total: O(n + m). In the pathological worst case (many hash collisions), every window triggers verification, giving O(n × m).

Space Complexity: O(1)

We store only a few variables: two hash values, the base, the modulus, and the power. No auxiliary arrays are needed.

Why This Approach Is Not Efficient

Rabin-Karp achieves O(n + m) on average, which is excellent in practice. However, its worst case is still O(n × m) — the same as brute force. This happens when the hash function produces many collisions (e.g., all windows hash to the same value), forcing character-by-character verification at every position.

For competitive programming and interviews where worst-case guarantees matter, we need an algorithm that is O(n + m) even in the worst case. The KMP (Knuth-Morris-Pratt) algorithm achieves this by pre-processing the needle into a failure function that eliminates all redundant comparisons, guaranteeing that the text pointer never moves backward.

Optimal Approach - KMP (Knuth-Morris-Pratt)

Intuition

KMP's genius insight is: when a mismatch occurs, we already know what the matched portion of the text looks like (it is identical to a prefix of the needle). We can use this knowledge to skip ahead instead of starting over.

Consider Example 3: at position 0 we match "aabaa" (5 characters) before failing at 'b' vs 'f'. The matched portion "aabaa" ends with "aa", which is also the beginning of the needle ("aa..."). So the next potential match starts at position 3 — not position 1 — because positions 1 and 2 cannot possibly work (the text does not start with 'a' at position 1 relative to the needle's expected prefix).

KMP captures this pattern in the failure function (prefix function). For each position j in the needle, fail[j] stores the length of the longest proper prefix of needle[0..j] that is also a suffix. When a mismatch occurs at position j, instead of resetting j to 0, we set j = fail[j-1] and continue comparing from there — never moving the text pointer i backward.

This guarantees that each character of the text is visited at most twice (once in the main scan, at most once via failure fallback), giving O(n + m) worst-case time.

Step-by-Step Explanation

Let's trace Example 3: haystack = "aabaabaaf", needle = "aabaaf".

Step 1: Build the failure function for needle = "aabaaf":

  • fail[0] = 0 (by definition: single char has no proper prefix = suffix)
  • fail[1] = 1 ('a' = 'a': prefix "a" matches suffix "a")
  • fail[2] = 0 ('b' ≠ 'a': no prefix of "aab" equals a suffix)
  • fail[3] = 1 ('a' = 'a': prefix "a" matches suffix "a" of "aaba")
  • fail[4] = 2 ('a' = 'a': prefix "aa" matches suffix "aa" of "aabaa")
  • fail[5] = 0 ('f' ≠ 'b': no prefix of "aabaaf" equals a suffix)
  • Result: fail = [0, 1, 0, 1, 2, 0]

Step 2: i=0, j=0: h[0]='a' = n[0]='a'. Match. j→1.

Step 3: i=1, j=1: h[1]='a' = n[1]='a'. Match. j→2.

Step 4: i=2, j=2: h[2]='b' = n[2]='b'. Match. j→3.

Step 5: i=3, j=3: h[3]='a' = n[3]='a'. Match. j→4. i=4, j=4: h[4]='a' = n[4]='a'. Match. j→5.

Step 6: i=5, j=5: h[5]='b' ≠ n[5]='f'. MISMATCH! Consult fail[4] = 2. Set j = 2. The key insight: we know h[3..4]="aa" matches needle[0..1]="aa", so we resume at j=2 without re-reading h[3] or h[4].

Step 7: i=5, j=2: h[5]='b' = n[2]='b'. Match! j→3. KMP reused 2 characters of progress.

Step 8: i=6, j=3: h[6]='a' = n[3]='a'. j→4. i=7, j=4: h[7]='a' = n[4]='a'. j→5.

Step 9: i=8, j=5: h[8]='f' = n[5]='f'. Match. j→6 = m. Full pattern matched!

Step 10: Return i - m + 1 = 8 - 6 + 1 = 3.

KMP — Failure Function Avoids Redundant Comparisons — Watch how KMP's failure function lets us skip ahead after a mismatch, reusing characters we already compared. The text pointer i never moves backward.

Algorithm

  1. Build failure function for needle:
    • Initialize fail array of size m with all zeros.
    • For i from 1 to m - 1: let j = fail[i-1]. While j > 0 and needle[i] ≠ needle[j], set j = fail[j-1]. If needle[i] == needle[j], increment j. Set fail[i] = j.
  2. Match against haystack:
    • Set j = 0 (position in needle).
    • For each i from 0 to n - 1:
      • While j > 0 and haystack[i] ≠ needle[j], set j = fail[j-1] (use failure function to skip).
      • If haystack[i] == needle[j], increment j.
      • If j == m, return i - m + 1 (full match found).
  3. Return -1.

Code

class Solution {
public:
    int strStr(string haystack, string needle) {
        int n = haystack.size(), m = needle.size();
        
        // Build KMP failure function
        vector<int> fail(m, 0);
        for (int i = 1; i < m; i++) {
            int j = fail[i - 1];
            while (j > 0 && needle[i] != needle[j])
                j = fail[j - 1];
            if (needle[i] == needle[j])
                j++;
            fail[i] = j;
        }
        
        // KMP matching
        int j = 0;
        for (int i = 0; i < n; i++) {
            while (j > 0 && haystack[i] != needle[j])
                j = fail[j - 1];
            if (haystack[i] == needle[j])
                j++;
            if (j == m)
                return i - m + 1;
        }
        return -1;
    }
};
class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        n, m = len(haystack), len(needle)
        
        # Build KMP failure function
        fail = [0] * m
        for i in range(1, m):
            j = fail[i - 1]
            while j > 0 and needle[i] != needle[j]:
                j = fail[j - 1]
            if needle[i] == needle[j]:
                j += 1
            fail[i] = j
        
        # KMP matching
        j = 0
        for i in range(n):
            while j > 0 and haystack[i] != needle[j]:
                j = fail[j - 1]
            if haystack[i] == needle[j]:
                j += 1
            if j == m:
                return i - m + 1
        return -1
class Solution {
    public int strStr(String haystack, String needle) {
        int n = haystack.length(), m = needle.length();
        
        // Build KMP failure function
        int[] fail = new int[m];
        for (int i = 1; i < m; i++) {
            int j = fail[i - 1];
            while (j > 0 && needle.charAt(i) != needle.charAt(j))
                j = fail[j - 1];
            if (needle.charAt(i) == needle.charAt(j))
                j++;
            fail[i] = j;
        }
        
        // KMP matching
        int j = 0;
        for (int i = 0; i < n; i++) {
            while (j > 0 && haystack.charAt(i) != needle.charAt(j))
                j = fail[j - 1];
            if (haystack.charAt(i) == needle.charAt(j))
                j++;
            if (j == m)
                return i - m + 1;
        }
        return -1;
    }
}

Complexity Analysis

Time Complexity: O(n + m)

Building the failure function takes O(m): the pointer j can decrease at most as many times as it increases, so the inner while loop runs O(m) total across all iterations. The matching phase is O(n) by the same argument: i advances monotonically from 0 to n-1, and j's total decreases are bounded by its total increases. Combined: O(n + m) — and this is a worst-case guarantee, not just expected.

Space Complexity: O(m)

The failure array stores one integer per character of the needle, so O(m) space. No copy of the haystack is made.