March 26, 202610 min read

String Matching Algorithms: KMP, Rabin-Karp, and When Brute Force Is Actually Fine

A practical guide to string pattern matching algorithms — naive search, KMP, Rabin-Karp, and how to choose between them for interviews and real code.

string-matching kmp rabin-karp algorithms interviews

String matching is one of those topics that sounds simple — find where a pattern appears in a text — but has surprisingly deep algorithmic ideas behind it. The naive approach is O(n*m) in the worst case. KMP gets it down to O(n+m). Rabin-Karp uses hashing for a different kind of elegance. And in practice, the naive approach often works fine.

Let's go through all of them and build honest intuition about when each one matters.

The Naive Approach

Try the pattern at every position in the text. For each position, compare character by character.

def naive_search(text, pattern):
    n, m = len(text), len(pattern)
    results = []
    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
        if match:
            results.append(i)
    return results

Time complexity: O((n - m + 1) * m) worst case. The worst case happens with inputs like text = "aaaaaa" and pattern = "aaab" — you compare almost all characters at each position before the mismatch. Average case: O(n + m) for random text and patterns. When characters are drawn from a large alphabet (like ASCII), mismatches happen early at most positions, and you move on quickly.

Here's the honest truth: for most practical applications and many interview problems, the naive approach is perfectly fine. If n is 10^4 and m is 10^2, even the worst case is 10^6 operations — well within time limits. KMP and Rabin-Karp are for when n and m are both large, or when the worst case actually occurs.

KMP (Knuth-Morris-Pratt)

The insight behind KMP: when you get a mismatch, you've already seen some characters. Don't throw away that information. Instead of restarting from scratch, jump to the next position where a match is still possible.

KMP achieves O(n + m) worst-case time. It never re-examines a character in the text.

The Failure Function (Prefix Table)

The magic of KMP is in preprocessing the pattern. For each position j in the pattern, compute the length of the longest proper prefix of pattern[0..j] that's also a suffix. This is called the failure function, prefix function, or partial match table.

Example for pattern "ABABAC":

Pattern:  A B A B A C
Index:    0 1 2 3 4 5
Failure:  0 0 1 2 3 0

At index 0: "A" has no proper prefix that's also a suffix. 0.
At index 1: "AB" — prefix "A" != suffix "B". 0.
At index 2: "ABA" — prefix "A" == suffix "A". 1.
At index 3: "ABAB" — prefix "AB" == suffix "AB". 2.
At index 4: "ABABA" — prefix "ABA" == suffix "ABA". 3.
At index 5: "ABABAC" — no matching prefix-suffix. 0.

Building the Failure Function

def build_failure(pattern):
    m = len(pattern)
    failure = [0] * m
    length = 0  # length of the previous longest prefix suffix
    i = 1

while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            failure[i] = length
            i += 1
        else:
            if length != 0:
                length = failure[length - 1]  # fall back
            else:
                failure[i] = 0
                i += 1

return failure

The key is the fallback: when there's a mismatch, instead of resetting length to 0, we jump back to failure[length - 1]. This reuses previously computed information. The fallback might happen multiple times, but amortized over the whole construction, the total work is O(m).

The Search

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0:
        return []

failure = build_failure(pattern)
    results = []
    j = 0  # position in pattern

for i in range(n):
        while j > 0 and text[i] != pattern[j]:
            j = failure[j - 1]  # fall back

if text[i] == pattern[j]:
            j += 1

if j == m:
            results.append(i - m + 1)
            j = failure[j - 1]  # look for next match

return results

The pointer i in the text never goes backward. That's the whole point. When there's a mismatch, we move j (the pattern pointer) backward using the failure function, which tells us the longest prefix of the pattern that's already matched.

Time complexity: O(n + m). Building the failure function is O(m). The search is O(n) because i advances every iteration and j can only fall back as many times as it advanced.

Why KMP Works — The Intuition

Imagine you're matching "ABABAC" against a text and you've matched "ABABA" (5 characters) when you hit a mismatch. The failure function says the longest prefix-suffix of "ABABA" has length 3 ("ABA").

That means the last 3 characters you matched ("ABA") are also the first 3 characters of the pattern. So instead of starting over, you jump to position 3 in the pattern — the "ABA" prefix is already matched, and you just need to continue from there.

You never look at the text characters you've already passed. That's what makes it O(n).

Rabin-Karp

Rabin-Karp uses hashing. Compute the hash of the pattern, then slide a window over the text computing rolling hashes. When hashes match, verify with a character-by-character comparison (to handle hash collisions).

Rolling Hash

The key to efficiency is that you don't recompute the hash from scratch at each position. You "roll" it: remove the contribution of the outgoing character and add the incoming one.

For a polynomial hash with base b and modulus q:

hash("ABCD") = Ab^3 + Bb^2 + Cb^1 + Db^0  (mod q)

To go from hash("ABCD") to hash("BCDE"):

Subtract A*b^3

Multiply by b (shifts everything left)

Add E

def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    if m > n:
        return []

base = 256
    mod = 10**9 + 7
    results = []

# compute base^(m-1) mod q
    h = pow(base, m - 1, mod)

# compute hash of pattern and first window
    p_hash = 0
    t_hash = 0
    for i in range(m):
        p_hash = (p_hash * base + ord(pattern[i])) % mod
        t_hash = (t_hash * base + ord(text[i])) % mod

for i in range(n - m + 1):
        if p_hash == t_hash:
            # hash match — verify character by character
            if text[i:i + m] == pattern:
                results.append(i)

# compute hash for next window
        if i < n - m:
            t_hash = (t_hash - ord(text[i]) * h) % mod
            t_hash = (t_hash * base + ord(text[i + m])) % mod
            t_hash = (t_hash + mod) % mod  # ensure non-negative

return results

Time complexity: O(n + m) expected. The rolling hash computation is O(1) per position. Character-by-character verification only happens on hash matches, which are rare if the hash function is good.

Worst case is still O(n*m) if every hash matches (think text = "aaaa", pattern = "aa" with a bad hash function). But with a good modulus, this essentially never happens in practice.

Multiple Pattern Search

Rabin-Karp's real superpower is searching for multiple patterns simultaneously. Compute the hash of each pattern, store them in a set, and check each window's hash against the set.

def multi_pattern_search(text, patterns):
    n = len(text)
    base = 256
    mod = 10**9 + 7
    results = {p: [] for p in patterns}

# group patterns by length
    from collections import defaultdict
    by_length = defaultdict(list)
    for p in patterns:
        by_length[len(p)].append(p)

for m, pats in by_length.items():
        if m > n:
            continue

# compute hashes of all patterns of this length
        pat_hashes = {}
        for p in pats:
            h = 0
            for c in p:
                h = (h * base + ord(c)) % mod
            pat_hashes[h] = pat_hashes.get(h, []) + [p]

# rolling hash over text
        h_pow = pow(base, m - 1, mod)
        t_hash = 0
        for i in range(m):
            t_hash = (t_hash * base + ord(text[i])) % mod

for i in range(n - m + 1):
            if t_hash in pat_hashes:
                window = text[i:i + m]
                for p in pat_hashes[t_hash]:
                    if window == p:
                        results[p].append(i)

if i < n - m:
                t_hash = (t_hash - ord(text[i]) * h_pow) % mod
                t_hash = (t_hash * base + ord(text[i + m])) % mod
                t_hash = (t_hash + mod) % mod

return results

For k patterns of the same length, this is O(n + km) instead of O(nk*m) for naive search. This is where Rabin-Karp really shines — plagiarism detection, DNA sequence matching, and similar multi-pattern problems.

Z-Algorithm

The Z-algorithm is another linear-time string matching approach that's sometimes cleaner than KMP. For a string S, Z[i] is the length of the longest substring starting at position i that matches a prefix of S.

def z_function(s):
    n = len(s)
    z = [0] * n
    z[0] = n
    l, r = 0, 0

for i in range(1, n):
        if i < r:
            z[i] = min(r - i, z[i - l])

while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1

if i + z[i] > r:
            l, r = i, i + z[i]

return z

def z_search(text, pattern):
    # concatenate pattern + separator + text
    combined = pattern + "$" + text
    z = z_function(combined)
    m = len(pattern)
    results = []
    for i in range(m + 1, len(combined)):
        if z[i] == m:
            results.append(i - m - 1)
    return results

The trick: concatenate pattern + a separator character + text, then compute the Z-array. Any position where Z[i] equals the pattern length is a match. O(n + m) time, clean implementation.

Which Algorithm to Use?

Naive: When the input is small, or when you're in an interview and the problem doesn't require linear time. Don't overcomplicate things. KMP: When you need guaranteed O(n + m) worst case. Also useful when you need the failure function itself (some problems use the prefix function directly, like finding the shortest period of a string). Rabin-Karp: When you're searching for multiple patterns, or when the problem involves matching with a hash-based approach (like detecting duplicate substrings). Z-Algorithm: When you want linear-time matching with a simpler implementation than KMP. Good for interview settings because it's easier to derive from scratch. Built-in functions: In practice, text.find(pattern) or text.index(pattern) in Python uses a highly optimized algorithm (a variant of Boyer-Moore-Horspool). For single-pattern search in production code, use the standard library.

Common Mistakes

Off-by-one in the sliding window for Rabin-Karp. Make sure your rolling hash computation handles the boundaries correctly. The hash for window starting at position i uses characters i through i+m-1. Not handling hash collisions. A hash match doesn't guarantee a string match. Always verify with a character-by-character comparison. This is the difference between O(n+m) expected time and a correct algorithm. Using a bad modulus. If mod is too small, you'll get tons of collisions. Use a large prime (10^9 + 7 or 10^9 + 9). Some people use double hashing (two different mod/base pairs) for extra safety. Confusing the KMP failure function with other prefix computations. The failure function is the length of the longest proper prefix that's also a suffix. "Proper" means it can't be the whole string itself. Make sure your implementation handles this correctly. Building the failure function wrong. The fallback step (length = failure[length - 1]) is subtle. Walk through an example by hand before coding it.

Practice string matching problems on CodeUp. Start with implementing each algorithm from scratch — the implementation itself is the hard part, not the concept. Once you've built KMP and Rabin-Karp a few times, you'll know which one to reach for in any given situation.