Rabin Karp Substring Search in Java

This post will explore one of the substring search algorithms known as Rabin-Karp created by Richard M. Karp and Michael O. Rabin. There are two concepts to talk about, the hashing algorithm and rolling hash. In this post, we will use "equi" as the needle and "thequickbrownfo..." as the haystack.

Hashing

One method to hash a given string is to multiply the decimal value of each character of a given string by a prime raised to the power of the index of that given character. Using this method and 3 as the prime, the hash for "equi" is 4328.

e 101 * 3^0 = 101
q 113 * 3^1 = 339
u 117 * 3^2 = 1053
i 105 * 3^3 = 2635
        sum = 4328

This algorithm first encodes the needle, "equi", and the first section of the haystack with equal length as the needle, "theq", as hash strings. We already know "equi" yields 4328. Hashing "theq" yields 4388. If the two hashes are equal, then the strings are the same*.

t 116 * 3^0 = 116
h 104 * 3^1 = 312
e 101 * 3^2 = 909
q 113 * 3^3 = 3051
        sum = 4388

Rolling Hash

The second part of the algorithm is the rolling hash. Essentially, its goal is to continuously select the next section of the haystack with equivalent length of the needle, hash it, compare with the hash of the needle, and return the index if the hashes are equal. The way the Rabin-Karp algorithm selects the next section does not require rehashing. It selects the next section by first subtracting the decimal value of the first character of the previous section, divide the difference by the prime (this essentially negates 1 from all the exponents), and adding to the quotient the product of the newly introduced character and prime raised to the power of the largest index.

t 116 * 3^0 = 116
h 104 * 3^1 = 312
e 101 * 3^2 = 909
q 113 * 3^3 = 3051
        sum = 4388
u 4388 - 116 = 4272
  4272 / 3 = 1424
  1424 + (117 * 3^3) = 4583
i 4583 - 104 = 4479
  4479 / 3 = 1493
  1493 + (105 * 3^3) = 4328 // Matches the hash of equi

The following shows another way to visualize this. Spaces are added increase readability.

e q u i 4328
t h e q u i c k b r o ...
t h e q 4388
  h e q u 4583
    e q u i 4328 // Matches the hash of equi

Implementation

public static int find (String haystack, String needle) {
    int prime = 3;
    int h_haystack = 0;
    int h_needle = 0;

    for (int i = 0; i < needle.length(); i++) {
        h_needle += needle.charAt(i) * Math.pow(prime, i);
        h_haystack += haystack.charAt(i) * Math.pow(prime, i);
    }

    if (h_haystack == h_needle)
        return 0;

    for (int i = 1; i < haystack.length() - needle.length(); i++) {
        h_haystack -= haystack.charAt(i - 1);
        h_haystack /= prime;
        h_haystack += haystack.charAt(i + needle.length() - 1) 
            * Math.pow(prime, needle.lenth() - 1);

        if (h_haystack == h_needle)
            return i;
    }

    return -1;
}

A Note on Selecting Prime

3 is probably a good enough prime for this example because the numbers we are dealing with are the decimal representation of the alphabet, which ranges from 97 to 122. When the variations of character increases, a bigger prime might be required to avoid collisions.

This article is my 34th oldest. It is 593 words long, and it’s got 0 comments for now.