Home Data Structures Rabin-Karp Algorithm for String Searching & Pattern Matching

Rabin-Karp Algorithm for String Searching & Pattern Matching

Rabin-Karp is a pattern-matching algorithm that works by calculating the hash of the pattern to be searched(say Length M) and the hash of M characters from the given text. If the hash values are the same, then it matches the individual M-character sequence. If the hash values are not equal then it will calculate the hash value for the next M-character sequence of a given text.

Hence the Rabin-Karp algorithm involves the calculation of the hash value for –

  1. Pattern (to be searched) of Length M
  2. All substrings of the given text of Length M.

Naive Approach

Let’s say we are given a text of length N. The naive approach would be to calculate the hash of 0 to M  characters (hash(x)=hash(text[0……M]) and if the match doesn’t occur, we will calculate the next M characters ranging from 1 to M+1 and so on.

Naive_Rabin_Karp(text, pattern){

    n = length(text)
    m = length(pattern)

    //iterative algorithm to calculate hash in O(M) time
    hash_patt = hash(pattern, 0, m);
     
    for(i=0; i<N; i++){
        hash_text = hash(text, i, m+i) 
        if(hash_text == hash_patt)
        {    
            for(j=0; i<M; j++){
                if(text[j+i] != pattern[i])
                    break; 
            }
            if(j == M)
                print("Pattern found at" +i)
        }
    }
}

Optimized Approach

We can choose the hash function in such a way that rehashing a string will take only O(1) time, but calculating hash for the first substring will still take O(M) time.

The hash function we will use will be the rolling hash. We will first start by calculating the integer value for each character. Say we have a string containing combinations of characters in range (a-z) or (A-Z), we can assign each character is converted to value equivalent to its position in range 1 to 26, such a way that a = 1, b=2, c=3 …… z = 26. Example – “aab” = 112, “bde”= 245 etc.

Next, we will choose a number to multiply each character in a way explained below –

Hash(“c1c2…cm“) = (cm*pm-1 + cm-1*pm-2  + ……………c1*pm-1)

Since the hash value generated will be very big, so we can perform a modulus with a prime number to get the final result in a given range. The prime number chosen for modulus should not be very small as it will be repeated for lots of text substrings and it should not be too big that the purpose of using modulus is lost.

Hash(“c1c2…cm“) = (cm*pm-1 + cm-1*pm-2  + ……………c1*pm-1 ) mod q 

Now, have chosen a hash function calculating first hash will take O(M) time but the rehashing will take only O(1) time.

/*note here he modulo q has been excluded readability and simplicity,
 but the calculations still hold when modulo q will be present */

Example -> text = "abcde", M = 3, p = 26

hash(abc) = c*26²+ b*26¹+ a*26°          
          = 3*26²+ 2*26¹ + 1*26°
          = 2028 + 52 + 1 = 2081

hash(bcd) = hash(d)*26² + (hash(abc) - hash(a))/26 
          = 4*26² +( 3*26²+ 2*26¹ + 1 - 1)/26
          = 4*26² + 3*26¹ + 2 = 2784

hash(cde) = hash(e)*26² + (hash(bcd) - hash(b))/26 
          = 5*26² +( 2784 - 2)/26
          = 4*26² + 107  = 2811

Let’s see how this optimized algorithm can be implemented.

Implementation of the above algorithm


//Rabin-Karp Algorithm implementation in CPP
#include <bits/stdc++.h> 
using namespace std; 
  
#define p 26
  
void Rabin_Karp(char pat[], char txt[], int q)  
{  
    int M = strlen(pat);  
    int N = strlen(txt);  
    int patt_hash = 0;     //hash value of pattern  
    int text_hash = 0;     //hash value of text
    int hash = 1;  
  
    for(int i=0; i<M-1; i++)  
        hash = (hash*p) % q;  
  
    for (int i = 0; i < M; i++)  
    {  
        patt_hash = (p * patt_hash + pat[i]) % q;  
        text_hash = (p * text_hash + txt[i]) % q;  
    }  
  
    for (int i=0; i<=N-M; i++)  
    {  
        if (patt_hash == text_hash)  
        {   
            int j;
            for (j = 0; j < M; j++)  
            {  
                if (txt[i+j] != pat[j])  
                    break;  
            }  
            if (j == M)  
                cout<<"Pattern found at index "<< i<<endl;  
        }  
   
        if ( i < N-M )  
        {  
            text_hash = (p*(text_hash - txt[i]*hash) + txt[i+M])%q;  
            if (text_hash < 0)  
            text_hash = (text_hash + q);  
        }  
    }  
}  
  
int main()  
{  
    char text[] = "nlognbestportalforcsisnlogn";  
    char pattern[] = "nlogn";  
    int q = 99;     
    Rabin_Karp(pattern, text, q);  
    return 0;  
}

Time complexity

The average and best-case time complexity of the Rabin-Karp algorithm are O(N+M), where M is the length of pattern we have to search, and N is the length of the given text. But the worst-case time complexity is O(NM). The worst case of the Rabin-Karp algorithm occurs when all characters of pattern and text are the same as the hash values of all the substrings of text matches with the hash value of pattern.

For example pat[] = “AAA” and txt[] = “AAAAAAA”.

Subscribe to our weekly newsletter

Join our community of 1000+ developers and stay updated with the fast moving world of computer science

We promise, we won't spam
Even we hate spam as much as you hate them

LEAVE A REPLY

Please enter your comment!
Please enter your name here