Friday, March 4, 2016

KMP Substring Match Algorithm (LPS array/Multiple Occurences) - Java

There is another implementation that uses 2D array for transition table of a DFA. Link

Example case:

text: aaabacaabaazq
pat : aabaax

a a b a a x
lps : [0 1 0 1 2 0]

j = pat index
i = txt index

What do lps values mean?

lps values at any index indicate where in (at what index) we should go in pat so
that our search for the pattern continues in case a mismatch occurs in the middle of successful matching.
If a mismatch occurs at index j in pat, we consider pat[j-1] as the index in pattern string which should be matched next.
Think about prefix suffix match (the tutorial link above) and you'll get the intuition.

For example, if we have matched upto "aa"a in txt
and "aa"b in pat.

Now the next character is different in both strings (j = 2, i = 2). So we look at lps[j-1] = lps[2-1] = lps[1] = 1.
It means we need to resume our search again at index 1 in pat with the current value of i=2.

Lets see what is at index 1 in pat and index 2 in txt.

M
0 1 2 3 4 5 6 7 8 9
txt: a a a b a c a a b a a z q
pat: a a b a a x

This is good as we did not go to index 1 if we were using bruteforce. We reused the already done computation i.e.
aa in the pat matches upto index 2 in text (beginning at index 1). So we start again at index 3 in txt and index 2 in pat.

Now consider that we already processed
|
text: aaabac"aabaa"zq
pat:  "aabaa"x
|

so i = 11, j = 5, the indices we are considering now.
A mismatch!, so now look at lps[j-1] = lps[4] = 2
That means we can resume our search with current i the same = 11 and j = 2.
Now we match characters at i = 11, j = 2 in their respective strings.

z!=x.
A mismatch again. But we saved the time for looking at 2 characters "aa"baax in aabaax. If it were bruteforce, we would have
started matching again from the beginning at index 7 in text and index 0 in pat. But now it is index 11 in text and index 2 in pat
that we are matching.

Source :

```public class KMPOD {
static int[] lps;
static int[] locations;

public static int[] lps(String pat){
lps = new int[pat.length()];

int i=1, len=0, L = pat.length();
while(i<L){
if(pat.charAt(i)==pat.charAt(len)){
len++;
lps[i] = len;
i++;
}else{
if(len!=0){
len = lps[len-1];
}else{
lps[i] = 0;
i++;
}
}
}
return lps;
}

public static int[] match(String txt, String pat){
locations = new int[txt.length()+1];
int z=1;

int M = pat.length();
int N = txt.length();

lps(pat);

int j = 0;
int i = 0;

while(i<N){
if(pat.charAt(j) == txt.charAt(i)){
j++;i++;
}
if(j==M){
locations[z++]=(i-j);
locations[0]++;         //0th index to store the size
j = lps[j-1];
}else if(i<N && pat.charAt(j) != txt.charAt(i)){
if(j!=0) j = lps[j-1];
else i++;
}
}
return locations;
}

public static void main(String[] args){
String text = "aneedleinahaystackneedlehereanotherneedlehere";
String pat = "needle";
int[] res = match(text, pat);
for(int i=1;i<=res[0];++i){
System.out.println("Match at Index : "+res[i]);
}
}
}```

Output:

Match at Index : 1
Match at Index : 18
Match at Index : 35