A spelling corrector based on Bayes Theorem (PHP, C#)

Jia.Chen

3.64/5 (8 votes)

Aug 31, 2007

6 min read

61063

This article introduces SpellingDice a spelling corrector based on Bayes Thorem and Dr. Peter Norvig's essay

Website of the project

For information about this project, please refer to http://www.horizonideas.com/spellingdice/

Intro

No matter you are a native English speaker or not, you may make some spelling mistakes every now and then. Isn't it neat, if the computer can provide some spelling suggestion for your unsure input? The most famous one should be Google Suggest. Every time you input some mistakes, there will be "Did you mean? XXX XXX XXX" on the next page. We know the whole system is based on Bayesian network, but how? On the other hand, if our own website or small size program could include this great function, that would be a valuable asset to attract people.

Peter Norvig, the Director of Google Research revealed the secrets of Google Suggest. I believe it's not the whole algorithm, but it's good enough for us to add spices ourselves and utilize in our own small-scale program.

I believe his essay is much better than my following descriptions. You can find his paper at

Here [English]

Here[Chinese]

Here[Japanese]

Here[Russian]

Suppose someone input "Majar", the program should be able to identify what the user actually meant was "major".

Ok, then how does it work?

We let P(s|i) denotes the probability of one suggested word s by providing an input word i.

So we have P(s|i)=P(s,i)/P(i)

P(i) is the probability of one random input string. As the user can input anything, so the probability is neglectable. We let it be 1. So we have:

P(s|i)=P(s,i)/1=P(s,i)

At the same time,

P(s,i)=P(i|s) * P(s) so we have :

P(s|i)=P(s,i)= P(i|s) * P(s)

Where P(s) is the probability of one word's probability. Question: How can we define this number?

Answer: find any book and read it through then make statics of each word's occurrence. Or download the API which includes the 5000 most commonly used English word. For instance, the word "the" must have a lot higher probability than "tae" [therapeutic arterial embolization].

Ok, what about P(i|s)? literately, P(i|s) is the probability of one input string by providing an recommend string. But it does not make sense. So it becomes the probability of one suggested string by calculating the editing distance.

Oh, edit distance. You don't know what? Read http://en.wikipedia.org/wiki/Edit_distance

The most famous one is Levenshtein Distance which is a build-in function in PHP. Another example is Hamming Distance in Information Theory. For instance D(01010, 10010)=2 . now you get what I mean by edit distance, right?

Ok, let's continue. They less edit distance you have, the higher probability you have.

For instance if someone inputs "hallo", "hello" has much higher P(i|s) than "helen". What if the word is correct? In other words, if one in the dictionary matches the input. Then the program should not return any suggesition. To save time complexity, ths program should break here.

Understand what I said? If not, again, please refer to his original article.

Ok. Let's take a closer look at "hallo". The program may find "Hallow" "Mallo"(if indeed, there is such word) and "Hello". They may have the same P(s) as we won't see too many people use tons of "hello" in their formal writings. And they have the same edit distance, so should they be placed in the order of alphabets? No! Absolutely not. Use your head to think! How many people would mis-type the first letter? Not so many~ So get rid of "mallo". Cool! Then you will find people tend to mis-type more vowels than consonants. The reason is because it's very obvious for human eyes to discover the typos of consonants. So they may correct them before submit to computer. Thus, "hello" must have much better P(i|s) "hallow". Is that clear?

The above are all in Dr. Peter Norvig's essay. But I don't think that's totally enough for determining P(i|s). So I observe how people type and find out some of us like me have giant fingers. So instead of type "hello" , we type "hrllo" or "hekko". If you take a closer look at your keyboard you will realize why. Because these 'e' and 'r' are near so are 'k' and 'l'. So here we have another standard, the P(i|s) will be increased if a letter is replaced by its neighbors on the keyboard. So when type "hrllo", "hello" must have higher probability than "hillo" (if indeed this is such word"). Ok, I hope you are not lost here. This is the idea of SpellingDice. If you don't want to know about the Algorithm, you can skip the following and go directly to download the APIs for PHP and C# at HERE.

Examples

Try out the online SpellingDice Corrector PHP

Please click HERE

Algorithm

Array<string> ReturnSuggestion(string inputWord)

{

Array<string> result=null;

if( inputWord[0].ToLowerCase()<='z' && inputWord[0].ToLowerCase()<='a') // if continues only if it's a English letter

{

Array<string> Candidates = ReadFromXML(inputWord[0].ToLowerCase()); // To speed up the process, only read the dictionary that start with the first letter. For instance, if input is "hallo" only read 'h.xml'.

Maps<string, double> RefinedCandidates;

foreach(string candidate in Candidates)

{

if(LevenshteinDistance(candidate, inputWord)<=2) //only check those words who have an edit distance of 2 comparing to the input. Otherwise the candidate list will be long and it does not help that much.

{

RefinedCandidates[candidate]=CalculateConditionProbability(candidate, inputWord)*probability(candidate);

}

Sort(RefinedCandidates);

result=ConvertMapIntoArray(RefinedCandidates);

}

return result;

}

APIs

To Download the Set, please click HERE

PHP

SpellDice::ReturnSuggestion($inputString);

// no matter the input is a sentence or a word, it will return a suggested sentence or 5 suggested words. if no suggestion, null will be returned. NOTE: When the input is a word, the return value will be a string array, if it is a sentence, a sentence string will be returned.

SpellDice::SentenceProcess($inputString);

//only process a sentence, return a suggested sentence

SpellDice::WordProcess($inputString)

//only process a word, a array of at most 5 suggested words will be returned.

namespace SpellCheck

public class SpellingDice

{

public SpellingDice() //intilize method

public List<string> ReturnSuggestion(string input); ///no matter the input is a sentence or a word, it will return a suggested sentence or 5 suggested words. if no suggestion, null will be returned. NOTE: When the input is a word, the return value will be a string array, if it is a sentence, an array of only one element will be returned.

public List<string> WordProcess(string word) //only process a word, a array of at most 5 suggested words will be returned.

}

A Snippet of Codes

PHP

        function GuessCandidates($compareString)
        {
        global $vocabularyArray;
        $letter=$compareString[0];
        if(count($vocabularyArray[$letter])==0)
        {
        $parser = xml_parser_create(); 

        xml_parser_set_option($parser,XML_OPTION_SKIP_WHITE,1); <o:p />
    
        xml_parser_set_option($parser,XML_OPTION_CASE_FOLDING,0);
        <o:p> </o:p>
        $data = implode("",file('dictionary/'.$letter.'.xml'));  
        xml_parse_into_struct($parser,$data,&$d_ar,&$i_ar)
            or print_error();
        $counter=0;
        for($i=0; $i<count($i_ar['WordEntry']);
            $i++) {       
             if($d_ar[$i_ar['WordEntry'][$i]]['type']=='open') {
                 for($j=$i_ar['WordEntry'][$i]; $j<$i_ar['WordEntry'][$i+1];
            $j++) {
                     if($d_ar[$j]['tag'] == 'word')
            {
                         $word = $d_ar[$j]['value'];
                     }elseif($d_ar[$j]['tag'] == 'probability')
            {
                         $probability = $d_ar[$j]['value'];
                     }
                 }
                 $editDistance=SpellDice::StringDistance($compareString,$word);
                 if($compareString==$word)
                 return $word;
                 else  
                 if($editDistance<=20)
                   $CandidateArray[$compareString][$word]=$probability*(20/$editDistance);
             }
        }
        xml_parser_free($parser);
        }
        arsort($CandidateArray[$compareString]);
        if(count($CandidateArray[$compareString])>5)
        $CandidateArray[$compareString]=array_slice($CandidateArray[$compareString],
            0, 5);
        $result=array();
        foreach ($CandidateArray[$compareString]
            as $key=> $value)
        { 
                         array_push($result,$key);
        } 
        return $result;     <o:p />
 

        }

public List<string> WordProcess(string
            word) 

                 {
                      List<string> result = new List<string>();
                     SortedDictionary<double, string>
            listOfCandidates = new SortedDictionary<double, string>(); 
                     if (word[0] <= 'z' &&
            word[0] >= 'a')
                     {
                         if (candidatesList[word[0]].Count
            == 0)
                             ReadFromXMLFile(word[0]);
                         foreach (Dictionary
            DictionaryItem in candidatesList[word[0]])
                         {
                             if
            (DictionaryItem.word == word)
                             {                           
        result.Add(word);                          
        return result;
                             }
                             else
                             {                
        int LD = Levenshtein(word, DictionaryItem.word);       
                      if (LD <= 2)                        
        {             
                double editDistance = StringDistance(word, DictionaryItem.word,
            LD);              
                                                          
                    listOfCandidates[1 / (DictionaryItem.probability * (1 / editDistance))] = DictionaryItem.word;
                                
        }                }   } <o:p />
     foreach (KeyValuePair<double,string> kvp in listOfCandidates) 
                         {
                             result.Add(kvp.Value);
                         }
                         // result.Reverse();
                         if (listOfCandidates.Count > 5)
                             result.RemoveRange(5,  listOfCandidates.Count - 5);
                     }
                     else
                         result.Add(word); <o:p />
                 return result; 
                 }

Conclusion

The performance of this algorithm is actually very good as we have some heuristic to ensure the least loops.

It needs O(n) to read the dictionary

Then O(n) to loop the dictionary

O(m) to check every word( m) is the length of words)

O(nlogn) to sort

So in totally we are looking at a O(n)+O(nlogn)+o(n*m) algorithm.

The PHP version is slightly faster than C# as it has some neat build-in functions. C# don't have a auto-sorted data structure to handle the array (SortedDictionary only sort the keys not value, and it's from low to high which is something we want in opposite way).

The program also largely depends on your dictionary. My dictionary included has 5000 English words. But it's not enough. You may want to expand the dictionary yourself. I made them 26 XML files. You can do it more yourself. This program could deal with regular spelling corrections, but I cannot guarantee it provides the same effects as Google Suggest does as they have much larger data stored and other advantages.

In all, this is a small-size fast program to help you achieve a neat feature on your website or small project. If you have any question or comments, please contact me directly.