Hướng dẫn soundex python

Is there a soundex function for python and if not how would you go about making a soundex code?

Soundex
Code    Letters 
1   B, F, P, V  
2   C, G, J, K, Q, S, X, Z  
3   D, T    
4   L   
5   M, N    
6   R   
SKIP   A, E, H, I, O, U, W, Y, H, W, and Y

For example:

Jackson = J250

Washington = W252

Clement = C455

Ashcraft = A261

Wu = W000

In this article, we will cover word similarity matching using the Soundex algorithm in Python.

What is Soundex Algorithm and how does it work?

Soundex is a phonetic algorithm that can locate phrases with similar sounds. A Soundex search method takes a word as input, such as a person’s name, and outputs a character string that identifies a group of words that are (roughly) phonetically similar or sound (approximately) the same.

What are stemming and lemmatization in NLP?

Stemming and lemmatization, are two methods used in Natural Language Processing (NLP) to standardize text and get words or documents ready for some more machine learning processing. 

Though these are two of the most popular canonicalization techniques, they happen to have certain limitations, our task is to reduce the words to their root form for pre-processing and this technique is known as canonicalization. Canonicalization is a method for transforming data with many representations into a standard or normal form. In this article, we’ll try to understand these limitations and how phonetic hashing comes to the rescue.

What is the difference between stemming and lemmatization?

Stemming is a canonicalization technique that attempts to reduce a word down to its root form by dropping its affixes. Lemmatization is another canonicalization technique, that tries to map a word to its lemma – thus, for correct results using lemmatization, the words must be spelled correctly in the corpus.

Limitations of Stemming and Lemmatization

A significant issue occurs in both methods when dealing with words with multiple variants of spellings due to different pronunciations. For example, the words ‘Colour’ and ‘Color’ might be treated differently by a stemmer, although they both mean the same thing. Likewise, ‘traveling’ and ‘traveling’ would give rise to two stems/lemmas despite being the variations of the same word. 

Phonetic Hashing

Phonetic hashing is a technique used to canonicalize words that have the same phonetic characteristics. As a result of phonetic hashing, each word is assigned a hash code based on its phoneme, which is the smallest unit of sound. Thus, the words that are phonetic variants of each other tend to have the same code. Phonetic hashing is achieved using Soundex algorithms which are available for different languages. In this section, we will explore the 

American Soundex Algorithm

American Soundex Algorithm, which performs Phonetic Hashing on English language word, i.e. it takes a word in English and generates its hash code.

The complete American Soundex algorithm to find Soundex code is as below:

Step 1: Retain the Initial Letter

This step is grounded upon the reasoning that in the English language, the first letter of any word determines its pronunciation and is imperative to the word’s comprehensibility. In this step, we retain the first letter of the word as it is in the hash code.

Step 2: Encode the Consonants

This step is based on phonetic studies in the English language. It tries to replace the consonants with the same phonetic characteristics with the same letter. The encoding is done as follows:

  • b, f, p, v → 1 
  • c, g, j, k, q, s, x, z → 2 
  • d, t → 3 
  • l → 4 
  • m, n → 5
  • r → 6 
  • h, w, y → Not Coded

Step 3: Drop the Vowels

This step is done as vowels in English do not contribute much to the phonetic distinction of the word. Thus, all the vowels are directly dropped from the word.

Step 4: Make the Code Length 4

The last step is based upon the standard that hash codes must be of length 4. In order to make the code length four we do the following:

  1. If the length is less than 4, pad zeros in the end.
  2. If the length exceeds 4, truncate the hash code after position 4.

Example of American Soundex Algorithm 

For example, Suppose we have to find the hash code of the word Bangalore.

Step 1: Retain the first letter

Hash Code: B

Step 2: Encode the Consonants

Hash Code:[n →5, g →2, l →4, r →6] = Ba524o6e

Step 3: Now, drop all the Vowels

Hash Code: B5246

Step 4: Make the Code Length 4 

Hash Code: B524
[Truncating everything after position 4]

Step 5: Final Output

Hash Code of Bangalore: B524

Likewise, if we calculate the hashcode of Bengaluru (a phonetic variant of Bangalore), it turns out to be the same, i.e. B524

Code Implementation

Example 1:

Let us now apply the American Soundex algorithm in Python. We create a function soundex_generator that generates the hash code of any word input to it. We can see that the below code gives the Soundex for Bangalore as B524, which is the same as the one we calculated manually. 

Python3

def soundex_generator(token):

    token = token.upper()

    soundex = ""

    soundex += token[0]

    dictionary = {"BFPV": "1", "CGJKQSXZ": "2",

                  "DT": "3",

                  "L": "4", "MN": "5", "R": "6",

                  "AEIOUHWY": "."}

    for char in token[1:]:

        for key in dictionary.keys():

            if char in key:

                code = dictionary[key]

                if code != '.':

                    if code != soundex[-1]:

                        soundex += code

    soundex = soundex[:7].ljust(7, "0")

    return soundex

print(soundex_generator('Bangalore'))

print(soundex_generator('GeeksforGeeks'))

Output:

B524600

Example 2:

In this example, we are passing GeeksforGeeks as an argument that gives output as G216200. Here, we Trim or Pad to make Soundex a 7-character code.

Python3

def soundex_generator(token):

    token = token.upper()

    soundex = ""

    soundex += token[0]

    dictionary = {"BFPV": "1", "CGJKQSXZ": "2",

                  "DT": "3",

                  "L": "4", "MN": "5", "R": "6",

                  "AEIOUHWY": "."}

    for char in token[1:]:

        for key in dictionary.keys():

            if char in key:

                code = dictionary[key]

                if code != '.':

                    if code != soundex[-1]:

                        soundex += code

    soundex = soundex[:7].ljust(7, "0")

    return soundex

print(soundex_generator('GeeksforGeeks'))

Output:

G216200