What is levenshtein distance in python?

In this article I will go over the intuition behind how Levenshtein distance works and how to use Levenshtein distance in building a plagiarism detection pipeline.

Table of Contents

  • Introduction to Text Similarity
  • Levenshtein Distance
    - Intuition
    - Mathematical Understanding
    - Python Implementation
  • Problem Statement
    - Solution Architecture
    - Installation Requirements
  • Fetch Data
  • Clean Data
  • Find Similarity
  • Check Plagiarism
  • Caveats
  • Concluding Remarks
  • Resources

Introduction to Text Similarity

Identifying similarity between text is a common problem in NLP and is used by many companies world wide. The most common application of text similarity comes from the form of identifying plagiarized text. Educational facilities ranging from elementary school, high school, college and universities all around the world use services like Turnitin to ensure that the work submitted by students is original and their own. Other applications of text similarity are commonly used by companies which have a similar structure to Stack Overflow or Stack Exchange. They want to be able to identify and flag duplicated questions so the user posting the question can be referenced to the original post with the solution. This reduces the number of duplicate questions being asked and increases the number of unique questions on their platform.

Text similarity can be broken down into two components, semantic similarity and lexical similarity. Given a pair of text, the semantic similarity of the pair refers to how close the documents are in meaning. Whereas, lexical similarity is a measure of overlap in vocabulary. If both documents in the pairs have the same vocabularies, then they would have a lexical similarity of 1 and vice versa of 0 if there was no overlap in vocabularies [2].

Achieving true semantic similarity is a very difficult and unsolved task in both NLP and Mathematics. It’s a heavily researched area and a lot of the solutions proposed does involve a certain degree of lexical similarity in them. For the focus of this article, I will not dive much deeper into semantic similarity, but prioritize a lot more on lexical similarity.

Levenshtein Distance

There are many ways to identify the lexical similarities between a pair of texts, the one which we’ll be covering today is Levenshtein distance. An algorithm invented in 1965 by Vladimir Levenshtein, a Soviet mathematician [1].

Intuition

Levenshtein distance is very impactful because it does not require two strings to be of equal length for them to be compared. Intuitively speaking, Levenshtein distance is quite easy to understand.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. [1]
- https://en.wikipedia.org/wiki/Levenshtein_distance

Essentially implying that the output distance between the two is the cumulative sum of the single-character edits. The larger the output distance implies that more changes were necessary to make the two words equal to each other, and the lower the output distance implies that fewer changes were necessary. For example, given a pair of words dream and dream the resulting Levenshtein distance would be 0 because the two words are the same. However, if the words were dream and steam the Levenshtein distance would be 2 as you would need to make 2 edits to change dr to st .

Thus a large value for Levenshtein distance implies that the two documents were not similar, and a small value for the distance implies that the two documents were similar.

Mathematical Understanding

The Levenshtein distance can be mathematically represented by the following piecewise function:

What is levenshtein distance in python?

Image taken from Levenshtein Distance Wikipedia

Where a and b correspond to the two input strings and |a| and |b| are the lengths of each respective string. The tail of a string a or b corresponds to all characters in the string except for the first. Where it indicates a[0] and b[0] , that is the character in a and b at the 0th element.

Let’s go through the following example to see how Levenshtein distance works.

What is levenshtein distance in python?

Example of Levenshtein distance on the strings King and Kind. Image provided by author.

Implementation

The Python code associated with implementing Levenshtein distance using dynamic programming. The same code can be implemented through a brute force and iterative solution (be aware that the brute force solution would not be optimal in terms of time complexity).

Problem Statement

Similar to softwares like Turnitin, we want to build a pipeline which identifies if an input article is plagiarized.

Solution Architecture

To solve this problem a few things will be required. Firstly, we need to get the information passed on by the user of the pipeline, for this we not only require the article that they want to check the plagiarism against but also a keyword tag which corresponds to the theme of that article. For the simplicity of this tutorial, we’ll use the initial text I’ve written for this article with the tag being Levenshtein Distance. Second, we need a large corpus of documents we want to compare the user input text with. We can leverage the Wikipedia-API to get access to Wikipedia articles associated with the tag of the user input data. We can then clean the user input document for redundancies like stopwords and punctuations to better optimize the calculation of Levenshtein distance. We pass this cleaned document through each document in our corpus under the same tag as the user input document and identify if there is any document which is very similar to the user submitted document.

What is levenshtein distance in python?

Solution architecture described above. Image provided by author

Installation Requirements

Python=3.8.8
python-Levenshtein=0.12.2
nltk=3.6.1
numpy=1.20.1
Wikipedia-API=0.5.4

For the purposes of this pipeline, we will be using an open source package which will calculate Levenshtein distance for us. We’re going to use this package because it implements Levenshtein distance in C and is most likely faster than anything we could programatically write in Python.

To download the Levenshtein package for this tutorial, you can run the following command on terminal : pip install python-Levenshtein as per their installation guide found here [3]. Or you can clone the GitHub repository associated with this package and use it accordingly.

For the purpose of comparing the user input article to another article, we will reference Wikipedia. You are able to fetch Wikipedia text data very easily through the Wikipedia library. You can run the following command on your console : pip install wikipedia as per the installation guide found here. Or you can clone the GitHub repository associated with this package and use it accordingly.

Fetch Data

The code above essentially passes the initial parts of my article as the user submitted the article under the tag of Levenshtein Distance . It also fetches the Wikipedia page for Levenshtein distance through the wikipedia module.

Clean Data

Now we can clean out the data associated with the user submitted and wikipedia fetched article. We’ll do some simple data preprocessing on the text by lowering the text, removing stop words and punctuations.

Find Similarity

What is levenshtein distance in python?

The distance of the user input article with the content taken from Wikipedia associated with the tag Levenshtein Distance. Image provided by author.

Now this value might seem relatively arbitrary to you, it’s hard to determine if this value reflects that the content is plagiarized or not. The larger the value is the less likely it is to be considered plagiarized based on our understanding of Levenshtein distance. However it’s difficult to determine the threshold of what distance is not large enough.

Check Plagiarism

We will check for plagiarism through a simple formulaic approach. First we get the maximum length between the user submitted article and the content. Then we check if the Levenshtein distance is less than or equal to that value multiplied by some threshold (I used 0.4), then that user submitted article can be considered plagiarized.

What is levenshtein distance in python?

The result of the pipeline, the initial input article was categorized as “Not Plagiarized” by the pipeline. Image provided by the author.

Caveats

There are a number of caveats to using the pipeline outlined above.

1) This pipeline does not identify which areas are plagiarized and which areas are not, it only yields an overall score of plagiarism.
2) The process does not account for properly cited and quoted pieces of text. This would misleadingly increase the overall plagiarism score.
3) It’s difficult to determine a threshold of what how small or large a distance should be to be considered plagiarized.

Concluding Remarks

Levenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2. The larger the distance between the pair implies that the strings are not similar to each other and vice versa.

I created this pipeline in a manner such that it’s easily integratable with other text similarity measures. Levenshtein distance is a great measure to use to identify lexical similarity between a pair of text, but it does not mean there aren’t other well performing similarity measures. The Jaro-Winkler score in particular comes to mind and can be easily implemented in this pipeline. Be aware that the Jaro similarity outputs a result which is interpreted differently than the Levenshtein distance. You can find documentation on other fuzzy string matching like Levenshtein distance here.

You can follow through with this pipeline in the Jupyter Notebook I created for this project. You can find the notebook on my GitHub page here.

Resources

  • [1] https://en.wikipedia.org/wiki/Levenshtein_distance
  • [2] https://en.wikipedia.org/wiki/Lexical_similarity
  • [3] https://pypi.org/project/python-Levenshtein/

If you enjoyed this article, here are a few others which you might also enjoy:

What is the use of Levenshtein distance?

The Levenshtein distance is a similarity measure between words. Given two words, the distance measures the number of edits needed to transform one word into another. There are three techniques that can be used for editing: Insertion.

Is Levenshtein distance NLP?

The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.

What is the difference between edit distance and Levenshtein distance?

Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.

Is Levenshtein distance an algorithm?

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.