What is levenshtein distance in python?
In this article I will go over the intuition behind how Levenshtein distance works and how to use Levenshtein distance in building a plagiarism detection pipeline. Show
Table of Contents
Introduction to Text SimilarityIdentifying similarity between text is a common problem in NLP and is used by many companies world wide. The most common application of text similarity comes from the form of identifying plagiarized text. Educational facilities ranging from elementary school, high school, college and universities all around the world use services like Turnitin to ensure that the work submitted by students is original and their own. Other applications of text similarity are commonly used by companies which have a similar structure to Stack Overflow or Stack Exchange. They want to be able to identify and flag duplicated questions so the user posting the question can be referenced to the original post with the solution. This reduces the number of duplicate questions being asked and increases the number of unique questions on their platform. Text similarity can be broken down into two components, semantic similarity and lexical similarity. Given a pair of text, the semantic similarity of the pair refers to how close the documents are in meaning. Whereas, lexical similarity is a measure of overlap in vocabulary. If both documents in the pairs have the same vocabularies, then they would have a lexical similarity of 1 and vice versa of 0 if there was no overlap in vocabularies [2]. Achieving true semantic similarity is a very difficult and unsolved task in both NLP and Mathematics. It’s a heavily researched area and a lot of the solutions proposed does involve a certain degree of lexical similarity in them. For the focus of this article, I will not dive much deeper into semantic similarity, but prioritize a lot more on lexical similarity. Levenshtein DistanceThere are many ways to identify the lexical similarities between a pair of texts, the one which we’ll be covering today is Levenshtein distance. An algorithm invented in 1965 by Vladimir Levenshtein, a Soviet mathematician [1]. IntuitionLevenshtein distance is very impactful because it does not require two strings to be of equal length for them to be compared. Intuitively speaking, Levenshtein distance is quite easy to understand.
Essentially implying that the output distance between the two is the cumulative sum of
the single-character edits. The larger the output distance implies that more changes were necessary to make the two words equal to each other, and the lower the output distance implies that fewer changes were necessary. For example, given a pair of words Thus a large value for Levenshtein distance implies that the two documents were not similar, and a small value for the distance implies that the two documents were similar. Mathematical UnderstandingThe Levenshtein distance can be mathematically represented by the following piecewise function: Image taken from Levenshtein Distance WikipediaWhere Let’s go through the following example to see how Levenshtein distance works. Example of Levenshtein distance on the strings King and Kind. Image provided by author.ImplementationThe Python code associated with implementing Levenshtein distance using dynamic programming. The same code can be implemented through a brute force and iterative solution (be aware that the brute force solution would not be optimal in terms of time complexity). Problem StatementSimilar to softwares like Turnitin, we want to build a pipeline which identifies if an input article is plagiarized. Solution ArchitectureTo solve this problem a few things will be required. Firstly, we need to get the information passed on by the user of the pipeline, for this we not only require the article that they want to check the plagiarism against but also a keyword tag which corresponds to the theme of that article. For the simplicity of this tutorial, we’ll use the initial text I’ve written for this article with the tag being Installation RequirementsPython=3.8.8 For the purposes of this pipeline, we will be using an open source package which will calculate Levenshtein distance for us. We’re going to use this package because it implements Levenshtein distance in C and is most likely faster than anything we could programatically write in Python. To download the Levenshtein package for this tutorial, you can run the following command on terminal : For the purpose of comparing the user input article to another article, we will reference Wikipedia. You are able to fetch Wikipedia text data very easily through the Wikipedia library. You can run the following command on your console : Fetch DataThe code above essentially passes the initial parts of my article as the user submitted the article under the tag of Clean DataNow we can clean out the data associated with the user submitted and wikipedia fetched article. We’ll do some simple data preprocessing on the text by lowering the text, removing stop words and punctuations. Find SimilarityNow this value might seem relatively arbitrary to you, it’s hard to determine if this value reflects that the content is plagiarized or not. The larger the value is the less likely it is to be considered plagiarized based on our understanding of Levenshtein distance. However it’s difficult to determine the threshold of what distance is not large enough. Check PlagiarismWe will check for plagiarism through a simple formulaic approach. First we get the maximum length between the user submitted article and the content. Then we check if the Levenshtein distance is less than or equal to that value multiplied by some threshold (I used 0.4), then that user submitted article can be considered plagiarized. The result of the pipeline, the initial input article was categorized as “Not Plagiarized” by the pipeline. Image provided by the author.CaveatsThere are a number of caveats to using the pipeline outlined above. 1) This pipeline does not identify which
areas are plagiarized and which areas are not, it only yields an overall score of plagiarism. Concluding RemarksLevenshtein distance is a lexical similarity measure which identifies the distance between one pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2. The larger the distance between the pair implies that the strings are not similar to each other and vice versa. I created this pipeline in a manner such that it’s easily integratable with other text similarity measures. Levenshtein distance is a great measure to use to identify lexical similarity between a pair of text, but it does not mean there aren’t other well performing similarity measures. The Jaro-Winkler score in particular comes to mind and can be easily implemented in this pipeline. Be aware that the Jaro similarity outputs a result which is interpreted differently than the Levenshtein distance. You can find documentation on other fuzzy string matching like Levenshtein distance here. You can follow through with this pipeline in the Jupyter Notebook I created for this project. You can find the notebook on my GitHub page here. Resources
If you enjoyed this article, here are a few others which you might also enjoy: What is the use of Levenshtein distance?The Levenshtein distance is a similarity measure between words. Given two words, the distance measures the number of edits needed to transform one word into another. There are three techniques that can be used for editing: Insertion.
Is Levenshtein distance NLP?The Levenshtein distance used as a metric provides a boost to accuracy of an NLP model by verifying each named entity in the entry. The vector search solution does a good job, and finds the most similar entry as defined by the vectorization.
What is the difference between edit distance and Levenshtein distance?Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
Is Levenshtein distance an algorithm?The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
|