Cosine similarity between two sentences python code

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
    Similarity = (A.B) / (||A||.||B||) where A and B are vectors.

    Cosine similarity and nltk toolkit module are used in this program. To execute this program nltk must be installed in your system. In order to install nltk module follow the steps below –

    1. Open terminal(Linux).
    2. sudo pip3 install nltk
    3. python3
    4. import nltk
    5. nltk.download(‘all’)

    Functions used:

    nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X) split the given sentence X into words and return list.

    nltk.corpus: In this program, it is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).

    Cosine similarity between two sentences python code

    Below is the Python implementation –

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize

    X ="I love horror movies"

    Y ="Lights out is a horror movie"

    X_list = word_tokenize(X) 

    Y_list = word_tokenize(Y)

    sw = stopwords.words('english'

    l1 =[];l2 =[]

    X_set = {w for w in X_list if not w in sw} 

    Y_set = {w for w in Y_list if not w in sw}

    rvector = X_set.union(Y_set) 

    for w in rvector:

        if w in X_set: l1.append(1)

        else: l1.append(0)

        if w in Y_set: l2.append(1)

        else: l2.append(0)

    c = 0

    for i in range(len(rvector)):

            c+= l1[i]*l2[i]

    cosine = c / float((sum(l1)*sum(l2))**0.5)

    print("similarity: ", cosine)

    Output:

    similarity:  0.2886751345948129
    


    The short answer is "no, it is not possible to do that in a principled way that works even remotely well". It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work. I'll very briefly summarize where we are and point you to a few publications:

    Meaning of words

    The most important assumption here is that it is possible to obtain a vector that represents each word in the sentence in quesion. This vector is usually chosen to capture the contexts the word can appear in. For example, if we only consider the three contexts "eat", "red" and "fluffy", the word "cat" might be represented as [98, 1, 87], because if you were to read a very very long piece of text (a few billion words is not uncommon by today's standard), the word "cat" would appear very often in the context of "fluffy" and "eat", but not that often in the context of "red". In the same way, "dog" might be represented as [87,2,34] and "umbrella" might be [1,13,0]. Imagening these vectors as points in 3D space, "cat" is clearly closer to "dog" than it is to "umbrella", therefore "cat" also means something more similar to "dog" than to an "umbrella".

    This line of work has been investigated since the early 90s (e.g. this work by Greffenstette) and has yielded some surprisingly good results. For example, here is a few random entries in a thesaurus I built recently by having my computer read wikipedia:

    theory -> analysis, concept, approach, idea, method
    voice -> vocal, tone, sound, melody, singing
    james -> william, john, thomas, robert, george, charles
    

    These lists of similar words were obtained entirely without human intervention- you feed text in and come back a few hours later.

    The problem with phrases

    You might ask why we are not doing the same thing for longer phrases, such as "ginger foxes love fruit". It's because we do not have enough text. In order for us to reliably establish what X is similar to, we need to see many examples of X being used in context. When X is a single word like "voice", this is not too hard. However, as X gets longer, the chances of finding natural occurrences of X get exponentially slower. For comparison, Google has about 1B pages containing the word "fox" and not a single page containing "ginger foxes love fruit", despite the fact that it is a perfectly valid English sentence and we all understand what it means.

    Composition

    To tackle the problem of data sparsity, we want to perform composition, i.e. to take vectors for words, which are easy to obtain from real text, and to put the together in a way that captures their meaning. The bad news is nobody has been able to do that well so far.

    The simplest and most obvious way is to add or multiply the individual word vectors together. This leads to undesirable side effect that "cats chase dogs" and "dogs chase cats" would mean the same to your system. Also, if you are multiplying, you have to be extra careful or every sentences will end up represented by [0,0,0,...,0], which defeats the point.

    Further reading

    I will not discuss the more sophisticated methods for composition that have been proposed so far. I suggest you read Katrin Erk's "Vector space models of word meaning and phrase meaning: a survey". This is a very good high-level survey to get you started. Unfortunately, is not freely available on the publisher's website, email the author directly to get a copy. In that paper you will find references to many more concrete methods. The more comprehensible ones are by Mitchel and Lapata (2008) and Baroni and Zamparelli (2010).


    Edit after comment by @vpekar: The bottom line of this answer is to stress the fact that while naive methods do exist (e.g. addition, multiplication, surface similarity, etc), these are fundamentally flawed and in general one should not expect great performance from them.

    How do you find the cosine similarity between two sentences in Python?

    Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

    How do you find the similarity between two sentences?

    The logic is this:.
    Take a sentence, convert it into a vector..
    Take many other sentences, and convert them into vectors..
    Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them — more on that here..
    We now have a measure of semantic similarity between sentences — easy!.

    How do you define cosine similarity in Python?

    Python Example.
    def cosine_similarity(x, y):.
    if len(x) != len(y) :.
    return None..
    dot_product = np. dot(x, y).
    magnitude_x = np. sqrt(np. sum(x**2)).
    magnitude_y = np. sqrt(np. sum(y**2)).
    cosine_similarity = dot_product / (magnitude_x * magnitude_y).
    return cosine_similarity..

    How do you find the similarity between two text files?

    The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.