Hướng dẫn spacy synonyms python

If you have a look at the semantic relatedness produced by this model: http://sense2vec.spacy.io , would these results be sufficient for you?

We don't have this integrated into spaCy yet. But that's the plan. For now you could use the built-in word vectors. The following function is relatively slow. You should probably iterate over the vocab and cache all the results.

>>> def most_similar(word):
...     by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
...     return [w.orth_ for w in by_similarity[:10]]
... 
>>> most_similar(nlp.vocab[u'dog'])
[u'dog', u'Dog', u'DOG', u'DoG', u'doG', u'cat', u'Cat', u'CAT', u'dogs', u'Dogs']
>>> most_similar(nlp.vocab[u'scrape'])
[u'scrape', u'Scrape', u'SCRAPE', u'rustle', u'Rustle', u'RUSTLE', u'gouge', u'Gouge', u'GOUGE', u'gnaw']

Looking at these results, it'd be nice to make it a bit case sensitive. We should also exclude rare terms:

>>> def most_similar(word):
...   queries = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
...   by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
...   return by_similarity[:10]
... 
>>> [w.lower_ for w in most_similar(nlp.vocab[u'dog'])]
[u'dog', u'cat', u'dogs', u'dachshund', u'pig', u'hamster', u'goat', u'rabbit', u'chimp', u'llama']

Finally, you can also consider the Brown cluster, as a way to speed up the search:

>>> nlp.vocab[u'dog'].cluster
37
>>> nlp.vocab[u'cat'].cluster
37
>>> nlp.vocab[u'imagination'].cluster
1893
>>> nlp.vocab[u'always'].cluster
15994
>>> nlp.vocab[u'goat'].cluster
57
>>> nlp.vocab[u'pig'].cluster
121

Try restricting the candidates to words whose Brown cluster is within some distance of the word you're looking for. I haven't tried this, but it should work pretty well.

I want to find synonyms of words.

If word is tall building then i want to find all synonyms of this word like "long apartment ,large building"etc

I used Spacy.

import en_core_web_sm
nlp = en_core_web_sm.load()

LOOP
nlp('tall building').similarity(nlp(mytokens[i]))

I can't use this because it takes a lot of time

neither I can use PhraseMatcher for this

Please help me

thanks in Advance

asked May 3, 2020 at 11:52

you could try using beautiful soup to parse data from an online thesaurus or use a python module such as [py-thesaurus]:https://pypi.org/project/py-thesaurus/

 from bs4 import BeautifulSoup as soup
 from urllib.request import urlopen as uReq
 from urllib.error import HTTPError




def find_synonym(string):
    """ Function to find synonyms for a string"""


    try:

        # Remove whitespace before and after word and use underscore between words
        stripped_string = string.strip()
        fixed_string = stripped_string.replace(" ", "_")
        print(f"{fixed_string}:")

        # Set the url using the amended string
        my_url = f'https://thesaurus.plus/thesaurus/{fixed_string}'
        # Open and read the HTMLz
        uClient = uReq(my_url)
        page_html = uClient.read()
        uClient.close()

        # Parse the html into text
        page_soup = soup(page_html, "html.parser")
        word_boxes = page_soup.find("ul", {"class": "list paper"})
        results = word_boxes.find_all("div", "list_item")

        # Iterate over results and print
        for result in results:
            print(result.text)

    except HTTPError:
        if "_" in fixed_string:
            print("Phrase not found! Please try a different phrase.")

        else:
            print("Word not found! Please try a different word.")


if __name__ == "__main__":
    find_synonym("hello ")

answered May 3, 2020 at 12:31

steve2020steve2020

3322 silver badges9 bronze badges

1

So it's a little hard to tell from your example, but it looks like you're creating a new spaCy doc in every iteration of your loop, which will be slow. You should do something like this instead:

import spacy
nlp = spacy.load('en')

query = nlp('tall building')
for token in mytokens:
    query.similarity(nlp(token))

This way spaCy only has to create the query doc once.

If you want to make repeated queries, you should put the vector for each doc in annoy or similar to get the most similar doc quickly.

Also, I generally wouldn't call this finding "synonyms" since every example you gave is multiple words. You're really looking for similar phrases. "Synonyms" would usually imply single words, like you'd find in a thesaurus, but that won't help you here.

answered May 4, 2020 at 7:43

polm23polm23

12.5k6 gold badges29 silver badges52 bronze badges

Not the answer you're looking for? Browse other questions tagged python nlp nltk spacy or ask your own question.