How do you find duplicate words in a paragraph in python?

I can see where you are going with sort, as you can reliably know when you have hit a new word and keep track of counts for each unique word. However, what you really want to do is use a hash [dictionary] to keep track of the counts as dictionary keys are unique. For example:

words = sentence.split[]
counts = {}
for word in words:
    if word not in counts:
        counts[word] = 0
    counts[word] += 1

Now that will give you a dictionary where the key is the word and the value is the number of times it appears. There are things you can do like using collections.defaultdict[int] so you can just add the value:

counts = collections.defaultdict[int]
for word in words:
    counts[word] += 1

But there is even something better than that... collections.Counter which will take your list of words and turn it into a dictionary [an extension of dictionary actually] containing the counts.

counts = collections.Counter[words]

From there you want the list of words in sorted order with their counts so you can print them. items[] will give you a list of tuples, and sorted will sort [by default] by the first item of each tuple [the word in this case]... which is exactly what you want.

import collections
sentence = """As far as the laws of mathematics refer to reality they are not certain as far as they are certain they do not refer to reality"""
words = sentence.split[]
word_counts = collections.Counter[words]
for word, count in sorted[word_counts.items[]]:
    print['"%s" is repeated %d time%s.' % [word, count, "s" if count > 1 else ""]]

OUTPUT

"As" is repeated 1 time.
"are" is repeated 2 times.
"as" is repeated 3 times.
"certain" is repeated 2 times.
"do" is repeated 1 time.
"far" is repeated 2 times.
"laws" is repeated 1 time.
"mathematics" is repeated 1 time.
"not" is repeated 2 times.
"of" is repeated 1 time.
"reality" is repeated 2 times.
"refer" is repeated 2 times.
"the" is repeated 1 time.
"they" is repeated 3 times.
"to" is repeated 2 times.

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Prerequisite : Dictionary data structure Given a string, Find the 1st repeated word in a string. Examples:

    Input : "Ravi had been saying that he had been there"
    Output : had
     
    Input : "Ravi had been saying that"
    Output : No Repetition
    
    Input : "he had had he"
    Output : he

    We have existing solution for this problem please refer Find the first repeated word in a string link. We can solve this problem quickly in python using Dictionary data structure. Approach is simple,

    1. First split given string separated by space.
    2. Now convert list of words into dictionary using collections.Counter[iterator] method. Dictionary contains words as key and it’s frequency as value.
    3. Now traverse list of words again and check which first word has frequency greater than 1.

    Python3

    from collections import Counter

    def firstRepeat[input]:

        words = input.split[' ']

        dict = Counter[words]

        for key in words:

            if dict[key]>1:

                print [key]

                return

    if __name__ == "__main__":

        input = 'Ravi had been saying that he had been there'

        firstRepeat[input]

    Output:

    had

    Time Complexity: O[length[words]]

    Auxiliary Space: O[length[dict]]

    Many times, we have a need of analysing the text only for the unique words present in the file. So, we need to eliminate the duplicate words from the text. This is achieved by using the word tokenization and set functions available in nltk.

    Without preserving the order

    In the below example we first tokenize the sentence into words. Then we apply set[] function which creates an unordered collection of unique elements. The result has unique words which are not ordered.

    import nltk
    word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 
    
    # First Word tokenization
    nltk_tokens = nltk.word_tokenize[word_data]
    
    # Applying Set
    no_order = list[set[nltk_tokens]]
    
    print no_order
    

    When we run the above program, we get the following output −

    ['blue', 'Rainbow', 'is', 'Sky', 'colour', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']
    

    Preserving the Order

    To get the words after removing the duplicates but still preserving the order of the words in the sentence, we read the words and add it to list by appending it.

    import nltk
    word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 
    # First Word tokenization
    nltk_tokens = nltk.word_tokenize[word_data]
    
    ordered_tokens = set[]
    result = []
    for word in nltk_tokens:
        if word not in ordered_tokens:
            ordered_tokens.add[word]
            result.append[word]
         
    print result        
    
    

    When we run the above program, we get the following output −

    ['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']
    

    How do I find a repeating word in a string in python?

    Approach is simple,.
    First split given string separated by space..
    Now convert list of words into dictionary using collections. Counter[iterator] method. Dictionary contains words as key and it's frequency as value..
    Now traverse list of words again and check which first word has frequency greater than 1..

    How do I remove repetitive words in Python?

    1] Split input sentence separated by space into words. 2] So to get all those strings together first we will join each string in given list of strings. 3] Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4] Join each words are unique to form single string.

    How do you print repeated letters in a string in python?

    Method 2:.
    Define a function which will take a word, m, n values as arguments..
    if M is greater than length of word. set m value equal to length of word..
    Now store the characters needed to be repeated into a string named repeat_string using slicing..
    Multiply the repeat_string with n..
    Now print the string..

    How do you find duplicate words in a paragraph in Java?

    ALGORITHM.
    STEP 1: START..
    STEP 2: DEFINE String string = "Big black bug bit a big black dog on his big black nose".
    STEP 3: DEFINE count..
    STEP 4: CONVERT string into lower-case..
    STEP 5: INITIALIZE words[] to SPLIT the string..
    STEP 6: PRINT "Duplicate words in a given string:".
    STEP 7: SET i=0. ... .
    STEP 8: SET count =1..

    Chủ Đề