Remove duplicate sentences in python

I have a file with one column. How to delete repeated lines in a file?

Shog9

153k34 gold badges227 silver badges232 bronze badges

asked Jul 31, 2009 at 22:37

0

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

lines_seen = set[] # holds lines already seen
outfile = open[outfilename, "w"]
for line in open[infilename, "r"]:
    if line not in lines_seen: # not a duplicate
        outfile.write[line]
        lines_seen.add[line]
outfile.close[]

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write[line] and instead, immediately after the loop, do outfile.writelines[sorted[lines_seen]].

answered Jul 31, 2009 at 22:46

Vinay SajipVinay Sajip

91.7k14 gold badges170 silver badges182 bronze badges

7

If you're on *nix, try running the following command:

sort  | uniq

answered Jul 31, 2009 at 22:43

David LockeDavid Locke

17.6k9 gold badges32 silver badges53 bronze badges

0

uniqlines = set[open['/tmp/foo'].readlines[]]

this will give you the list of unique lines.

writing that back to some file would be as easy as:

bar = open['/tmp/bar', 'w'].writelines[uniqlines]

bar.close[]

Marco Bonelli

58.5k20 gold badges112 silver badges117 bronze badges

answered Aug 1, 2009 at 12:51

marcellmarcell

1,2448 silver badges10 bronze badges

3

You can do:

import os
os.system["awk '!x[$0]++' /path/to/file > /path/to/rem-dups"]

Here You are using bash into python :]

You have also other way:

with open['/tmp/result.txt'] as result:
        uniqlines = set[result.readlines[]]
        with open['/tmp/rmdup.txt', 'w'] as rmdup:
            rmdup.writelines[set[uniqlines]]

answered Jun 7, 2014 at 13:15

MLSCMLSC

5,6627 gold badges52 silver badges86 bronze badges

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1","line2","line3","line2","line1"]
>>> list[set[x]]
['line3', 'line2', 'line1']
>>>

If you need to preserve the ordering of lines - as set is unordered collection - try this:

y = []
for l in x:
    if l not in y:
        y.append[l]

and write the content back to the file.

answered Aug 1, 2009 at 15:18

shahjapanshahjapan

12.8k22 gold badges70 silver badges101 bronze badges

1

Its a rehash of whats already been said here - here what I use.

import optparse

def removeDups[inputfile, outputfile]:
        lines=open[inputfile, 'r'].readlines[]
        lines_set = set[lines]
        out=open[outputfile, 'w']
        for line in lines_set:
                out.write[line]

def main[]:
        parser = optparse.OptionParser['usage %prog ' +\
                        '-i  -o ']
        parser.add_option['-i', dest='inputfile', type='string',
                        help='specify your input file']
        parser.add_option['-o', dest='outputfile', type='string',
                        help='specify your output file']
        [options, args] = parser.parse_args[]
        inputfile = options.inputfile
        outputfile = options.outputfile
        if [inputfile == None] or [outputfile == None]:
                print parser.usage
                exit[1]
        else:
                removeDups[inputfile, outputfile]

if __name__ == '__main__':
        main[]

answered Mar 31, 2015 at 10:29

Arthur MArthur M

4215 silver badges19 bronze badges

Python One liners :

python -c "import sys; lines = sys.stdin.readlines[]; print ''.join[sorted[set[lines]]]" < InputFile > OutputFile

answered Sep 15, 2013 at 9:16

Rahul PatilRahul Patil

9843 gold badges12 silver badges29 bronze badges

adding to @David Locke's answer, with *nix systems you can run

sort -u messy_file.txt > clean_file.txt

which will create clean_file.txt removing duplicates in alphabetical order.

answered Jan 27, 2017 at 13:18

All Іѕ VаиітyAll Іѕ Vаиітy

23k14 gold badges82 silver badges105 bronze badges

1

Look at script I created to remove duplicate emails from text files. Hope this helps!

# function to remove duplicate emails
def remove_duplicate[]:
    # opens emails.txt in r mode as one long string and assigns to var
    emails = open['emails.txt', 'r'].read[]
    # .split[] removes excess whitespaces from str, return str as list
    emails = emails.split[]
    # empty list to store non-duplicate e-mails
    clean_list = []
    # for loop to append non-duplicate emails to clean list
    for email in emails:
        if email not in clean_list:
            clean_list.append[email]
    return clean_list
    # close emails.txt file
    emails.close[]
# assigns no_duplicate_emails.txt to variable below
no_duplicate_emails = open['no_duplicate_emails.txt', 'w']

# function to convert clean_list 'list' elements in to strings
for email in remove_duplicate[]:
    # .strip[] method to remove commas
    email = email.strip[',']
    no_duplicate_emails.write[f"E-mail: {email}\n"]
# close no_duplicate_emails.txt file
no_duplicate_emails.close[]

answered May 10, 2018 at 19:12

If anyone is looking for a solution that uses a hashing and is a little more flashy, this is what I currently use:

def remove_duplicate_lines[input_path, output_path]:

    if os.path.isfile[output_path]:
        raise OSError['File at {} [output file location] exists.'.format[output_path]]

    with open[input_path, 'r'] as input_file, open[output_path, 'w'] as output_file:
        seen_lines = set[]

        def add_line[line]:
            seen_lines.add[line]
            return line

        output_file.writelines[[add_line[line] for line in input_file
                                if line not in seen_lines]]

answered Feb 28, 2017 at 7:05

TorkoalTorkoal

4495 silver badges17 bronze badges

1

edit it within the same file

lines_seen = set[] # holds lines already seen

with open["file.txt", "r+"] as f:
    d = f.readlines[]
    f.seek[0]
    for i in d:
        if i not in lines_seen:
            f.write[i]
            lines_seen.add[i]
    f.truncate[]

answered Apr 1, 2020 at 22:58

Readable and Concise

with open['sample.txt'] as fl:
    content = fl.read[].split['\n']

content = set[[line for line in content if line != '']]

content = '\n'.join[content]

with open['sample.txt', 'w'] as fl:
    fl.writelines[content]

answered Oct 26, 2020 at 7:21

Ravgeet DhillonRavgeet Dhillon

4731 gold badge5 silver badges17 bronze badges

1

Here is my solution

if __name__ == '__main__':
f = open['temp.txt','w+']
flag = False
with open['file.txt'] as fp:
    for line in fp:
        for temp in f:
            if temp == line:
                flag = True
                print['Found Match']
                break
        if flag == False:
            f.write[line]
        elif flag == True:
            flag = False
        f.seek[0]
    f.close[]

answered Jun 28, 2013 at 2:15

cat  | grep '^[a-zA-Z]+$' | sort -u > outfile.txt

To filter and remove duplicate values from the file.

answered Jun 11, 2021 at 9:00

AshwaqAshwaq

4016 silver badges16 bronze badges

Here is my solution

d = input["your file:"] #write your file name here
file1 = open[d, mode="r"]
file2 = open['file2.txt', mode='w']
file2 = open['file2.txt', mode='a']
file1row = file1.readline[]


while file1row != "" :
    file2 = open['file2.txt', mode='a']
    file2read = open['file2.txt', mode='r']
    file2r = file2read.read[].strip[]
    if file1row not in file2r:
        file2.write[file1row]   
    file1row = file1.readline[]
    file2read.close[]
    file2.close

answered Sep 10 at 17:51

How do you remove duplicate text in Python?

Explanation:.
First of all, save the path of the input and output file paths in two variables. ... .
Create one Set variable. ... .
Open the output file in write mode. ... .
Start one for loop to read from the input file line by line. ... .
Find the hash value of the current line. ... .
Check if this hash value is already in the Set variable or not..

How do I remove duplicate sentences from a paragraph in Python?

We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar. txt and the output is stored in foo.

How do you find duplicate lines in Python?

“python script to find duplicate lines in a file and delete” Code Answer's.
lines_seen = set[] # holds lines already seen..
with open["file.txt", "r+"] as f:.
d = f. readlines[].
f. seek[0].
for i in d:.
if i not in lines_seen:.
f. write[i].

How do I remove duplicate lines in a text file?

The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines. For uniq to work, you must first sort the output.

Chủ Đề