Hướng dẫn python replace multiple punctuation

I would like to find multiple occurrences of exclamation marks, question marks and periods (such as !!?!, ...?, ...!) and replace them with just the final punctuation.

i.e. !?!?!? would become ?

and ....! would become !

Is this possible?

asked Jan 27, 2016 at 15:39

0

text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', '', text)

That is, remove any sequence of ?!. characters that are going to be followed by another ?!. character.

[...] is a character class. It matches any character inside the brackets.

+ means "1 or more of these".

(?=...) is a lookahead. It looks to see what is going to come next in the string.

answered Jan 27, 2016 at 15:49

Hướng dẫn python replace multiple punctuation

khelwoodkhelwood

53k13 gold badges79 silver badges99 bronze badges

text = re.search('[.?!]*([.?!])', text).group(1)

The way this works is that the parentheses create a capture group, allowing you to access the matched text via the group function.

answered Jan 27, 2016 at 15:58

Hướng dẫn python replace multiple punctuation

Zachary SlossZachary Sloss

281 gold badge1 silver badge5 bronze badges

All of these answers seem to be complicating things or not understanding regex very well. I recommend using special sequences to catch any and all punctuation you're trying to replace with spaces.

Nội dung chính

  • Method 1: Remove Punctuation from a String with Translate
  • Method 2: Remove Punctuation from a String with Python loop
  • Method 3: Remove Punctuation from a String with regex 
  • Method 4:  Using for loop, punctuation string and not in operator
  • How do I get rid of punctuation in Python?
  • How do I get rid of punctuation in pandas?
  • How do you remove punctuation from Python using NLTK?
  • Does string punctuation include space?

My answer is a simplification of Jonathan's leveraging Python regex special sequences rather than a manual list of punctuation and spaces to catch.

import re

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub(r'''      # Start raw string block
               \W+       # Accept one or more non-word characters
               \s*       # plus zero or more whitespace characters,
               ''',      # Close string block
               ' ',      # and replace it with a single space
               tweet,
               flags=re.VERBOSE)
print(tweet + '\n' + clean)

Results:

I am tired! I like fruit...and milk
I am tired I like fruit and milk

Compact version:

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('\W+\s*', ' ', tweet)
print(tweet + '\n' + clean)

What separates my version from Jonathan's is symbols like hyphens, tildes, parentheses, brackets, etc are all caught and removed, not just the list of given punctuation, catches any non-space whitespace, like tab, newline, etc. and converts to a single space.

Jonathan's version is good if you want to remove a specific list of punctuation but not all punctuation, like my solution does.

If you don't want to even allow underscores in your text, you can replace the special sequence \W with just a simple [^a-zA-Z0-9], i.e.

tweet = 'I am tired! I like fruit...and milk'
clean = re.sub('[^a-zA-Z0-9]+\s*', ' ', tweet)
print(tweet + '\n' + clean)

Special sequence explanation, from Python's documentation on regex:

"The special sequences consist of '\' and a character from the list below."

\W: Matches any character which is not a word character. (A word character, \w, includes most characters that can be part of a word in any language, as well as numbers and the underscore.)

\s: For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).

Many times while working with Python strings, we have a problem in which we need to remove certain characters from strings. This can have applications in data preprocessing in the Data Science domain and also in day-day programming. Let’s discuss certain ways in which we can perform this task using Python.

Method 1: Remove Punctuation from a String with Translate

The first two arguments for string.translate method is empty strings, and the third input is a Python list of the punctuation that should be removed. This instructs the Python method to eliminate punctuation from a string. This is one of the best ways to strip punctuation from a string.

Python3

import string

test_str = 'Gfg, is best: for ! Geeks ;'

test_str = test_str.translate

    (str.maketrans('', '', string.punctuation))

print(test_str)

Output:

Gfg is best for  Geeks 

Method 2: Remove Punctuation from a String with Python loop

This is the brute way in which this task can be performed. In this, we check for the punctuations using a raw string that contain punctuations and then we construct a string removing those punctuations.

Python3

test_str = "Gfg, is best : for ! Geeks ;"

print("The original string is : " + test_str)

punc =

for ele in test_str:

    if ele in punc:

        test_str = test_str.replace(ele, "")

print("The string after punctuation filter : " + test_str)

Output: 

The original string is : Gfg, is best : for ! Geeks ;
The string after punctuation filter : Gfg is best  for  Geeks 

Method 3: Remove Punctuation from a String with regex 

The part of replacing with punctuation can also be performed using regex. In this, we replace all punctuation with an empty string using a certain regex.

Python3

import re

test_str = "Gfg, is best : for ! Geeks ;"

print("The original string is : " + test_str)

res = re.sub(r'[^\w\s]', '', test_str)

print("The string after punctuation filter : " + res)

Output : 

The original string is : Gfg, is best : for ! Geeks ;
The string after punctuation filter : Gfg is best  for  Geeks 

Method 4:  Using for loop, punctuation string and not in operator

Python3

test_str = "Gfg, is best : for ! Geeks ;"

print("The original string is : " + test_str)

punc =

res=" "

for ele in test_str:

    if ele not in punc:

        res+=ele

print("The string after punctuation filter : " + res)

Output

The original string is : Gfg, is best : for ! Geeks ;
The string after punctuation filter :  Gfg is best  for  Geeks 

The Time and Space Complexity for all the methods are the same:

Time Complexity: O(n)

Auxiliary Space: O(n)


How do I get rid of punctuation in Python?

We can use replace() method to remove punctuation from python string by replacing each punctuation mark by empty string. We will iterate over the entire punctuation marks one by one replace it by an empty string in our text string.

How do I get rid of punctuation in pandas?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .

How do you remove punctuation from Python using NLTK?

Use nltk..

sentence = "Think and wonder, wonder and think.".

tokenizer = nltk. RegexpTokenizer(r"\w+").

new_words = tokenizer. tokenize(sentence).

print(new_words).

Does string punctuation include space?

Note The string. punctuation values do not include Unicode symbols or whitespace characters. Remove punctuation.