Hướng dẫn remove contractions python

Text preprocessing is a crucial step in NLP. Cleaning our text data in order to convert it into a presentable form that is analyzable and predictable for our task is known as text preprocessing. In this article, we are going to discuss contractions and how to handle contractions in text.

What are contractions?

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

Nowadays, where everything is shifting online, we communicate with others more through text messages or posts on different social media like Facebook, Instagram, Whatsapp, Twitter, LinkedIn, etc. in the form of texts. With so many people to talk, we rely on abbreviations and shortened form of words for texting people.

For example I’ll be there within 5 min. Are u not gng there? Am I mssng out on smthng? I’d like to see u near d park.

In English contractions, we often drop the vowels from a word to form the contractions. Removing contractions contributes to text standardization and is useful when we are working on Twitter data, on reviews of a product as the words play an important role in sentiment analysis.

How to expand contractions?

1. Using contractions library

First, install the library. You can try this library on Google colab as installing the library becomes super smooth.

Using pip:

!pip install contractions

In Jupyter notebook:

import sys  
!{sys.executable} -m pip install contractions

Code 1:  For expanding contractions using contractions library

Python3

import contractions

text =

expanded_words = []   

for word in text.split():

  expanded_words.append(contractions.fix(word))  

expanded_text = ' '.join(expanded_words)

print('Original text: ' + text)

print('Expanded_text: ' + expanded_text)

Output:

Original text: I'll be there within 5 min. Shouldn't you be there too? 
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.
Expanded_text: I will be there within 5 min. should not you be there too? 
          I would love to see you there my dear. it is awesome to meet new friends. 
          we have been waiting for this day for so long.

Removing contractions before forming word vectors helps in dimensionality reduction.

Code 2: Simply using contractions.fix to expand the text.

Python3

text =

contractions.fix(text)

Output:

'she would like to know how I would done that! 
 she is going to the park and I do not think I will be home for dinner.
 they are going to the zoo and she will be home for dinner.'

Contractions can also be handled using other techniques like dictionary mapping, and also using pycontractions library. You can refer to the documentation of pycontractions library for learning more about this: https://pypi.org/project/pycontractions/


I would do something like this.

import re

def remove_contraction_apostraphes(input):
    text = re.sub('([A-Za-z]+)[\'`]([A-Za-z]+)', r'\1'r'\2', input)                                       
    return text

print(remove_contraction_apostraphes("can't"))

  1. It matches one or more letters [A-Za-z]+
  • things in square brackets means one of these characters, the plus means at least one or more of what comes before
  1. followed by one of the following ' or `

  2. followed by one or more letters

and replaces it with

  1. what was found in the first set of parenthesis r'\1'
  • r'\1' returns the pattern that was matched by the first ([A-Za-z]+)
  1. followed by what was found in the second set of parenthesis r'\2'

If you have other characters, such as �, and you know what they all are you can place them with the square brackets. This line will match any of those characters, and account for the chance of white spaces by the apostrophe

text = re.sub('([A-Za-z]+)\s?[\'`�]\s?([A-Za-z]+)', r'\1'r'\2', input)       
  • /s : Any white space
  • ? : 0 or 1 of the previous

You could also use [^A-Za-z0-9]

    text = re.sub('([A-Za-z]+)[^A-Za-z0-9]([A-Za-z]+)', r'\1'r'\2', input)     

to match any any number of character's followed by any character which isn't a letter or a number, followed by any number of character's. If you want to add the \s? in there, I would recommend adding \., \?, \!, \: ... to you regex making it '([A-Za-z]+)\s?[^A-Za-z0-9\.\!\?\:]s?([A-Za-z]+)' because otherwise your regex will match things like the ends of sentences, which are not contractions


This will match any contraction, no matter how letters before or after the apostrophe there are. You will need to put all the different apostrophe's that you have within the ['`] block