Lesson 12: Text Preparation in NLP

Text Preparation in NLP

Text preparation or preprocessing involves cleaning and normalizing text. Text data comes from a variety of sources including social media sites, and such data is usually messy, inconsistent and contains noise such as spelling errors, special characters, symbols, emojis, and abbreviations. The goal of text preparation is to remove or reduce noise and variability in text data make the text more structured, meaningful and relevant for building a good NLP model. Hence, text preparation allows us to build NLP models that are more efficient and effective.

Text cleaning removes noise from data and includes special character or punctuation removal, stop word removal, HTML parsing (tag and URL removal).

Text normalization standardizes text and involves transforming text such that word variations of each word in the text are represented in a standard or common form. Examples of text normalization tasks include lowercasing, Unicode normalization, spelling correction, tokenization, stemming, and lemmatization.

Punctuation Removal

Punctuation or special character removal is the process of removing punctuation marks or special characters such as exclamation signs, commas, backward and forward slashes, square brackets from text data. The punctuation constant in the string module in Python contains punctuation marks or special characters that can be removed from a sequence of text.

code

import string

punctuations = string.punctuation
print(punctuations)
## !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Let’s use string.punctuation to remove special characters in a text.

code

import string
punctuations = string.punctuation
text = 'Found it, ENC  &lt;#&gt; SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
clean_text = text
for char in text:
    if char in punctuations:
        clean_text = clean_text.replace(char, "")
print(clean_text)
## Found it ENC  ltgt SMS sun081 says HI Stop Send STOP to 62468

To keep a specific special character in the text, you could use the .replace(“!”, ’’) string method to remove the special characters from the string.punctuation. For example, let’s keep the exclamation sign (!) in the text while other punctuation are removed.

code

import string
punctuations = string.punctuation
punctuations = punctuations.replace("!", '')
text = 'Found it, ENC  &lt;#&gt; SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
clean_text = text
for char in text:
    if char in punctuations:
        clean_text = clean_text.replace(char, "")
print(clean_text)
## Found it ENC  ltgt SMS sun081 says HI!!! Stop Send STOP to 62468

Regular expressions provide a more efficient way to remove punctuation marks from a sequence of text.

code

import re

text = 'Found it, ENC  &lt;#&gt; SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)
## Found it ENC  ltgt SMS sun081 says HI Stop Send STOP to 62468

Lowercasing

Lowercasing is the conversion of text to lowercase, to normalize the text. The .lower() string method can be used to implement lowercasing.

code

import re

text = 'Found it, ENC  &lt;#&gt; SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
# remove punctuation 
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
print(clean_text)
## found it enc  ltgt sms sun081 says hi stop send stop to 62468

Tokenization

Tokenization is the splitting of a sequence of text into text units called tokens. The tokens could be characters, words, ngrams of words, sentences, etc. Word tokenization is the splitting of a sequence of text into a list of words while sentence tokenization is the splitting of a sequence of text into a list of sentences. Typically, the sequence of text is split based on whitespaces.

Word tokenization with the .split() Method of the String Object in Python

code


text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation 
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words based on whitespace
clean_text = clean_text.split()
# display only the first 10 words for convenience
print(clean_text[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']

Word tokenization with the re.split() Method

Regular expression can handle tokenization where the text is split on some complex pattern that is more complex than a whitespace.

code

import re

text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation 
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words based on whitespace
clean_text = re.split(pattern=r'\s+', string=clean_text)
# display only the first 10 words for convenience
print(clean_text[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']

Word Tokenization with nltk Tokenizer

You can use the nltk package in Python to tokenize a sequence of text:

Activate the project’s conda environment through the command line: conda activate <PATH/CONDA_ENV>
Install nltk through the command line: conda install nltk or pip install nltk
Import nltk in your code: import nltk
include this line of code to your code: nltk.download(‘all’) or nltk.download(‘punct’)

code

import re
import nltk
from nltk.tokenize import word_tokenize
# nltk.download("punkt")

text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation 
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words
tokens = word_tokenize(clean_text)
# display only the first 10 words for convenience
print(tokens[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']

Sentence Tokenization with nltk Tokenizer

code

import re
import nltk
from nltk.tokenize import sent_tokenize
# nltk.download("punkt")

text = "How is it going? You inspire me. Please call me back."
# tokenize the text into a list of sentences
tokens = sent_tokenize(text)
print(tokens)
## ['How is it going?', 'You inspire me.', 'Please call me back.']
# convert each sentence to lowercase
clean_tokens = [sent.lower() for sent in tokens]
# remove punctuation from each sentence
clean_tokens = [re.sub(r'[^\w\s]', '', sent) for sent in clean_tokens]
print(clean_tokens)
## ['how is it going', 'you inspire me', 'please call me back']

Stop Word Removal

Stop words are common words in a language, such as and, the, is,, etc and are considered not to carry relevant information about the text. Stop words are inconsequential and so could be removed from the text to reduce the size of the text or feature space. Removing stop words allows the algorithm to be more efficient and to learn from important or relevant words. We will use the nltk package to illustrate how to remove stop words.

code

import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')

stop_words = stopwords.words('english')
text = "How is it going? You inspire me. Please call me back."
# convert text to lowercase and remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text.lower())
# tokenize a sequence of text into a list of words
tokens = word_tokenize(clean_text)
# remove stop words from a sequence of text
relevant_tokens = [token for token in tokens if token not in stop_words]
print(relevant_tokens)
## ['going', 'inspire', 'please', 'call', 'back']

Stemming

Stemming is the transformation of word variations in a sequence of text into their common root form called stem. It is a more extreme form of normalization that eliminates inflectional affixes such as -ed and -s suffixes in English to capture the same underlying concept or root. Stemming allow you to group word variations into the same feature. The algorithm used to implement stemming is called a stemmer, which usually applies a series of regular expressions substitutions. The stems obtained from stemming sometimes are not meaningful.

code

from nltk.stem import PorterStemmer

ps = PorterStemmer()
tokens = ["computer", "compute", "computes", "computing", "computation"]
# remove stop words from a sequence of text
stem_tokens = [ps.stem(word) for word in tokens]
print(stem_tokens)
## ['comput', 'comput', 'comput', 'comput', 'comput']

Lemmatization

Similar to stemming, lemmatization transforms word variants into their root form called lemma but a lemma is more meaningful compared to a stem. Let’s use the nltk packakge to implement lemmatization.

code

from nltk.stem import WordNetLemmatizer
# create an object of class WordNetLemmatizer

tokens = ["computer", "compute", "computes", "computing", "computation"]

lemmatizer = WordNetLemmatizer()
# remove stop words from a sequence of text
lemma_tokens = [lemmatizer.lemmatize(word, 'v') for word in tokens]
print(lemma_tokens)
## ['computer', 'compute', 'compute', 'compute', 'computation']