code
import string
= string.punctuation
punctuations print(punctuations)
## !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Text preparation or preprocessing involves cleaning and normalizing text. Text data comes from a variety of sources including social media sites, and such data is usually messy, inconsistent and contains noise such as spelling errors, special characters, symbols, emojis, and abbreviations. The goal of text preparation is to remove or reduce noise and variability in text data make the text more structured, meaningful and relevant for building a good NLP model. Hence, text preparation allows us to build NLP models that are more efficient and effective.
Text cleaning removes noise from data and includes special character or punctuation removal, stop word removal, HTML parsing (tag and URL removal).
Text normalization standardizes text and involves transforming text such that word variations of each word in the text are represented in a standard or common form. Examples of text normalization tasks include lowercasing, Unicode normalization, spelling correction, tokenization, stemming, and lemmatization.
Punctuation or special character removal is the process of removing punctuation marks or special characters such as exclamation signs, commas, backward and forward slashes, square brackets from text data. The punctuation constant in the string module in Python contains punctuation marks or special characters that can be removed from a sequence of text.
Let’s use string.punctuation to remove special characters in a text.
import string
punctuations = string.punctuation
text = 'Found it, ENC <#> SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
clean_text = text
for char in text:
if char in punctuations:
clean_text = clean_text.replace(char, "")
print(clean_text)
## Found it ENC ltgt SMS sun081 says HI Stop Send STOP to 62468
To keep a specific special character in the text, you could use the .replace(“!”, ’’) string method to remove the special characters from the string.punctuation. For example, let’s keep the exclamation sign (!) in the text while other punctuation are removed.
import string
punctuations = string.punctuation
punctuations = punctuations.replace("!", '')
text = 'Found it, ENC <#> SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
clean_text = text
for char in text:
if char in punctuations:
clean_text = clean_text.replace(char, "")
print(clean_text)
## Found it ENC ltgt SMS sun081 says HI!!! Stop Send STOP to 62468
Regular expressions provide a more efficient way to remove punctuation marks from a sequence of text.
Lowercasing is the conversion of text to lowercase, to normalize the text. The .lower() string method can be used to implement lowercasing.
import re
text = 'Found it, ENC <#> SMS. sun081 says HI!!!" Stop? Send STOP to 62468'
# remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
print(clean_text)
## found it enc ltgt sms sun081 says hi stop send stop to 62468
Tokenization is the splitting of a sequence of text into text units called tokens. The tokens could be characters, words, ngrams of words, sentences, etc. Word tokenization is the splitting of a sequence of text into a list of words while sentence tokenization is the splitting of a sequence of text into a list of sentences. Typically, the sequence of text is split based on whitespaces.
text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words based on whitespace
clean_text = clean_text.split()
# display only the first 10 words for convenience
print(clean_text[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']
Regular expression can handle tokenization where the text is split on some complex pattern that is more complex than a whitespace.
import re
text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words based on whitespace
clean_text = re.split(pattern=r'\s+', string=clean_text)
# display only the first 10 words for convenience
print(clean_text[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']
You can use the nltk package in Python to tokenize a sequence of text:
conda activate <PATH/CONDA_ENV>
conda install nltk
or pip install nltk
import nltk
nltk.download(‘all’)
or nltk.download(‘punct’)
import re
import nltk
from nltk.tokenize import word_tokenize
# nltk.download("punkt")
text = "Mary, did you know that your baby boy Would one day walk on water?"
# remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text)
# convert the text to lowercase
clean_text = clean_text.lower()
# tokenize a sequence of text into a list of words
tokens = word_tokenize(clean_text)
# display only the first 10 words for convenience
print(tokens[0:10])
## ['mary', 'did', 'you', 'know', 'that', 'your', 'baby', 'boy', 'would', 'one']
import re
import nltk
from nltk.tokenize import sent_tokenize
# nltk.download("punkt")
text = "How is it going? You inspire me. Please call me back."
# tokenize the text into a list of sentences
tokens = sent_tokenize(text)
print(tokens)
## ['How is it going?', 'You inspire me.', 'Please call me back.']
# convert each sentence to lowercase
clean_tokens = [sent.lower() for sent in tokens]
# remove punctuation from each sentence
clean_tokens = [re.sub(r'[^\w\s]', '', sent) for sent in clean_tokens]
print(clean_tokens)
## ['how is it going', 'you inspire me', 'please call me back']
Stop words are common words in a language, such as and, the, is,, etc and are considered not to carry relevant information about the text. Stop words are inconsequential and so could be removed from the text to reduce the size of the text or feature space. Removing stop words allows the algorithm to be more efficient and to learn from important or relevant words. We will use the nltk package to illustrate how to remove stop words.
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = stopwords.words('english')
text = "How is it going? You inspire me. Please call me back."
# convert text to lowercase and remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text.lower())
# tokenize a sequence of text into a list of words
tokens = word_tokenize(clean_text)
# remove stop words from a sequence of text
relevant_tokens = [token for token in tokens if token not in stop_words]
print(relevant_tokens)
## ['going', 'inspire', 'please', 'call', 'back']
Stemming is the transformation of word variations in a sequence of text into their common root form called stem. It is a more extreme form of normalization that eliminates inflectional affixes such as -ed and -s suffixes in English to capture the same underlying concept or root. Stemming allow you to group word variations into the same feature. The algorithm used to implement stemming is called a stemmer, which usually applies a series of regular expressions substitutions. The stems obtained from stemming sometimes are not meaningful.
Similar to stemming, lemmatization transforms word variants into their root form called lemma but a lemma is more meaningful compared to a stem. Let’s use the nltk packakge to implement lemmatization.
from nltk.stem import WordNetLemmatizer
# create an object of class WordNetLemmatizer
tokens = ["computer", "compute", "computes", "computing", "computation"]
lemmatizer = WordNetLemmatizer()
# remove stop words from a sequence of text
lemma_tokens = [lemmatizer.lemmatize(word, 'v') for word in tokens]
print(lemma_tokens)
## ['computer', 'compute', 'compute', 'compute', 'computation']