Text representation in natural language processing is the encoding or vectorization of text data into its numerical form suitable for learning algorithms. Algorithms feed on numerical data and do not understand raw text data or strings. Hence, for NLP tasks, it is necessary to first represent the raw input text data in a numerical form for learning algorithms to process.
Text representation can be viewed as feature extractions since it involves extracting new features from text data. The representation of text units (words, n-grams of words, sentences, paragraphs, documents, etc) as vectors in a common vector space is called a vector space model or vector representation.
There are various vector representation approaches that allow us to represent text in a numerical or machine-readable form, ranging from basic to state-of-the-art approaches. Basic vector representation approaches include one-hot encoding, bag of words, bag of n-grams, TF-IDF while advanced vector representations include word embeddings such as Word2vec and GloVe, BERT, etc.
Before we dive into the various vector representation approaches, it’s important to define the following terminologies commonly used in NLP.
Definitions
Corpus: a body of text
Token: text units such as characters, word, n-gram of words, sentence, paragraph, document, etc.
Vocabulary: unique words in a corpus
Document: this is a record in a text dataset
One-Hot Encoding
In traditional machine learning, one-hot encoding (also know as dummy encoding) is used to transform categorical data into a numerical form through the creation of dummy variables.
In natural language processing, one-hot encoding involves the representation of words as vectors of zeros and ones. The following process is used to generate one-hot-encoding for a sequence of text.
The text is tokenized into a list of words.
Unique words are extracted from the tokens to create the vocabulary.
Each word in the vocabulary is then represented as a vector of zeros and 1. For a given word, the position of 1 in the vector corresponds to the index of the word in question.
Consider the following documents in a corpus:
Documents in a Corpus
doc1 = “I am royalty”
doc2 = “I am a child of the King””
Lowercasing and Punctuation Removal
code
import redef clean_text(text):return re.sub(r'[^\w\s]', '', text.lower())doc1 ="I am royalty"doc2 ="I am a child of the King"corpus = [doc1, doc2]for doc in corpus:print(clean_text(doc))## i am royalty## i am a child of the king
Extract vocabulary from corpus
code
import reimport nltkfrom nltk.tokenize import word_tokenize#nltk.download("punkt")# extract vocabulary def extract_vocab(text): clean_text = re.sub(r'[^\w\s]', '', text.lower()) tokens = word_tokenize(clean_text) vocab = []for word in tokens:if word notin vocab: vocab.append(word)return vocabdoc1 ="I am royalty"doc2 ="I am a child of the King"corpus = [doc1, doc2]# extract vocab from corpusvocabulary = extract_vocab(" ".join(corpus))print(vocabulary)## ['i', 'am', 'royalty', 'a', 'child', 'of', 'the', 'king']
Now that we have a each word represented as a vector, we can then replace the words in each document with their vectors to create document vectors, which are a nested list of lists. We could also add the vectors of words for each document element-wise to obtain a more compact document vector where the entries represent the number of times a specific word occurs in that document.
One-hot encoding is easy to understand and implement but has some limitations. Vector representations from one-hot encoding can be sparse (has a lot of zeros), which can be problematic. Large vocabulary size would produce vectors in a high dimensional space, which could be slow to process. High dimensional data can also lead to overfitting if used to train machine learning models.
Also, if the documents in the corpus do not have the number of words, the length of each document vector would be different. That means, one-hot encoding does not give a fixed-length representation for different documents.
Given that a model is trained with vectors from one-hot encoding, in production, the input text may have words that are not in the vocabulary used during training. One-hot encoding would not be able to handle this situation unless the model is retrained with a larger vocabulary that includes the new words in production.
Bag of Words
The gag of words approach to text representation is a popularly used representation especially in text classification. Each document is considered as a bag of words, meaning that documents that are the same belong to the same bag. The bag of word representation represents each document in the corpus as a vector of numbers where the values in the vector are the frequencies of the words in the corpus vocabulary.
Hence, representing text data using the bag of words approach involves:
Creating a vocabulary from the corpus
Counting how many times each word in the vocabulary occurs in each document.
The bag of words model can be represented as a term-document matrix where the entries are the frequencies of vocabulary words in a document. That is, a bag of word is a collection of document vectors in a vector space where vector entries are frequencies of vocabulary words in documents.
Let’s define a function for extracting vocabulary and document vectors.
code
import reimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize#nltk.download("punkt")#nltk.download('stopwords')stop_words = stopwords.words('english')# extract vocabulary def extract_vocab(text): clean_text = re.sub(r'[^\w\s]', '', text.lower()) tokens = word_tokenize(clean_text) relevant_tokens = [token for token in tokens if token notin stop_words] vocab = []for word in relevant_tokens:if word notin vocab: vocab.append(word)return vocab# get document vector where entries are frequency of vocabulary wordsdef doc_vec(vocab, doc):""" parameters ------------ vocab: list a list of unique words in a corpus doc: str a sequence of text Returns ----------- document vector: list a list of values representing frequencies of vocabulary words in a document """# punctuation removal, tokenization, stop words removal clean_doc = re.sub(r'[^\w\s]', '', doc.lower()) doc_tokens = word_tokenize(clean_doc)# convert document to document vector doc_vector = [doc_tokens.count(word) for word in vocab]return doc_vector
Let’s convert one of the documents to a document vector
code
import inspect doc1 ="Mary, did you know that your baby boy Would one day walk on water? Mary, did you know that your baby boy Would save our sons and daughters?"doc2 ="Did you know that your baby boy Has come to make you new? This child that you delivered, will soon deliver you"doc3 ="Mary, did you know that your baby boy. Will give sight to a blind man? Mary, did you know that your baby boy Will calm the storm with his hand?"corpus = [doc1, doc2, doc3]# extract vocab from corpusvocabulary = extract_vocab(" ".join(corpus))# create document vector doc1_vector = doc_vec(vocabulary, doc1)# prepare doc1 clean_doc1 = re.sub(r'[^\w\s]', '', doc1.lower())doc1_tokens = word_tokenize(clean_doc1)doc1_tokens = [token for token in doc1_tokens if token notin stop_words]doc1_prepared =" ".join(doc1_tokens)print(vocabulary)## ['mary', 'know', 'baby', 'boy', 'would', 'one', 'day', 'walk', 'water', 'save', 'sons', 'daughters', 'come', 'make', 'new', 'child', 'delivered', 'soon', 'deliver', 'give', 'sight', 'blind', 'man', 'calm', 'storm', 'hand']print(doc1_prepared)## mary know baby boy would one day walk water mary know baby boy would save sons daughtersprint(doc1_vector)## [2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Let’s use the following documents to create a bag of words representation.
Documents in a Corpus
doc1 = “Mary, did you know that your baby boy Would one day walk on water? Mary, did you know that your baby boy Would save our sons and daughters?”
doc2 = “Did you know that your baby boy Has come to make you new? This child that you delivered, will soon deliver you”
doc3 = “Mary, did you know that your baby boy. Will give sight to a blind man? Mary, did you know that your baby boy Will calm the storm with his hand?”
The bag of words representation will be created as a term-document matrix with frequencies as entries, representing the the vector space.
code
import pandas as pdcorpus = [doc1, doc2, doc3]bag_of_words = [doc_vec(vocabulary, doc) for doc in corpus]bow_df = pd.DataFrame(bag_of_words, columns=vocabulary)bow_df.index=["doc1", "doc2", "doc3"]bow_df## mary know baby boy would one ... sight blind man calm storm hand## doc1 2 2 2 2 2 1 ... 0 0 0 0 0 0## doc2 0 1 1 1 0 0 ... 0 0 0 0 0 0## doc3 2 2 2 2 0 0 ... 1 1 1 1 1 1## ## [3 rows x 26 columns]
The features in the bag of words representation are the words in the vocabulary while the records are the documents. This representation is suitable for NLP task such as classification, information retrieval and document clustering.
Let’s use the sklearn package to create a bag of word representation through the CountVectorizer class.
Instead of having frequencies as the entries of the term-document matrix, binary ones and zeros could also be used as entries where ones indicate the presence of a word in a document and zero indicates the absence of a word in a document. This representation has been found useful in sentiment analysis where the presence of words determine the sentiments in the document instead of word frequencies.
The CountVectorizer in sklearn has a binary parameter which could be set to True to create a term-document matrix with zeros and ones as entries.
This representation allows us to easily compare documents based on vocabulary. Documents with similar vocabulary will be close in the vector space.
Generally, for the bag of word representation, word order information is lost. The number of features grow as vocabulary size increases, which could result to the issue of sparsity, though the most frequent n-words could be used to resolve this issue.
Bag of N-Grams
A bag of ngrams representation is basically an extension of a bag of words where the tokens are n units of words or phrases. An ngram consisting of one, two, or three words is a unigram, bigram or trigram respectively. A bag of ngrams captures the contextual and word order information. Hence semantic similarity between documents is captured. This implies documents with same ngrams are closer to each other in the vector space. Dimensional and out of vocabulary issues could be present with this representation.
We can create a bag of ngram representation using the CountVectorizer in sklearn by specifying the ngram parameter value.
code
from sklearn.feature_extraction.text import CountVectorizerfrom nltk.corpus import stopwords#nltk.download('stopwords')stop_words = stopwords.words('english')corpus = [doc1, doc2, doc3]count_vect = CountVectorizer(ngram_range=(2,2), stop_words=stop_words, lowercase=True)X = count_vect.fit_transform(corpus)features = count_vect.get_feature_names_out()data = X.toarray()bow_df = pd.DataFrame(data, columns=features)print(bow_df)## baby boy blind man boy calm ... water mary would one would save## 0 2 0 0 ... 1 1 1## 1 1 0 0 ... 0 0 0## 2 2 1 1 ... 0 0 0## ## [3 rows x 27 columns]
TF-IDF
A TF-IDF representation is a term-document matrix where the entries are term-frequency inverse document frequency (tf-idf). The tf-idf indicates the importance of a word in a document. TF or term frequency measures the frequency of a word in a document while IDF or inverse-document frequency measures how common a word is across documents in the corpus. The greater the value of IDF the rarer the word is across documents.
The tf-idf measure is calculated as: \[\text{TF-IDF}=TF*IDF\] Where:
\[TF(t, D) =\frac{\text{Number of times term t occurs in a document D}}{\text{Total number of words in document D}}\]\[IDF(t) = log_e (\frac{\text{Total number of documents in the corpus}}{\text{Number of documents in which the term t occurs}})\]