Lesson 13: Text Representation in NLP

Text Representation in NLP

Text representation in natural language processing is the encoding or vectorization of text data into its numerical form suitable for learning algorithms. Algorithms feed on numerical data and do not understand raw text data or strings. Hence, for NLP tasks, it is necessary to first represent the raw input text data in a numerical form for learning algorithms to process.

Text representation can be viewed as feature extractions since it involves extracting new features from text data. The representation of text units (words, n-grams of words, sentences, paragraphs, documents, etc) as vectors in a common vector space is called a vector space model or vector representation.

There are various vector representation approaches that allow us to represent text in a numerical or machine-readable form, ranging from basic to state-of-the-art approaches. Basic vector representation approaches include one-hot encoding, bag of words, bag of n-grams, TF-IDF while advanced vector representations include word embeddings such as Word2vec and GloVe, BERT, etc.

Before we dive into the various vector representation approaches, it’s important to define the following terminologies commonly used in NLP.

Definitions

Corpus: a body of text
Token: text units such as characters, word, n-gram of words, sentence, paragraph, document, etc.
Vocabulary: unique words in a corpus
Document: this is a record in a text dataset

One-Hot Encoding

In traditional machine learning, one-hot encoding (also know as dummy encoding) is used to transform categorical data into a numerical form through the creation of dummy variables.

In natural language processing, one-hot encoding involves the representation of words as vectors of zeros and ones. The following process is used to generate one-hot-encoding for a sequence of text.

The text is tokenized into a list of words.
Unique words are extracted from the tokens to create the vocabulary.
Each word in the vocabulary is then represented as a vector of zeros and 1. For a given word, the position of 1 in the vector corresponds to the index of the word in question.

Consider the following documents in a corpus:

Documents in a Corpus

doc1 = “I am royalty”
doc2 = “I am a child of the King””

Lowercasing and Punctuation Removal

code

import re

def clean_text(text):
  return re.sub(r'[^\w\s]', '', text.lower())

doc1 = "I am royalty"
doc2 = "I am a child of the King"
corpus = [doc1, doc2]

for doc in corpus:
  print(clean_text(doc))
## i am royalty
## i am a child of the king

Extract vocabulary from corpus

code

import re
import nltk
from nltk.tokenize import word_tokenize
#nltk.download("punkt")

# extract vocabulary 
def extract_vocab(text):
  clean_text = re.sub(r'[^\w\s]', '', text.lower())
  tokens = word_tokenize(clean_text)
  
  vocab = []
  for word in tokens:
    if word not in vocab:
      vocab.append(word)
  return vocab

doc1 = "I am royalty"
doc2 = "I am a child of the King"
corpus = [doc1, doc2]

# extract vocab from corpus
vocabulary = extract_vocab(" ".join(corpus))
print(vocabulary)
## ['i', 'am', 'royalty', 'a', 'child', 'of', 'the', 'king']

Create a Vector for each Word in the Vocabulary

code

vocab_size = len(vocabulary)
zero_vec = [0]*vocab_size

one_hot_encoding = []
for i in range(vocab_size):
  word_vector = zero_vec.copy()
  word_vector[i] = 1
  one_hot_encoding.append(word_vector)
  print(vocabulary[i], ": ", word_vector)
## i :  [1, 0, 0, 0, 0, 0, 0, 0]
## am :  [0, 1, 0, 0, 0, 0, 0, 0]
## royalty :  [0, 0, 1, 0, 0, 0, 0, 0]
## a :  [0, 0, 0, 1, 0, 0, 0, 0]
## child :  [0, 0, 0, 0, 1, 0, 0, 0]
## of :  [0, 0, 0, 0, 0, 1, 0, 0]
## the :  [0, 0, 0, 0, 0, 0, 1, 0]
## king :  [0, 0, 0, 0, 0, 0, 0, 1]

Now that we have a each word represented as a vector, we can then replace the words in each document with their vectors to create document vectors, which are a nested list of lists. We could also add the vectors of words for each document element-wise to obtain a more compact document vector where the entries represent the number of times a specific word occurs in that document.

One-hot encoding is easy to understand and implement but has some limitations. Vector representations from one-hot encoding can be sparse (has a lot of zeros), which can be problematic. Large vocabulary size would produce vectors in a high dimensional space, which could be slow to process. High dimensional data can also lead to overfitting if used to train machine learning models.

Also, if the documents in the corpus do not have the number of words, the length of each document vector would be different. That means, one-hot encoding does not give a fixed-length representation for different documents.

Given that a model is trained with vectors from one-hot encoding, in production, the input text may have words that are not in the vocabulary used during training. One-hot encoding would not be able to handle this situation unless the model is retrained with a larger vocabulary that includes the new words in production.

Bag of Words

The bag of words approach to text representation is a popularly used representation especially in text classification. Each document is considered as a bag of words, meaning that documents that are the same belong to the same bag. The bag of word representation represents each document in the corpus as a vector of numbers where the values in the vector are the frequencies of the words in the corpus vocabulary.

Hence, representing text data using the bag of words approach involves:

Creating a vocabulary from the corpus
Counting how many times each word in the vocabulary occurs in each document.

The bag of words model can be represented as a term-document matrix where the entries are the frequencies of vocabulary words in a document. That is, a bag of word is a collection of document vectors in a vector space where vector entries are frequencies of vocabulary words in documents.

Let’s define a function for extracting vocabulary and document vectors.

code

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
#nltk.download("punkt")
#nltk.download('stopwords')
stop_words = stopwords.words('english')

# extract vocabulary 
def extract_vocab(text):
  clean_text = re.sub(r'[^\w\s]', '', text.lower())
  tokens = word_tokenize(clean_text)
  relevant_tokens = [token for token in tokens if token not in stop_words]
  vocab = []
  for word in relevant_tokens:
    if word not in vocab:
      vocab.append(word)
  return vocab

# get document vector where entries are frequency of vocabulary words
def doc_vec(vocab, doc):
  """
  parameters 
 ------------
  vocab: list
    a list of unique words in a corpus
  doc: str
    a sequence of text
  
  Returns 
  -----------
  document vector: list
    a list of values representing frequencies of vocabulary words in a document 
  """
  # punctuation removal, tokenization, stop words removal
  clean_doc = re.sub(r'[^\w\s]', '', doc.lower())
  doc_tokens = word_tokenize(clean_doc)
  # convert document to document vector 
  doc_vector =  [doc_tokens.count(word) for word in vocab]
  
  return doc_vector

Let’s convert one of the documents to a document vector

code

import inspect 
doc1 = "Mary, did you know that your baby boy Would one day walk on water? Mary, did you know that your baby boy Would save our sons and daughters?"
doc2 = "Did you know that your baby boy Has come to make you new? This child that you delivered, will soon deliver you"
doc3 =  "Mary, did you know that your baby boy. Will give sight to a blind man? Mary, did you know that your baby boy Will calm the storm with his hand?"

corpus = [doc1, doc2, doc3]

# extract vocab from corpus
vocabulary = extract_vocab(" ".join(corpus))
# create document vector 
doc1_vector = doc_vec(vocabulary, doc1)

# prepare doc1 
clean_doc1 = re.sub(r'[^\w\s]', '', doc1.lower())
doc1_tokens = word_tokenize(clean_doc1)
doc1_tokens = [token for token in doc1_tokens if token not in stop_words]
doc1_prepared = " ".join(doc1_tokens)

print(vocabulary)
## ['mary', 'know', 'baby', 'boy', 'would', 'one', 'day', 'walk', 'water', 'save', 'sons', 'daughters', 'come', 'make', 'new', 'child', 'delivered', 'soon', 'deliver', 'give', 'sight', 'blind', 'man', 'calm', 'storm', 'hand']
print(doc1_prepared)
## mary know baby boy would one day walk water mary know baby boy would save sons daughters
print(doc1_vector)
## [2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Let’s use the following documents to create a bag of words representation.

Documents in a Corpus

doc1 = “Mary, did you know that your baby boy Would one day walk on water? Mary, did you know that your baby boy Would save our sons and daughters?”
doc2 = “Did you know that your baby boy Has come to make you new? This child that you delivered, will soon deliver you”
doc3 = “Mary, did you know that your baby boy. Will give sight to a blind man? Mary, did you know that your baby boy Will calm the storm with his hand?”

The bag of words representation will be created as a term-document matrix with frequencies as entries, representing the the vector space.

code

import pandas as pd

corpus = [doc1, doc2, doc3]
bag_of_words = [doc_vec(vocabulary, doc) for doc in corpus]
bow_df = pd.DataFrame(bag_of_words, columns=vocabulary)
bow_df.index=["doc1", "doc2", "doc3"]
bow_df
##       mary  know  baby  boy  would  one  ...  sight  blind  man  calm  storm  hand
## doc1     2     2     2    2      2    1  ...      0      0    0     0      0     0
## doc2     0     1     1    1      0    0  ...      0      0    0     0      0     0
## doc3     2     2     2    2      0    0  ...      1      1    1     1      1     1
## 
## [3 rows x 26 columns]

The features in the bag of words representation are the words in the vocabulary while the records are the documents. This representation is suitable for NLP task such as classification, information retrieval and document clustering.

Let’s use the sklearn package to create a bag of word representation through the CountVectorizer class.

code

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = stopwords.words('english')

corpus = [doc1, doc2, doc3]
vocabulary = extract_vocab(" ".join(corpus))

count_vect = CountVectorizer(vocabulary=vocabulary, 
                             token_pattern=r"\b\w\w+\b",
                              stop_words=stop_words,
                              lowercase=True)
X = count_vect.fit_transform(corpus)
features = count_vect.get_feature_names_out()
data = X.toarray()
bow_df = pd.DataFrame(data, columns=features)
print(bow_df)
##    mary  know  baby  boy  would  one  ...  sight  blind  man  calm  storm  hand
## 0     2     2     2    2      2    1  ...      0      0    0     0      0     0
## 1     0     1     1    1      0    0  ...      0      0    0     0      0     0
## 2     2     2     2    2      0    0  ...      1      1    1     1      1     1
## 
## [3 rows x 26 columns]

Instead of having frequencies as the entries of the term-document matrix, binary ones and zeros could also be used as entries where ones indicate the presence of a word in a document and zero indicates the absence of a word in a document. This representation has been found useful in sentiment analysis where the presence of words determine the sentiments in the document instead of word frequencies.

The CountVectorizer in sklearn has a binary parameter which could be set to True to create a term-document matrix with zeros and ones as entries.

code

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = stopwords.words('english')

corpus = [doc1, doc2, doc3]
vocabulary = extract_vocab(" ".join(corpus))

count_vect = CountVectorizer(binary=True, 
                             vocabulary=vocabulary, 
                             token_pattern=r"\b\w\w+\b",
                             stop_words=stop_words,
                             lowercase=True)
X = count_vect.fit_transform(corpus)
features = count_vect.get_feature_names_out()
data = X.toarray()
bow_df = pd.DataFrame(data, columns=features)
print(bow_df)
##    mary  know  baby  boy  would  one  ...  sight  blind  man  calm  storm  hand
## 0     1     1     1    1      1    1  ...      0      0    0     0      0     0
## 1     0     1     1    1      0    0  ...      0      0    0     0      0     0
## 2     1     1     1    1      0    0  ...      1      1    1     1      1     1
## 
## [3 rows x 26 columns]

This representation allows us to easily compare documents based on vocabulary. Documents with similar vocabulary will be close in the vector space.

Generally, for the bag of word representation, word order information is lost. The number of features grow as vocabulary size increases, which could result to the issue of sparsity, though the most frequent n-words could be used to resolve this issue.

Bag of N-Grams

A bag of ngrams representation is basically an extension of a bag of words where the tokens are n units of words or phrases. An ngram consisting of one, two, or three words is a unigram, bigram or trigram respectively. A bag of ngrams captures the contextual and word order information. Hence semantic similarity between documents is captured. This implies documents with same ngrams are closer to each other in the vector space. Dimensional and out of vocabulary issues could be present with this representation.

We can create a bag of ngram representation using the CountVectorizer in sklearn by specifying the ngram parameter value.

code

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = stopwords.words('english')

corpus = [doc1, doc2, doc3]
count_vect = CountVectorizer(ngram_range=(2,2),
                             stop_words=stop_words,
                             lowercase=True)
X = count_vect.fit_transform(corpus)
features = count_vect.get_feature_names_out()
data = X.toarray()

bow_df = pd.DataFrame(data, columns=features)
print(bow_df)
##    baby boy  blind man  boy calm  ...  water mary  would one  would save
## 0         2          0         0  ...           1          1           1
## 1         1          0         0  ...           0          0           0
## 2         2          1         1  ...           0          0           0
## 
## [3 rows x 27 columns]

TF-IDF

A TF-IDF representation is a term-document matrix where the entries are term-frequency inverse document frequency (tf-idf). The tf-idf indicates the importance of a word in a document. TF or term frequency measures the frequency of a word in a document while IDF or inverse-document frequency measures how common a word is across documents in the corpus. The greater the value of IDF the rarer the word is across documents.

The tf-idf measure is calculated as: \[\text{TF-IDF}=TF*IDF\] Where:

\[TF(t, D) =\frac{\text{Number of times term t occurs in a document D}}{\text{Total number of words in document D}}\] \[IDF(t) = log_e (\frac{\text{Total number of documents in the corpus}}{\text{Number of documents in which the term t occurs}})\]

code

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = stopwords.words('english')

corpus = [doc1, doc2, doc3]
vocabulary = extract_vocab(" ".join(corpus))

count_vect = TfidfVectorizer(vocabulary=vocabulary,
                             stop_words=stop_words)
X = count_vect.fit_transform(corpus)
features = count_vect.get_feature_names_out()
data = X.toarray()

bow_df = pd.DataFrame(data, columns=features)
print(bow_df)
##        mary      know      baby       boy  ...      man     calm    storm     hand
## 0  0.363606  0.282372  0.282372  0.282372  ...  0.00000  0.00000  0.00000  0.00000
## 1  0.000000  0.208210  0.208210  0.208210  ...  0.00000  0.00000  0.00000  0.00000
## 2  0.413985  0.321496  0.321496  0.321496  ...  0.27217  0.27217  0.27217  0.27217
## 
## [3 rows x 26 columns]