Lesson 10: Regular Expressions

Regular Expressions and NLP

Regular expressions, usually abbreviated as regex or regexp, are generally helpful in finding parts of a text that meet certain conditions. They are vital for searching, extracting, or replacing string elements in a text. Regular expressions have been applied in different ways in natural language processing, including searching for and removing special characters for text cleaning. They are also used with hard-coded logic to build pattern-based chatbots.

For example, given some input text, you could use a regular expression to find the word, “John” in the text. You could also be given a text file to search for valid passwords. Suppose a valid password has eight characters consisting of at least a number, a lowercase letter, an uppercase letter, and a special character; a regular expression or search pattern could be used to specify a rule that allows you to match valid passwords. Regular expressions could be used to validate emails and phone numbers entered into an electronic form or device.

What is a Regular Expression?

Note

A regular expression is a sequence of characters that define a search pattern. The search pattern used is then used to match, locate or find specific strings in a text.

There are different types of characters in a search pattern:

Alphanumeric characters: these are alphabetic (letters) or numeric (numbers) characters. Alphanumeric characters are also called literals. Literals are the simplest form of regular expressions, and consists of letters from a to z or numbers from 0 to 9.
Special Characters are non-alphanumeric characters also called meta-characters. Examples of special characters include \, +, (, ), $, [, ], ?, ., *, and |.

We will use the findall() method inside the regular expression module in Python to demonstrate these concepts. The findall() method returns a list of all the matches found. For example, the pattern /John/ can be searched in the sentence, My name is John, and I like regular expressions! For the sake of illustration in this lesson, we will put every pattern inside forward slashes, / /. These forward slashes are not needed in code.

Code

import re

pattern = "John"
text = "My name is John, I like regular expressions!"
re.findall(pattern=pattern, string=text)
## ['John']

The result of the above code shows that one match was found for the pattern.

We can use a literal in combination with special characters in a pattern. For example, /John|Jonas/ could be used to find the word John or Jonas in the text, My name is John, and my brother’s name is Jonas. The vertical bar | is a special character that stands for or. Let’s demonstrate this example in Python.

Code

import re

pattern = "John|Jonas"
text = "My name is John, and my brother's name is Jonas"
re.findall(pattern=pattern, string=text)
## ['John', 'Jonas']

If you want a pattern to match a special character, you need to escape the special character in the pattern by preceding the special character with a backslash. These special characters should be escaped in a pattern if you intend to match them in a text: \, +, (, ), $, [, ], ?, . and |.

Note

Various programming languages may have different mechanisms for escaping special characters in patterns. For example, Python uses re.escape() method to escape special characters.

Pattern matching with regular expressions returns no results if no match is found. One or multiple matches could be found and returned. If we apply the pattern /mangoes/ to find all the matches in the phrase sweet mangoes are the best mangoes in the world, the result displayed will contain two matches because two matches are found in the phrase.

Regular Expression Syntax

Using Character Classes

A character class is a range or set of characters specified in a pattern to find any part of a text that matches the pattern. Square brackets, [], are used to specify a character class. For example, if we want to search for the word analyze (American English) or analyse (British English) we can use character classes in the pattern, /analy[zs]e/. Ranges of character classes can be specified as follows.

Character Classes

/[0-9]/ 0r /\d/ matches any numeric character or numbers from 0 to 9.
/[^0-9]/ or /\D/ matches characters that are not numbers.
/[a-b]/ matches lowercase letters from a to z.
/[A-Z]/ matches uppercase letters from A to Z.
/[aA-zZ]/ matches uppercase or lowercase letters. This is equivalent to [:alpha:] in Python.
/[0-9a-zA-Z]/ or /\w/ matches sinle alphanumeric character (number, lowercase letter or uppercase letter). This is equivalent to [:alnum:] in Python.
/[^0-9a-zA-Z]/ or /\W/ matches any single non-alphanumeric character.
[ ] is used to specify a set of characters. For example, /[-\s]/ matches any single hyphen or whitespace such as in a phone number.

Note

The POSIX standards provide a simply and consistent syntax for character classes, For example: [:name:] and [:alpha:]. Some (not all) of these POSIX character classes are supported in Python.
The caret ^ symbol is used to negate character classes. For example, /[^0-9]/ matches any character that is not a number.
Some character classes have shortcuts. For example,/\d/ is a shortcut for /[0-9]/.

Other Regular Expression Patterns:

/./ matches any character except the newline character \n. Using character classes is a better practice for clarity.
/\s/ matches any single whitespace character. This is equivalent to [:alnum:] in Python.
/\S/ matches any single non-whitespace character
/^/ matches any other character except the following characters if used with character sets.
/\// matches a froward slash /.
/hello|hi|hola/ matches either hello, hi or hola. The vertical bar or pipe means OR.
/greetings:(hello|hi)/ matches greetings:hello or greetings:hi.

Quantifiers

Quantifiers are special characters in a pattern that indicate how many characters can be matched.

Note

* is a special character representing zero to any number of characters. A pattern such as /file*/ can match file, files, file.txt, file.html, file.py file1, file2, fileA, etc.
? is a special character representing zero or one character. The pattern/file?/ will match file, file1, file2, fileA, etc. /apple?/ will match the word apple and its plural apples. ? is useful for matching characters that are optional such as the country code in a phone number.
+ represents one or more characters.
{1, 3} represents 1 to 3 characters.
{2, } represents 2 or more characters.

Note that quantifiers are always applied only to the previous token. So /\d+/ matches any decimal or numeric character repeating one or more times.

Boundary Matchers

So far, we have been interested in whether a match was found at any location in the input text. For example, the pattern, /compute/ will find matches in any location in the string “compute, computes and computer”.

Code

import re

pattern = "compute"
text = "compute, computes, and computer"
re.findall(pattern=pattern, string=text)
## ['compute', 'compute', 'compute']

Instead of searching for a match in any location in the text or matching just a part of a word; you may be interested in using boundary matchers to:

match a full word.
find a match at the beginning of a line.
find a match at the end of a line.
find a match at the end of a previous match.

Boundary matchers restrict the match to happen at certain locations and could be used to match full words. For example, if we want to match only the word compute, then we need to use a word boundary \b in the pattern, for example, /\bcompute\b/.

Code

import re

pattern = r"\bcompute\b"
text = "compute, computes, and computer"
re.findall(pattern=pattern, string=text)
## ['compute']

Note

Python’s raw string r is used in the pattern so that Python does not interpret the backslashes as escape sequence. This is a good practice even when backslashes are not used, as you may need to adjust the string later to include backslashes.

Here is a List of Boundary Matchers:

Boundary Matchers

^ matches the beginning of a line. ^ placed before the word to match.
$ matches the end of a line. ^ is placed at the end of the word to match.
\b matches a word boundary. \b checks whether a pattern begin or end on a word boundary.
\B matches a non-word boundary or anything that is not a word boundary.
\A matches the beginning of the input text.
\Z matches the end of the input text.

Regular Expressions in Python

The re built-in Python module allows you to use regular expressions in Python, however, note that regular expression is a separate language. Regular expressions could be more powerful and efficient for specifying complex search patterns compared to custom Python functions (find and replace methods of Python string).

The re can be imported as import re. There are several methods in the re module. The methods in the re module can be inspected using the code dir(re), after importing the re module.

Code

import re

# indexes of a few methods in the re module to display
positions = [-16, -12, -11, -10, -7, -5, -4]
[dir(re)[-16:][i] for i in positions]
## ['compile', 'escape', 'findall', 'finditer', 'match', 'search', 'split']

We will explore some of these regular expression methods in Python, mostly through examples.

The findall() Method of the `re` Module

The findall() method finds all pattern matches in the input text and returns a list of all the matches. findall() allows us to find the number of occurrences of a pattern in a text.

Let’s find every word in a text.

Code

import re

pattern = r"\w+"
text = "Scan this text and match all words or tokens"
re.findall(pattern=pattern, string=text)
## ['Scan', 'this', 'text', 'and', 'match', 'all', 'words', 'or', 'tokens']

Note

\w matches word (alphanumeric) characters and is equivalent to [A-Za-z0-9]

Code

import re

pattern = r"[A-Za-z0-9]+"
text = "Scan this text and match all words or tokens"
re.findall(pattern=pattern, string=text)
## ['Scan', 'this', 'text', 'and', 'match', 'all', 'words', 'or', 'tokens']

Let’s adjust the pattern to match only words that start with the letter t in the text.

Code

import re

pattern = r"\bt\w+"
text = "Scan this text and match words or tokens that start with letter t"
re.findall(pattern=pattern, string=text)
## ['this', 'text', 'tokens', 'that']

The compile() Method of the `re` Module

The compile() method allows you to compile the pattern into a regular expression object which can then be used by other regular expression methods such as the findall(), search() and match() method.

Code

import re

pattern = r"\bt\w+"
pattern = re.compile(pattern)
text = "Scan this text and match words or tokens that start with letter t"
pattern.findall(text)
## ['this', 'text', 'tokens', 'that']

Note

The compile() method caches the the resulting regular expression object, hence saves the compiled pattern to be reused by other regular expression methods. You should use compile if the pattern needs to be reused multiple times by other regular expression methods, as this is more efficient. Otherwise, regular expression methods such as search(), and match() automatically compile the pattern before finding a match for the compiled pattern.

Using the compile() method gives you the flexibility of using the optional parameters, pos and endpos to limit the search to include only matches from pos to endpos - 1 in other regular expression methods.

Code

import re

pattern = r"\bt\w+"
pattern = re.compile(pattern)
text = "Scan this text and match words or tokens that start with letter t"
pattern.findall(text, pos=6, endpos=40)
## ['text', 'tokens']

The search() Method of the `re` Module

The search() method in the re module in Python returns a match object if the first match is found, compared to the findall() method that returns a list of all matches.

Code

import re

pattern = "compute[a-z]*"
text = "computer, computing, computes,  compute, compete"
re.search(pattern=pattern, string=text)
## <re.Match object; span=(0, 8), match='computer'>

Note

The span in the results of the search indicates the position of the first match found.

Note

To extract the group of string that was matched, you can call the group() method on the match object.

Code

import re

pattern = "compute[a-z]*"
text = "computer, computing, computes, compute, compete"
re.search(pattern=pattern, string=text).group()
## 'computer'

The search() method return a match object when a match is found, otherwise None is returned. Hence, an if statement usually follows the search method to test whether a match was found. We can use search methods to validate whether a pattern is has a match in a given string.

Let’s use the search method to check whether a phone number is valid. A valid phone number in this case is the one that follows the format, xxx-xxx-xxxx or 1-xxx-xxx-xxxx.

Code

import re

pattern = "^(1-)?\d{3}-\d{3}-\d{4}$"
text = "1-701-876-1234"
match = re.search(pattern=pattern, string=text)

if match:
    print("Match found, phone number is valid")
else:
    print("No match found, phone number is not valid")
## Match found, phone number is valid

Code

import re

pattern = "^(1-)?\d{3}-\d{3}-\d{4}$"
text = "720-90-100"
match = re.search(pattern=pattern, string=text)

if match:
    print("Match found, phone number is valid")
else:
    print("No match found, phone number is not valid")
## No match found, phone number is not valid

Instead of matching only the phone numbers with dashes (-), we could modify the pattern to match phone numbers with spaces. Hence, we can instead replace the dash - in the pattern with [-\s] to match either dashes or spaces so that phone numbers with the format xxx xxx xxx or 1 xxx xxx xxx can also be identified as valid phone numbers. The search pattern, ^(1-)?\d{3}-\d{3}-\d{4}$ then become, ^(1[-\s])?\d{3}[-\s]\d{3}[-\s]\d{4}$.

Let’s loop over a list of numbers and extract only the valid phone numbers.

Code

import re

pattern = "^(1[-\s])?\d{3}[-\s]\d{3}[-\s]\d{4}$"
phone_list = ["1-701-876-1234", "1 701 876 1234",  
              "720-900-100", "717 550 1675", "2487620356"]

valid_phone_numbers = []
for phone_number in phone_list:
    match = re.search(pattern=pattern, string=phone_number)
    if match:
        valid_phone_numbers.append(phone_number)
print(valid_phone_numbers)
## ['1-701-876-1234', '1 701 876 1234', '717 550 1675']

You would notice that 2487620356 is a valid phone number but is detected as invalid because the pattern is not flexible enough to handle a phone number that does not have spaces or dashes.

Let’s modify the pattern to also handle phone numbers without spaces or dashes, in the format, xxxxxxxxx or xxxxxxxxxx. We need to include a question mark (?) after each dash or space in the pattern to optionally handle no spacing or no dashes.

Code

import re

pattern = "^(1[-\s])?\d{3}[-\s]?\d{3}[-\s]?\d{4}$"
phone_list = ["1-701-876-1234", "1 701 876 1234",  
              "720-900-100", "717 550 1675", "2487620356"]

valid_phone_numbers = []
for phone_number in phone_list:
    match = re.search(pattern=pattern, string=phone_number)
    if match:
        valid_phone_numbers.append(phone_number)
print(valid_phone_numbers)
## ['1-701-876-1234', '1 701 876 1234', '717 550 1675', '2487620356']

The Match() Method of the `re` Module

The match() method tries find a match at the beginning of the input text. If the pattern is not at the beginning of the input text, None would be returned, otherwise a match object is returned. You can then use an if statement to test if a match was found.

Code

import re

pattern = r"Hello"
text = "Hello is used for greeting. An alternative word for Hello is Hi"
result = re.match(pattern=pattern, string=text)
print(result)
## <re.Match object; span=(0, 5), match='Hello'>

As shown below, since the pattern /Hello/ is not at the beginning of a string, no match would be found.

Code

import re

pattern = r"Hello"
text = "Say Hello!"
result = re.match(pattern=pattern, string=text)
print(result)
## None

We can explicitly compile the pattern to use the pos optional parameter in the match() method of the pattern object to specify the appropriate location where the search should start as shown below.

Code

import re

pattern = r"Hello"
pattern = re.compile(pattern)
text = "Say Hello!"
match = pattern.match(text, pos=4)
print(match)
## <re.Match object; span=(4, 9), match='Hello'>

The finditer() Method of the `re` Module

The finditer() method of the re module finds pattern matches in the string and returns them as an iterator. Let’s find the patterns in the string below using the finditer() method.

Code

import re

pattern = r"[0-9]+"
text = """
          John 0987
          James 8765
          Mary 6543
          Nathalia 39873
          Kenzie 2133
        """
re.finditer(pattern=pattern, string=text)
## <callable_iterator object at 0x7f1941eb4dc0>

Groups

Parenthesis in regular expressions can be used to specify a group of characters as a unit. For example, let’s examine whether a string has one or multiple ab together followed by c.

Code

import re

pattern = r"(ab)+c"
text = "ababcaba"
re.search(pattern=pattern, string=text)
## <re.Match object; span=(0, 5), match='ababc'>

Parenthesis can also be used to group subpatterns to specify alternatives. Let’s scan each string in a list and check whether the string contains the word color or colour.

Code

import re

pattern = r"col(o|ou)r"
string_list = ["color", "colour", "colur"]
for string in string_list:
    match = re.search(pattern=pattern, string=string)
    if match:
        print(match.group())
## color
## colour

If we used a pattern such as col[ou]{1,2}r to scan text such as "color colour colur coloor coluur colar", we will still be able to detect strings like color and colour but this pattern will also detect other strings such “colur”, coloor, coluur. This pattern can introduce a bug in the program if the intention is to detect only color or colour. This is why parenthesis would be needed to restrict the search to match specific alternatives or groups such as color and colour.

Code

import re

pattern = r"col[ou]{1,2}r"
text = "color colour colur coloor coluur colar"
re.findall(pattern=pattern, string=text)
## ['color', 'colour', 'colur', 'coloor', 'coluur']

Let’s examine another example that involves groups. Our goal is to group the names and last four social security numbers after each name in a given string. If the number after the name is not a four digit number, it should not be matched.

Code

import re

pattern = r"([a-zA-Z]+)\s*(\b[0-9]{4}\b)"
text = """
          John 0987
          James 8765
          Mary 6543
          Nathalia 39873
          Kenzie 2133
        """
re.findall(pattern=pattern, string=text)
## [('John', '0987'), ('James', '8765'), ('Mary', '6543'), ('Kenzie', '2133')]

Greedy and Non-greedy Matching

By default, regular expressions implement greedy matching where the regular expression engine tries to match the pattern at each position in the string or input text, and goes to the next position if no match is found.

Note

In the greedy mode (by default) a quantified character is repeated as many times as possible.

The example below shows how the repeated quantified character, b in the pattern produces different matches due to greedy matching.

Code

import re

pattern = r"ab*"
text = "acabcbabb"
re.findall(pattern=pattern, string=text)
## ['a', 'ab', 'abb']

In the following example, the repeating character in the pattern, <.+> is specified with a dot . representing any single character in the string except a new line. We are interested in searching and extracting the opening tag element of the HTML syntax in the string, HTML tag example: <h1> This is a level 1 heading </h1>. is an example of a tag element.

After finding the first character < in the pattern,<.+>, the regex engine adds every character one after another including closing angle brackets, until the end of the string is reached due to the subpattern .+ in the pattern. As the closing angle bracket > is not at the end of the string, the regex engine backtracks until it find an angle bracket >. Hence, the string <h1> This is a level 1 heading </h1> is matched as shown below.

Code

import re

pattern = r"<.+>"
text = """HTML tag example: <h1> This is a level 1 heading </h1>. is an 
          example of a tag element"""
re.search(pattern=pattern, string=text).group()
## '<h1> This is a level 1 heading </h1>'

To search and extract only the first opening tag element, <h1>, we need a lazzy matching instead of a greedy matching. A lazy matching is specified by adding the question mark symbol, ? to the quantifier.

Code

import re

pattern = r"<.+?>"
text = """HTML tag example: <h1> This is a level 1 heading </h1>. is an 
          example of a tag element"""
re.search(pattern=pattern, string=text).group()
## '<h1>'

Note

In lazy matching, a quantified character is repeated the least number of times possible.

Chatbots and Regular Expressions

Early AI applications that used natural language processing were built with rule-based logic using regular expressions. For example, ELIZA is an early natural language processing chatbot developed at MIT by Joseph Weizenbuam between 1964 and 1967. ELIZA responded to questions using keywords from users’ questions, pattern matching and substitution methodology.

The program behaves as though it was intelligent but it was explicitly programmed with rule-based logic and regular expressions. Hence, ELIZA does not meet today’s definition of AI, but is instrumental in understanding the history and evolution of AI. ELIZA was the first program that allowed a conversation between humans and machines. Natural language processing allows machines to be programmed or trained to provide human-like responses to questions.

Here is an example of responses from ELIZA chatbot, generated from https://www.masswerk.at/elizabot/.

Eliza Chatbot In addition to ELIZA, other popular chatbots were developed such as SIRI in 2010, Google Now in 2012, ALEXA and CORTANA in 2015.

Keyword-based Chatbots

Simple chatbots can be created as keyword-based chatbots with conditional statements. Keyword-based chatbots are also known as rule-based chatbots programmed to recognize a list of keywords or phrases, then provide fixed or predefined responses based on the keywords identified. Though Keyword-based chatbots do not understand the user’s intent or context, rule-based chatbots still provide value for certain types of use cases.

Rule-based chatbots are still used today to provide static information to users. For example, a rule-based chatbot is able to provide a user with product details or description, answers to frequently asked questions, and links to specific sites. Hence these chatbots can be useful for customer support and information retrieval. Customer service is one of the most popular use cases of conversational or chatbots technology today.

How do rule-based chatbots work? A user interacts with a rule-based chatbot typically by typing a message or question. The chatbot receives the input text and searches through the text to identify keywords or phrases that match with it’s predefined list of keywords/phrases. Once a matching keyword/phrase is found in the input text, the chatbot returns a programmed response associated to the matched keyword.

Pattern-based Chatbots

Other virtual assistants use pattern-based technologies powered using regular expressions. Amazon Alexa initially used pattern-based matching and intent classification (pre-defined intents) to respond to user commands. Pattern-based matching involves using regular expressions to identify complex sequence of characters or words within text. A complex pattern can match a specific sequence and its variants compared to keyword matching where exact keywords are matched. For example a pattern could be written to match a phone number with dashes or white space.

Regular Expressions and NLP

What is a Regular Expression?

Regular Expression Syntax

Using Character Classes

Quantifiers

Boundary Matchers

Here is a List of Boundary Matchers:

Regular Expressions in Python

The findall() Method of the re Module

The compile() Method of the re Module

The search() Method of the re Module

The Match() Method of the re Module

The finditer() Method of the re Module

Groups

Greedy and Non-greedy Matching

Chatbots and Regular Expressions

Keyword-based Chatbots

Pattern-based Chatbots

The findall() Method of the `re` Module

The compile() Method of the `re` Module

The search() Method of the `re` Module

The Match() Method of the `re` Module

The finditer() Method of the `re` Module