Tokenization using regular expression sub patterns

A while back was writing some stuff on this blog about regular expressoins. While that remains unfinished, a mini regex example – nothing earth shattering but a useful technique if you hadn’t already seen it. Prompted by a real world example, one often-overlooked feature of most regular expressions engines is how subpatterns can useful to whip up tokenizers relatively easily. The problem? I needed to match the word any of the words “Canton”, “Region” or “Group” in a string and perform a follow up action depending on which matched. Dealing with four main languages in Switzerland ( German, French, Italian and English), it get’s a bit more interesting; “Canton” translates to “Kanton” in German and “Cantone” in Italian, while “Region” is “Regione” in Italian. and Group is “Gruppe”, “Groupe” and “Gruppo” in German, French and Italian respectively. Composing those into three straightforward regular expressions I have;

Canton: /cantone?|kanton/i
Region: /regione?/i
Group: /groupe?|grupp(?:o|e)/i

Now on examining some input string, I could try testing each of those regexes individually against the string but that’s a) inefficient and b) likely to lead to lengthier code. Instead I make a single regular expression using sub patterns: /(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i …then figure out which sub pattern matched after a match is made. Note that technically this problem is not really one of tokenization but rather just classifying the input with a common name, but the technique can be fairly easily extended. In PHP the solution is courtesy of the third argument to preg_match(), for example;


$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');

foreach ( $inputs as $input ) {
    preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches);
    print_r($matches);
}

I get output like;


Array
(
    [0] => Kanton
    [1] => Kanton
)
Array
(
    [0] => Regione
    [1] => 
    [2] => Regione
)
Array
(
    [0] => Gruppe
    [1] => 
    [2] => 
    [3] => Gruppe
)

Notice how the first element of this array in always what I matched while elements indexed 1+ correspond to the position of subpattern I matched against, from left to right in the pattern – this I can use to tell me what I actually matched e.g.;


$inputs = array( 'Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe');
$tokens = array('canton','region','group'); // the token names

foreach ( $inputs as $input ) {
    
    if ( preg_match("/(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))/i", $input, $matches) ) {
        
        foreach ( array_keys( $matches) as $key) {
            if ( $key == 0 ) { continue; } // skip the first element
            
            // Look for the subpattern we matched...
            if ( $matches[$key] != "" ) {
                printf("Input: '%s',  Token: '%s'n", $input, $tokens[$key-1]);
            }
        }
    }
}

Which gives me output like;


Input: 'Kanton Zuerich',  Token: 'canton'
Input: 'Frauenfeld Regione',  Token: 'region'
Input: 'Fricktal Gruppe',  Token: 'group'

…so I’m now able to classify the input to one of a set of known tokens and react accordingly. Most regex. apis provide something along this lines, for example here’s the same (and much cleaner) in Python, which is what I actually used on this problem;


import re

p = re.compile('(cantone?|kanton)|(regione?)|(groupe?|grupp(?:o|e))', re.I)
inputs = ('Kanton Zuerich', 'Frauenfeld Regione', 'Fricktal Gruppe')
tokens = ('canton','region','group')

for input in inputs:
    m  = p.search(input)
    if not m: continue
    for group, token in zip(m.groups(), tokens):
        if group is not None:
            print "Input: '%s', Token: '%s'" % ( input, token )

Could be reduced further using list comprehensions but don’t think it helps readability in this case. An alternative problem to give you a feel for how this technique can be applied. Let’s say you want to parse an HTML document and list a subset of the block level vs. the inline level tags it contains. You might do this with two sub-patterns e.g. (</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>) (note this regex as-is is geared to python’s idea of greediness – you’d need to change it for PHP) leading to something like this is python;


p = re.compile('(</?(?:div|h[1-6]{1}|p|ol|ul|pre).*?>)|(</?(?:span|code|em|strong|a).*?>)')

for match in p.finditer('foo <div> test <strong>bar</strong> test 1</div> bar'):
    print "[pos: %s] matched %s" % ( match.start(), str(match.groups()) )

The call to match.groups()

returns a tuple which tells you which sub pattern matched while match.start() tells you the character position in the document where the match was made, allowing you to pull substrings out of the document.

Frequently Asked Questions on Tokenization Using Regular Expression Sub-Patterns

What is tokenization in the context of regular expressions?

Tokenization, in the context of regular expressions, is a process of breaking down a sequence of strings into pieces such as words, keywords, phrases, symbols, and other elements, which are called tokens. These tokens help in understanding the context or developing the structure of the text data. This process is widely used in Natural Language Processing (NLP) and text analytics to help machines understand human language.

How does tokenization using regular expression sub-patterns work?

Tokenization using regular expression sub-patterns involves using specific patterns to match and extract tokens from a given text. Regular expressions are a powerful tool for text manipulation. They can match almost any desired character sequence or pattern. When used in tokenization, they can efficiently break down large text corpuses into smaller, more manageable tokens.

What are the benefits of using regular expressions for tokenization?

Regular expressions provide a flexible and efficient way to perform text processing tasks such as tokenization. They allow for complex pattern matching and extraction, which can be particularly useful when dealing with unstructured text data. Regular expressions can also handle a wide variety of text processing tasks beyond tokenization, such as string replacement, pattern finding, and text cleaning.

Can you provide an example of tokenization using regular expressions?

Sure, let’s consider a simple example. Suppose we have a sentence “I love programming in Python!”. We can use the regular expression “\w+” to tokenize this sentence. The “\w+” pattern matches one or more word characters. So, when applied to our sentence, it will break it down into the following tokens: [‘I’, ‘love’, ‘programming’, ‘in’, ‘Python’].

How can I use regular expressions for tokenization in Python?

Python’s re module provides functions to work with regular expressions. To use it for tokenization, you can use the re.findall() function, which returns all non-overlapping matches of a pattern in a string as a list of strings. The string is scanned from left to right, and matches are returned in the order found.

What are sub-patterns in regular expressions?

Sub-patterns in regular expressions are parts of the pattern that are enclosed in parentheses. These sub-patterns can be used to extract specific parts of the matched text. They can also be used to apply repetition operators to a group of characters instead of just a single character.

How can I use sub-patterns in tokenization?

Sub-patterns can be used in tokenization to extract specific parts of the text. For example, if you want to extract all email addresses from a text, you can use a regular expression with a sub-pattern that matches the structure of an email address.

Can you provide an example of using sub-patterns in tokenization?

Sure, let’s consider an example where we want to extract all dates in the format “dd-mm-yyyy” from a text. We can use the regular expression “(\d{2})-(\d{2})-(\d{4})” for this. Here, each “\d{2}” and “\d{4}” are sub-patterns that match two and four digit numbers respectively.

What are some common challenges in tokenization using regular expressions?

Some common challenges in tokenization using regular expressions include handling different languages and scripts, dealing with ambiguous patterns, and managing the complexity of regular expressions for complex patterns. It can also be computationally expensive for large texts.

How can I overcome these challenges?

To overcome these challenges, you can use libraries and tools that provide pre-built regular expressions for common patterns. You can also use machine learning techniques to improve the accuracy of tokenization. For handling large texts, you can use techniques such as streaming to process the text in chunks instead of all at once.