Saturday, May 14, 2016

A Regular Expression Engine in Python


Regular expressions are quiet useful in programming tasks. They are a useful method to search for patterns in long strings of text. For example the pattern "(c|b)at" allows us to search for either 'cat' or 'bat' in any string that this pattern is run over. This example was a simple one and regular expressions in practice are far more powerful and complex.

One must take care however in remembering that regular expressions are a loaded mathematical term. They are not the same as regex and regexp since those two do not make regular languages.

Here, for the sake of brevity we shall call regular expressions regex. Regexes are usually made using Deterministic Finite Automatons. They take a string of input symbols and tell us if the DFA accepted the symbol string or not. Such an activity is called the run of the DFA.

Steps
  1. We define that our regex engine will only provide the basic elements of concatenation, union and kleen star. Thus '|*' are the only special characters allowed besides the other things in the alphabet. 
  2. We first build a Non-Deterministic Finite Automaton with epsilon transitions using Thompson's construction algorithm from the given pattern.
  3. Then we convert this e-NFA to a regular NFA without epsilon transitions by the epsilon closure method.
  4. The new NFA is converted to an equivalent DFA by the power set construction method.
  5. The DFA is finally run on a string. The Union (|) and the Kleen closure (*). Concatenation is assumed for consecutive symbols.
  6. Upon completing the run, the DFA either returns True or False depicting that it has accepted the string or rejected it.
Example

For the regex '0|10*' (which means either a zero or one followed by any number of zeros) on the alphabet '01' the following strings return results as shown:
  • '1' : True
  • '0' : True
  • '01': False
  • '10': True
  • '11': False 
The entire code is available as a gist.