# Lecture 2: Regular Expressions and Finite State Machines

### Background Material

• Regular expressions (CS 265, CS 270, ECE 200)
• Finite state machines (CS 270, ECE 200)
• Chapter 10 of Alfred V. Aho and Jeffrey D. Ullman, Foundations of Computer Science - C Edition, W. H. Freeman and Company, 1995 (text for CS 270).

• Sec. 4.1 of the text (tokens and scanning)

### Theme

Regular expressions describe patterns which can be recognized by finite state machines (FSM). It is possible to algorithmically construct a FSM that corresponds to a given regular expression. A FSM can be described by a transition table (program), which can be represented by a string. A FSM can be simulated to recognize the patterns it accepts.

### Topics

1. Deterministic Finite Automata (DFA) - language recognizer.
• Definition: (A,S,s0,F,T), A = Alphabet, S = States, s0 = start state, F = accepting states, T = transition function). For s in S, a in A, T[s,a] = s' in S.
• Simulation on input string. Starting in s0, read one symbol at a time applying T to determine next state. The input string is accepted if the final state is in F.
• Language accepted by DFA = set of input strings that cause the DFA to end up in an accepting state.
• Textual representation of DFAs.
• Graphical representation of DFAs using Graphviz. See fsm.pdf and its input as a DOT program fsm.dot
2. Non-deterministic Finite Automata (NDFA)
• Definition: Same as DFA (A,S,s0,F,T) except there can be multiple transitions from a given s in S and a in A. I.E. T may not be a function. Also, allowed are epsilon transitions. I.E. it is possible to transition to a new state without reading a symbol from the input.
• set of possible transition states and epsilon closure.
• Simulation of NDFA M: Compute S = set of states at M could be in after reading each symbol in the input.
1. Initialize S = {s0}
2. Let S_i be the set of states that M could be in after reading the first i symbols in str. S_i is computed by taking the union of all possible transitions, and then computing the epsilon closure. (i.e. the states that can be reached by applying epsilon transitions.

T_i = Union_{s in S_{i-1}} M->T[s,str[i]]
S_i = EpsilonClosure(T_i)
3. If after reading the entire string S contains an accepting state report that the input string was accepted by M.
• For any NDFA there exists an equivalent DFA which accepts the same strings, i.e. defines the same language. This means that for finite automata non-determinism does not add any more power.
3. Regular expressions (language generator)
• Definition:
1. Base Case: a character, symbol epsilon (empty string), the empty set.
2. Recursion: If R and S are regular expressions then R|S [union], RS [concatenation], and R* [closure] are regular expressions.
• Examples: a, (a|b), (aa)*, (a|b)*abb, b*(b*ab*ab*a)*
• Constructing a NDFA that accepts the language described by a regular expression. Will construct a NDFA with a single accepting state.
1. Base case. For R = a, create a 2 state DFA with a start state s0 and an accepting state s1, and T[s0,a] = s1. transition
2. Construction for recursive part of definition.
• [R|S] Add new start state with epsilon transitions into the start states of R and S. Add epsilon transitions from accepting states [no longer accepting states] of R and S to a new accepting state.
• [RS] Make the start state of R the start state of RS and connect, via an epsilon transition the accepting state [no longer an accepting state] of R to the start state of S. The accepting state of RS is the accepting state of S.
• [R*] Add new start and accepting states with an epsilon transition to the start state of R and an epsilon transition from the start state to the accepting state. Also add an epsilon transition from the accepting state [no longer an accepting state] of R to the new accepting state. Finally, add an epsilon transition from the accepting state of R to the start state of R.
4. The awk language

### Exercises

• Simulate a deterministic FSM
• Construct a FSM corresponding to regular expression
• Assignment 1
Created: Jan. 8, 2006 by jjohnson AT cs DOT drexel DOT DOT edu