Lecture 3: Lexical Analysis and Scanning

Background Material

Reading

Theme

Regular expressions provide a convenient mechanism for formally specifying patterns. The patterns can be used for search and consequently regular expressions are found in most editors. There are also tools for finding patterns and editing strings matching patterns, such as grep and sed, that make use of regular expressions. Some scripting languages, such as awk and perl, also incorporate regular expressions. These languages perform certain actions when patterns, matching regular expressions are found.

Regular expressions are an important tool for describing tokens, identifiers, keywords, constants, separators, in programming languages. The tokens are the baic symbols that are put together to form the syntax of a language. Processing the tokens, separate from determining if they are put together properly (syntax), simplifies the task of checking syntax. For example, white space and comments can be removed, identifiers and numeric constants processed, and the level of abstraction increased thus avoiding dealing with characters when looking for different tokens.

In this lecture we show how to build a scanner for processing tokens. A hand written scanner is similar to encoding a finite state machine that recognizes the language described by a regular expression. Alternatively, it is possible to automatically derive a scanner given the description of the allowable tokens. We use the scanner generator lex as an example.

Topics

  1. Using regular expressions to describe patterns
  2. Recognizing patterns described by regular expressions
  3. Writing a scanner
  4. Limitations of regular expressions
  5. The awk language

References and programs

Exercises

Created: Jan. 15, 2006 by jjohnson AT cs DOT drexel DOT DOT edu