Lecture 3: Lexical Analysis and Scanning
- Sec. 4.1 of the text (tokens and scanning)
Regular expressions provide a convenient mechanism for formally specifying
patterns. The patterns can be used for search and consequently regular
expressions are found in most editors. There are also tools for finding
patterns and editing strings matching patterns, such as grep and sed, that
make use of regular expressions. Some scripting languages, such as awk
and perl, also incorporate regular expressions. These languages perform
certain actions when patterns, matching regular expressions are found.
Regular expressions are an important tool for describing tokens,
identifiers, keywords, constants, separators, in programming languages.
The tokens are the baic symbols that are put together to form the syntax
of a language. Processing the tokens, separate from determining if they
are put together properly (syntax), simplifies the task of checking
syntax. For example, white space and comments can be removed, identifiers
and numeric constants processed, and the level of abstraction increased
thus avoiding dealing with characters when looking for different tokens.
In this lecture we show how to build a scanner for processing tokens. A
hand written scanner is similar to encoding a finite state machine
that recognizes the language described by a regular expression. Alternatively,
it is possible to automatically derive a scanner given the description
of the allowable tokens. We use the scanner generator lex as an example.
- Using regular expressions to describe patterns
- Extensions (grep, egrep)
- Numbers: [0-9]+, [1-9][0-9]*, [0-9]+|-[0-9]+, -?[0-9]+
- Lists: ()|(\(number,\)*number)
- Recognizing patterns described by regular expressions
- Finite state machine to recognize numbers and lists
- Reading lists
- Writing a scanner
- Hand written scanner. See Fig4-1.c
scanner from Figure 4-1 of the text.
- Automatically generated scanners (lex, flex). See
Fig4-1.l for the flex input for the tokens
scanned by the scanner in Fig4-1.c.
- Limitations of regular expressions
- Syntactic structure not readily apparent from regular expression.
It may be better to use a grammar for some constructs that can be
described by regular expressions (e.g. lists).
- Not all patterns, syntax, can be described by regular expressions
(e.g. a^nb^n, higher order lists and arithmetic expressions).
- The awk language
- See Lecture 1 notes form CS 265 for a brief introduction
to awk and other UNIX commands that use regular expressions.
- Here are some sample awk programs for finding and counting the number
of occurrences of the word "war". Try them on the sample file
wp.txt. Note that in order to capture all occurrences
of the word, you have to be careful with your regular expression. I have
provided a sequence of attempts culminating in something that captures
all of the occurrences. The program in countwar.awk does not count multiple occurrences of war on a single line. This can be
fixed by first splitting words on each line so that there is only one word
per line. This is done in split.awk.
Another approaches uses awk's associative arrays to count the number of
occurrences of all words in a text file. This is done in
countwords.awk. In order to use this
program it is a good idea to first remove all punctuation (this can be done
using sed, a stream editor, which also uses regular expressions to match
patterns. The sed command in rp.sed uses the
substitute command, s, to remove all the listed punctuation. Note that the
g following the command forces all occurrence on each line to be
substituted. Another useful thing do to is to translate all upper case
letters to lower case so that the same word is not counted twice. Here
is the full UNIX command I used.
tr A-Z a-z < wp.txt | sed -f rp.sed | awk -f countwords.awk | sort -nr +1
References and programs
- Chapter 10 of Alfred V. Aho and Jeffrey D. Ullman,
Foundations of Computer Science - C Edition, W. H. Freeman and Company, 1995 (text for CS 270).
- John R. Levine, Tony Mason, and Doug Brown,
lex & yacc, 2nd Edition,
O'Reilly & Associates, Inc.
- man pages and info for grep, egrep, flex, and awk.
- Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger,
The AWK Programming Language, Addison-Wesley, 1988.
- Gawk - GNU Project
- Wikipedia entry on AWK
Created: Jan. 15, 2006 by jjohnson AT cs DOT drexel DOT DOT edu
- Show a regular expression matching C floating point numbers.
- Construct a FSM to recognize C floating point numbers.
- Write a function to read C floating point numbers.
- Write flex specifications to describe C floating point numbers
and use flex to generate a scanner to read C floating point numbers.
- Assignment 1