Section 2: Finite Automata and JLex

CS 164 Programming Language and Compilers
Spring 2014
Section 2: Finite Automata and JLex
GSI: Peter Xi (2/5) and Wontae Choi (2/6)
• Announcement
– WA2 and PA3 are out next Monday (2/10).
– WA1 due next Monday (2/10) at 10:30am.
– PA2 due next Tuesday (2/11) at 11:59pm.
• Outline
1. Understanding finite automata.
2. Introduction to JLex.
3. WA1 and PA2 QA.
1
Finite Automata
Finite automata1 can be used to implement regular languages. Finite automaton is defined as tuple
(Σ, S, n, F, δ) where Σ is the alphabet, S is the set of state, n ∈ S is the start state, F ⊆ S is the
set of accepting states, and δ : S × S is the transition relation.
Theorem 1. Theorem: A language is regular if and only if it is accepted by a finite automaton.
1. Miscellaneous questions:
• How does NFA differ from a DFA?
• When does NFA and DFA accept its input?
• Which is more powerful?
• Why would you choose one over the other?
2. What language is accepted by the following DFA?
Answer: Binary strings of even length.
3. What language is accepted by the following NFA?
Answer:Binary strings ending in 01 or 010
1 Automata
is plural. The singular form is automaton
1
4. Construct DFA and NFA for binary strings with different first and last bits.
Answer:
0,1
0
1
1
0
1
1
0
0
0
1
1
0
0,1
1
NFA
0
DFA
5. Convert regular expression a(bb|cc)∗ to a NFA using Thompson construction.
6. Convert the NFA from Exercise 5 to a DFA using subset construction.
Answer:
a
0
3
E
E
1
b
b
5
7
E
E
2
E
4
c
9
c
6
8
E
10
E
E
'E' is an epsilon
Thompson construction
{5}
b
{1,2,7,9,10}
b
b
c
a
{0}
b
{1,2,3,4}
b
b
{6}
{1,2,8,9,10}
Subset construction
7. (Challenge) :
(a) Construct NFA for decimal numbers divisible by 3. (Leading zeroes are okay.)
2
(b) Construct NFA for binary numbers divisible by 5. (Leading zeros are okay.)
(c) Write a regular expression for the language in each of Exercise (a) and (b).
Answer: Very complex. See http://www.andrew.cmu.edu/user/ko/pdfs/lecture-5.
pdf for one algorithm to convert NFAs to regular expressions.
2
Introduction to JLex
JLex is an automatic tool that will spit out Java code for a scanner (lexer) for you, given a bunch
of regular expressions. JLex take regular expressions, convert them interally to a DFA, and then
emit code for that DFA.
1. JLex syntax for regular expressions:
a
"a"
\a
.
a*
a+
a?
a|b
(a)
^a
a$
[ab]
[a-z]
[^a]
a{m,n}
{xx}
literal a.
an "a", even if a is an operator.
an \a, even if a is an operator.
all character not equal to \n (new line).
0 or more a’s.
1 or more a’s.
0 or 1 a’s.
an a or b.
an a.
an a at the beginning of a line.
an a at the end of a line.
a or b.
all characters between a and z.
all characters not equal to a.
m through n occurrences of a.
the translation of xx from the definitions section.
3
2. JLex input format:
[headers for lexer, such as <class def> and <import>]
%%
[JLex directives]
%%
[lexer rules]
3. JLex example: toy.lex
class Toy{
public static void main(String argv[]) throws java.io.IOException{
Yylex yy = new Yylex(System.in);
yy.yylex();
}
}
class Yytoken{ /* dummy */ }
%%
%{
private int count = 0;
private void out(String message){
System.out.println(message);
count++;
}
%}
DIGIT
ID
[0-9]
[a-z][a-z0-9]*
%state COMMENT
%%
<YYINITIAL>
<YYINITIAL>
<YYINITIAL>
<YYINITIAL>
<YYINITIAL>
<YYINITIAL>
<YYINITIAL>
exit
"/*"
{DIGIT}+
{DIGIT}+"."{DIGIT}*
{ID}
"+"|"-"|"*"|"/"
[ \t\n]+
{
{
{
{
{
{
{
return new Yytoken(); }
yybegin(COMMENT); }
out("int: " + yytext()); }
out("float: " + yytext()); }
out("identifier: " + yytext()); }
out("operator: " + yytext()); }
/* eat up whitespace */ }
<YYINITIAL> .
{ out("unrecognized character: " + yytext()); }
<COMMENT> [^*]*
<COMMENT> "*"+[^*/]*
<COMMENT> "*"+"/"
{ /* skip */ }
{ /* skip */ }
{ yybegin(YYINITIAL); }
This lexer works! Give it a try.
4
% jlex -o toy.java toy.lex
% javac toy.java
% java Toy
Some notes:
• The funny COMMENT section defines a start condition–that is, a special context for defining a
separate set of rules. Here, we switch into the COMMENT context when consuming comment
text.
• Be careful! Your lexer will need to handle nested comments.
• Skim the Jlex manual for more detail on all of these features.
4. Questions:
(a) How do we handle ambiguity?
(b) How do we handle errors?
Further Readings
• Finite automata
– Dragon Book (Compilers: Principles, Techniques, and Tools), Ch. 3.6-3.7
– Regular Expression Matching Can Be Simple and Fast (http://swtch.com/~rsc/regexp/
regexp1.html)
• Lexical analyzer generators
– Jlex manual (http://www.cs.princeton.edu/~appel/modern/java/JLex/)
– The original Lex23 and Flex45 online manual (http://dinosaur.compilertools.net/)
– RE2C : a very fast lexical analyzer generator. (http://re2c.org)
2 Eric
Schmidt (Google) developed Lex
is a Java version of Lex.
4 Vern Paxon (UC Berkeley Professor, Computer Security) developed Flex
5 Jflex is a java version of Flex.
3 Jlex
5