Presentation Transcript
JFlex: JFlex Basically, a lexer is a Finite-State “Transducer” plus bells and whistles
Arbitrary Java code can be associated with actions state transitions
Specify the “transducer” in a .flex file; JFlex compiles it into a .java file
By default, JFlex gives you convenience methods to access results of state transitions
A simple task: A simple task Charniak’s statistical parser takes input sentences delimited by ...
Suppose we want to take a Reader over such input and get back a Tokenizer over the tokens, which returns Word objects, plus a special end-of-sentence character garbage garbage garbage
Stocks skyrocketed on news that
investigation of Cheney ’s energy
taskforce was dropped . more garbage
edu.stanford.nlp.process.AbstractTokenizer: edu.stanford.nlp.process.AbstractTokenizer ...
/**
* Internally fetches the next token.
*
* @return the next token in the token * stream, or null if none exists.
*/
protected abstract Object getNext();
...
Lexical Rules: Lexical Rules Basically you’re specifying a finite-state automaton* with actions associated with state transitions *though not strictly limited by FSA expressivity
Schematic .flex file: Schematic .flex file {user code}
%%
{options and declarations}
%%
{lexical rules}
Lexical Rules (schematic): Lexical Rules (schematic) {
{BeginSentence} { yybegin{SENTENCE};
return yylex(); }
{WhiteSpace} { /* ignore */ return yylex();}
. { /* ignore */ return yylex();}
}
{
{EndSentence} { yybegin{YYINITIAL};
return SENTENCE_BOUNDARY; }
{Token} { return new Word(yytext()); }
{Space} { /* ignore */ return yylex(); }
}
Lexical Rules (detail): Lexical Rules (detail) {
{BeginSentence} / .* { yybegin(SENTENCE);
return yylex();}
...
}
{
{EndSentence} / .* { yybegin(YYINITIAL);
return SENTENCE_BOUNDARY;}
{Token} { return new Word(yytext()); }
...
}
Options and declarations: States and Macros: Options and declarations: States and Macros Macros can be used to define other macros
Order of macro definition is irrelevant %state SENTENCE
SentenceLetter = s
BeginSentence =
EndSentence =
WhiteSpace = [ \t\r\n\f]
Token = [^ \t\r\n\f]+
Other options and declarations: Other options and declarations %class CharniakTokenizer
%implements Tokenizer
%extends AbstractTokenizer
%unicode
%type Object
%eofval{
return null;
%eofval}
Options & declarations: class-internal code (1): Options & declarations: class-internal code (1) %{
static final Word SENTENCE_BOUNDARY =
new Word("SENTENCE_BOUNDARY");
public Object getNext() {
try {
Object o = yylex();
return o;
}
catch(IOException e) {
return null;
}
}
...
%}
Options & declarations: class-internal code (2): Options & declarations: class-internal code (2) %{
...
public static void main(String[] args) throws
IOException {
Reader r = new FileReader(args[0]);
Tokenizer t = new CharniakTokenizer(r);
while(t.hasNext()) {
System.out.println(t.next());
}
}
%}
User Code inserted directly into the file: User Code inserted directly into the file package rog;
import java.util.*;
import java.io.*;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.process.*;
/** A lexer for Charniak input sentences
* @author Roger Levy
*/
Beyond FSA expressivity: Beyond FSA expressivity %class ParenCounter
%{
private int numParens = 0;
%}
...
%%
...
{
\( { numParens++; return yytext(); }
\) { if(numParens == 0) throw new RuntimeException(
"error – too many close parens!");
else {
numParens--;
return yytext();
}
}
}