Jflex Tutorial

Uploaded from authorPOINT Lite
Download as
 PPT
Presentation Description 

No description available

authorSTREAM Premium Service
What's up on authorSTREAM?
Views: 1638
Like it  ( Likes) Dislike it  ( Dislikes)
Added: November 27, 2007 This Presentation is Public 
Presentation Category : Entertainment All Rights Reserved
Presentation Transcript

JFlex: JFlex Basically, a lexer is a Finite-State “Transducer” plus bells and whistles Arbitrary Java code can be associated with actions state transitions Specify the “transducer” in a .flex file; JFlex compiles it into a .java file By default, JFlex gives you convenience methods to access results of state transitions


A simple task: A simple task Charniak’s statistical parser takes input sentences delimited by ... Suppose we want to take a Reader over such input and get back a Tokenizer over the tokens, which returns Word objects, plus a special end-of-sentence character garbage garbage garbage Stocks skyrocketed on news that investigation of Cheney ’s energy taskforce was dropped . more garbage


edu.stanford.nlp.process.AbstractTokenizer: edu.stanford.nlp.process.AbstractTokenizer ... /** * Internally fetches the next token. * * @return the next token in the token * stream, or null if none exists. */ protected abstract Object getNext(); ...


Lexical Rules: Lexical Rules Basically you’re specifying a finite-state automaton* with actions associated with state transitions *though not strictly limited by FSA expressivity


Schematic .flex file: Schematic .flex file {user code} %% {options and declarations} %% {lexical rules}


Lexical Rules (schematic): Lexical Rules (schematic) { {BeginSentence} { yybegin{SENTENCE}; return yylex(); } {WhiteSpace} { /* ignore */ return yylex();} . { /* ignore */ return yylex();} } { {EndSentence} { yybegin{YYINITIAL}; return SENTENCE_BOUNDARY; } {Token} { return new Word(yytext()); } {Space} { /* ignore */ return yylex(); } }


Lexical Rules (detail): Lexical Rules (detail) { {BeginSentence} / .* { yybegin(SENTENCE); return yylex();} ... } { {EndSentence} / .* { yybegin(YYINITIAL); return SENTENCE_BOUNDARY;} {Token} { return new Word(yytext()); } ... }


Options and declarations: States and Macros: Options and declarations: States and Macros Macros can be used to define other macros Order of macro definition is irrelevant %state SENTENCE SentenceLetter = s BeginSentence = EndSentence = WhiteSpace = [ \t\r\n\f] Token = [^ \t\r\n\f]+


Other options and declarations: Other options and declarations %class CharniakTokenizer %implements Tokenizer %extends AbstractTokenizer %unicode %type Object %eofval{ return null; %eofval}


Options & declarations: class-internal code (1): Options & declarations: class-internal code (1) %{ static final Word SENTENCE_BOUNDARY = new Word("SENTENCE_BOUNDARY"); public Object getNext() { try { Object o = yylex(); return o; } catch(IOException e) { return null; } } ... %}


Options & declarations: class-internal code (2): Options & declarations: class-internal code (2) %{ ... public static void main(String[] args) throws IOException { Reader r = new FileReader(args[0]); Tokenizer t = new CharniakTokenizer(r); while(t.hasNext()) { System.out.println(t.next()); } } %}


User Code inserted directly into the file: User Code inserted directly into the file package rog; import java.util.*; import java.io.*; import edu.stanford.nlp.ling.Word; import edu.stanford.nlp.process.*; /** A lexer for Charniak input sentences * @author Roger Levy */


Beyond FSA expressivity: Beyond FSA expressivity %class ParenCounter %{ private int numParens = 0; %} ... %% ... { \( { numParens++; return yytext(); } \) { if(numParens == 0) throw new RuntimeException( "error – too many close parens!"); else { numParens--; return yytext(); } } }