jflex tutorial

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

JFlex: 

JFlex Basically, a lexer is a Finite-State “Transducer” plus bells and whistles Arbitrary Java code can be associated with actions state transitions Specify the “transducer” in a .flex file; JFlex compiles it into a .java file By default, JFlex gives you convenience methods to access results of state transitions

A simple task: 

A simple task Charniak’s statistical parser takes input sentences delimited by <s>...</s> Suppose we want to take a Reader over such input and get back a Tokenizer over the tokens, which returns Word objects, plus a special end-of-sentence character garbage garbage garbage <s>Stocks skyrocketed on news that investigation of Cheney ’s energy taskforce was dropped . </s>more garbage

edu.stanford.nlp.process.AbstractTokenizer: 

edu.stanford.nlp.process.AbstractTokenizer ... /** * Internally fetches the next token. * * @return the next token in the token * stream, or null if none exists. */ protected abstract Object getNext(); ...

Lexical Rules: 

Lexical Rules Basically you’re specifying a finite-state automaton* with actions associated with state transitions *though not strictly limited by FSA expressivity

Schematic .flex file: 

Schematic .flex file {user code} %% {options and declarations} %% {lexical rules}

Lexical Rules (schematic): 

Lexical Rules (schematic) <YYINITIAL> { {BeginSentence} { yybegin{SENTENCE}; return yylex(); } {WhiteSpace} { /* ignore */ return yylex();} . { /* ignore */ return yylex();} } <SENTENCE> { {EndSentence} { yybegin{YYINITIAL}; return SENTENCE_BOUNDARY; } {Token} { return new Word(yytext()); } {Space} { /* ignore */ return yylex(); } }

Lexical Rules (detail): 

Lexical Rules (detail) <YYINITIAL> { {BeginSentence} / .* { yybegin(SENTENCE); return yylex();} ... } <SENTENCE> { {EndSentence} / .* { yybegin(YYINITIAL); return SENTENCE_BOUNDARY;} {Token} { return new Word(yytext()); } ... }

Options and declarations: States and Macros: 

Options and declarations: States and Macros Macros can be used to define other macros Order of macro definition is irrelevant %state SENTENCE SentenceLetter = s BeginSentence = <{SentenceLetter}> EndSentence = <\/{SentenceLetter}> WhiteSpace = [ \t\r\n\f] Token = [^ \t\r\n\f]+

Other options and declarations: 

Other options and declarations %class CharniakTokenizer %implements Tokenizer %extends AbstractTokenizer %unicode %type Object %eofval{ return null; %eofval}

Options & declarations: class-internal code (1): 

Options & declarations: class-internal code (1) %{ static final Word SENTENCE_BOUNDARY = new Word("SENTENCE_BOUNDARY"); public Object getNext() { try { Object o = yylex(); return o; } catch(IOException e) { return null; } } ... %}

Options & declarations: class-internal code (2): 

Options & declarations: class-internal code (2) %{ ... public static void main(String[] args) throws IOException { Reader r = new FileReader(args[0]); Tokenizer t = new CharniakTokenizer(r); while(t.hasNext()) { System.out.println(t.next()); } } %}

User Code inserted directly into the file: 

User Code inserted directly into the file package rog; import java.util.*; import java.io.*; import edu.stanford.nlp.ling.Word; import edu.stanford.nlp.process.*; /** A lexer for Charniak input sentences * @author Roger Levy */

Beyond FSA expressivity: 

Beyond FSA expressivity %class ParenCounter %{ private int numParens = 0; %} ... %% ... <YYINITIAL> { \( { numParens++; return yytext(); } \) { if(numParens == 0) throw new RuntimeException( "error – too many close parens!"); else { numParens--; return yytext(); } } }