Finite-State Methods in Natural Language Processing: Finite-State Methods in Natural Language Processing Lauri Karttunen
LSA 2005 Summer Institute
July 27, 2005
Course Outline: Course Outline July 18:
Intro to computational morphology
XFST
Readings
Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:
Regular expressions
More on XFST
Readings
Chapter 2: “Systematic Introduction”
Chapter 3: “The XFST interface”
Slide3: July 25
More on XFST: Date Parser
Concatenative morphotactics: The LEXC language
Readings
Chapter 4. “The LEXC Language”
July 27
Constraining non-local dependencies: Flag Diacritics
Complex morphotactics and alternations: Finnish Numerals
Readings
Chapter 5. “Flag Diacritics”
”
Slide4: August 1
Non-concatenative morphotactics
Reduplication, interdigitation
Realizational morphology
Readings
Chapter 8. “Non-Concatenative Morphotactics”
Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3
Optimality theory
Readings
Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Syllabification revisited: Syllabification revisited define MarkNonDiphthongs [
[. .] -> "." || [HighV | MidV] _ LowV, # i.a, e.a
LowV _ MidV, # a.e
i _ [MidV - e], # i.o, i.ä
u _ [MidV - o], # u.e
y _ [MidV - ö], # y.e
$V i _ e, # poiki.en
V u _ o, #
$V y _ ö, #
$V [MidV | LowV] _ [u|y] C C|.#.]]; # oike.us
define Syllabify [ C* V+ C* @-> ... "." || _ C V ];
regex FinnWords .o. MarkNonDiphthongs .o. Syllabify;
Constraints: Constraints ge hund bon ne
mal eg
et in
eg
et o a ec j n MF%+ => _ ~$[%+Fem] %+Pl ; MF+ +Fem +Pl
Constraining by composition: Constraining by composition xfst[0]: read lexc _ ~$["+Fem"] "+Pl" ;
1.2 Kb, 2 states, 7 arcs, Circular
xfst[2]: compose
3.2 Kb, 61 states, 89 arcs, Circular
xfst[1]: up gehundino
xfst[1]: *** Not accepted ***
Less words, bigger network.
Esperanto with Flags: Esperanto with Flags Multichar_Symbols
+Noun +Adj +Nsuff
+ASuff +Nize
+Pl +Sg +Acc MF+
+Aug +Dim +Fem Op+ Neg+
@U.MF.Yes@ @U.MF.No@
LEXICON Root
Nouns ;
Adjectives ;
LEXICON Nouns
NounRoots ;
@U.MF.Yes@ Ge ;
LEXICON Ge
MF+:ge NounRoots;
LEXICON NounRoots bird Nmf ;
hund Nmf ;
kat Nmf ;
LEXICON Nmf
+Noun:0 AugDimFem ;
LEXICON AugDimFem
@U.MF.No@ Fem ;
+Dim:et AugDimFem ;
+Aug:eg AugDimFem ;
Nend ;
Adjend ;
LEXICON Fem
+Fem:in AugDimFem ;
Constraining by flags: Constraining by flags xfst[0]: read lexc < esperanto-flags.lexc
xfst[1]: up gehundino
xfst[1]:
xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg
xfst[1]:
xfst[1]: set obey-flags off
variable obey-flags = off
xfst[1]: up gehundino
xfst[1]: MF+hund+Noun+Fem+NSuff+Sg
xfst[1]: set show-flags on
variable show-flags = on
xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg
@U.MF.Yes@gehund@U.MF.No@ino@U.MF.No@
Flags in the sigma: Flags in the sigma xfst[1]: print sigma
MF+ Neg+ Op+ a b c d e f g h i j k l m n o r
t u v +ASuff +Acc +Adj +Aug +Dim +Fem +Nsuff
+Nize +Noun +Pl +Sg @U.MF.No@ @U.MF.Yes@
Size: 35
@U.MF.Yes@: UNIFY feature 'MF' with value 'Yes'
@U.MF.No@: UNIFY feature 'MF' with value 'No'
2 flag diacritics
Eliminating flags: Eliminating flags xfst[1]: eliminate flag MF
3.2 Kb. 61 states 89 arcs, Circular
Size: 35
xfst[1]: print sigma
MF+ Neg+ Op+ a b c d e f g h i j k l m n o r t u
v +ASuff +Acc +Adj +Aug +Dim +Fem +NSuff +Nize
+Noun +Pl +Sg
Size: 33
The eliminate flag command composes the network with constraint networks that have the same effect as the flag diacritics that are removed.
Flag Diacritics: Flag Diacritics
Special symbols for encoding features, that is, attribute-value pairs.
Checked at runtime to avoid the cost of compiling them into the structure of the network
If a check fails, the path is abandoned.
Attributes and Values: Attributes and Values Epsilon arcs with feature constraints. @U.Feature.Value@ @C.Feature@ Unify ‘Feature’ with ‘Value’ if possible. Set ‘Feature’ to the unspecified value.
Rules: Rules
There can be any number of attributes.
An attribute can have any number of values.
If the value of an attribute is unspecified, it unifies successfully with any given value and is set to that value.
If the value of an attribute is specified, it unifies only with the given value.
Actions: Unify, Positive Set: Actions: Unify, Positive Set @U.Feature.Value@ Unify Value with the current setting of Feature, if possible. Otherwise fail.
@P.Feature.Value@ Set Feature to Value regardless of the current
setting. Always succeeds.
More Actions: Negative Set, Clear: More Actions: Negative Set, Clear @N.Feature.Value@ Set Feature to the
complement of Value regardless of the current
setting. Always succeeds.
@C.Feature@ Make Feature be
unspecified. Always
succeeds.
More Actions: Require: More Actions: Require @R.Feature.Value@ Succeed in Feature is set
to Value. Otherwise fail.
@R.Feature@ Succeed if Feature has
been set to some value.
Otherwise fail.
More Actions: Equality: More Actions: Equality @E.Feature1.Feature2@ Succeed if Feature1
has the same value as
Feature2. Otherwise fail.
Eliminating flags: Eliminating flags The constraints on "@U.FEATURE.VALUE@" have the form
~[?* PROHIBIT_FLAGS ~$[ALLOW_FLAGS] SELF ?*]
Constraint for eliminating @U.MF.No@:
~[?* ["@U.MF.Yes@"] # prohibit
~$["@P.MF.No@" | ”@C.MF@”] # allow
"@U.MF.No@"
?*]
Finnish Numerals: Finnish Numerals
Numbers and Numerals: Numbers and Numerals The mapping from integers 0, 1, 2, 3 … to the corresponding numerals one, two, three… is a regular relation.
Some languages have a very simple numeral system, some are more complicated:
seventy-three, soixante-treize, drei-und-sibzig
We can compile transducers that map between the numbers and the corresponding numerals.
Number-to-Numeral transducer: Number-to-Numeral transducer
The Goal Ahead: Finnish: The Goal Ahead: Finnish
Finnish Numerals: Finnish Numerals Compound numerals written as one word
2 • 1000 + 5 • 100 + 3 • 10 + 1 = 2531
kaksituhattaviisisataakolmekymmentäyksi Express ordinality, number, and case
sata+Sg+Nom (100) sata+Ord+Sg+Nom (100th)
sata sadas
sata+Sg+Gen (100) sata+Ord+Sg+Gen (100th)
sadan sadannen
sata+Pl+Gen (100) sata+Ord+Pl+Gen (100th)
satojen sadansien
Singular vs. Plural: Singular vs. Plural Numerals generally occur with singular nouns
kaksi+Sg+Gen kenkä+Sg+Gen
kahden kengän omistaja
(owner of two shoes) Sets and public events may be in plural
kaksi+Pl+Gen kenkä+Pl+Gen
kaksien kenkien omistaja
(owner of two pairs of shoes)
kolme+Ord+Pl+Nom olympialainen+Pl+Nom
kolmannet olympialaiset
(third olympic games)
yksi+Pl+Nom hää+Pl+Nom
yhdet häät
(one wedding)
Morphotactics: Morphotactics All parts of compound numerals agree in all respects
two thousand five hundred (2500)
kaksi+Sg+Gen tuhat+Sg+Gen viisi+Sg+Gen sata+Sg+Gen
kahden tuhannen viiden sadan
two ten eighth (28th)
kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Gen
kahde ns i en kymmene ns i en kahdeksa ns i en
Singular nominative is exceptional: Singular nominative is exceptional Numeral with a noun
kaksi+Gen kenkä+Gen
kahden kengän (two shoes)
kaksi+Nom kenkä+Part
kaksi kenkää (two shoes) Compound numeral
kaksi+Gen tuhat+Gen viisi+Gen sata+Gen kolme+Gen (2503)
kahden tuhannen viiden sadan kolmen
kaksi+Nom tuhat+Part viisi+Nom sata+Part kolme+Nom (2503)
(kaksi • tuhatta) + (viisi • sataa) + kolme
Morphological Alternations: Morphological Alternations Semiregular stem alternations
yksi+Sg+Nom : yksi (one)
yksi+Sg+Ess : yhtenä
yksi+Sg+Gen : yhden
yksi+Sg+Part : yhtä
yksi+Pl+Gen : yksien
Irregular stem alternations
yksi+Ord+Sg+Nom : ensimmäinen (first) Regular suffix alternations
Vowel harmony
kolme+Sg+Part : kolmea vs. neljä+Sg+Part : neljää
Illative vowel
kolme+Sg+Ill : kolmeen vs. neljä+Ill+Part : neljään
Partitive t
yksi+Sg+Part : yhtä vs. neljä+Sg+Part : neljää
Solution for Finnish: Solution for Finnish
Example: Example
Sublexicon for One: Sublexicon for One LEXICON Yksi
YKSI+Sg:yksi Nom; # singular nominative
YKSI+Sg:yhde WeakGrade; # weak stem (most cases)
YKSI+Sg:yhte StrongGrade; # strong stem (essive, ill.)
YKSI+Sg:yht Par; # partitive stem
YKSI:yks PlStem1; # plural stem
YKSI+Ord1+Sg:ensimmäinen Nom; # singular nominative
YKSI+Ord1+Sg:ensimmäise AnyGrade; # weak/strong stem
YKSI+Ord1+Sg:ensimmäis Par; # partitive stem
YKSI+Ord+Sg:yhdes Nom; # singular nominative
YKSI+Ord+Sg:yhdenne WeakGrade; # weak stem
YKSI+Ord+Sg:yhdente StrongGrade; # strong stem
YKSI+Ord+Sg:yhdet Par; # partitive stem
YKSI+Ord:yhdens PlStem1; # plural stem
Some sublexicons: Some sublexicons LEXICON WeakGrade
SgGen; ! Singular Genitive
PlNom; ! Plural Nominative
InvarWeak; ! Invariant (plural and singular) cases
LEXICON InvarWeak
+Tra:ksi Next; ! Translative “into”
+Ine:ssA Next; ! Inessive “in”
+Ela:ltA Next; ! Elative “from” (inside)
+Ade:llA Next; ! Adessive “on”
+Abl:ltA Next; ! Ablative “from” (outside)
+All:lle Next; ! Allative “onto”
+Abe:ttA Next; ! Abessive “without”
Sample paths for Two: Sample paths for Two kaksi+Sg+Nom kaksi+Sg+Gen kaksi+Sg+Ess
kaksi kahde n kahte na
kaksi+Sg+Par kaksi+Pl+Gen kaksi+Pl+Ill
kah TA kaks i en kaks i Vn
kaksi+Ord+Sg+Nom kaksi+Ord1+Sg+Nom
kahde s toinen
kaksi+Ord+Sg+Ill kaksi+Ord1+Sg+Ill
kahde nte Vn toise Vn
Morphophonologial rules: Morphophonologial rules define BackV [a | o | u];
define FrontV [ä | ö | y];
define Vow [BackV | FrontV | i | e]; define VHarmony [A -> a || BackV ~$[FrontV] _
.o.
A -> ä]; define IllativeV [V -> a || a (h) _ ,
V -> e || e (h) _ , … ] define PartitiveT [T -> 0 || \Vow Vow _ ];
Example again: Example again
Remaining problems: Remaining problems Special ordinals for yksi (one), kaksi (two)
ensimmäinen (1st) vs. kahdeskymmenesyhdes (21st)
Compose the lexicon with an appropriate filter to eliminate unwanted variants.
No internal tags
2+Sg+Gen00+Sg+Gen
Delete them: 0 %+Par // %+Sg %+Nom ~$Tag %+Sg _ ;
Ordinal/Plural/Case agreement
Flag diacritics!
Flags for Finnish numerals: Flags for Finnish numerals @U.Type.Card@ @U.Type.Ord@
@U.Number.Sg@ @U.Number.Pl@
@U.Case.Nom@ @U.Case.Gen@ @U.Case.Par@ @U.Case.Tra@
@U.Case.Ess@ @U.Case.Abe@ @U.Case.Ine@ @U.Case.Ela@
@U.Case.Ill@ @U.Case.Ade@ @U.Case.Abl@ @U.Case.All@
@U.Case.Com@ @U.Case.Ins@ 300+Sg+Gen
kolmensadan
Conclusion: Conclusion Mapping from numbers to numerals can be done in a simple and elegant way even for languages with complex morphology.
Necessary for text to speech applications.
Tervetuloa kahdensienkymmenensienkahdeksansien olympialaisten avajaisiin!
Welcome to the opening ceremonies of the 28th Olympic Games!
Demo!: Demo!