Optimizing Local Probability Models for Statistical Parsing: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning
Computer Science Department
Stanford University
Highlights: Highlights Choosing a local probability model P(expansion(n)|history(n)) for statistical parsing – a comparison of commonly used models
A new player – memory based models and their relation to interpolated models
Joint likelihood, conditional likelihood and classification accuracy for models of this form
Motivation: Motivation Many problems in natural language processing are disambiguation problems
word senses
jaguar – a big cat, a car, name of a Java package
line - phone, queue, in mathematics, air line, etc.
part-of-speech tags (noun, verb, proper noun, etc.)
? ? ?
Joy makes progress every day . NN
VB DT NN NN
NNP VBZ
NNS
Parsing as Classification: Parsing as Classification 'I would like to meet with you again on Monday' Input: a sentence Classify to one of
the possible parses
Motivation – Classification Problems: Motivation – Classification Problems There are two major differences from typical ML domains:
The number of classes can be very large or even infinite; the set of available classes for an input varies (and depends on a grammar)
Data is usually very sparse and the number of possible features is large (e.g. words)
Solutions : Solutions The possible parse trees are broken down into small pieces defining features
features are now functions of input and class, not input only
Discriminative or generative models are built using these features
we concentrate on generative models here; when a huge number of analyses are possible, they are the only practical ones
History-Based Generative Parsing Models: History-Based Generative Parsing Models 'Tuesday Marks bought Brooks'. TOP The generative models
learn a distribution P(S,T) on andlt;sentence, parse treeandgt; pairs:
select a single most likely parse based on:
Factors in the Performance of Generative History-Based Models: Factors in the Performance of Generative History-Based Models The chosen decomposition of parse tree generation, including the representation of parse tree nodes and the independence assumptions
The model family chosen for representing local probability distributions:
Decision Trees, Naïve Bayes, Log-linear Models
The optimization method for fitting major and smoothing parameters:
Maximum likelihood, maximum conditional likelihood, minimum error rate , etc.
Previous Studies and This Work: Previous Studies and This Work The influence of the previous three factors has not been isolated in previous work: authors presented specific choices for all components and the importance of each was unclear.
We assume the generative history-based model and set of features (the representation of parse tree nodes) are fixed and we study carefully the other two factors.
Deleted Interpolation: Deleted Interpolation Estimating the probability P(y|X) by interpolating relative frequency estimates for lower-order distributions Most commonly used: linear feature subsets order Jelinek-Mercer with fixed weight, Witten Bell with varying d, Decision Trees with path interpolation,Memory-Based Learning
Memory-Based Learning as Deleted Interpolation: Memory-Based Learning as Deleted Interpolation In k-NN, the probability of a class given features is estimated as:
If the distance function depends only on the positions of the matching features* , it is a case of deleted interpolation
Memory-Based Learning as Deleted Interpolation: Memory-Based Learning as Deleted Interpolation P(eye-color=blue|hair-color=blond)
We have N=12 samples of people
d=1 or d=0 (match), w(1)=w1 , w(0)=w0, K=12
Deleted Interpolation where the interpolation weights depend on the counts and weights of nearest neighbors at all accepted distances
The Task and Features Used : The Task and Features Used Maximum ambiguity – 507, minimum - 2
Experiments : Experiments Linear Feature Subsets Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Feature Subsets Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods and accuracy
Experiments – Linear Sequence: Experiments – Linear Sequence The features {1,2,…,8} ordered by gain ratio {1,8,2,3,5,4,7,6} Jelinek Mercer Fixed Weight Witten-Bell Varying d
Experiments – Linear Sequence: Experiments – Linear Sequence heavy smoothing for best results
MBL Linear Subsets Sequence: MBL Linear Subsets Sequence Restrict MBL to be an instance of the same linear subsets sequence deleted interpolation as follows:
Weighting functions INV3 and INV4 performed best: LKNN3 best at K=3,000 79.94%
LKNN4 best at K=15,000 80.18%
LKNN4 is best of all
Experiments : Experiments Linear Subsets Feature Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Subsets Feature Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods and accuracy
Model Implementations – Decision Trees: Model Implementations – Decision Trees (DecTreeWBd)
n-ary decision trees; If we choose a feature f to split on, all its values form subtrees
splitting criterion – gain ratio
final probabilities estimates at the leaves are Witten Bell d interpolations of estimates on the path to the root
feat: 1 feat:2 HCOMP instances of deleted interpolation models!
NOPTCOMP
Model Implementations – Log-linear Models: Model Implementations – Log-linear Models Binary features formed by instantiating templates
Three models with different allowable features
Single attributes only LogLinSingle
Pairs of attributes, only pairs involving the most important feature (node label) LogLinPairs
Linear feature subsets – comparable to previous models LogLinBackoff
Gaussian smoothing was used
Trained by Conjugate Gradient (Stanford Classify Package)
Model Implementations –Memory-Based Learning: Model Implementations –Memory-Based Learning
Weighting functions INV3 and INV4
KNN4 better than DecTreeWBd and Log-linear models
KNN4 has 5.8% error reduction from WBd (significant at the 0.01 level)
Accuracy Curves for MBL and Decision Trees: Accuracy Curves for MBL and Decision Trees
Experiments : Experiments Linear Subsets Feature Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Subsets Feature Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods and accuracy
Joint Likelihood, Conditional Likelihood, and Classification Accuracy: Joint Likelihood, Conditional Likelihood, and Classification Accuracy Our aim is to maximize parsing accuracy, but:
Smoothing parameters are usually fit on held-out data to maximize joint likelihood
Sometimes conditional likelihood is optimized
We look at the relationship among the maxima of these three scoring functions, depending on the amount of smoothing, finding that:
Much heavier smoothing is needed to maximize accuracy than joint likelihood
Conditional likelihood also increases with smoothing, even long after the maximum for joint likelihood
Test Set Performance versus Amount of Smoothing - I: Test Set Performance versus Amount of Smoothing - I
Test Set Performance versus Amount of Smoothing: Test Set Performance versus Amount of Smoothing
Test Set Performance versus Amount of Smoothing –PP Attachment: Test Set Performance versus Amount of Smoothing –PP Attachment Witten-Bell Varying d
Summary: Summary The problem of effectively estimating local probability distributions for compound decision models used for classification is under-explored
We showed that the chosen local distribution model matters
We showed the relationship between MBL and deleted interpolation models
MBL with large numbers of neighbors and appropriate weighting outperformed more expensive and popular algorithms – Decision Trees and Log-linear Models
Fitting a small number of smoothing parameters to maximize classification accuracy is promising for improving performance
Future Work: Future Work Compare MBL to other state-of-the art smoothing methods
Better ways of fitting MBL weight functions
Theoretical investigation of bias-variance tradeoffs for compound decision systems with strong independence assumptions