Statistical Linguistics: Statistical Linguistics Why to count, what to count and how to count it
Applications of statistical thinking: Applications of statistical thinking Language Models: predict the next word
Applications: act on limited information
Linguistics1: prefer economical theories
Linguistics2: explain natural language phenomena.
Linguistics3: explain language acquisition.
Probability and language models: Probability and language models Language Models: predict next word on basis of knowledge or assumptions.
Probability theory developed as a way of reasoning about uncertainty and (especially) games of chance.
Strategy: conceptualise language as a game of chance, then use probability theory.
Probability distributions: Probability distributions Suppose there are just three words: “the”, “old” and “dog”. 0 1 the old dog Uniform distribution Non-uniform distribution
Events, Trials and Outcomes: Events, Trials and Outcomes We sometimes say that an event (Heads), is the outcome of a trial (Tossing a coin).
Now imagine a series of trials (repeatedly tossing a coin). Plausible that the outcome of trial n+1 will be unaffected by that of trial n. Earlier events in the sequence do not affect later events.
Propositional variables: Propositional variables Variables in propositional logic formalise the idea of a truth value.
They acquire their meaning from their relationships with other variables
These relationships are expressed using connectives with evocative names like AND, OR and NOT
Or, as Charles Pierce knew, we can just use a NAND gate and have the full expressive power of Boolean logic
Propositional variables: Propositional variables For convenience, we often choose to use a wider range of connectives than just NAND, because words like “or”, “and” and “implies” are more familiar.
In essence, what we are able to do is to make statements about the relationships between truths, formalizing things like “If rain would have made the grass wet, and the grass isn’t wet, then it didn’t rain”
A pack of cards: A pack of cards Could we draw a 2? Yes
Could we draw an 8? Yes
Could we get a queen? Yes
Could we draw a 4? Don’t know
Could we draw a 19? Don’t know
Random variables: Random variables Random variables formalise the idea of a trial.
Random variables represent what you know about the trial before you have seen its outcome.
Before you draw a card from the deck, you know that there are fifty-two cards, four suits, two colours, twelve face cards, ...
Notation for probabilities: Notation for probabilities P(X=xi) a probability (number) for xi
P(xi) an abbreviation for the above
P(X), or either of above to mean the function that assigns a value to each xi
|X=xi| for the number of times xi occurs.
Can define P(X=xi) as |X=xi| / Σj |X=xj| if large number of trials
Conditional probabilities: Conditional probabilities Conan-Doyle does not play dice
“Holmes” follows every use of “Sherlock”.
The formal statement of this fact is that the conditional probability of the nth word being “Holmes” if the n-1th is “Sherlock” appears to be 1 for Conan-Doyle stories.
P(Wn = holmes | Wn-1 = sherlock) = 1
Joint events: Joint events Joint event of the n-1th word being “Sherlock” and the nth “Holmes”.
P(Wn-1 = sherlock ,Wn = holmes)
Know identity of the next word when we have seen the ``Sherlock'', so
P(Wn = holmes,Wn-1 = sherlock) = P(Wn-1 = sherlock)
Decomposing joint events: Decomposing joint events In general, for any pair of words wk, wk’, we will have:
P(Wn = wk’,Wn+1 = wk) = P(Wn = wk ) P(Wn+1 = wk’|Wn = wk )
which is usually written more compactly in a form like:
P(wkn , wk’n+1) = P(wkn ) P(wk’n+1| wkn)
Pitfalls: Pitfalls Just because Holmes is the only word that follows Sherlock, it need not be that Holmes is always preceded by Sherlock.
Bayes’ theorem: Bayes’ theorem P(wkn , wk’n+1) = P(wkn ) P(wk’n+1| wkn) = P(wk’n+1 ) P(wkn | wk’n+1)
Divide through by P(wkn )
P(wk’n+1| wkn) = P(wk’n+1 ) P(wkn | wk’n+1)/ P(wkn )
Which is an instance of Bayes’ theorem
P(A| B) = P(A) P(B|A)/ P(B)
Medical diagnosis: Medical diagnosis The doctors problem: P(S,C) vs P(S,P)
Causal Information: P(S|C) = P(S|P) = 1
Base Rates: P(P) = 10-6 P(C) = 0.25 P(S) = 0.33
Wants: P(C|S) and P(P|S)
Bayes’ rule applied: Bayes’ rule applied P(P|S) = ( P(P) × P(S|P))/P(S)
In this case (10-6 × 1)/0.33 = 3 × 10-6
the doctor probably wouldn’t panic.
In an epidemic the prior might change dramatically, affecting the outcome.
The prior dominates the posterior.
Being Bayesian: Being Bayesian Posterior α Prior × Likelihood
P(L|X) α P(L) × P(X|L)
Grammar inference = Prior beliefs about possible grammars × Language model
Linguistic diagnosis: Language Identification: Linguistic diagnosis: Language Identification e preebas bioquimica
man immunodeficiency
faits se sont produi
Bad techniques for language identification: Bad techniques for language identification Common words: needs longish test data.
Proper names: especially bad, since very numerous, and highly likely to appear in documents in another language.
Unique letter pairs or sequences: OK sometimes, but doesn’t weight evidence by frequency
Tutorial Paper by Ted Dunning: Tutorial Paper by Ted Dunning Dunning asks the following questions:
Q: How simple can the program be?
A: Small program based on statistical principles
Q: What does it need to learn?
A: No hand-coded linguistic knowledge is needed. Only training data plus the assumption that texts are somehow made of bytes.
Q: How much training data needed?
A: A few thousand words of sample text from each language suffices. Ideally about 50 Kbytes
More questions: More questions Q: How much test data?
A: 10 characters work, 500 characters very well.
Q: Can it generalise?
A: If trained on French, English and Spanish, thinks German is English.
Markov Models: Markov Models States and transitions (with probabilities)
Tabular form of Markov models: Tabular form of Markov models Transition Matrix(A) The Dogs Bit The 0.33 0.33 0.33 Dogs 0.33 0.33 0.33 Bit 0.33 0.33 0.33
Start with initial probabilities p(0) The 0.7 Dogs 0.2 Bit 0.1
Using Markov models: Using Markov models Choose initial state from p(0)
Choose transition from relevant row of A
Repeat ad lib
Simple probabilistic generator for sequences of words. Yields sequence and its probability.
Same can be done with characters, giving a probabilistic generator for character strings.
Higher order Markov models : Higher order Markov models If you want to take account of more context, you can re-draw the machine so that each state is labelled with a longer fragment of the context.
Example
Randomly generated strings from Markov models: Randomly generated strings from Markov models 0 hm 1 imuando~doc ni leotLs Aiqe1pdt6cf tlc.teontctrrdsxo~es loo oil3s
1 ~ a meston s oflas n, ~ 2 nikexihiomanotrmo s,~125 0 3 1 35 fo there
2 s ist ses anat p sup sures Alihows raaial on terliketicany of prelly
3 approduction where. If the lineral wate probability the or likelihood
4 sumed normal of the normal distribution. Gale, Church,Willings. This
5 ~k sub 1} sup {n-k} .EN where than roughly 5. This agreemented by th
6 these mean is not words can be said to specify appear McDonald. 1989
Bayesian Decision Rules: Bayesian Decision Rules Spanish and English as diseases causing the symptom “immunotechno”
Compare P(immunotechno,Spanish) with P(immunotechno,English)
If we want we can compare P(immunotechno|Spanish)p(Spanish) with P(immunotechno|English)p(English)
Bayesian Decision Rules2: Bayesian Decision Rules2 Assume that P(Spanish) = P(English).
So compare P(immunotechno|Spanish) with P(immunotechno|English)
To do this, we need a model which will generate character strings from data.
A character-based Markov model is what is needed
Character Markov Models: Character Markov Models P(s0…sn) = P(s0)P(s1| s0) P(s2| s0 ,s1)P(s3| s0,s1 ,s2 ) … P(sn| s0…sn-1)
The above is exact, but impractical, because we don’t have sufficient data for reliable estimates.
Approximate with P*(s0…sn) = P(s0)P(s1| s0) P(s2| s0 ,s1)P(s3| s1 ,s2 ) … P(sn| sn-2,sn-1)
Character Markov Models: Character Markov Models That was for character-level Markov models of order 2, which corresponds to a trigram model. Word-level trigram models are common.
For the language identification application we actually use character 4-grams.
Probability estimation: Probability estimation We need transition probabilities p(s4|s1,s2,s3)
Obvious way to proceed is to collect |s1,s2,s3,s4 | and |s1,s2,s3 | then divide.
This gives zero if |s1,s2,s3,s4 | = 0, undefined if |s1,s2,s3 | = 0
Probability estimation 2: Probability estimation 2 It's a great model for the training set, but fails catastrophically in the real world.
There may be k+1-grams in the test data which are absent from the training data.
Probability estimation 3: Probability estimation 3 By bad luck may be a k+1 gram in the training data for one language, even though it is in fact rare in all the languages.
If this happens, all strings containing that k+1-gram will be judged to be from that language, because all the others will be 0.
maximum likelihood estimator is too brittle.
Smoothing: Smoothing P((s4|s1,s2,s3, Lang) = ( |s1,s2,s3,s4 | + 1)/(|s1,s2,s3 |+M)
this is more stable, can be seen as mixing in a small measure of uniform distribution into the empirical data
Discounts the data, but also prevents zeros
Laplace’s M-estimate
Results : Results In a binary choice between English and Spanish strings drawn from a bilingual corpus, an accuracy of 92% from 20 bytes of test data and 50Kbytes of training data,
improving to about 99.9% when 500 bytes of test data are allowed.