LanguageModeling

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Language Modeling: 

Language Modeling I. Scott MacKenzie

Plan: 

Plan My gosh, language is a complicated subject! How shall we proceed? How about… Models of description Models of relation Models of prediction

Models of Description: 

Models of Description

Describing Language: 

Describing Language How can you describe a language such as English or Finnish? We may describe… Letters Sequences of letters Words Sequences of words Grammar Phrases Ideas

Language: In what form?: 

Language: In what form? Spoken vs. written vs. ... Formal vs. informal vs. ... Old vs. new vs. ... Etc.

Corpora : 

Corpora The written form of a language is typically represented in the form of a corpus A corpus is… A large representative body of text for a given language English examples Brown Corpus, British National Corpus (BNC), etc Finnish example Text from Turun Sanomat

Example #1 BNC Corpus (“d1”): 

Example #1 BNC Corpus (“d1”) Descriptive stats 27 unique letters, 368,832,032 total letters 9022 unique words, 67,962,112 total words 729 digrams, 368,832,032 total digrams Files d1-letterfreq.txt – letters and their frequencies d1-wordfreq.txt – words and their frequencies d1-digramfreq.txt – digrams and their frequencies

Example #2 BNC Corpus (“d2”): 

Example #2 BNC Corpus (“d2”) Descriptive stats 27 unique letters, 505,863,847 total letters 64,566 unique words, 90,563,946 total words 729 digrams, 505,863,847total digrams Files d2-letterfreq.txt – letters and their frequencies d2-wordfreq.txt – words and their frequencies d2-digramfreq.txt – digrams and their frequencies

Example #3 Brown Corpus (“bc”): 

Example #3 Brown Corpus (“bc”) Descriptive stats 27 unique letters, 5,666,538 total letters 41,532 unique words, 997,552 total words 729 digrams, 5,666,538 total digrams Files bc-letterfreq.txt – letters and their frequencies bc-wordfreq.txt – words and their frequencies bc-digramfreq.txt – digrams and their frequencies

Example #43 Turun Sanomat Text (“fi”): 

Example #43 Turun Sanomat Text (“fi”) Descriptive stats 29 unique letters, 161,652,647 total letters 200,000 unique words, 19,646,670 total words 841 digrams, 161,652,647 total digrams Files fi-letterfreq.txt – letters and their frequencies fi-wordfreq.txt – words and their frequencies fi-digramfreq.txt – digrams and their frequencies

Word Frequency Stats -- d1 vs. fi: 

Word Frequency Stats -- d1 vs. fi Total words: 9022 Total of frequencies: 67962112 Longest word: telecommunications (18 letters) Shortest word: a Average word size (unnormalized): 7.088 Average word size (normalized): 4.427 Total words: 200000 Total of frequencies: 19646670 Longest word: perustuslainsäätämisjärjestyksessä (34 letters) Shortest word: a Average word size (unnormalized): 11.267 Average word size (normalized): 7.228 d1 fi

Models of Relation: 

Models of Relation

Relations in Language: 

Relations in Language Are there interesting models of relation in languages? Let’s see…

Word Rank vs. Probability – d1: 

Word Rank vs. Probability – d1 Hmm… There appears to be a relationship between word rank and word probability. Plotting both on log scales, as above, reveals a linear, or straight line, relationship. How strong is this relation? (next slide)

Slide15: 

File: WordFrequencies.xls

Phrase Sets for Evaluating Text Entry Methods: 

Phrase Sets for Evaluating Text Entry Methods In evaluating text entry methods, users are typically asked to enter phrases of text while their performance is measured It is seems worthwhile to use phrases that are “representative” of the intended language In other words, there should be a relation between the phrases entered and the intended language Let’s see… http://www.yorku.ca/mack/chi03b.html Files: phrases2.txt d1-letterfreq.txt PhraseSets.zip

Models of Prediction: 

Models of Prediction

Models of Prediction: 

Models of Prediction Letter guessing experiment Take a printout of the entire BNC. Cut out all the letters and place them in a bag. Shake the bag. Pick a letter. Try to guess the letter. Your best guess is “e” (or SPACE if these are included in the experiment) In short… Language is highly redundant This is why you can understand conversations in noisy environments – you “fill in”, or predict, the missing pieces An example follows…

90% Removed: 

90% Removed k r ce r oz y t a e t h c o , d o a a o b a a e g v t a l a ss m n t i s n f d w i t l h e n s - w le e a r i w f e n t e h r w e , h v e r d l Wi

80% Removed: 

80% Removed D r s r s n h s d f w e e b e i e in th w i r o a t e t ar e t e k i t i c i ver e . as d on, ho ve n o n an o h t h sp r f a e of s a h o a u o e e n a n s - au h m r s s o h l e ld a e s n in i f li w h mas u i e i m g in i i o i t e o t w t a g z a N n d

70% Removed: 

70% Removed D uc ore frown d on i h r id n ter a h t ee a ent o e whi e in f fro y seeme t l a s ac c s e ad A l ei d h h lf wa s a s, th mo t, a ha th p o v at sa s e a a i i g t aught m r t h y ne s a g t at wa m es e e o i laug s e fr s p k f h gr nes i f t e fu an i c u l s m of r i u l f a he ef rt o il , v , ro ar t land il

60% Removed: 

60% Removed k ruc res f w n th r id t e fr z n t r y. T r d stri d b a rece t n t w i cove n o r s a d t e e e to e n h ot er, l o no s e ad ng l g . A st n eig d v e l d. h n t e w a d s ti , i es , wi mo e t, so a d co t a pi it i as ev hat dn ss he a n it a ht r b t f l u e te bl t n ess a u e t r e s the sm e t p i ugh o d a he fr d p ta in h g ne s llib l It as e a ful d mm c le w e nity lau h g a t e l t o and ff t . I a he W d s ge fr z a e N th a W ld

50% Removed: 

50% Removed D k r ce fore t o on i h r si the fr en wa erw tr s ha ee i ed y a ec nt wi d f the r whit erin o f an e e med t lean t war ch o er, ck a d mi ou in g i t. t le r ig d r he l d la d t el as d so i if , w t o m v ent, s lone d d hat h spi t f t was t ven th sa e s h as a h nt n i f aug t , t f l u t mo e ter ble than an a ss - lau ter hat as m h as t s ile f he ph n a ghte ol as he os a d r a ing e grimn ss o n a il y. a h m s er u in om a i d e it laug n a t fu i y f l e he ff t of if . It he W l , t e s ag , f e - ed N r n ld

40% Removed: 

40% Removed Dark s r c res ow e n eit er side the froze w erwa . T trees h d b n tr p d b c t w n o heir white c v ring of f s and h y s med e n towa ds e c o her la k and i o s, i e f l gh . A as ilence reig ed o he and. The lan e f a a de l tion, life e s, i mo ment, so n nd cold h the pi t o it s n t ev n hat o sa nes . h e wa hin n it of laughter, bu a lau h r e t rib e a dn - laught r a was mi hle s as the i e of t e s a la hte ol as he ro t a d ar ak n f th g mne of inf llib i . It wa he mas erful n incom un b e w s om of t rnity l ugh n at e lity f fe nd e effort f ife. I as t e , savage, fro n- h a r hland Wi d.

30% Removed: 

30% Removed Da s ruce fores r on ith r i e the froz n ater ay The t s ha een st ippe b re e wi d f heir white c vering of ost, an they e d o ean towards eac othe , lac a d om nou , n t e f in ight. A vast il nc re ned o e he la d. The l d s l wa d s a ion, if s, without m men o e and old that t e pi t of it was not e en tha o sadnes T ere s a hint in t of laug r, ut o a laugh e o er ble t n a y sadn s a la ghte at as irthless s the mil o h phi x, a ghter cold as the fro t a d par ak ng f the grim es of n al b l ty was th m s r ul and co mu i able wi m of ter la ghi t t e futil ty o if an the ef rt of li e. It as the Wild, h avage, froze - hearte Northland Wil .

20% Removed: 

20% Removed Dark spruce forest frowned n either side the roze ate wa . The trees had ee stripp d by ecent wi d f thei white coverin of frost d they eemed to lean towa ds e ch o h r, bl ck and mino s, in the ad ng l ght. A vas sil n e reigned over the land. The land i sel was e olatio , ifeless, wit out movem n , s lon n cold ha e spi it of s n t eve hat adn s. The e was hi i it of a ght , but f a laughter ore ible t an any s ne s - a laughter t a was mi hless as he mile of he sphinx, a aug ter col as the f ost nd arta in of th grimn ss of i fall bility It as the asterful and inc mmunicabl wisdo of e ernity ugh ng at the futilit of li and t e for of ife. It w the Wild, the avage, rozen- hea t d N rthla d W ld.

10% Removed: 

10% Removed Dark s ru e forest frowned on either side the frozen waterw y. The trees had bee stripped by a recent ind of t eir w ite covering of rost, and they seemed o lean towards each ot er, black and ominous, in the fading li h . A vast silence reigned ver the land The land itself w s a deso ation, lifel ss, without ovement, s l n and cold hat the spiri of it wa not even that of sa ness. here was a hint in it of laughte , but of a l ug ter more terrible than any sadness - a ughte that as mirthless s he smile of the sphinx a laug ter cold as the rost a d p rt k ng of the grimness of infal i ility. It as the masterful and incommunica le wisdom of eternity laughing at th futility of life an the effort f life. It was e Wild, the sa ag , froz n- earte Northland Wild.

0% Removed: 

0% Removed Dark spruce forest frowned on either side the frozen waterway. The trees had been stripped by a recent wind of their white covering of frost, and they seemed to lean towards each other, black and ominous, in the fading light. A vast silence reigned over the land. The land itself was a desolation, lifeless, without movement, so lone and cold that the spirit of it was not even that of sadness. There was a hint in it of laughter, but of a laughter more terrible than any sadness - a laughter that was mirthless as the smile of the sphinx, a laughter cold as the frost and partaking of the grimness of infallibility. It was the masterful and incommunicable wisdom of eternity laughing at the futility of life and the effort of life. It was the Wild, the savage, frozen- hearted Northland Wild. From Jack London’s “White Fang” Files: paragraph.txt | RemoveText.class

Language as Information: 

Language as Information Information is commonly measured in “bits” Since language is highly redundant, perhaps it can be viewed somewhat like information Can language be measured or coded in “bits”? Sure. Examples include… ASCII (7 bits per “symbol”) Unicode (16 bits per symbol) But coding schemes, such as ASCII or Unicode, do not account for the redundancy In the language Two questions: Can a coding system be developed for a language (e.g., English) that accounts for the redundancy in the language? If so, how may bits per symbol are required?

How many bits? : 

How many bits? ASCII codes have seven bits 27 = 128 codes Codes include… 33 control codes 95 symbols, including 26 uppercase letters, 26 lowercase letters, space, 42 “other” symbols In general, if we have n symbols, the number of bits to encode them is log2n (Note: log2128 = 7) What about bare bones English – 26 letters plus space? How many bits?

How many bits? (2): 

How many bits? (2) It takes log227 = 4.75 bits/character to encode bare bones English But, what about redundancy in English? Since English is highly redundant, is there a way to encode letters in fewer bits? Yes How many bits? The answer (drum roll please)…

How many bits? (3): 

How many bits? (3) The minimum number of bits to encode English is (approximately)… 1 bit/character How is this possible? E.g., Huffman coding Want to learn more? (Hint: do a Google search) More importantly, how is this answer computed? Want to learn how? Read… Shannon, C. E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30, 51-64.

Formulas: 

Formulas Files: Entropy.html | Entropy.class

Disambiguation: 

Disambiguation A special case of prediction is disambiguation Consider the telephone keypad… Is it possible to enter text using this keypad? Yes. But the keys are ambiguous.

Ambiguity Continuum: 

Ambiguity Continuum 53 keys 27 keys 8 keys 1 key Less Ambiguity More Ambiguity

Coping With Ambiguity: 

Coping With Ambiguity There are two approaches to disambiguating the telephone keypad Explicit Use additional keys or keystrokes to select the desired letter E.g., multitap Implicit Add “intelligence” (i.e., a language model) to the interface to guess the intended letter E.g., T9, Letterwise

Multitap: 

Multitap Press a key once for the 1st letter, twice for the 2nd letter, and so on Example... 84433.778844422255.22777666966.33366699. th e q u i c k b r o wn f o x 58867N7777.66688833777.84433.55529999N999.36664. ju mp s o v e r th e l az y do g

Word + Frequency + Keystroke Files: 

Word + Frequency + Keystroke Files Language modeling files: Letters + frequencies Digrams + frequencies Words + frequencies A useful variation appends keystrokes for a given interaction technique, such as multitap: Letters + frequencies + keystrokes Digrams + frequencies + keystrokes Words + frequencies + keystrokes Let’s see... java BuildKeystrokes java BuildKeystrokes d1-wordfreq.txt –wf –mt | more

T9: 

T9 Product of Tegic Communications (www.tegic.com), a subsidiary of AOL Inc. Licensed to many mobile phone companies The idea is simple: one key = one character A language model works “behind the scenes” to disambiguate Example (next slide)...

Guess the Word: 

Guess the Word

“Quick Brown Fox” Using T9: 

“Quick Brown Fox” Using T9 843.78425.27696.369.58677.6837.843.5299.364. the quick brown fox jumps over the lazy dog But, there is problem. The key sequences are ambiguous and other words may exist for some sequences. See below.

Alternate Word Selection: 

Alternate Word Selection The problem just illustrated is easily addressed A special “next word” (N) key1 cycles through the list of alternate words when the set size > 1 Words presented by decreasing probability So... 1 Nokia phones: “next word” = asterisk

T9 Keystrokes : 

T9 Keystrokes Generating a T9 word + frequency + keystrokes file is a bit tricky, because “next word” keystrokes must be included Let’s see... java T9 d1-wordfreq-phoneks.txt –k | more

Ambiguity Analysis: 

Ambiguity Analysis Some questions of interest… For any given word, is its encoding unique? How many words in a dictionary are ambiguous? How many words require 1 press of N? How many words require 2 presses of N? (and so on) What is the greatest number of presses of N required? Can we produce the sets of ambiguous words? Let’s see... java Decode d1-wordfreq-phoneks.txt 5 java T9 d1-wordfreq-phoneks.txt –s java T9 d1-wordfreq-phoneks.txt –a | more

Keystrokes Per Character (KSPC): 

Keystrokes Per Character (KSPC) Earlier examples used the “quick brown fox” phrase, which has 44 characters (including one character after each word) Multitap and T9 require very different keystroke sequences Compare…

KSPC as a Characteristic of Text Entry Methods: 

KSPC as a Characteristic of Text Entry Methods The preceding analysis suggests that KSPC is useful to characterise and compare text entry methods The examples were for an example phrase Can we compute KSPC for a language? Yes. Let’s see… java KSPCWords d1-wordfreq-mtks.txt java KSPCWords d1-wordfreq-t9ks.txt

(KSPC for) Other Text Entry Methods: 

(KSPC for) Other Text Entry Methods Some other text entry methods to consider LetterWise MessageEase 5-key pager 3-key date stamp 1-key method (Can you think of a method?)

LetterWise: 

LetterWise Product of Eatoni Ergonomics (www.eatoni.com) Uses “prefix-based disambiguation” Guesses each letter based on preceding letter(s) E.g., pressing “3” with prefix “_th” is very likely “e” because “_the” is more likely than “_thd” or “_thf” Does not use a dictionary! Uses probabilities of letter sequences Standard implementation uses prefix size of 3 Wrong letters still produced occasionally, and, in this case, the user presses the “next letter” (N) key to sequence through the possibilities

LetterWise Keystrokes and KSPC: 

LetterWise Keystrokes and KSPC File: d1-wordfreq-lwks.txt java KSPCWords d1-wordfreq-lwks.txt

MessageEase: 

MessageEase Product of EXideas Inc. (www.exideas.com) Uses mobile phone keypad Letters organized in 3 sets Operation: For the nine most-common letters (anihortes), the corresponding key is pressed twice (e.g., 4-4 for 'h') For the next-most-common letters (upbjdgcq), the 5 key is pressed first followed by the key "pointed to" by the letter (e.g., 5-3 for 'p') For the least-common letters (vlxkmywf) the corresponding key is pressed followed by the 5 key (e.g., 4-5 for 'k') To enter a SPACE, 0 is pressed once

MeassageEase Keystrokes and KSPC: 

MeassageEase Keystrokes and KSPC File: d1-wordfreq-meks.txt java KSPCWords d1-wordfreq-meks.txt

5-Key Pager: 

5-Key Pager LCD display with soft keyboard 5 keys for text entry (Left, Right, Up, Down, Select) Navigate soft keyboard with arrows keys, press S to enter a character Space character in centre of soft keyboard Cursor “snaps to home” (space) after each character entered

5-key Pager Keystrokes and KSPC: 

5-key Pager Keystrokes and KSPC File: d1-wordfreq-pagerks.txt java KSPCWords d1-wordfreq-pagerks.txt

3-key “Date Stamp” Text Entry: 

3-key “Date Stamp” Text Entry Alphabet appears as a linear sequence of letters, e.g., Three keys: Left, Right, Select Named after a teller’s date stamp _abcdefghijklmnopqrstuvwxyz

3-key “Date Stamp” Text Entry (2): 

3-key “Date Stamp” Text Entry (2) Six methods considered: #1 – space on left, cursor “persists” #2 – space on left, cursor “snaps to home” (space) #3 – space in middle, persist #4 – space in middle, snap to home #5 – space on left, FOCL level #1, snap to home #6 - space on left, FOCL level #2, snap to home Notes FOCL = fluctuating optimal character layout Strictly speaking, only two keys are required for methods #2, #5, and #6. However, a Left key is useful for correcting “overshoot errors”

Date Stamp Keystrokes and KSPC: 

Date Stamp Keystrokes and KSPC Files: d1-wordfreq-datestamp1ks.txt d1-wordfreq-datestamp2ks.txt d1-wordfreq-datestamp3ks.txt d1-wordfreq-datestamp4ks.txt d1-FOCL1.txt d1-wordfreq-datestamp5ks.txt d1-FOCL2.txt d1-wordfreq-datestamp6ks.txt java KSPCWords d1-wordfreq-datestamp1ks.txt

KSPC Summary (Part 1): 

KSPC Summary (Part 1)

KSPC – Part 2: 

KSPC – Part 2 We’ve seen text entry methods with KSPC ranging from 1 to >10 Is KSPC < 1 possible? Yes. Using word prediction

Word Prediction: 

Word Prediction The problem… Given some amount of preceding text, predict some amount of text in the future Many levels Letters, words, word sequences, phrases or ideas Demo java WordPredict d1-wordfreq.txt 5

KSPC With Word Prediction: 

KSPC With Word Prediction Interesting question: What is the lowest possible value for KSPC? We can answer this question, but… First, we must define the interaction technique Some design parameters are Word-level vs. phrase-level prediction Candidate list size (e.g., 1, 2, 5, 10) Input method (e.g., keypad, stylus)

Design Parameters: 

Design Parameters Static vs. dynamic (viz. adaptive) language model Let’s consider only a static language model Word-level vs phrase-level prediction Let’s consider only word-level prediction Candidate list size (n) Let’s use n = 1, 2, 5, or 10 With each character entered, a candidate list is produced based on the current word stem Input method Let’s consider keypad and stylus When/if the desired word appears, it is selected by... tapping on it (stylus input), or pressing arrow keys to “reach” the candidate word (keypad input)

KSPC Analysis: 

KSPC Analysis The KSPCWords program will compute the KSPC characteristic for any text entry method, provided we have a word + frequency + keystrokes file for the method The problem, therefore, is to build such files for the word prediction designs just considered

Example: 

Example Consider the word “vegetable” How many and what keystrokes are required? 4 keystrokes 6 keystrokes

Demo: 

Demo java BuildKeystrokesWP java BuildKeystrokesWP d1-wordfreq.txt –v | more java BuildKeystrokesWP d1-wordfreq.txt –k | more java KSPCWords d1-wordfreq-wps5ks.txt n = 1, 2, 5, or 10 s = stylus, k = keypad

KSPC Summary: 

KSPC Summary KSPC bottoms out at about 0.5

Word Prediction with Phone Keypad: 

Word Prediction with Phone Keypad Can we combine T9 disambiguation with word prediction on a mobile phone keypad? Sure. Let’s see...

Demo : 

Demo java PhoneKeypad d1-wordfreq-phoneks.txt

Thank You: 

Thank You References MacKenzie, I. S. (2002). KSPC (keystrokes per character) as a characteristic of text entry techniques. Proceedings of the Fourth International Symposium on Human Computer Interaction with Mobile Devices, 195-210. Heidelberg: Springer Verlag. 2. Shannon, C. E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30, 51-64.