S13

Views:
 
Category: Sports
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Wed, Feb 23: 

Wed, Feb 23 CS1573 of Spring 05 Kurt VanLehn

Course Outline: 

Course Outline Knowledge representation Search Natural Language Processing Classical NLU Probabilistic NLU (2 classes) Dialogue systems (2 classes) Planning

Probabilistic NLP outline: 

Probabilistic NLP outline Language modelling Information retrieval Information extraction Text classification

Information Retrieval: 

Information Retrieval Already covered The problem: Given query, return documents Language modelling method Recall andamp; Precision Presentation ordering To be covered today Vector space methods Latent semantic analysis (LSA)

Vector space method for IR: 

Vector space method for IR Each document andamp; query is treated as a bag of words, and occurrences are counted 'Monkey see, monkey do'  {monkey:2, do:1, see:1} Think of it as a sparse vector: [0, 0, …, 1, …., 2,…, 1, ….0] Number of 'Aardvark' occurrences Number of 'do' occurrences Number of 'monkey' occurrences Number of 'see' occurrences Number of 'zebra'

Measure similarity of two documents via dot product: 

Measure similarity of two documents via dot product Dot product of two vectors V1  V2 = length(V1) * length(V2) * cosine(andlt;angle between V1 and V2andgt;) V1  V2 =  ddimension (component(V1,d)*component(V2,d) Dot product of two documents D1  D2 =  wlexicon (count(D1,w)*count(D2,w) 'monkey see, monkey do'  'monkey' = 2 'monkey see, monkey do'  'see' = 1 'monkey see, monkey do'  'monkey monkey see' = 5

IR via vector spaces: 

IR via vector spaces Precompute the vector for each document Given a query, compute its vector Take dot product of query with all document vectors Sort documents, highest dot product first Return first N documents

Problem: No representation of semantic similarity: 

Problem: No representation of semantic similarity 'monkey see, monkey do'  'monkeys' = 0 'monkey see, monkey do'  'vision' = 0 Vectors have too many zeros Similar documents have too low a dot product

Latent Semantic Analysis: 

Latent Semantic Analysis Reduce the number of dimensions via principle component analysis Intuitively, this replaces a set of words that tend to co-occur with a single new dimension that represents the set If 'see' and 'vision' tend to occur together, then replace their 2 slots in the vector with 1 slot Each document now represented by short vector Length shrinks from ~15,000 to ~400 Most 'counts' become small real numbers instead of 0

Using LSA for IR: 

Using LSA for IR Given a (large) sample of the document collection, compute principle components Yields a function that inputs a long, word-count vector and outputs a short vector Apply the function to every document’s vector Represent the document by its the new (short) vector Given a query, apply the function to the query compute dot products with all document vectors Return N documents with largest dot products

IR summary: 

IR summary Main methods Language modeling Vector space LSA Accessories Presentation ordering Stemming, case folding Evaluation Precision = Of documents retrieved, % relevant Recall = Of relevant documents in collection, % retrieved

Information Extraction: 

Information Extraction Given a text e.g,. A Wall Street Journal article Determine if it describes a certain kind of event Corporate merger Terrorist attack If so, extract information to fill the template’s slots Mergers: Who? Where? Amount? Product? Attacks: Type? Location? Injuries? Organization? Add the new record to a database

A common approach: 

A common approach Tokenization Characters  words Semantic parsing Words  NPs and VPs with semantic features Extraction of slot fillers Phrases  templates with some slots filled Reference resolution and merger Templates  fewer templates with more slots filled

Example input/output: 

Example input/output Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and 'metal wood' clubs a month Template id: Tie-up-1 Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'A local concern', 'a Japanese trading house'} Joint: 'Bridgestone Sports Taiwan Co.' Activity: Activity-1 Amount: NT$(20000000) Template id: Activity-1 Company: 'Bridgestone Sports Taiwan Co.' Product: 'iron and ‘metal wood’ clubs' Start Date: During(January(1990))

Tokenization: 

Tokenization Segments character string into words '…Sports Taiwan Co., capitalized…'  'Sports' 'Taiwan' 'Co' , 'capitalized' Easier for English than Japanese Blanks separate most words in English Treats period as abbreviation instead of sentence ender

Semantic parsing is often based on hand-coded knowledge: 

Semantic parsing is often based on hand-coded knowledge Use domain-specific grammars CompanyName  CappedWords CompanySuffix CompanySuffix  'Co' | 'Inc' | 'Ltd' CappedWords  CappedWord CappedWords  CappedWord CappedWords Example 'Bridgestone Sports Co'  [CompanyName [CappedWords [CappedWord 'Bridgestone'] [CappedWords [CappedWord 'Sports']]]] [CompanySuffix 'Co']]

Often use semantic features: 

Often use semantic features Use conventional non-terminals, e.g., NP, VP, PP… but arguments have domain-specific semantic features NP(company)  CompanyName NP(company)  GenericCompany NP(?x)  NP(?x) Conjunction NP(?x) 'Bridgestone Co and a local concern'  [NP(company) [NP(company) [CompanyName 'Bridgestone Co']] [Conjunction 'and'] [NP(company) [GenericCompany 'a local concern']]]

Use multi-pass bottom-up parsing: 

Use multi-pass bottom-up parsing Recognize multi-word phrase, numbers and proper names Recognize simple noun phrases, verb groups, and particles Recognize complex noun phrases (conjunctions, PP modifiers, ..) and complex verb phrases Don’t bother to recognize Ss. Leave each sentence as a list of NPs, VPs and PPs and unrecognized words

Output of semantic parsing: 

Output of semantic parsing [NP(company) Bridgestone Sports Co.] said Friday it has [VP(set_up) set up] [NP(joint_venture) a joint venture] in Taiwan with [NP(company) a local concern and a Japanese trading house] [VP(produce) to produce] [NP(product) golf clubs] to be shipped to Japan. [NP(company) The joint venture, Bridgestone Sports Taiwan Co., ] [VP(capitalized) capitalized] at [NP(currency) 20 million new Taiwan dollars], [NP(company) The joint venture, Bridgestone Sports Taiwan Co., ] will [VP(start) start production] in [NP(date) January 1990] with production of [NP(product) 20,000 iron and 'metal wood' clubs] a month

A common approach: 

A common approach Tokenization Characters  words Semantic parsing Words  NPs andamp; VPs with semantic features Extraction of slot fillers Phrases  templates with some slots filled Reference resolution and merger Templates  fewer templates with more slots filled Next

Extraction via pattern matching: 

Extraction via pattern matching Pattern: NP(company) VP(set_up) NP(joint_venture) with NP(company) Matching ignores extra words in the input [NP(company) Bridgestone Sports Co.] said Friday it has [VP(set_up) set up] [NP(joint_venture) a joint venture] in Taiwan with [NP(company) a local concern and a Japanese trading house] [VP(produce) to produce] [NP(product) golf clubs] to be shipped to Japan. If the pattern matches Create a template Fill selected slots with words spanned by phrases matches this

Output from extraction: 5 partially filled templates: 

Output from extraction: 5 partially filled templates Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'a local concern', 'A Japanese trading house'} Activity: Production Product: 'golf clubs' Relationship: Tie-up Joint: 'Bridgestone Sports Taiwan Co.' Amount: NT$(20000000) Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Start date: during(January(1990)) Activity: Production Product: 'iron and ‘metal wood’ clubs'

Merger does co-reference resolution: 

Merger does co-reference resolution OUTPUT Template id: Tie-up-1 Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'A local concern', 'a Japanese trading house'} Joint: 'Bridgestone Sports Taiwan Co.' Activity: Activity-1 Amount:NT$(20000000) Template id: Activity-1 Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Product: 'iron and ‘metal wood’ clubs' Start Date: During(January(1990)) INPUT Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'a local concern', 'A Japanese trading house'} Activity: Production Product: 'golf clubs' Relationship: Tie-up Joint: 'Bridgestone Sports Taiwan Co.' Amount: NT$(20000000) Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Start date: during(January(1990)) Activity: Production Product: 'iron and ‘metal wood’ clubs'

To make it fast…: 

To make it fast… Restrict semantic grammars to be regular expressions Compile into finite-state autonomata Cascade/stream the levels tokenization Complex words Simple phrases

Info Extraction Summary: 

Info Extraction Summary Use domain-specific semantic grammars Knowledge-based development  corpus-based? Use multi-stage, bottom-up recognition, skipping words at the last stage if unrecognized Merge templates by guessing about co-referring phrases

Probabilistic NLP outline: 

Probabilistic NLP outline Language modelling Information retrieval Information extraction Text classification

Text Classification: 

Text Classification Given some text, decide which class (if any) is is an instance of Often used in dialogue systems System: 'How can I help you?' User: 'I need to get to Boston'  reservations 'When does my kid’s flight arrive?'  schedules 'Have my bags arrived yet?'  service 'Is the lady of the house at home?'  hang up 'kill yourself'  none Classes

Some Text Classification Methods: 

Some Text Classification Methods Knowledge-based Information extraction andamp; semantic grammars Corpus-based Language modeling LSA and other vector methods

Language modeling method: 

Language modeling method Gather data Ask thousands of callers 'How can I help you?' Convert their answers to text Human coders tag each utterance: 'I need to get to Boston'  reservations 'When does my kid’s flight arrive?'  schedules 'Have my bags arrived yet?'  customer service 'Is the lady of the house at home?'  hang up Generalize From counts of andlt;tagandgt; given andlt;word was saidandgt;, generate unigram model P(andlt;tagandgt; | andlt;new word stringandgt;) or bigram, trigram, etc. Smoothing Use Given new utterance with words wstring Choose tag that maximizes P(andlt;tagandgt;|wstring)

LSA Method: 

LSA Method Gather data (same as with language modeling method) Ask thousands of callers 'How can I help you?' Convert their answers to text Human coders tag each utterance: 'I need to get to Boston'  reservations etc Compute function that reduces vector lengths Convert each utterance to a long vector Do principle components analysis Convert each utterance’s long vector to a short one For each class, Average the vectors of the utterances in that class Each class now represented by a single (short) vector In use Given new utterance, compute its (short) vector Take dot product with each class’s vector Choose class with largest dot product

Questions?: 

Questions? Language modelling Information retrieval Information extraction Text classification

authorStream Live Help