Wed, Feb 23: Wed, Feb 23 CS1573 of Spring 05
Kurt VanLehn
Course Outline: Course Outline Knowledge representation
Search
Natural Language Processing
Classical NLU
Probabilistic NLU (2 classes)
Dialogue systems (2 classes)
Planning
Probabilistic NLP outline: Probabilistic NLP outline Language modelling
Information retrieval
Information extraction
Text classification
Information Retrieval: Information Retrieval Already covered
The problem: Given query, return documents
Language modelling method
Recall andamp; Precision
Presentation ordering
To be covered today
Vector space methods
Latent semantic analysis (LSA)
Vector space method for IR: Vector space method for IR Each document andamp; query is treated as a bag of words, and occurrences are counted
'Monkey see, monkey do' {monkey:2, do:1, see:1}
Think of it as a sparse vector: [0, 0, …, 1, …., 2,…, 1, ….0]
Number of 'Aardvark' occurrences Number of 'do' occurrences Number of 'monkey' occurrences Number of 'see' occurrences Number of 'zebra'
Measure similarity of two documents via dot product: Measure similarity of two documents via dot product Dot product of two vectors
V1 V2 = length(V1) * length(V2) * cosine(andlt;angle between V1 and V2andgt;)
V1 V2 = ddimension (component(V1,d)*component(V2,d)
Dot product of two documents
D1 D2 = wlexicon (count(D1,w)*count(D2,w)
'monkey see, monkey do' 'monkey' = 2
'monkey see, monkey do' 'see' = 1
'monkey see, monkey do' 'monkey monkey see' = 5
IR via vector spaces: IR via vector spaces Precompute the vector for each document
Given a query, compute its vector
Take dot product of query with all document vectors
Sort documents, highest dot product first
Return first N documents
Problem: No representation of semantic similarity: Problem: No representation of semantic similarity 'monkey see, monkey do' 'monkeys' = 0
'monkey see, monkey do' 'vision' = 0
Vectors have too many zeros
Similar documents have too low a dot product
Latent Semantic Analysis: Latent Semantic Analysis Reduce the number of dimensions via principle component analysis
Intuitively, this replaces a set of words that tend to co-occur with a single new dimension that represents the set
If 'see' and 'vision' tend to occur together, then replace their 2 slots in the vector with 1 slot
Each document now represented by short vector
Length shrinks from ~15,000 to ~400
Most 'counts' become small real numbers instead of 0
Using LSA for IR: Using LSA for IR Given a (large) sample of the document collection, compute principle components
Yields a function that inputs a long, word-count vector and outputs a short vector
Apply the function to every document’s vector
Represent the document by its the new (short) vector
Given a query,
apply the function to the query
compute dot products with all document vectors
Return N documents with largest dot products
IR summary: IR summary Main methods
Language modeling
Vector space
LSA
Accessories
Presentation ordering
Stemming, case folding
Evaluation
Precision = Of documents retrieved, % relevant
Recall = Of relevant documents in collection, % retrieved
Information Extraction: Information Extraction Given a text
e.g,. A Wall Street Journal article
Determine if it describes a certain kind of event
Corporate merger
Terrorist attack
If so, extract information to fill the template’s slots
Mergers: Who? Where? Amount? Product?
Attacks: Type? Location? Injuries? Organization?
Add the new record to a database
A common approach: A common approach Tokenization
Characters words
Semantic parsing
Words NPs and VPs with semantic features
Extraction of slot fillers
Phrases templates with some slots filled
Reference resolution and merger
Templates fewer templates with more slots filled
Example input/output: Example input/output Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and 'metal wood' clubs a month
Template id: Tie-up-1 Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'A local concern', 'a Japanese trading house'} Joint: 'Bridgestone Sports Taiwan Co.' Activity: Activity-1 Amount: NT$(20000000)
Template id: Activity-1 Company: 'Bridgestone Sports Taiwan Co.' Product: 'iron and ‘metal wood’ clubs' Start Date: During(January(1990))
Tokenization: Tokenization Segments character string into words
'…Sports Taiwan Co., capitalized…'
'Sports'
'Taiwan'
'Co'
,
'capitalized'
Easier for English than Japanese
Blanks separate most words in English
Treats period as abbreviation instead of sentence ender
Semantic parsing is often based on hand-coded knowledge: Semantic parsing is often based on hand-coded knowledge Use domain-specific grammars
CompanyName CappedWords CompanySuffix
CompanySuffix 'Co' | 'Inc' | 'Ltd'
CappedWords CappedWord
CappedWords CappedWord CappedWords
Example
'Bridgestone Sports Co'
[CompanyName [CappedWords [CappedWord 'Bridgestone'] [CappedWords [CappedWord 'Sports']]]] [CompanySuffix 'Co']]
Often use semantic features: Often use semantic features Use conventional non-terminals, e.g., NP, VP, PP… but arguments have domain-specific semantic features
NP(company) CompanyName
NP(company) GenericCompany
NP(?x) NP(?x) Conjunction NP(?x)
'Bridgestone Co and a local concern'
[NP(company) [NP(company) [CompanyName 'Bridgestone Co']] [Conjunction 'and'] [NP(company) [GenericCompany 'a local concern']]]
Use multi-pass bottom-up parsing: Use multi-pass bottom-up parsing Recognize multi-word phrase, numbers and proper names
Recognize simple noun phrases, verb groups, and particles
Recognize complex noun phrases (conjunctions, PP modifiers, ..) and complex verb phrases
Don’t bother to recognize Ss.
Leave each sentence as a list of NPs, VPs and PPs and unrecognized words
Output of semantic parsing: Output of semantic parsing [NP(company) Bridgestone Sports Co.] said Friday it has
[VP(set_up) set up]
[NP(joint_venture) a joint venture] in Taiwan with
[NP(company) a local concern and a Japanese trading house]
[VP(produce) to produce]
[NP(product) golf clubs] to be shipped to Japan.
[NP(company) The joint venture, Bridgestone Sports Taiwan Co., ]
[VP(capitalized) capitalized] at
[NP(currency) 20 million new Taiwan dollars],
[NP(company) The joint venture, Bridgestone Sports Taiwan Co., ] will
[VP(start) start production] in
[NP(date) January 1990] with production of
[NP(product) 20,000 iron and 'metal wood' clubs] a month
A common approach: A common approach Tokenization
Characters words
Semantic parsing
Words NPs andamp; VPs with semantic features
Extraction of slot fillers
Phrases templates with some slots filled
Reference resolution and merger
Templates fewer templates with more slots filled Next
Extraction via pattern matching: Extraction via pattern matching Pattern:
NP(company) VP(set_up) NP(joint_venture) with NP(company)
Matching ignores extra words in the input
[NP(company) Bridgestone Sports Co.] said Friday it has
[VP(set_up) set up]
[NP(joint_venture) a joint venture] in Taiwan with
[NP(company) a local concern and a Japanese trading house]
[VP(produce) to produce]
[NP(product) golf clubs] to be shipped to Japan.
If the pattern matches
Create a template
Fill selected slots with words spanned by phrases matches this
Output from extraction: 5 partially filled templates: Output from extraction: 5 partially filled templates Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'a local concern', 'A Japanese trading house'}
Activity: Production Product: 'golf clubs'
Relationship: Tie-up Joint: 'Bridgestone Sports Taiwan Co.' Amount: NT$(20000000)
Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Start date: during(January(1990))
Activity: Production Product: 'iron and ‘metal wood’ clubs'
Merger does co-reference resolution: Merger does co-reference resolution OUTPUT
Template id: Tie-up-1 Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'A local concern', 'a Japanese trading house'} Joint: 'Bridgestone Sports Taiwan Co.' Activity: Activity-1 Amount:NT$(20000000)
Template id: Activity-1 Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Product: 'iron and ‘metal wood’ clubs' Start Date: During(January(1990))
INPUT
Relationship: Tie-up Entities: {'Bridgestone Sports Co.', 'a local concern', 'A Japanese trading house'}
Activity: Production Product: 'golf clubs'
Relationship: Tie-up Joint: 'Bridgestone Sports Taiwan Co.' Amount: NT$(20000000)
Activity: Production Company: 'Bridgestone Sports Taiwan Co.' Start date: during(January(1990))
Activity: Production Product: 'iron and ‘metal wood’ clubs'
To make it fast…: To make it fast… Restrict semantic grammars to be regular expressions
Compile into finite-state autonomata
Cascade/stream the levels
tokenization Complex words Simple phrases
Info Extraction Summary: Info Extraction Summary Use domain-specific semantic grammars
Knowledge-based development corpus-based?
Use multi-stage, bottom-up recognition, skipping words at the last stage if unrecognized
Merge templates by guessing about co-referring phrases
Probabilistic NLP outline: Probabilistic NLP outline Language modelling
Information retrieval
Information extraction
Text classification
Text Classification: Text Classification Given some text, decide which class (if any) is is an instance of
Often used in dialogue systems
System: 'How can I help you?'
User:
'I need to get to Boston' reservations
'When does my kid’s flight arrive?' schedules
'Have my bags arrived yet?' service
'Is the lady of the house at home?' hang up
'kill yourself' none
Classes
Some Text Classification Methods: Some Text Classification Methods Knowledge-based
Information extraction andamp; semantic grammars
Corpus-based
Language modeling
LSA and other vector methods
Language modeling method: Language modeling method Gather data
Ask thousands of callers 'How can I help you?'
Convert their answers to text
Human coders tag each utterance:
'I need to get to Boston' reservations
'When does my kid’s flight arrive?' schedules
'Have my bags arrived yet?' customer service
'Is the lady of the house at home?' hang up
Generalize
From counts of andlt;tagandgt; given andlt;word was saidandgt;, generate unigram model P(andlt;tagandgt; | andlt;new word stringandgt;) or bigram, trigram, etc.
Smoothing
Use
Given new utterance with words wstring
Choose tag that maximizes P(andlt;tagandgt;|wstring)
LSA Method: LSA Method Gather data (same as with language modeling method)
Ask thousands of callers 'How can I help you?'
Convert their answers to text
Human coders tag each utterance:
'I need to get to Boston' reservations
etc
Compute function that reduces vector lengths
Convert each utterance to a long vector
Do principle components analysis
Convert each utterance’s long vector to a short one
For each class,
Average the vectors of the utterances in that class
Each class now represented by a single (short) vector
In use
Given new utterance, compute its (short) vector
Take dot product with each class’s vector
Choose class with largest dot product
Questions?: Questions? Language modelling
Information retrieval
Information extraction
Text classification