LINGUIST 180: Introduction to Computational Linguistics: LINGUIST 180: Introduction to Computational Linguistics
Dan Jurafsky
Lecture 3: Dialogue and Conversational Agents (part 1)
Outline: Outline The Linguistics of Conversation
Basic Conversational Agents
ASR
NLU
Generation
Dialogue Manager
Dialogue Manager Design
Finite State
Frame-based
Initiative: User, System, Mixed
VoiceXML
Information-State
Dialogue-Act Detection
Dialogue-Act Generation
Conversational Agents: Conversational Agents AKA:
Spoken Language Systems
Dialogue Systems
Speech Dialogue Systems
Applications:
Travel arrangements (Amtrak, United airlines)
Telephone call routing
Tutoring
Communicating with robots
Anything with limited screen/keyboard
A travel dialog: Communicator: A travel dialog: Communicator
Call routing: ATT HMIHY: Call routing: ATT HMIHY
A tutorial dialogue: ITSPOKE: A tutorial dialogue: ITSPOKE
Linguistics of Human Conversation: Linguistics of Human Conversation Turn-taking
Speech Acts
Grounding
Conversational Structure
Implicature
Turn-taking: Turn-taking Dialogue is characterized by turn-taking.
A:
B:
A:
B:
…
Resource allocation problem:
How do speakers know when to take the floor?
Total amount of overlap relatively small (5% - Levinson 1983)
Don’t pause either
Must be a way to know who should talk and when.
Turn-taking rules: Turn-taking rules At each transition-relevance place of each turn:
a. If during this turn the current speaker has selected B as the next speaker then B must speak next.
b. If the current speaker does not select the next speaker, any other speaker may take the next turn.
c. If no one else takes the next turn, the current speaker may take the next turn.
Implications of subrule a: Implications of subrule a For some utterances the current speaker selects the next speaker
Adjacency pairs
Question/answer
Greeting/greeting
Compliment/downplayer
Request/grant
Silence between 2 parts of adjacency pair is different than silence after
A: Is there something bothering you or not?
(1.0)
A: Yes or no?
(1.5)
A: Eh
B: No.
Speech Acts: Speech Acts Austin (1962): An utterance is a kind of action
Clear case: performatives
I name this ship the Titanic
I second that motion
I bet you five dollars it will snow tomorrow
Performative verbs (name, second)
Austin’s idea: not just these verbs
Each utterance is 3 acts: Each utterance is 3 acts Locutionary act: the utterance of a sentence with a particular meaning
Illocutionary act: the act of asking, answering, promising, etc., in uttering a sentence.
Perlocutionary act: the (often intentional) production of certain effects upon the thoughts, feelings, or actions of addressee in uttering a sentence.
Locutionary and illocutionary: Locutionary and illocutionary “You can’t do that!”
Illocutionary force:
Protesting
Perlocutionary force:
Intent to annoy addressee
Intent to stop addressee from doing something
The 3 levels of act revisited: The 3 levels of act revisited
Illocutionary Acts: Illocutionary Acts What are they?
5 classes of speech acts: Searle (1975): 5 classes of speech acts: Searle (1975) Assertives: committing the speaker to something’s being the case (suggesting, putting forward, swearing, boasting, concluding)
Directives: attempts by the speaker to get the addressee to do something (asking, ordering, requesting, inviting, advising, begging)
Commissives:Committing the speaker to some future course of action (promising, planning, vowing, betting, opposing).
Expressives: expressing the psychological state of the speaker about a state of affairs (thanking, apologizing, welcoming, deploring).
Declarations: bringing about a different state of the world via the utterance (I resign; You’re fired)
Grounding: Grounding Dialogue is a collective act performed by speaker and hearer
Common ground: set of things mutually believed by both speaker and hearer
Need to achieve common ground, so hearer must ground or acknowledge speakers utterance.
Clark (1996):
Principle of closure. Agents performing an action require evidence, sufficient for current purposes, that they have succeeded in performing it
(Interestingly, Clark points out that this idea draws from Norman (1988) work on non-linguistic acts)
Need to know whether an action succeeded or failed
Clark and Schaefer: Grounding: Clark and Schaefer: Grounding Continued attention: B continues attending to A
Relevant next contribution: B starts in on next relevant contribution
Acknowledgement: B nods or says continuer like uh-huh, yeah, assessment (great!)
Demonstration: B demonstrates understanding A by paraphrasing or reformulating A’s contribution, or by collaboratively completing A’s utterance
Display: B displays verbatim all or part of A’s presentation
A human-human conversation: A human-human conversation
Grounding examples: Grounding examples Display:
C: I need to travel in May
A: And, what day in May did you want to travel?
Acknowledgement
C: He wants to fly from Boston
A: mm-hmm
C: to Baltimore Washington International
[Mm-hmm (usually transcribed “uh-huh”) is a backchannel, continuer, or acknowledgement token]
Grounding Examples (2): Grounding Examples (2) Acknowledgement + next relevant contribution
And, what day in May did you want to travel?
And you’re flying into what city?
And what time would you like to leave?
The and indicates to the client that agent has successfully understood answer to the last question.
Grounding negative responsesFrom Cohen et al. (2004): Grounding negative responses From Cohen et al. (2004) System: Did you want to review some more of your personal profile?
Caller: No.
System: Okay, what’s next?
System: Did you want to review some more of your personal profile?
Caller: No.
System: What’s next?
Grounding and Dialogue Systems: Grounding and Dialogue Systems Grounding is not just a tidbit about humans
Is key to design of conversational agent
Why?
Grounding and Dialogue Systems: Grounding and Dialogue Systems Grounding is not just a tidbit about humans
Is key to design of conversational agent
Why?
HCI researchers find users of speech-based interfaces are confused when system doesn’t give them an explicit acknowledgement signal
Stifelman et al. (1993), Yankelovich et al. (1995)
Conversational Structure: Conversational Structure Telephone conversations
Stage 1: Enter a conversation
Stage 2: Identification
Stage 3: Establish joint willingness to converse
Stage 4: First topic is raised, usually by caller
Why is this customer confused?: Why is this customer confused? Customer: (rings)
Operator: Directory Enquiries, for which town please?
Customer: Could you give me the phone number of um: Mrs. um: Smithson?
Operator: Yes, which town is this at please?
Customer: Huddleston.
Operator: Yes. And the name again?
Customer: Mrs. Smithson
Conversational Implicature: Conversational Implicature A: And, what day in May did you want to travel?
C: OK, uh, I need to be there for a meeting that’s from the 12th to the 15th.
Note that client did not answer question.
Meaning of client’s sentence:
Meeting
Start-of-meeting: 12th
End-of-meeting: 15th
Doesn’t say anything about flying!!!!!
What is it that licenses agent to infer that client is mentioning this meeting so as to inform the agent of the travel dates?
Conversational Implicature (2): Conversational Implicature (2) A: … there’s 3 non-stops today.
This would still be true if 7 non-stops today.
But no, the agent means: 3 and only 3.
How can client infer that agent means:
only 3
Grice: conversational implicature: Grice: conversational implicature Implicature means a particular class of licensed inferences.
Grice (1975) proposed that what enables hearers to draw correct inferences is:
Cooperative Principle
This is a tacit agreement by speakers and listeners to cooperate in communication
4 Gricean Maxims: 4 Gricean Maxims Relevance: Be relevant
Quantity: Do not make your contribution more or less informative than required
Quality: try to make your contribution one that is true (don’t say things that are false or for which you lack adequate evidence)
Manner: Avoid ambiguity and obscurity; be brief and orderly
Relevance: Relevance A: Is Regina here?
B: Her car is outside.
Implication: yes
Hearer thinks: why would he mention the car? It must be relevant. How could it be relevant? It could since if her car is here she is probably here.
Client: I need to be there for a meeting that’s from the 12th to the 15th
Hearer thinks: Speaker is following maxims, would only have mentioned meeting if it was relevant. How could meeting be relevant? If client meant me to understand that he had to depart in time for the mtg.
Quantity: Quantity A:How much money do you have on you?
B: I have 5 dollars
Implication: not 6 dollars
Similarly, 3 non stops can’t mean 7 non-stops (hearer thinks:
if speaker meant 7 non-stops she would have said 7 non-stops
A: Did you do the reading for today’s class?
B: I intended to
Implication: No
B’s answer would be true if B intended to do the reading AND did the reading, but would then violate maxim
Dialogue System Architecture: Dialogue System Architecture
Speech recognition: Speech recognition Or ASR (Automatic Speech Recognition)
Speech to words
Input: acoustic waveform
Output: string of words
We’ll introduce the algorithms in week 10
Basic components:
a recognizer for phones, small sound units like [k] or [ae].
a pronunciation dictionary like cat = [k ae t]
a grammar telling us what words are likely to follow what words
A search algorithm to find the best string of words
Natural Language Understanding: Natural Language Understanding Or “NLU”
Or “Computational semantics”
There are many ways to represent the meaning of sentences
For speech dialogue systems, most common is “Frame and slot semantics”.
An example of a frame: An example of a frame Show me morning flights from Boston to SF on Tuesday.
SHOW:
FLIGHTS:
ORIGIN:
CITY: Boston
DATE: Tuesday
TIME: morning
DEST:
CITY: San Francisco
How to generate this semantics?: How to generate this semantics? Many methods,
Simplest: “semantic grammars”
We’ll come back to these after we’ve seen parsing.
But a quick teaser for those of you who might have already seen parsing:
CFG in which the LHS of rules is a semantic category:
LIST -> show me | I want | can I see|…
DEPARTTIME -> (after|around|before) HOUR | morning | afternoon | evening
HOUR -> one|two|three…|twelve (am|pm)
FLIGHTS -> (a) flight|flights
ORIGIN -> from CITY
DESTINATION -> to CITY
CITY -> Boston | San Francisco | Denver | Washington
Semantics for a sentence: Semantics for a sentence LIST FLIGHTS ORIGIN
Show me flights from Boston
DESTINATION DEPARTDATE
to San Francisco on Tuesday
DEPARTTIME
morning
Generation and TTS: Generation and TTS Generation component
Chooses concepts to express to user
Plans out how to express these concepts in words
Assigns any necessary prosody to the words
TTS component
Takes words and prosodic annotations
Synthesizes a waveform
Generation Component: Generation Component Content Planner
Decides what content to express to user
(ask a question, present an answer, etc)
Often merged with dialogue manager
Language Generation
Chooses syntactic structures and words to express meaning.
Simplest method
All words in sentence are prespecified!
“Template-based generation”
Can have variables:
What time do you want to leave CITY-ORIG?
Will you return to CITY-ORIG from CITY-DEST?
More sophisticated language generation component: More sophisticated language generation component Natural Language Generation
This is a field, like Parsing, or Natural Language Understanding, or Speech Synthesis, with its own (small) conference
Approach:
Dialogue manager builds representation of meaning of utterance to be expressed
Passes this to a “generator”
Generators have three components
Sentence planner
Surface realizer
Prosody assigner
Architecture of a generator for a dialogue system(after Walker and Rambow 2002): Architecture of a generator for a dialogue system (after Walker and Rambow 2002)
HCI constraints on generation for dialogue: “Coherence”: HCI constraints on generation for dialogue: “Coherence” Discourse markers and pronouns (“Coherence”):
(1) Please say the date.
…
Please say the start time.
…
Please say the duration…
…
Please say the subject…
(2) First, tell me the date.
…
Next, I’ll need the time it starts.
…
Thanks. Now, how long is it supposed to last?
…
Last of all, I just need a brief description
HCI constraints on generation for dialogue: coherence (II): tapered prompts: HCI constraints on generation for dialogue: coherence (II): tapered prompts Prompts which get incrementally shorter:
System: Now, what’s the first company to add to your watch list?
Caller: Cisco
System: What’s the next company name? (Or, you can say, “Finished”)
Caller: IBM
System: Tell me the next company name, or say, “Finished.”
Caller: Intel
System: Next one?
Caller: America Online.
System: Next?
Caller: …
Dialogue Manager: Dialogue Manager Controls the architecture and structure of dialogue
Takes input from ASR/NLU components
Maintains some sort of state
Interfaces with Task Manager
Passes output to NLG/TTS modules
Four architectures for dialogue management: Four architectures for dialogue management Finite State
Frame-based
Information State
Markov Decision Processes
AI Planning
Finite-State Dialogue Mgmt: Finite-State Dialogue Mgmt Consider a trivial airline travel system
Ask the user for a departure city
For a destination city
For a time
Whether the trip is round-trip or not
Finite State Dialogue Manager: Finite State Dialogue Manager
Finite-state dialogue managers: Finite-state dialogue managers System completely controls the conversation with the user.
It asks the user a series of questions
Ignoring (or misinterpreting) anything the user says that is not a direct answer to the system’s questions
Dialogue Initiative: Dialogue Initiative Systems that control conversation like this are system initiative or single initiative.
“Initiative”: who has control of conversation
In normal human-human dialogue, initiative shifts back and forth between participants.
System Initiative: System Initiative Systems which completely control the conversation at all times are called system initiative.
Advantages:
Simple to build
User always knows what they can say next
System always knows what user can say next
Known words: Better performance from ASR
Known topic: Better performance from NLU
Ok for VERY simple tasks (entering a credit card, or login name and password)
Disadvantage:
Too limited
User Initiative: User Initiative User directs the system
Generally, user asks a single question, system answers
System can’t ask questions back, engage in clarification dialogue, confirmation dialogue
Used for simple database queries
User asks question, system gives answer
Web search is user initiative dialogue.
Problems with System Initiative: Problems with System Initiative Real dialogue involves give and take!
In travel planning, users might want to say something that is not the direct answer to the question.
For example answering more than one question in a sentence:
Hi, I’d like to fly from Seattle Tuesday morning
I want a flight from Milwaukee to Orlando one way leaving after 5 p.m. on Wednesday.
Single initiative + universals: Single initiative + universals We can give users a little more flexibility by adding universal commands
Universals: commands you can say anywhere
As if we augmented every state of FSA with these
Help
Start over
Correct
This describes many implemented systems
But still doesn’t allow user to say what the want to say
Mixed Initiative: Mixed Initiative Conversational initiative can shift between system and user
Simplest kind of mixed initiative: use the structure of the frame itself to guide dialogue
Slot Question
ORIGIN What city are you leaving from?
DEST Where are you going?
DEPT DATE What day would you like to leave?
DEPT TIME What time would you like to leave?
AIRLINE What is your preferred airline?
Frames are mixed-initiative: Frames are mixed-initiative User can answer multiple questions at once.
System asks questions of user, filling any slots that user specifies
When frame is filled, do database query
If user answers 3 questions at once, system has to fill slots and not ask these questions again!
Anyhow, we avoid the strict constraints on order of the finite-state architecture.
Multiple frames: Multiple frames flights, hotels, rental cars
Flight legs: Each flight can have multiple legs, which might need to be discussed separately
Presenting the flights (If there are multiple flights meeting users constraints)
It has slots like 1ST_FLIGHT or 2ND_FLIGHT so user can ask “how much is the second one”
General route information:
Which airlines fly from Boston to San Francisco
Airfare practices:
Do I have to stay over Saturday to get a decent airfare?
Multiple Frames: Multiple Frames Need to be able to switch from frame to frame
Based on what user says.
Disambiguate which slot of which frame an input is supposed to fill, then switch dialogue control to that frame.
Main implementation: production rules
Different types of inputs cause different productions to fire
Each of which can flexibly fill in different frames
Can also switch control to different frame
Defining Mixed Initiative: Defining Mixed Initiative Mixed Initiative could mean
User can arbitrarily take or give up initiative in various ways
This is really only possible in very complex plan-based dialogue systems
No commercial implementations
Important research area
Something simpler and quite specific which we will define in the next few slides
True Mixed Initiative: True Mixed Initiative
How mixed initiative is usually defined: How mixed initiative is usually defined First we need to define two other factors
Open prompts vs. directive prompts
Restrictive versus non-restrictive grammar
Open vs. Directive Prompts: Open vs. Directive Prompts Open prompt
System gives user very few constraints
User can respond how they please:
“How may I help you?” “How may I direct your call?”
Directive prompt
Explicit instructs user how to respond
“Say yes if you accept the call; otherwise, say no”
Restrictive vs. Non-restrictive grammars: Restrictive vs. Non-restrictive grammars Restrictive grammar
Language model which strongly constrains the ASR system, based on dialogue state
Non-restrictive grammar
Open language model which is not restricted to a particular dialogue state
Definition of Mixed Initiative: Definition of Mixed Initiative
VoiceXML: VoiceXML Voice eXtensible Markup Language
An XML-based dialogue design language
Makes use of ASR and TTS
Deals well with simple, frame-based mixed initiative dialogue.
Most common in commercial world (too limited for research systems)
But useful to get a handle on the concepts.
Voice XML: Voice XML Each dialogue is a . (Form is the VoiceXML word for frame)
Each generally consists of a sequence of s, with other commands
Sample vxml doc: Sample vxml doc
Please choose airline, hotel, or rental car.
[airline hotel "rental car"]
You have chosen .
VoiceXML interpreter: VoiceXML interpreter Walks through a VXML form in document order
Iteratively selecting each item
If multiple fields, visit each one in order.
Special commands for events
Another vxml doc (1): Another vxml doc (1)
I'm sorry, I didn't hear you.
- “noinput” means silence exceeds a timeout threshold
I'm sorry, I didn't understand that.
- “nomatch” means confidence value for utterance is too low
- notice “reprompt” command
Another vxml doc (2): Another vxml doc (2)
Welcome to the air travel consultant.
Which city do you want to leave from?
[(san francisco) denver (new york) barcelona]
OK, from
- “filled” tag is executed by interpreter as soon as field filled by user
Another vxml doc (3): Another vxml doc (3)
And which city do you want to go to?
[(san francisco) denver (new york) barcelona]
OK, to
And what date do you want to leave?
OK, on
Another vxml doc (4): Another vxml doc (4)
OK, I have you are departing from
to on
send the info to book a flight...
Summary: VoiceXML: Summary: VoiceXML Voice eXtensible Markup Language
An XML-based dialogue design language
Makes use of ASR and TTS
Deals well with simple, frame-based mixed initiative dialogue.
Most common in commercial world (too limited for research systems)
But useful to get a handle on the concepts.
Summary: Summary The Linguistics of Conversation
Basic Conversational Agents
ASR
NLU
Generation
Dialogue Manager
Dialogue Manager Design
Finite State
Frame-based
Initiative: User, System, Mixed
VoiceXML
Information-State
Dialogue-Act Detection
Dialogue-Act Generation