DARPA NoD

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Multimodal Technology Integration for News-on-Demand: 

Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998

Personnel: 

Personnel Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur Natural language: David Israel, David Martin, John Bear Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata OCR: Greg Myers, Ken Nitz Architectures: Luc Julia, Adam Cheyer

SRI News-on-Demand Highlights: 

SRI News-on-Demand Highlights Focus on technologies New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation Exploit technology fusion MAESTRO multimedia browser

Outline: 

Outline Goals for News-on-Demand Component Technologies The MAESTRO testbed Information Fusion Prosody for Information Extraction Future Work Summary

High-level Goal: 

High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text.

Information We Want: 

Information We Want Geographical location Topic of the story News-makers Who or what is in the picture Who is speaking

Component Technologies: 

Component Technologies Speech processing Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection Video analysis Scene segmentation Scene tracking/grouping Camera flashes Optical character recognition (OCR) Video caption Scene text (light or dark) Person identification Information extraction (IE) Names of people, places, organizations Temporal terms Story segmentation/classification

Component Flowchart: 

Component Flowchart

MAESTRO: 

MAESTRO Testbed for multimodal News-on-Demand Technologies Links input data and output from component technologies through common time line MAESTRO “score” visually correlates component technologies output Easy to integrate new technologies through uniform data representation format

MAESTRO Interface: 

Score ASR Output Video IR Results MAESTRO Interface

The Technical Challenge: 

The Technical Challenge Problem: Knowledge sources are not always available or reliable Approaches Make existing sources more reliable Combine multiple sources for increased reliability and functionality (fusion) Exploit new knowledge sources

Two Examples: 

Two Examples Technology Fusion: Speech recognition + Named entity finding = better OCR New knowledge source: Speech prosody for finding names and sentence boundaries

Fusion Ideas: 

Fusion Ideas Use the names of people detected in the audio track to suggest names in captions Use the names of people detected in yesterday’s news to suggest names in audio Use a video caption to identify a person speaking, and then use their voice to recognize them again

Information Fusion: 

Information Fusion “Moore” + “Moore” add to lexicon moore

Slide15: 

EXTRACTED INFORMATION Video imagery Auxiliary text news sources Audio track Face Det/Rec Caption Recog Scene Text Det/Rec Speaker Seg/Clust/Class Audio event detection Speech Recog Name Extraction Topic detection Story start/end Geographic focus Story topic Who / What’s in view Who’s speaking Video object tracking Scene Seg/Clust/Class TECHNOLOGY COMPONENTS INPUT MODALITITES Input processing paths First-pass fusion opportunities

Augmented Lexicon Improves Recognition Results: 

Augmented Lexicon Improves Recognition Results

Prosody for Enhanced Speech Understanding: 

Prosody for Enhanced Speech Understanding Prosody = Rhythm and Melody of Speech Measured through duration (of phones and pauses), energy, and pitch Can help extract information crucial to speech understanding Examples: Sentence boundaries and Named Entities

Prosody for Sentence Segmentation: 

Prosody for Sentence Segmentation Finding sentence boundaries important for information extraction, structuring output for retrieval Ex.: Any surprises? No. Tanks are in the area. Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiers

Sentence Segmentation: Results: 

Sentence Segmentation: Results Baseline accuracy = 50% (same number boundaries & non-boundaries) Accuracy using prosody = 85.7% Boundaries indicated by: long pauses, low pitch before, high pitch after Pitch cues work much better in Broadcast News than in Switchboard

Prosody for Named Entities: 

Prosody for Named Entities Finding names (of people, places, organizations) key to info extraction Names tend to be important to content, hence prosodic emphasis Prosodic cues can be detected even if words are misrecognized: could help find new named entities

Named Entities: Results: 

Named Entities: Results Baseline accuracy = 50% Using prosody only: accuracy = 64.9% N.E.s indicated by longer duration (more careful pronunciation) more within-word pitch variation Challenges only first mentions are accented only one word in longer N.E. marked non-names accented

Using Prosody in NoD: Summary: 

Using Prosody in NoD: Summary Prosody can help information extraction independent of word recognition Preliminary positive results for sentence segmentation and N.E. finding Other uses: topic boundaries, emotion detection

Ongoing and Future Work: 

Ongoing and Future Work Combine prosody and words for name finding Implement additional fusion opportunities: OCR helping speech speaker tracking helping topic tracking Leverage geographical information for recognition technologies

Conclusions: 

Conclusions News-on-Demand technologies are making great strides Robustness still a challenge Improved reliability through data fusion and new knowledge sources