logging in or signing up DARPA NoD Lindon Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 136 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 08, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Multimodal Technology Integration forNews-on-Demand: Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998Personnel: Personnel Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur Natural language: David Israel, David Martin, John Bear Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata OCR: Greg Myers, Ken Nitz Architectures: Luc Julia, Adam CheyerSRI News-on-Demand Highlights: SRI News-on-Demand Highlights Focus on technologies New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation Exploit technology fusion MAESTRO multimedia browser Outline: Outline Goals for News-on-Demand Component Technologies The MAESTRO testbed Information Fusion Prosody for Information Extraction Future Work SummaryHigh-level Goal: High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text.Information We Want: Information We Want Geographical location Topic of the story News-makers Who or what is in the picture Who is speakingComponent Technologies: Component Technologies Speech processing Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection Video analysis Scene segmentation Scene tracking/grouping Camera flashes Optical character recognition (OCR) Video caption Scene text (light or dark) Person identification Information extraction (IE) Names of people, places, organizations Temporal terms Story segmentation/classificationComponent Flowchart: Component Flowchart MAESTRO: MAESTRO Testbed for multimodal News-on-Demand Technologies Links input data and output from component technologies through common time line MAESTRO “score” visually correlates component technologies output Easy to integrate new technologies through uniform data representation formatMAESTRO Interface: Score ASR Output Video IR Results MAESTRO Interface The Technical Challenge: The Technical Challenge Problem: Knowledge sources are not always available or reliable Approaches Make existing sources more reliable Combine multiple sources for increased reliability and functionality (fusion) Exploit new knowledge sourcesTwo Examples: Two Examples Technology Fusion: Speech recognition + Named entity finding = better OCR New knowledge source: Speech prosody for finding names and sentence boundariesFusion Ideas: Fusion Ideas Use the names of people detected in the audio track to suggest names in captions Use the names of people detected in yesterday’s news to suggest names in audio Use a video caption to identify a person speaking, and then use their voice to recognize them again Information Fusion: Information Fusion “Moore” + “Moore” add to lexicon moore Slide15: EXTRACTED INFORMATION Video imagery Auxiliary text news sources Audio track Face Det/Rec Caption Recog Scene Text Det/Rec Speaker Seg/Clust/Class Audio event detection Speech Recog Name Extraction Topic detection Story start/end Geographic focus Story topic Who / What’s in view Who’s speaking Video object tracking Scene Seg/Clust/Class TECHNOLOGY COMPONENTS INPUT MODALITITES Input processing paths First-pass fusion opportunities Augmented Lexicon Improves Recognition Results: Augmented Lexicon Improves Recognition ResultsProsody for Enhanced Speech Understanding: Prosody for Enhanced Speech Understanding Prosody = Rhythm and Melody of Speech Measured through duration (of phones and pauses), energy, and pitch Can help extract information crucial to speech understanding Examples: Sentence boundaries and Named EntitiesProsody for Sentence Segmentation: Prosody for Sentence Segmentation Finding sentence boundaries important for information extraction, structuring output for retrieval Ex.: Any surprises? No. Tanks are in the area. Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiersSentence Segmentation: Results: Sentence Segmentation: Results Baseline accuracy = 50% (same number boundaries & non-boundaries) Accuracy using prosody = 85.7% Boundaries indicated by: long pauses, low pitch before, high pitch after Pitch cues work much better in Broadcast News than in SwitchboardProsody for Named Entities: Prosody for Named Entities Finding names (of people, places, organizations) key to info extraction Names tend to be important to content, hence prosodic emphasis Prosodic cues can be detected even if words are misrecognized: could help find new named entitiesNamed Entities: Results: Named Entities: Results Baseline accuracy = 50% Using prosody only: accuracy = 64.9% N.E.s indicated by longer duration (more careful pronunciation) more within-word pitch variation Challenges only first mentions are accented only one word in longer N.E. marked non-names accentedUsing Prosody in NoD: Summary: Using Prosody in NoD: Summary Prosody can help information extraction independent of word recognition Preliminary positive results for sentence segmentation and N.E. finding Other uses: topic boundaries, emotion detection Ongoing and Future Work: Ongoing and Future Work Combine prosody and words for name finding Implement additional fusion opportunities: OCR helping speech speaker tracking helping topic tracking Leverage geographical information for recognition technologiesConclusions: Conclusions News-on-Demand technologies are making great strides Robustness still a challenge Improved reliability through data fusion and new knowledge sources You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
DARPA NoD Lindon Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 136 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 08, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Multimodal Technology Integration forNews-on-Demand: Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998Personnel: Personnel Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur Natural language: David Israel, David Martin, John Bear Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata OCR: Greg Myers, Ken Nitz Architectures: Luc Julia, Adam CheyerSRI News-on-Demand Highlights: SRI News-on-Demand Highlights Focus on technologies New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation Exploit technology fusion MAESTRO multimedia browser Outline: Outline Goals for News-on-Demand Component Technologies The MAESTRO testbed Information Fusion Prosody for Information Extraction Future Work SummaryHigh-level Goal: High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text.Information We Want: Information We Want Geographical location Topic of the story News-makers Who or what is in the picture Who is speakingComponent Technologies: Component Technologies Speech processing Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection Video analysis Scene segmentation Scene tracking/grouping Camera flashes Optical character recognition (OCR) Video caption Scene text (light or dark) Person identification Information extraction (IE) Names of people, places, organizations Temporal terms Story segmentation/classificationComponent Flowchart: Component Flowchart MAESTRO: MAESTRO Testbed for multimodal News-on-Demand Technologies Links input data and output from component technologies through common time line MAESTRO “score” visually correlates component technologies output Easy to integrate new technologies through uniform data representation formatMAESTRO Interface: Score ASR Output Video IR Results MAESTRO Interface The Technical Challenge: The Technical Challenge Problem: Knowledge sources are not always available or reliable Approaches Make existing sources more reliable Combine multiple sources for increased reliability and functionality (fusion) Exploit new knowledge sourcesTwo Examples: Two Examples Technology Fusion: Speech recognition + Named entity finding = better OCR New knowledge source: Speech prosody for finding names and sentence boundariesFusion Ideas: Fusion Ideas Use the names of people detected in the audio track to suggest names in captions Use the names of people detected in yesterday’s news to suggest names in audio Use a video caption to identify a person speaking, and then use their voice to recognize them again Information Fusion: Information Fusion “Moore” + “Moore” add to lexicon moore Slide15: EXTRACTED INFORMATION Video imagery Auxiliary text news sources Audio track Face Det/Rec Caption Recog Scene Text Det/Rec Speaker Seg/Clust/Class Audio event detection Speech Recog Name Extraction Topic detection Story start/end Geographic focus Story topic Who / What’s in view Who’s speaking Video object tracking Scene Seg/Clust/Class TECHNOLOGY COMPONENTS INPUT MODALITITES Input processing paths First-pass fusion opportunities Augmented Lexicon Improves Recognition Results: Augmented Lexicon Improves Recognition ResultsProsody for Enhanced Speech Understanding: Prosody for Enhanced Speech Understanding Prosody = Rhythm and Melody of Speech Measured through duration (of phones and pauses), energy, and pitch Can help extract information crucial to speech understanding Examples: Sentence boundaries and Named EntitiesProsody for Sentence Segmentation: Prosody for Sentence Segmentation Finding sentence boundaries important for information extraction, structuring output for retrieval Ex.: Any surprises? No. Tanks are in the area. Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiersSentence Segmentation: Results: Sentence Segmentation: Results Baseline accuracy = 50% (same number boundaries & non-boundaries) Accuracy using prosody = 85.7% Boundaries indicated by: long pauses, low pitch before, high pitch after Pitch cues work much better in Broadcast News than in SwitchboardProsody for Named Entities: Prosody for Named Entities Finding names (of people, places, organizations) key to info extraction Names tend to be important to content, hence prosodic emphasis Prosodic cues can be detected even if words are misrecognized: could help find new named entitiesNamed Entities: Results: Named Entities: Results Baseline accuracy = 50% Using prosody only: accuracy = 64.9% N.E.s indicated by longer duration (more careful pronunciation) more within-word pitch variation Challenges only first mentions are accented only one word in longer N.E. marked non-names accentedUsing Prosody in NoD: Summary: Using Prosody in NoD: Summary Prosody can help information extraction independent of word recognition Preliminary positive results for sentence segmentation and N.E. finding Other uses: topic boundaries, emotion detection Ongoing and Future Work: Ongoing and Future Work Combine prosody and words for name finding Implement additional fusion opportunities: OCR helping speech speaker tracking helping topic tracking Leverage geographical information for recognition technologiesConclusions: Conclusions News-on-Demand technologies are making great strides Robustness still a challenge Improved reliability through data fusion and new knowledge sources