logging in or signing up StorySegmentation 05312006 Patrizia Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 42 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: April 22, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Story Segmentation in English Mandarin and Arabic Broadcast News: Story Segmentation in English Mandarin and Arabic Broadcast News Andrew Rosenberg, Julia Hirschberg Columbia University 5/31/06Outline: Outline Introduction Approach Motivating Example ResultsWhy do we need story segmentation?: Why do we need story segmentation? News shows commonly contain many distinct stories. Identifying story (topic) boundaries improves: Summarization [Hearst and Plaunt 93] Information Retrieval [Hearst and Plaunt 93] Anaphora Resolution [Kozima 93]Our Approach: Our Approach Input: Speech signal, ASR transcript, automatic speaker diarization and automatic sentence boundary hypotheses Assume: story boundaries occur at sentence boundaries JRip: java implementation of Cohen’s RIPPER, a rule induction algorithm. build rulesets for each show individually using Lexical, Acoustic and Speaker-dependant features.Corpus Description: Corpus Description Broadcast News material from TDT4 Corpus distributed by LDC English: 312.5 hrs, 450 broadcasts 6 shows Mandarin: 134 hrs, 205 broadcasts, 3 shows Arabic: 88.5 hrs, 109 broadcasts, 2 shows Each broadcast contains between 10 and 20 storiesStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is going very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics *** in armenian gymnast tested positive for a banned stimulant --- that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass *** a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonLexical Features: Lexical Features TextTiling Identify boundaries with locally minimal cosine similarity of the preceding and following regions. LCSeg Augments the above process by weighting cosine similarity by a lexical chain score: a measure of lexical repetition. ‘Cue’ Unigrams Those (stemmed, when available) unigrams that are significantly more likely to appear near the start or end of a story. Acoustic Features: Acoustic Features Pitch and Intensity Min, max, median, mean, std.dev., mean absolute slope Calculated over previous sentence, and first-order difference between previous and following Vowel Duration Mean vowel length, sentence final vowel length, sentence final rhyme length Voiced/Total 10ms frames as an approximation of speaking rateSpeaker-dependent Features: Speaker-dependent Features Based on automatic speaker diarization Performed by our collaborators at U.Washington Normalization of acoustic features. Speaker participation features as a rough approximation of speaker “role”. What percent of the show’s sentences does the speaker of the previous sentence deliver? First turn? Last turn? Is this the first speaker in the show?Ruleset Excerpts: Ruleset Excerpts (ENG) If (speaker_distribution > .16) and (length > 15.85) and (maxI > 80.5) and (minI < 43.6) and (vowels_per_sec < 3) Then SEGMENT (MAN) If (speaker_boundary) and (last_speaker_turn) and (speaker_distribution > 0.036) and (vowels_per_sec_norm > 1.03) Then SEGMENT (ARB) If (following_cue_words > 1) and (preceding_cue_words > 1) and (meanI < 67.8) and (stdev_I > 8.0) Then SEGMENTResults - English: Results - EnglishResults - Mandarin: Results - MandarinResults - Arabic: Results - ArabicConclusions: Conclusions The described approach is successful at detecting story boundaries in English and Mandarin BN, though recall could be improved. The acoustic features shown here and elsewhere to be indicative of topic shifts are not discriminative on Arabic BN. Arabic has a different intonation strategy MSA is not any speaker’s native language Errors from previous processing -- ASR, sentence segmentation -- hinder the effectiveness of acoustic analysis.Thank You: Thank You {amaxwell,julia}@columbia.edu You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
StorySegmentation 05312006 Patrizia Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 42 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: April 22, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Story Segmentation in English Mandarin and Arabic Broadcast News: Story Segmentation in English Mandarin and Arabic Broadcast News Andrew Rosenberg, Julia Hirschberg Columbia University 5/31/06Outline: Outline Introduction Approach Motivating Example ResultsWhy do we need story segmentation?: Why do we need story segmentation? News shows commonly contain many distinct stories. Identifying story (topic) boundaries improves: Summarization [Hearst and Plaunt 93] Information Retrieval [Hearst and Plaunt 93] Anaphora Resolution [Kozima 93]Our Approach: Our Approach Input: Speech signal, ASR transcript, automatic speaker diarization and automatic sentence boundary hypotheses Assume: story boundaries occur at sentence boundaries JRip: java implementation of Cohen’s RIPPER, a rule induction algorithm. build rulesets for each show individually using Lexical, Acoustic and Speaker-dependant features.Corpus Description: Corpus Description Broadcast News material from TDT4 Corpus distributed by LDC English: 312.5 hrs, 450 broadcasts 6 shows Mandarin: 134 hrs, 205 broadcasts, 3 shows Arabic: 88.5 hrs, 109 broadcasts, 2 shows Each broadcast contains between 10 and 20 storiesStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is going very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics *** in armenian gymnast tested positive for a banned stimulant --- that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass *** a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonStory segmentation example: Story segmentation example the united states finished at the top with a total of ninety seven medals thirty nine of them gold russian china and australia rounded up its four andrea run economy seemed to champion style welcome home even though she was stripped of her individual gold medal at the sydney olympics in armenian gymnast tested positive for a banned stimulant that was in a nonprescription cold medicine she took from any as government is honoring her with its own gold medal inscribed everlasting olympic champion the international olympic committee did allow run a con to keep her team gold medal and the silver medal she won in the vote compass a spokeswoman says republican senator strom thurmond is doing very well after falling ill saturday he spent the night it will to read army medical center in washingtonLexical Features: Lexical Features TextTiling Identify boundaries with locally minimal cosine similarity of the preceding and following regions. LCSeg Augments the above process by weighting cosine similarity by a lexical chain score: a measure of lexical repetition. ‘Cue’ Unigrams Those (stemmed, when available) unigrams that are significantly more likely to appear near the start or end of a story. Acoustic Features: Acoustic Features Pitch and Intensity Min, max, median, mean, std.dev., mean absolute slope Calculated over previous sentence, and first-order difference between previous and following Vowel Duration Mean vowel length, sentence final vowel length, sentence final rhyme length Voiced/Total 10ms frames as an approximation of speaking rateSpeaker-dependent Features: Speaker-dependent Features Based on automatic speaker diarization Performed by our collaborators at U.Washington Normalization of acoustic features. Speaker participation features as a rough approximation of speaker “role”. What percent of the show’s sentences does the speaker of the previous sentence deliver? First turn? Last turn? Is this the first speaker in the show?Ruleset Excerpts: Ruleset Excerpts (ENG) If (speaker_distribution > .16) and (length > 15.85) and (maxI > 80.5) and (minI < 43.6) and (vowels_per_sec < 3) Then SEGMENT (MAN) If (speaker_boundary) and (last_speaker_turn) and (speaker_distribution > 0.036) and (vowels_per_sec_norm > 1.03) Then SEGMENT (ARB) If (following_cue_words > 1) and (preceding_cue_words > 1) and (meanI < 67.8) and (stdev_I > 8.0) Then SEGMENTResults - English: Results - EnglishResults - Mandarin: Results - MandarinResults - Arabic: Results - ArabicConclusions: Conclusions The described approach is successful at detecting story boundaries in English and Mandarin BN, though recall could be improved. The acoustic features shown here and elsewhere to be indicative of topic shifts are not discriminative on Arabic BN. Arabic has a different intonation strategy MSA is not any speaker’s native language Errors from previous processing -- ASR, sentence segmentation -- hinder the effectiveness of acoustic analysis.Thank You: Thank You {amaxwell,julia}@columbia.edu