Semantic Analysis for Video Contents Extraction -Spotting by Association in News Video: Semantic Analysis for Video Contents Extraction -Spotting by Association in News Video Paper by –
Yuichi NAKAMURA
Takeo KANADE
Presented By- Hemant Joshi
Introduction: Introduction Enormous amount of multimedia data
Linking two news matters together
Semantic linking
Using closed-captioning along with video
Video Content Spotting by Association: Video Content Spotting by Association Necessity for multiple Modalities
video content extraction from only language or image data is not reliable
``They say'' – difficult to determine without semantics.
Situation Spotting by Association: Situation Spotting by Association Association between language and image clues is important key.
Two advantages
Reliable detection utilizing both images and language
The data explained by both modalities is clearly understandable to users.
Situation Spotting by Association (Con.): Situation Spotting by Association (Con.)
Situation Spotting by Association (Con.): Situation Spotting by Association (Con.)
Language Clue Detection: Language Clue Detection Simple Keyword Spotting
Direct Vs. Indirect narration
Keyword usage for speech
Language Clue Detection (Cont.): Language Clue Detection (Cont.) Keyword usage for meeting and visiting
Screening Keywords: Screening Keywords To avoid false detection of keywords not related to subject matter of interest, parse the sentence in transcripts, check the role of each keyword and check the semantics of the subject, the verb and the objects. Also consider following:
Part-of-speech of each word can be used as keyword. Example- “talk” as verb
If keyword is verb, subject or object is checked semantically. For semantic checking, use Hypernym relation in WordNet
Negative sentences or those in future tense can be ignored.
Location name which follows several kinds of prepositions such as “in”, ”to” is considered as a language clue.
Process - Conditions for key-sentence detection : Process - Conditions for key-sentence detection In key-sentence detection, keywords are detected from transcripts.
Keywords are syntactically and semantically checked and evaluated by using the parsing results.
we focus only on subjects and verbs, results are more acceptable. (80% correct –CNN news headlines)
A sentence including one or more words which satisfy these conditions is considered a key-sentence.
Process - Key-sentence detection result : Process - Key-sentence detection result The figure (X/Y/Z) in each table shows the numbers of detected key-sentence
X is the number of sentences which include keywords
Y is the sentences removed by the above keyword screening
Z is the number of sentences incorrectly removed
Image Clue Detection – Key Image: Image Clue Detection – Key Image Image Clues ?
Face close-ups
People Images
Outdoor Scenes
Usage of Face close-up
Key Image – Usage of People Images: Key Image – Usage of People Images usage of people images is the description about crowds, such as people in a demonstration
Key Image – Outdoor Scenes: Key Image – Outdoor Scenes In the case of outdoor scenes, images describe the place, the degree of a disasters, etc.
Key Image Detection: Key Image Detection Face Close-up Detection
In this research, human faces are detected by the neural-network based face detection program. Most face close-ups are easily detected because they are large and frontal. Therefore, most frontal faces, less than half of the small faces and profiles are detected.
People Image and Outdoor Scene Detection
As for images with many people, the problem becomes difficult because small faces and human figures are more difficult to detect. The same can be said of outdoor scene detection.
Automatic face and outdoor scene detection is still under development. For the experiments in this paper, we manually pick them. Since the representative image of each cut is automatically detected, it takes only a few minutes for us to pick those images from a 30-minute news video.
Association by Dynamic Programming: Association by Dynamic Programming Basic Idea
The detected data is the sequence of key images and that of key-sentences to which starting and ending time is given. If a key image duration and a key-sentence duration have enough overlap (or close to each other) and the suggested situations are compatible, they should be associated.
Basic Assumption
Order of a key image sequence and that of a key-sentence sequence are the same.
The basic idea is to minimize the following penalty value P.
P = Sumj \in Sn Skips(j) + Sumk \in In Skipi(k)
+ Sumj \in S, k \in I Match(j, k)
where S and I are the key-sentences and key images which have corresponding clues in the other modality, Sn and In are those without corresponding clues. Skips is the penalty value for a key-sentence without inter-modal correspondence, Skipi is for a key image without inter-modal correspondence, and Match(j,k) is the penalty for the correspondence between the j-th key-sentence and the k-th key image.
Association by DP - Cost Evaluation: Association by DP - Cost Evaluation Skipping Cost(Skip)
The penalty values are determined by the importance of the data, that is the possibility of each data having the inter-modal correspondence. In this research, importance evaluation of each clues is calculated by the following formula. The skip penalty Skip is considered as -E.
E = EtypeEdata
where the Etype is the type of evaluation, for example, the evaluation of a type “face close-up”. Edata is that of each clue, for example, the face size evaluation for a face close-up.
Example of cost definition
key-sentence: speech 1.0, meeting 0.6, crowd 0.6, travel/visit 0.6, location 0.6
key image: face 1.0, people 0.6, scene 0.6
Association by DP - Cost Evaluation: Association by DP - Cost Evaluation Matching Cost(Match)
The evaluation of correspondences is calculated by the following formula.
Match(i,j) = Mtime(i, j) Mtype(i, j)
where Mtime is the duration compatibility between an image and a sentence. The more their durations overlap, the less the penalty becomes.
A key image's duration (di) is the duration of the cut from which the key image is taken; the starting and ending time of a sentence in the speech is used for key-sentence duration (ds). In the case where the exact speech time is difficult to obtain, it is substituted by the time when closed-caption appears.
The actual values for Mtype are shown in Table. They are roughly determined by the number of correspondences in our sample videos.
Experiments & Results: Experiments & Results
Results (Continued.): Results (Continued.)
Usage of Results: Usage of Results Summarization and Presentation tool
Around 70 segments are spotted for each 30-minute news video. This means an average of 3 segments in a minute. If a topic is not too long, we can place all of the segments in one topic into one window. This view could be a good presentation of a topic as well as a good summarization tool.
Each pair of a picture and a sentence is an associated pair. The picture is a key image, and the sentence is a key-sentence. The position of the pair is determined by the situations defined
This view enables us to overlook how the topic is organized. Visit and place information is given first, meeting information is given second, then a few public speeches and opinions are given.
Usage of Results (Continued.): Usage of Results (Continued.) Data tagging to video segments
News Video topic explainer (Category + Time Order): News Video topic explainer (Category + Time Order)
Details in Topic Explainer: Details in Topic Explainer
Conclusion: Conclusion The idea of the Spotting by Association in news video.
video segments with typical semantics are detected by associating language clue and image clue.
Most of the detected segments fit the typical situations
Proposed new applications by using detected news segments.
future work
Improvement of key image and key-sentence detection
Check the effectiveness of this method with other kinds of videos.
Questions?: Questions?