Overview of the TDT 2004 Evaluation and Results: Overview of the TDT 2004 Evaluation and Results Jonathan Fiscus
Barbara Wheatley
National Institute of Standards and Technology
Gaithersburg, Maryland
December 2-3, 2004
Outline: Outline TDT Evaluation Overview
Changes in 2004
2004 TDT Evaluation Result Summaries
New Event Detection
Link Detection
Topic Tracking
Experimental Tasks:
Supervised Adaptive Topic Tracking
Hierarchical Topic Detection
Topic Detection and Tracking: Topic Detection and Tracking 5 TDT Applications
Story Segmentation*
Topic Tracking
Topic Detection
First Story Detection
Link Detection
“Applications for organizing text” Terabytes of Unorganized data * Not evaluated in 2004
TDT’s Research Domain : TDT’s Research Domain Technology challenge
Develop applications that organize and locate relevant stories from a continuous feed of news stories
Research driven by evaluation tasks
Composite applications built from
Document Retrieval
Speech-to-Text (STT) – not included this year
Story Segmentation – not included this year
Definitions: Definitions An event is …
A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences.
A topic is …
an event or activity, along with all directly related events and activities
A broadcast news story is …
a section of transcribed text with substantive information content and a unified topical focus
Evaluation Corpus: Evaluation Corpus Same languages as last year
Summary of differences
New time period
No broadcast news
No non-news stories
4.5 times more stories
3.1 times more topics
Topics have ½ as many on-topic stories
Topic Size Distribution: Topic Size Distribution 35 Arb+Eng+Man 62 Arb 62 Man 63 Eng 21 Eng+Man 7 Arb+Eng
Mutlilingual Topic overlap: Mutlilingual Topic overlap 25 Common Stories Topic ID Unique Stories Topics on Terrorism 107: Casablanca bombs 71: Demonstrations in Casablanca Single Overlap Topics Multiply Overlap Topics
Topic labels: Topic labels 72 Court indicts Liberian President
89 Liberian former president arrives in exile
29 Swedish Foreign Minister killed
125 Sweden rejects the Euro
151 Egyptian delegation in Gaza
189 Palestinian public uprising suspended for three months
69 Earthquake in Algeria
145 Visit of Morocco Minister of Foreign Affairs to Algeria
186 Press conference between Lebanon and US foreign ministers
193 Colin Powell Plans to visit Middle East and Europe 105 UN official killed in attack
126 British soldiers attacked in Basra
215 Jerusalem: Bus suicide bombing
227 Bin Laden Videotape
171 Morocco: death sentences for bombing suspects
107 Casablanca bombs
71 Demonstrations in Casablanca
106 Bombing in Riyadh, Saudi Arabia
118 World Economic Forum in Jordan
154 Saudi suicide bomber dies in shootout
60 Saudi King has eye surgery
80 Spanish Elections Single Overlap Topics Multiply Overlap Topics
Participation by Task:Showing the Number of Submitted System Runs: Participation by Task: Showing the Number of Submitted System Runs Foreign Domestic
New Event Detection Task: New Event Detection Task System Goal:
To detect the first story that discusses each topic
TDT Evaluation Methodology: TDT Evaluation Methodology Tasks are modeled as detection tasks
Systems are presented with many trials and must answer the question: “Is this example a target trial?”
Systems respond:
YES this is a target, or NO this is not
Each decision includes a likelihood score indicating the system’s confidence in the decision
System performance measured by linearly combining the system’s missed detection rate and false alarm rate
Detection Evaluation Methodology: Detection Evaluation Methodology Performance is measured in terms of Detection Cost
CDet = CMiss * PMiss * Ptarget + CFA * PFA * (1- Ptarget)
Constants:
CMiss = 1 and CFA = 0.1 are preset costs
Ptarget = 0.02 is the a priori probability of a target
System performance estimates
PMiss and PFA
Normalized Detection Cost generally lies between 0 and 1:
(CDet)Norm = CDet/min{CMiss*Ptarget, CFA * (1-Ptarget)}
Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA
Makes use of likelihood scores attached to the YES/NO decisions
Two important scores per system
Actual Normalized Detection Cost
Based on the YES/NO decision threshold
Minimum Normalized DET point
Based on the DET curve: Minimum score with proper threshold
Slide14: Performance Measures Example Bottom left is better DET Curve Bar Chart
Primary New Event Detection ResultsNewswire, English Texts: Primary New Event Detection Results Newswire, English Texts
New Event DetectionPerformance History : New Event Detection Performance History * 0.4283 on 2002 Topics
Slide17: TDT Link Detection Task System Goal:
To detect whether a pair of stories discuss the same topic.
(Can be thought of as a “primitive operator” to build a variety of applications)
?
Primary Link Detection ResultsNewswire, Multilingual links, 10-file deferral period: Primary Link Detection Results Newswire, Multilingual links, 10-file deferral period
Link DetectionPerformance History : Link Detection Performance History * 0.1798 on 2002 Topics
Topic Tracking Task: Topic Tracking Task System Goal:
To detect stories that discuss the target topic, in multiple source streams
Supervised Training
Given Nt samples stories that discuss a given target topic
Testing
Find all subsequent stories that discuss the target topic
Primary Tracking ResultsNewswire, Multilingual Texts, 1 English Training Story: Primary Tracking Results Newswire, Multilingual Texts, 1 English Training Story
Tracking Performance History : Tracking Performance History * 0.1618 on 2002 Topics
Supervised Adaptive Tracking Task: Supervised Adaptive Tracking Task Variation of Topic Tracking system goal:
To detect stories that discuss the target topic when a human provides feedback to the system
System receives human judgment (on or off-topic) for every retrieved story
Same task as TREC 2002 Adaptive Filtering training data test data
Supervised Adaptive Tracking Metrics: Supervised Adaptive Tracking Metrics Normalized Detection Cost
Same measure as for basic Tracking task
Linear Utility Measure
As defined for TREC 2002 Filtering Track (Robertson & Soboroff)
Measures value of the stories sent to the user:
Credit for relevant stories, debit for non-relevant stories
Equivalent to thresholding based on estimated probability of relevance
No penalty for missing relevant stories (i.e. all precision, no recall)
Implication: Challenge is to beat the “do-nothing” baseline (i.e. a system that rejects all stories)
Supervised Adaptive Tracking Metrics: Supervised Adaptive Tracking Metrics Linear Utility Measure Computation:
Basic formula: U = Wrel R - NR
R = number of relevant stories retrieved
NR = number of non-relevant stories retrieved
Wrel = relative weight of relevant vs non-relevant (set to 10, by analogy with CMiss vs. CFA weights for CDet)
Normalization across topics:
Divide by maximum possible utility score for each topic
Scaling across topics:
Define arbitrary minimum possible score, to avoid having average dominated by a few topics with huge NR counts
Corresponds to application scenario in which user stops looking at stories when system exceeds some tolerable false alarm rate
Scaled, normalized value:
Uscale = [ max(Unorm, Umin) ] / [ 1 - Umin ]
Supervised Adaptive TrackingBest Two Submissions per Site Newswire, Multilingual Texts, 1 English Training Story: Supervised Adaptive Tracking Best Two Submissions per Site Newswire, Multilingual Texts, 1 English Training Story
Effect of Supervised Adaptation: Effect of Supervised Adaptation CMU4 is a simple cosine similarity tracker
Contrastive run submitted without supervised adaptation
Supervised Adaptive TrackingUtility vs. Detection cost: Supervised Adaptive Tracking Utility vs. Detection cost Performance on Utility measure:
2/3 of systems surpassed baseline scaled utility score (0.33)
Most systems optimized for detection cost, not utility
Detection Cost and Utility are uncorrelated: R2 of 0.23
Even for CMU3 which was tuned for utility
Hierarchical Topic Detection: Hierarchical Topic Detection System goal:
To detect topics in terms of the (clusters of) stories that discuss them
Problems with past Topic Detection evaluations:
Topics are at different levels of granularity, yet systems had to choose single operating point for creating a new cluster
Stories may pertain to multiple topics, yet systems had to assign each to only one cluster
Topic Hierarchy Solves Problems: Topic Hierarchy Solves Problems System operation:
Unsupervised topic training - no topic instances as input
Assign each story to one or more clusters
Clusters may overlap or include other clusters
Clusters must be organized as directed acyclic graph (DAG) with single root
Treated as retrospective search
Semantics of topic hierarchy:
Root = entire collection
Leaf nodes = the most specific topics
Intermediate nodes represent different levels of granularity
Performance assessment:
Given a topic, find matching cluster with lowest cost Edge Vertex Story IDs
Hierarchical Topic Detection Metric: Minimal Cost: Hierarchical Topic Detection Metric: Minimal Cost Weighted combination of Detection Cost and Travel Cost:
WDET (Cdet(topic, bestVertex))Norm + (1 - WDET) Ctravel(topic, bestVertex))Norm
Detection Cost: same as for other tasks
Travel Cost: function of the hierarchy
Detection Cost weighted 2 Travel Cost (WDET = 0.66)
Minimal Cost metric selected based on study at U Mass (Allan et al.):
Effectively eliminates power set solution
Favors balance of cluster purity vs. number of clusters
Computationally tractable
Good behavior in U Mass experiments
Analytic use model:
Find best-matching cluster by traversing DAG, starting from root
Corresponds to analytic task of exploring an unknown collection
Drawbacks:
Does not model analytic task of finding other stories on same or neighboring topics
Not obvious how to normalize travel cost
Hierarchical Topic Detection Metric: Travel Cost: Hierarchical Topic Detection Metric: Travel Cost Travel Cost computation:
Ctravel(topic, vertex) = Ctravel(topic, parentOf(vertex)) +
CBRANCH NumChildren(parentOf(vertex)) + CTITLE
CBRANCH = cost per branch, for each vertex on path to best match
CTITLE = cost of examining each vertex
Relative values of CBRANCH and CTITLE determine preference for shallow, bushy hierarchy vs. deep, less bushy hierarchy
Evaluation values chosen to favor branching factor of 3
Travel Cost normalization:
Absolute travel cost depends on size of corpus, diversity of topics
Must be normalized to combine with Detection Cost
Normalization scheme for trial evaluation chosen to yield CtravelNorm = 1 for “ignorant” hierarchy (by analogy with use of prior probability for CdetNorm):
CtravelNorm = Ctravel / (CBRANCH * MAXVTS * NSTORIES / AVESPT) + CTITLE
MAXVTS = 3 (maximum number of vertices per story, controls overlap)
AVESPT = 88 (average stories per topic, computed from TDT4 multilingual data)
Hierarchical Topic Detection: Hierarchical Topic Detection
Hierarchical Topic Detection Observations: Hierarchical Topic Detection Observations All systems structured hierarchy as a tree – each vertex has one parent
Travel cost has very little effect on finding the best cluster
Setting WDET to 1.0 has little effect on topic mapping
Cost parameters favor false alarms
Average mapped cluster sizes are between 1262 and 7757 stories
Average topic size is 40 stories
Summary: Summary Eleven research groups participated in five evaluation tasks
Error rates increased for new event detection
Why?
Error rates decreased for tracking
Error rates decreased for link detection
Dry run of hierarchical topic detection completed
Solves previous problems with topic detection task, but raises new issues
Questions to consider:
Is the specified hierarchical structure (single-root DAG) appropriate?
Is the minimal cost metric appropriate?
If so, is the normalization right?
Dry run of supervised adaptive tracking completed
Promising results for including relevance feedback
Questions to consider:
Should we continue the task?
If so, should we continue using both metrics?