The METER Corpus: A corpus for analysing journalistic text reuse :The METER Corpus: A corpus for analysing journalistic text reuse Robert Gaizauskas1, Jonathan Foster2, Yorick Wilks1,
John Arundel2, Paul Clough1, Scott Piao1
1Department of Computer Science, 2Department of Journalism University of Sheffield
Outline of Talk :November 16, 2001 UMIST Seminar Outline of Talk The METER Project and the METER Corpus
Text Reuse in the British Press
Construction of the Corpus
Structure of the Corpus
Annotation of the Corpus
Preliminary Experiments with the Corpus
Conclusion/Discussion
The METER Project and the METER Corpus :November 16, 2001 UMIST Seminar The METER Project and the METER Corpus The MEasuring TExt Reuse (METER) project aims
to investigate how text is reused in the production of newspaper articles from newswire sources
to determine whether algorithms can be discovered to detect and quantify such reuse automatically
From this hope to gain broader insights into the nature of text derivation and paraphrase
newspaper-newswire scenario provides an ideal initial case study
newspaper-newswire scenario has considerable potential application
To assist in this study have constructed the METER corpus containing
newswire source texts
newspaper articles reporting the same stories
some derived from the newswire texts
some not derived from the newswire texts
The Text Derivation Game :November 16, 2001 UMIST Seminar The Text Derivation Game A? C? B?
Text Reuse in the British Press :November 16, 2001 UMIST Seminar Text Reuse in the British Press The Press Agency (PA) is the national news agency for the UK and Ireland
provides regional, national and international news 24 hours a day, 365 days a year to media customers throughout Britain + abroad
daily sources 1,500 news, sport and feature stories
also supplies finance, arts and entertainment and television listings, and materials for websites, magazines, and periodicals
PA performs a critical function for the British media in setting the news agenda
widely regarded as a credible, authoritative and trustworthy journalistic source
PA is widely reused
directly: cut and paste; paraphrase
Indirectly: fact checking; “copy tasting”
Text Reuse in the British Press: Example :November 16, 2001 UMIST Seminar Text Reuse in the British Press: Example The Times
Eamon Reidy, 32, a drink-driver who rammed into Queen Elizabeth the Queen Mother's Daimler, was fined £700 and banned from driving for two years. The Queen Mother was not in car when the accident happened on July 4 in Surrey. The Telegraph
A driver was almost three times over the limit when he crashed into Queen Elizabeth the Queen Mother's Daimler then fled, a court was told yesterday.
Eamon Reidy, 32, reversed away but crashed his Citroen BX into a wall at Egham, near Windsor Great Park, Surrey. He then ran off and was caught after a mile-and-a-half chase. The Mirror
A BOOZY driver who smashed into the Queen Mum's chauffeur-driven Daimler minutes after she had been dropped off was banned for two years and fined £700 yesterday.
Eamon Reidy, 32, fled across fields in Windsor Great Park after the crash, the court heard. Grandad John Horton, 56, head gardener on the royal estate, chased him in his slippers for one and a half miles as armed cops, dogs and helicopter joined in the pursuit. John caught up with the fugitive and grabbed his arm. But when Reidy threatened him - "he decided discretion was the better part of valour and let him go," Woking magistrates were told.
Police discovered airport worker Reidy lying in undergrowth near the Queen Mum's Royal Lodge on the Crown estate. He was found to be two-and-a-half times over the legal limit. Reidy, of Langley, Berks, admitted drink- driving and failing to stop. The Sun
A DRUNK driver who ploughed into the Queen Mother's limo was fined £700 and banned for two years yesterday.
Eamon Reidy, 32, was 2½ times over the legal limit when he rammed the parked Daimler in a country lane. The Queen Mum - 99 last week - was not in the car at the time but her chauffeur was.
Airport worker Reid sped off after the smash near Egham, Surrey, on July 4. He glanced off a wall and flattened some bushes before abandoning his Citroen.
Chased
Then he ran 1½ miles across fields chased by crash witness John Horton. Mr Horton finally cornered, him - but Reidy threatened him and fled. Reidy, of Langley, Berks, tried to hide in some undergrowth. But he was spotted by a police helicopter and arrested, magistrates in Woking, Surrey, heard.
Defending, Lesley Barry said Reidy was trying to buy a house and had money worries. He had drunk two glasses of champagne at his parents' wedding anniversary party before drinking three pints of strong lager at a pub. PA version
A drink-driver who ran into the Queen Mother's official Daimler was fined £700 and banned from driving for two years today.
Eamon Reidy, 32, was two-and-a-half times over the drink-drive limit when he rammed the royal car, magistrates in Woking, Surrey, were told.
The 99-year-old Queen Mother was not in the vehicle when the accident happened on July 4 in Bishopsgate, Egham, Surrey.
Magistrates were told that Reidy sped off before abandoning his car, running across fields and hiding in undergrowth until he was spotted by the police helicopter.
Prosecuting Robin Bowen said: ``At 8pm the defendant was driving towards Englefield Green in a black Citroen BX and collided with a Daimler limousine, a vehicle which was used on a daily basis by the Queen Mother. She was not in it at the time. It was being driven by a chauffeur
+ 11 sentences The Star
A DRUNK driver who crashed into the back of the Queen Mum's limo was banned for two years yesterday.
Airport worker Eamon Reidy, 32, was nearly three times the drink-drive limit when he hit the royal Daimler after a two-and-a-half hour session in the pub. He reversed his black E-reg Citroen BX after the crash and hit a wall before fleeing the crash scene. But he was chased for a mile-and-a-half by a passer-by who gave police a description of the Citroen driver. A helicopter and armed police were drafted into the search and Reidy was found hiding in bushes.
The Queen Mother who uses the Daimler daily, was not in the car when it was hit .
Reidy refused to comment after the case at Woking magistrates' court.
He hit the chauffeur-driven car, registration NLT 2, in Bishopsgate Road, Egham, Surrey, last month.
Head gardener John Horton, 56, chased Reidy, who told his pursuer to leave him alone or he would "have him". Reidy was found in bushes by police, but ran off again before he was finally arrested.
+ 11 sentences
Text Reuse in the British Press: Utility of Measuring Reuse :November 16, 2001 UMIST Seminar Text Reuse in the British Press: Utility of Measuring Reuse Like most newswire agencies, PA does not monitor uptake or dissemination of copy they release because they lack
tools
technologies
conceptual framework
for measuring reuse
Potential applications of accurately measuring reuse include:
monitoring of source take-up to identify unused or little used stories
identifying the most reused stories within the British media
determining customer dependencies on PA copy
new methods for charging customers based upon the amount of copy reused
Construction of the Corpus :November 16, 2001 UMIST Seminar Construction of the Corpus Texts of the METER corpus were collected manually
from the PA online service
the paper editions of nine British newspapers The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The Daily Telegraph, The Guardian and The Independent
Scope of corpus is limited to two domains
British law court reporting
show business stories
Court stories
substantial amount of data in newspapers and PA
regular recurrence in British news
revolve around “facts” -- name of the accused, charge, etc. -- limited scope for journalistic interpretation
Show business
more expansive style -- greater freedom of expression/interpretation
more frivolous, light-hearted manner
Construction of the Corpus (cont) :November 16, 2001 UMIST Seminar Construction of the Corpus (cont) Temporal extent of corpus is limited to
24 days for court domain
13 days for show business domain
Spread over 1 year period from July 1999 to June 2000
PA stories are classified
Into broad categories: Courts, Showbiz
Stories within these categories called catchlines – e.g. Courts(Axe), Courts(Strangle), Courts(Gamekeeper)
Updates for each catchline, called PA pages, throughout the day
For each selected catchline
All PA pages downloaded
Final Southern paper editions of 9 dailies from next day examined
Selected newspaper articles were scanned and spell-corrected
Construction of the Corpus: Statistics :November 16, 2001 UMIST Seminar Construction of the Corpus: Statistics
Construction of the Corpus: Story Overlap :November 16, 2001 UMIST Seminar Construction of the Corpus: Story Overlap
Structure of the Corpus :November 16, 2001 UMIST Seminar Structure of the Corpus Lowest level of alignment ... 21.06/00 ... Showbiz Courts Catch line N Catch line 1 … annotated Catch line N Catch line 1 ... meter corpus ... Page 1 Page N Newspaper
1 Newspaper
N news papers ... annotated rawtext 21.06.00 12.07.99 ... Courts rawtext 12.07.99 PA Showbiz
Annotation of the Corpus :November 16, 2001 UMIST Seminar Annotation of the Corpus The METER corpus is annotated at two levels:
The document level – indicating degree of derivation from PA
The word sequence level – indicating extent of text reuse
All annotations were carried out by a single professional journalist
Second judgments are being collected for 5% of the material to validate the annotations
Annotation of the Corpus: Classification at the Document Level :November 16, 2001 UMIST Seminar Annotation of the Corpus: Classification at the Document Level Each document in the newspaper portion of the corpus is classified to indicate its derivational relation to the PA:
Wholly derived (WD) – all content of the target text is derived only from the PA source text
Partially derived (PD) – some content of the target text is derived from the source text. Other sources have also been used
Non-derived (ND) – no content of the target text is derived from the source text. Although verbatim and rewritten text may appear in the target text, the context, overlap of entities or use of source text is not indicative of reuse
Annotation of the Corpus: Classification at the Word Sequence Level :November 16, 2001 UMIST Seminar Annotation of the Corpus: Classification at the Word Sequence Level About ½ of the newspaper texts (~450) are annotated at the level of word sequences
Verbatim: text that is reused from PA word-for-word in the same context
Rewrite: text that is reused from PA, but paraphrased to create a different surface appearance. The context is still the same
New: text not appearing in PA or apparently verbatim or rewritten, but used in a different context.
Annotation of the Corpus: DTD :November 16, 2001 UMIST Seminar Annotation of the Corpus: DTD (required)
Attributes: filename: filename of the text (required)
newspaper: the newspaper name (required)
domain: courts or showbiz (required)
classification:either wholly-derived, partially-derived or non-derived (optional)
pagenumber: the newspaper page number (optional)
date: the date of publication (required)
catchline: the catchline as given by the journalist (required)
Annotation of the Corpus: DTD -- Example :November 16, 2001 UMIST Seminar Annotation of the Corpus: DTD -- Example Original PA version
BANKER'S BITTERNESS LED TO SYSTEMATIC THEFTS
By Lyndsay Moss, PA News
A middle-aged banker who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years today.
Trusted Derek Boe, 48, used some of the money to splash out on holidays, buy a car and a caravan, and pay for expensive home improvements.
Telegraph version:
A BANKER who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years yesterday.
Derek Boe, 48, used some of the money for holidays, to buy a car and a caravan, and to pay for home improvements. Annotated Telegraph version:
A
BANKER who stole more than
£270,000
from his bosses because he resented younger staff being promoted over his head, was jailed for four years
yesterday.
Derek Boe, 48, used some of the money
for
holidays,
to
buy a car and a caravan, and
to
pay for
home improvements.
Preliminary Experiments with the Corpus :November 16, 2001 UMIST Seminar Preliminary Experiments with the Corpus Initial experiments are underway to explore techniques for detecting whether a candidate reused text is wholly derived, partially derived or non-derived from a PA source text.
Techniques being investigated include:
Dotplot
Information retrieval text similarity measures (tf.idf)
Word n-gram overlap measures
50-70% correct identification of document level classification
Statistical alignment techniques
80-90 % correct identification of document level classification
Slide 19 :November 16, 2001 UMIST Seminar Dotplot - visualising patterns of reuse (1) 1 J Helfman, Dotplot: a program for exploring self-similarity in millions of lines of text and code, Journal of Computational and Graphical Statistics, 2(2), pp(153-174), June 1993. Can be used to visually identify derived newspaper stories using patterns formed from matching verbatim text.
Using simple combination of n-grams and Dotplot1.
Dotplot immune to change in word order (not substitution).
Useful for displaying relationships between long texts, biological subsequences or software programs.
Specific Dotplot “patterns” indicate relationships between sequences analysed: long diagonal lines imply verbatim text in same order, blocks indicate verbatim blocks re-ordered.
No quantitative score given and relies on human analysis.
Slide 20 :November 16, 2001 UMIST Seminar Dotplot - visualising patterns of reuse (2) This part will show self-reuse (of PA) UNRELATED TEXTS NON-DERIVED TEXTS DERIVED TEXTS This part will show any reuse between X and PA Quite obvious diagonals - implies verbatim reuse PA duplicated text
Slide 21 :November 16, 2001 UMIST Seminar Information Retrieval Approach Used Okapi1 - state-of-the-art Information Retrieval system.
Probabilistic approach using Best Match (BM) set of similarity operators - tested BM25.
Indexed all newspaper articles (removed function words and stemmed).
Used PA copy as the query - removed function words and selected n words (using tf*idf weighting based upon the index).
Calibrated the BM25 scores between PA query and returned newspaper documents - to enable classification.
Would like to try vector-space approach for comparison. 1 S. Robertson et al., Okapi at TREC-7: automatic ad hoc, filtering VLC and interactive track, NIST Special Publication 500-242 at TREC-7, pp(253-264), 1998.
Slide 22 :November 16, 2001 UMIST Seminar N-grams Extracted unique word n-grams from newspaper and PA texts.
Hypothesis: derived texts will share longer n-grams than non- derived texts.
Compared PA and newspaper texts using set-theoretic association scores.
Dice, Jaccard, resemblance, containment, cosine.
Compared texts for n-grams between 1 and 10 words.
Tried removing function words, morphological analysis and removing direct quotes.
Found hypothesis to be true except quotes (direct and indirect) cause problems (unexpected 10-grams in non-derived texts) and rewriting causes no 10-grams in wholly-derived texts.
Experimenting with approximate word matching e.g. edit distance to allow more 10-grams to match WD texts.
Slide 23 :November 16, 2001 UMIST Seminar Longest common substrings Extract the longest common substrings between PA and newspaper texts.
Hypothesis: derived texts will have more longer substrings than non-derived texts
Using Greedy String Tiling1 - an efficient and popular method used in plagiarism detection.
Finds maximal tiling between two texts. Greedy because longer matches preferred. Also between matches of the same length will match the first it finds.
Gets over limitations of longest common subsequence in that re- ordering does not affect the GST algorithm.
Used the longest common substring length, the mean tile length, the standard deviation of tile lengths and the PA and newspaper file lengths as features to a classifier. 1 M. Wise, YAP3: improved detection of similarities in computer programs and other texts, Presented at SIGCSE’96, pp(130-134),1996.
Slide 24 :November 16, 2001 UMIST Seminar Alignment The METER task is construed as a translation task – from the source to derived texts.
First align sentences between the candidate source and derived texts.
Align by pair-wise comparison between all sentences using a similarity score – some sentences may fail to be aligned.
Based on the successfully aligned sentences, estimate the probability that the document is derived from the PA.
Combined with machine learning classifier, the candidate derived texts are classified into WD, PD or ND.
Slide 25 :November 16, 2001 UMIST Seminar Comparison of initial results - courts domain Randomly selected 166 files from all domains (used all WD files).
Used same files for all approaches.
Used results as features to Naïve Bayes classifier.
Slide 26 :November 16, 2001 UMIST Seminar Comments on the results Alignment gives the best results – but not clear if they are significantly better
Although alignment the best, the very simple n-gram approach is close behind.
N-gram intuition correct: WD share more long n-grams than ND, but for classification not enough examples of 10-grams exist. Therefore 1-grams give best classification.
Direct quotes and rewriting cause problems with exact matching.
Problem in classification is between the PD and ND classes. WD separates well whichever method used.
Combining the features from different approaches together with some computational approximations to human judgements might work best, e.g. use longest strings, distribution of matches, types of matches etc.
Conclusion/Discussion (1) :November 16, 2001 UMIST Seminar Conclusion/Discussion (1) Have presented the METER corpus
first corpus to attempt to support the study of (legitimate) text reuse
first corpus to attempt to systematically align source/derived text in the journalistic world
Texts are derived from two domains (Courts and Showbiz) over a period of one year
Texts are annotated at two levels
Document level – a coarse indication of derivation/reuse
Word sequence level – a fine-grained indication of derivation/reuse
Conclusion/Discussion (2) :November 16, 2001 UMIST Seminar Conclusion/Discussion (2) Corpus is limited in terms of
Scope (2 domains only)
Temporal extent (36 days over 1 year only)
Size (1717 stories in total)
Annotation content (no links back to source texts)
Annotation accuracy (one annotator; evolving conception of annotation guidelines)
Primary purpose is to serve as a pilot – if useful/interesting subsequent versions or related corpora can be created
Distribution through ELRA/LDC underway
Conclusion/Discussion (3) :November 16, 2001 UMIST Seminar Conclusion/Discussion (3) Experiments with various string matching algorithms have been carried out with a view to automatically classifying texts as
wholly derived (WD)
partially derived (PD) or
non-derived (ND)
from PA source
Results suggest the WD-ND distinction can be quite accurately captured, with alignment techniques apparently performing best; the 3-way WD-PD-ND distinction is harder to capture
Algorithms to test derivation at the word sequence level are currently being tested
An open question is how to utilize richer linguistic models for this task