Presentation Transcript
Slide1: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez
ETT/SI Data Handling Group- CERN
Geneva (Switzerland) Joint Research Center, Ispra (Italy) -4 March 2002
CERN: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research What we are going to see today... Data Handling Group Keyword assignment process
Why keywords?
How it is done for High Energy Physics papers
The HEPindexer project:
Future work Data
Algorithm
Experiments
Results
CERN: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group Authors Indexer Keyworded papers
CERN: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group
Slide5: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group The document... Full text paper
Stored in a database
Simplified representation needed
Slide6: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group The thesaurus... Controlled vocabulary of concepts
Relationships between keywords
Categories and subcategories
Can be domain specific
Can be translated into multiple languages
Slide7: CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group The thesaurus: a relational model for terms
cheese
MT 6016 processed agricultural produce
BT1 milk product
NT1 blue-veined cheese
NT1 cow's milk cheese
NT1 fresh cheese
NT1 goat's milk cheese
NT1 hard cheese
NT1 processed cheese
NT1 semi-soft cheese
NT1 sheep's milk cheese
NT1 soft cheese
RT cheese factory (6031)
Slide8: CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group The thesaurus: a subject tree
04 POLITICS
0406 political framework
0411 political party
0416 electoral procedure and voting
0421 parliament
0426 parliamentary proceedings
0431 politics and public safety
0436 executive power and public service
08 INTERNATIONAL RELATIONS
0806 international affairs
0811 cooperation policy
0816 international balance
0821 defence
10 EUROPEAN COMMUNITIES
1006 Community institutions and European civil service
1011 Community law
1016 European construction
1021 Community finance
Slide9: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Keyword assignment process Data Handling Group The indexer... An expert in the domain of the documents
An expert in the use of the thesaurus
Heavy task
Not always the same proposition
Expensive!
Slide10: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Why keywords? Data Handling Group Permit to index documents in a coherent way
Can be viewed like the "index" at the end of a book
Concepts that represent better the content
Human made (value added)
Meaningful
Can stablish relations between documents
Multilingual
Slide11: CERN European Organization for Nuclear Research Data Handling Group Why keywords? Access to documents But... we already have fulltext indexing! Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Slide12: Classification:
To store (libraries)
To access (narrow searches) Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group Why keywords? Category 1 Category 2 Category 3
Slide13: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group Why keywords? Navaja Razor Couteau Navaja Razor Couteau Razor? Lametta Lametta Crosslingual access
Slide14: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group Why keywords? Razor Razor Lametta Lametta Multilingual comparison Murder Frabbica CERN European Organization for Nuclear Research Data Handling Group CERN European Organization for Nuclear Research Data Handling Group Why keywords? Multilingual comparison
Slide15: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group CERN Why keywords? Advantages over fulltext searches: No ambiguity
Better relevance and precision Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 More advanced tools for searching and classification are coming!
Slide16: CERN European Organization for Nuclear Research Data Handling Group CERN Why keywords? The BIG problem... - E X P E N S I V E - Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Slide17: CERN European Organization for Nuclear Research Data Handling Group CERN Why keywords? The BIG problem? E X P E N S I V E ? Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Slide18: CERN European Organization for Nuclear Research Data Handling Group CERN Why keywords? The BIG problem? E X P E N S I V E ? Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Slide19: CERN European Organization for Nuclear Research Data Handling Group CERN The CERN Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 The world's largest particle physics centre
Explores what matter is made of, and what forces hold it together
Employs just under 3000 people
6500 scientists, come for their research
Slide20: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research How it is done for High Energy Physics papers Data Handling Group DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany) DESY thesaurus
Group of indexers (students, experts...)
Only High Energy Physics related papers
Slide21: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research How it is done for High Energy Physics papers Data Handling Group The DESY thesaurus A
*a4(2040) ('postulated particle, a4(2040)', was delta(2040))
*a6(2450) ('postulated particle, a6(2450)', was delta(2450))
*abelian
*aberration
absorption
-absorptive model (model, absorption)
accelerator
. . .
B
B
B anti-B
B+
B+L number
B*(5320) (excited B)
-B** ('B*2...', similar for B/s, etc.)
*B*2(5732) (postulated particle, B*2(5732))
B-
-B-factory (B, particle source)
B-L number
. . .
Slide22: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research How it is done for High Energy Physics papers Data Handling Group The DESY thesaurus: Few categories rarely used
Only two type of keywords:
main keywords (1191)
secondary keywords (949)
No relationships between terms
Specific terminology
Slide23: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research How it is done for High Energy Physics papers Data Handling Group The DESY thesaurus: specific terminology Energy declarations: 1.5-2.7 GeV-cms
Resonances: Delta (1232)
Reaction equations: anti-p p ---> K0 K- pi+
Combinations: angular distribution, (photon), mass spectrum (pi+ pi- pi0)
Two-particle initial state: 'anti-p p', 'electron positron'
Slide24: Physicists Indexer Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research How it is done for High Energy Physics papers Data Handling Group The problem More than 500 preprints per week!
Slide25: CERN European Organization for Nuclear Research The HEPindexer project Data Handling Group Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Physicists Indexer Keyworded papers The solution
Slide26: CERN European Organization for Nuclear Research The HEPindexer project Data Handling Group Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 Use of IR techniques
Objective evaluation
Real time answer
Easy portable
Full integrable into CDS
Posibility of growing
Fully automatical & aider tool
Slide27: Keyworded papers (collection) Keyword Term Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project
Slide28: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Keyword Term DESY keywords Documents
Slide29: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project 1220 test collection 2441 training collection Data
3,661 documents
19,143 terms
1,191 main keywords
Slide30: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Algorithm
Slide31: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Algorithm Preprocessing
Punctuation
Lower case
Remove stop words
Stemming
Slide32: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Weight term - document Weight keyword - document Weight keyword - term Similarity keyword - document Algorithm
Slide33: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Experiments
Slide34: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Experiments AÇB A B A: keywords propossed by DESY
B: keywords propossed by HEPindexer Keywords in the trainning collection
Slide35: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Results 52.7 % of precision
58.5 % of recall Response in 2 seconds
Slide36: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Results
Slide37: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Results
Slide38: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project C++ / STL
UNIX
Command line interface
Digilib: Web interface (PHP)
http://cern.ch/digilib
Installation on the CERN Document Server
http://cds.cern.ch Software
Slide39: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Software
Slide40: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Software
Slide41: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group The HEPindexer project Software
Slide42: Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002 CERN European Organization for Nuclear Research Data Handling Group Future Work Automatic proposition of secondary keywords
Improve the algorithm
(lemmatizer, multiwords, segmentation...)
Use of references to link documents based on
common concepts
Specific algorithms for handling of energies,
particle decays, desintegrations, etc.
Agents
OAI
Apply Semantic Web approaches