05INIST MAI

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

INIST Machine-aided indexing Abdelmajid Khayari Stéphane Schneider INIST/CNRS France NFAIS Forum New York. April 22, 2005

Slide2: 

INIST Institute for Scientific and Technical Information A service of the French CNRS. Activities : collection, analysis and dissemination of the results and findings of worldwide research. Fields covered : science, technology, medicine, humanities and social sciences. Leading scientific and technical document supplier in France. Producer of multilingual, multidisciplinary bibliographic databases, PASCAL, FRANCIS and ISD covering the core worldwide scientific literature. 2 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide3: 

INIST (continued) Provider of customized services to the scientific community (portals, current awareness, training, etc.) Partner in open access initiatives. Research partner in the NTIC community 3 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide4: 

Aims and scopes Introduction of a part of automation in the indexing process : To which extent can the process be automated ? Which approach is suitable ? What are the prerequisites ? Evaluation of the final result : Is there support from the indexers ? Does it meet the expectations of the indexers ? Are the results acceptable ? 4 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide5: 

Current indexing practices About 70 internal and external specialized indexers. Documents in diverse languages (main : English, French, Spanish, German). Semi-manual allocation of descriptors and classification codes. Use of controlled vocabularies and classification schemes completed with free key-words. Multilingual descriptors. 5 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide6: 

Workstation Development of a home-made platform accessed by INIST indexers since year 2000 via our intranet. Set up of indexing programs and fine tuning. Collaborative work on terminology resources. About 2000 input records processed each night. Use of fully automated indexing for periodicals that are not manually analyzed (May, 2004). 6 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide7: 

Two indexing programs Lexical method. Uses equivalence rules gathered in subject terminological resources to assign descriptors and classification codes to documents. Statistical approach (lexical collocation). A two-stage process : Training stage : a corpus of human-indexed citations is processed to create association dictionaries. Indexing stage : using the association dictionaries, controlled vocabulary descriptors and classification codes are assigned to the incoming documents. 7 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide8: 

Lexical method Text processing of bibliographical records (titles + abstracts + author keywords) : Parsing text to phrases and phrases to words. Lemmatization. Matching with subject terminology resources : Searching for terms that correspond to descriptors and classification codes. Searching intervals for compound terms (2 to 6 words). Ordering of candidate key-words and classification codes : semantic categories are used to construct the indexing grid. 8 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide9: 

Lexical method (continued) Generation of additional descriptors and codes : each keyword or code may trigger another one using association rules in a cascade-like manner : Rat -andgt; Animal / Acropulpitis -andgt; Finger. Pointing task -andgt; Manual task -andgt; Motor control Pointing task + Vision -andgt; Visuomotor integration Ranking of keywords and codes. Filtering of the assigned elements : number of occurrences for each category is used as a filter to set the desired number of candidate descriptors. 9 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide10: 

Lexical method (continued) Check-up by human indexers : candidate descriptors and codes are validated and completed. Continuous feedback on terminological resources is operated in parallel by introducing new equivalence rules, new data (synonyms) in existing rules or deleting noise-producing rules. Feedback on the indexing program (changing parameters and terminology resources combination). 10 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide11: 

Initial version Deployment of a basic version using the 2 methods. A semi-automated process is used to construct the subject terminology resources. Bibliographic databases are used to extract corpora dealing with a specific subject (i.e. pain) Corpora are processed to extract a ranked list of descriptors which is run against the controlled vacabulary to extract synonyms, translations, semantic categories, etc. This core thematic dictionary is enriched with new concepts and new data. During an iterative indexing and re-indexing process, the performance of the dictionary is improved. 11 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide12: 

Initial version (continued) Sharing of thematic resources - Someone has the dictionary of diseases or of geographical names I need. Access to full-text articles (OCRized and directly from publishers). Direct feedback to administrators/developers incorporated. Evaluation of performance on each citation or collection of citations. Final indexing is compared to the initial one as proposed by the program. INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005. 12

Slide13: 

Evaluation Indexers support for MAI was not easy to obtain : Important psychological reluctance at the beginning (the machine will never be able to perform a highly intellectual task : abstraction). The crucial need to formalize the specialist knowledge is becoming well understood. Many concerns about fully automated indexing since the standard scale is the human produced indexing (i.e. a candidate descriptor which is not inaccurate per se will be considered as wrong since the human indexer did not include it in the final record.) 13 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide14: 

Evaluation (continued) The lexical method is predominantly used by indexers. The statistical one is used mainly for determination of classification codes. Stats are obtained by comparing machine indexing with the final record after human revision: Performance is proportional to the degree of improvement of terminology resources (in pilot subject fields up to 80% accurate candidate descriptors can be obtained). Unsuccessful machine-indexing triggers feedback on computer programs and on terminology resources 14 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide15: 

Evaluation (continued) During the deployment phase, time-saving is not always achieved because feedback on terminology resources is time-consuming. Nevertheless, benefits are real in terms of : Indexing consistency (less intra- and inter-individual variations) Indexor’s expertise and knowledge acquired during the abstraction-indexing processes are integrated into an organization resource (knowledge capitalization and sharing). 15 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.

Slide16: 

Future trends Indexing programs improvement Improvement of textual pattern extraction (genes, etc.) Introduction of advanced natural language processing: Extraction of concepts. Extraction of relationships between concepts. Improvement of citation pre-classification in order to be able to assign the right combination of subject resources. Constitution of a unified terminology database 16 INIST Machine-Aided Indexing. NFAIS forum. New York. April 22, 2005.