C3 Yudina

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Tatyana N. Yudina   yudina@mail.cir.ru University Information System RUSSIA   (Russian inter-University Social Sciences Information and Analytical consortium) www.cir.ru Research Computing Center of Moscow State University NCO Center for Information Research

Plan: 

Plan UIS RUSSIA. General Thesaurus ALTP Bilingual Information Retrieval Text categorization Examples

Slide3: 

University Information System RUSSIA Collections 1 500,000/ 17.5Gb (www.cir.ru)

Slide4: 

NLP technology in UIS RUSSIA Automatic Linguistic Text Processing/Linguistic Processors *.POD *.OUT *.LEM *.HDR ORACLE WEB www.cir.ru (Apache; OAS) Administrator. holdings convertors *.HTM

UIS RUSSIA: 

UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - Council of Europe/European Court of Human Rights documents - UNESCO documents

Slide6: 


Slide7: 


Slide8: 


Slide9: 


Slide10: 


Slide11: 


Thesaurus: 

Thesaurus

Slide13: 

Sociopolitical Thesaurus 29,000  concepts,     75,000  terms 110,000  conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is adapted to information-retrieval applications; regularly tested during automatic text processing

THESAURUS for Information Retrievalin Sociopolitical Domain: 

THESAURUS for Information Retrieval in Sociopolitical Domain -- Thesaurus provides for query refinement - reformulation/expansion -- Terminology of Thesaurus covers 95-98% of words and terms of Russian government publications, academic papers and mass media texts from 1991 -- Thesaurus is a main element of ALTP/automatic linguistic text processing technology.

Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV): 

Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV)

General Structure of Thesaurus: 

General Structure of Thesaurus

Query Refinement: 

Query Refinement

Navigation in Thesaurus : 

Navigation in Thesaurus

ALTP: 

ALTP Automatic Linguistic Text Processing -- Conceptual Indexing -- Automatic Coherent Summarisation -- Automatic Text Categorisation

Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995): 

Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995)

Slide21: 

Thematic Lines of Thesaurus Terms (RF Government Regulation N604 26.06.1995)

Slide22: 

Network of Thematic Nodes (RF Government Regulation N604 26.06.1995)

Slide23: 

Network of Thematic Nodes (RF Government Regulation N604 26.06.1995)

Slide24: 

Structure of Thematic Representation Main Thematic Nodes Specific Thematic Nodes

Structural Thematic Summary(RF Government Regulation N604 26.06.1995): 

Structural Thematic Summary (RF Government Regulation N604 26.06.1995)

Bilingual Information Retrieval: 

Bilingual Information Retrieval

English-Russian Sociopolitical Thesaurus: 

English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 65 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri Cross-checking of translations. Addition multiword variants. Internet checks.

English-Russian Sociopolitical Thesaurus: testing and use in new applications: 

English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian Legislation

Thesaurus Terminology in Sociopolitical Domain: 

Thesaurus Terminology in Sociopolitical Domain

Adding languages to Sociopolitical Thesaurus: 

Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in Russia

Approach to Organization of Bilingual Search in UIS RUSSIA: 

Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processing

Тематическое двуязычное индексирование: 

Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа

Тематическое двуязычное индексирование: 

Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа

Use of Thesaurus in Information Retrieval applications: 

Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) NEW: Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system -- SUMMAC conference Thesaurus-based information retrieval -- a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision)

Bilingual Search in UIS RUSSIA: 

Bilingual Search in UIS RUSSIA

Slide36: 

www.cir.ru/is4/

Text Categorization: 

Text Categorization

Субъективизм экспертов: 

Субъективизм экспертов Совпадение при ручной рубрикации между разными экспертами 60% Высокая точность, НО невысокая полнота

Автоматическое рубрицирование: 

Автоматическое рубрицирование

Text Categorization Using Thematic Representation: 

Text Categorization Using Thematic Representation Systems of Subject Headings: -- RF Subject Headings System for Legal Acts (RF President Decree N511, 15.03.200; 1169 items, 4 levels) -- RF Central Election Committee Legal Subject Headings (450 items; 4 levels) -- 80 Top Terms of Legislative Indexing Vocabulary (LIV) Congressional Research Service of the US Congress

Схема описания рубрики: 

Схема описания рубрики Рубрика Альтернатива1 Альтернатива2 У11 У12 У13 У21 Условие22 ИЛИ И И И + + + - ИЛИ ИЛИ

Subject Heading as Formulae of Support Concepts: 

Subject Heading as Formulae of Support Concepts

Full Representation of Subject Heading(expansion of support concepts): 

Full Representation of Subject Heading (expansion of support concepts)

Examples: 

Examples

Slide45: 

http://www.cir.ru/docs/ips/techno/gmtpod/index_e.jsp

Results of Text Categorization: 

Results of Text Categorization Info

Known terms: 

Known terms Word/Phrase Sense Disumbiguation T_M ok M not

Related Terms fo judgment: 

Related Terms fo judgment

Thematic Summary: 

Thematic Summary Link between two thematic lines: DISCHARGE;  ORIGINATE;  DISMISSAL WAGE;  LABOR;  and HUNGARY;  FORINT;  HUNGARIANS;  BUDAPEST;  STATE, COUNTRY;

Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text: 

Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text

Automatic Summary: 

Automatic Summary

Russian Text – English Terms: 

Russian Text – English Terms

Russian Text – English Thematic Summary: 

Russian Text – English Thematic Summary

Bilingual Text Categorization: 

Bilingual Text Categorization Support of Subject Heading