March2006 english final

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Bilingual Russian Énglish Thesaurus and Domain Ontologies. Thesaurus-Based Technologies and Value-Added Servicies at University Information System RUSSIA

Slide2: 

University Information System RUSSIA   (Russian inter-University Social Sciences Information and Analytical consortium) www.cir.ru Prepared for Seminar at Finish Social Science Data Archive, Helsinki, March 9 - 10, 2006 by Tatyana N. Yudina, Leading researcher, Ph.D. (history) Moscow State University Research Computing Center Anna Bogomolova, Assistant professor, Ph.D. (economics) Moscow State University Economic faculty yudina@mail.cir.ru; bogo@mail.cir.ru Moscow State University Research Computing Center NCO Center for Information Research

Slide6: 

University Information System RUSSIA Collections 2 000,000/ 20 Gb (www.cir.ru)

UIS RUSSIA: 

UIS RUSSIA Collections of documents in English - OECD Health Data, - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts, - Council of Europe documents, - European Court for Human Rights archive, - Publications of Kennan Institute, USA.

Slide8: 

NLP technology in UIS RUSSIA Automatic Linguistic Text Processing/Linguistic Processors *.POD *.OUT *.LEM *.HDR ORACLE WEB www.cir.ru (Apache; OAS) Administrator. holdings convertors *.HTM

Slide9: 

Automatic Linguistic Text Processing (ALTP) is a UIS RUSSIA team know how. ALTP is adjusted to content-based process and integrate all main types of business prose text corpora (documents and statistics)– government publications, parliament chambers daily records, think tanks reports, scientific journals, mass media, public opinion polls. Content-based processing includes: -- Conceptual Indexing, -- Coherent Summarization, -- Text Categorisation.

Thesaurus: 

Thesaurus

Slide11: 

Sociopolitical Thesaurus 29,000  concepts,     75,000  terms 110,000  conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social,legislative and cultural domains; regularly tested during automatic text processing. set of relations is adjusted to serve content-based search, navigation and query refinement.

General Structure of Thesaurus: 

General Structure of Thesaurus

English-Russian Sociopolitical Thesaurus: 

English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 65 thousand English terms Manual work: Use of general and special domain English-Russian and Russian-English dictionaries, Study of conventional American and British dictionaries and thesauri, Cross-checking of translations. Internet search checking.

Thesaurus terminology in social and political domain: 

Thesaurus terminology in social and political domain

Adding languages to Thesaurus: 

Adding languages to Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of social and political domains from different languages and arrange in a multilingual hierarchical net. A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars is the second nation in Russia.

Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995): 

Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995)

Slide17: 

Thematic Lines of Thesaurus Terms (RF Government Regulation N604 26.06.1995)

Slide18: 

Network of Thematic Nodes (RF Government Regulation N604 26.06.1995)

Slide19: 

Network of Thematic Nodes in English (RF Government Regulation N604 26.06.1995)

Slide20: 

Structure of Thematic Representation Main Thematic Nodes Specific Thematic Nodes

Structural Thematic Summary (RF Government Regulation N604 26.06.1995): 

Structural Thematic Summary (RF Government Regulation N604 26.06.1995)

THESAURUS for Information Retrieval in Sociopolitical Domain: 

THESAURUS for Information Retrieval in Sociopolitical Domain Thesaurus provides for query refinement - reformulation - expansion; Terminology of Thesaurus covers 95-98% of business prose - terms of Russian government publications, academic papers and mass media texts from 1991; Thesaurus is a main element of ALTP/automatic linguistic text processing technology at UIS RUSSIA.

Query Refinement: 

Query Refinement

Navigation in Thesaurus : 

Navigation in Thesaurus

Bilingual Information Retrieval: 

Bilingual Information Retrieval

Document content representation in two languages scheme: 

Document content representation in two languages scheme Document in Russian Document In English Content representation In English Content representation in Russian Content representation of a document

Documents content representation in two languages example : 

Documents content representation in two languages example Document in Russian Document in English Content representation In English Content representation In Russian Content representation of a document

Bilingual Search in UIS RUSSIA: 

Bilingual Search in UIS RUSSIA

Slide33: 

www.cir.ru/is4/

Text Categorization: 

Text Categorization

Expert-made classification: 

Expert-made classification 60% coincidence High accuracy Not high relevance

Classification in automatic mode: 

Classification in automatic mode

Text Categorization Using Thematic Representation: 

Text Categorization Using Thematic Representation Systems of Subject Headings: UIS RUSSIA system of subject headings, RF Central Election Committee Legal Subject Headings (450 items; 4 levels), 80 Top Terms of Legislative Indexing Vocabulary (LIV) Congressional Research Service of the US Congress.

English-Russian Sociopolitical Thesaurus: new applications: 

English-Russian Sociopolitical Thesaurus: new applications Automatic text categorization of research papers in economics exploiting JEL subject headings (700 categories), Automatic text processing of statistical tables, Automatic text processing of European organizations documents (European Court of Human Rights, Council of Europe, European Union).

System of Subject Headings for Budget Data: 

System of Subject Headings for Budget Data 87 hierarchic categories First level categories are: Macroeconomic Indicators Budget Revenues and Expenditures Tax Concessions Budget Deficit/Surplus State and Municipal Debt Budget Process Budget Federalism Extra-Budgetary Funds State Authorities Fiscal Misconduct

Foreign Exchange rate: 

Foreign Exchange rate 1. ((US Dollar OR Euro Currency OR Ruble) AND Foreign Exchange Rate) OR 2. ((US Dollar OR Euro Currency) AND Ruble AND Economic Development (Economic Crisis; Economic Forecasting; Economic Indicator; Economic Growth; Economic Laws; Economic Situation))

Slide44: 

Thank you ! Tatyana N. Yudina, Leading researcher, Ph.D. (history) Moscow State University Research Computing Center yudina@mail.cir.ru Anna Bogomolova, Assistant professor, Ph.D. (economics) Moscow State University Economic faculty bogo@mail.cir.ru