loukachevitch

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Natalia V. Loukachevitch   louk@mail.cir.ru Russian Language in Cross-Language Information Retrieval: Resources and Tools in Russia    Research Computing Center of Moscow State University NCO Center for Information Research

Plan.1: 

Plan.1 Morphological analyzers of Russian Morphology of East Slavonic languages Multilingual information-retrieval thesauri Electronic bilingual dictionaries Russian and bilingual text collections

Plan.2: 

Plan.2 Machine Translation systems Example-Based Machine Translation system and conceptual information retrieval Bilingual ontologies Russian WordNet Sociopolitical Thesaurus for automatic text processing

Morphological analysis of Russian language: 

Morphological analysis of Russian language No problem: a lot of qualitative morphological analyzers of Russian Based on classification in “Grammatical dictionary of Russian language” by A.A. Zalizniak (the first edition was published in 1983)

Morphological analyzers:: 

Morphological analyzers: Zalizniak dictionary and a morphological analyzer http://starling.rinet.ru/morpho.htm With license LGPL http://www.aot.ru/download.html (site in Russian) http://linguist.nm.ru/index.htm (Russian and Ukrainian) - paid resources, used in several known commercial Russian systems

Russian Internet Search Engines use with Russian morphology analysis: 

Russian Internet Search Engines use with Russian morphology analysis Yandex – www.yandex.ru Rambler – www.rambler.ru Aport – www.aport.ru

Morphology of East Slavonic Languages in Search Engines: 

Morphology of East Slavonic Languages in Search Engines Ukrainian Internet search engine Meta (www.meta.ua) Russian, English and Ukrainian morphology Byelorussian search engine (www.akavita.by) Russian, English and Byelorussian morphology (will be added)

Traditional multilingual information-retrieval thesauri: 

Traditional multilingual information-retrieval thesauri

Thesaurus of European Union: EUROVOC : 

Thesaurus of European Union: EUROVOC Translated into 9 languages Translated into Russian language by specialists of Parliamentary library Added with Russian specific terms (9646 descriptors in Russian version) Used for manual indexing of documents in the library

Electronic dictionaries: 

Electronic dictionaries

MultiLex dictionaries: 

MultiLex dictionaries www.medialingua.com English, French, Spanish, German, Italian Licenced versions of dictionaries from publishers Usually includes a general dictionary and several domain-specific dictionaries

Lingvo dictionaries: 

Lingvo dictionaries www.abbyy.co.uk Abbyy Lingvo 8.0 Multilingual edition: Eight translation directions – 41 general and specialised dictionaries FineReader – the best Russian OCR-system. Support more than 100 languages. Winner in 70 comparative tests worldwide

Polyglossum dictionaries: 

Polyglossum dictionaries ETS publishing house www.ets.ru Electronic (plain text format is possible) versions and traditional printed versions Bilingual English, German, French, Spanish, + Finnish languages

Russian Text Collections: 

Russian Text Collections

Internet Library of Moshkov: 

Internet Library of Moshkov www.lib.ru Fiction in Russian including classic works 3300 Mb Text-files and 300 Mb other files Free access No copyright

Internet library - www.public.ru : 

Internet library - www.public.ru More than 1000 names of periodic press after 1990. Free access No copyright License to librarian activity

Morphologically tagged corpus of Russian “Russian Standard”: 

Morphologically tagged corpus of Russian “Russian Standard” Creation of a morpologically tagged corpus of Russian in Russia has been begun Russian fiction 583,814 words Serge Sharoff http://corpus.leeds.ac.uk/

Parallel collections: 

Parallel collections

Parallel translation of news reports: 

Parallel translation of news reports ITAR-TASS agency: news reports in 6 languages (http://corp.itar-tass.com/english/about/) RIA-Novosti agency: news reports in 12 languages (http://en.rian.ru/rian/index.cfm) Internet newspaper PRAVDA On-Line http://english.pravda.ru/ - translation into English

Translation of Russian Legislation: 

Translation of Russian Legislation GARANT company – legal information systems http://www.garant.ru/nav.php?pid=286&ssid=89 Translated more than 25 thousand Russian legal acts into English is disseminated via the network of the American company LEXIS/NEXIS.

Machine translation systems: 

Machine translation systems

ETAP machine translation system: 

ETAP machine translation system Based on Meaning-Text Theory by I.Melchuk and Y. Apresyan. Detailed rule-based syntactic analysis. English-Russian http://cl.iitp.ru/etap/index.html

Most known commercial machine translation system: PROMT: 

Most known commercial machine translation system: PROMT www.e-prompt.com Russian - English, French, German, Spanish, Italian English-German Development of domain-specific systems Online translation: www.translate.ru

Example-Based Machine Translation: ETRANS, RTRANS: 

Example-Based Machine Translation: ETRANS, RTRANS Gerold Belonogov Idea was published in 1975 VINITI - All-Russian Scientific and Technical Information Institute of Russian Academy of Sciences (www.viniti.ru)

Example –based machine translation in VINITI -2: 

Example –based machine translation in VINITI -2 VINITI: manual indexing – search images of technical literature, abstracts, collected for many years 900 thousand Russian terms were extracted (length 1-13 words) Parallel collection of English abstracts and their translation into Russian => 800 thousand English terms

Conceptual indexing in VINITI: 

Conceptual indexing in VINITI Bilingual base of terms can serve as a resource for bilingual search It is not an ontology, only bilingual pairs An important tool for VINITI: access of foreign researchers to Russian technical literature, but (as I know) not implemented yet

Multilingual ontologies: 

Multilingual ontologies

Russian WordNet - RussNet: 

Russian WordNet - RussNet Saint-Petersburg State University 2003: 15000 words – 5000 synsets – 8000 relations Adding of several types of new relations such as derivative synonyms, derivative semantic roles

Slide29: 

University Information System RUSSIA Collections (Center for Information Research) 800,000/ 7.5Gb (www.cir.ru)

UIS RUSSIA: 

UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - collection of Council of Europe documents. access to parallel collections of legislation. Harmonization of legislation

Approach to Organization of Bilingual Search in UIS RUSSIA: 

Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processing

Slide32: 

Sociopolitical Thesaurus 28,000  concepts,     70,000  terms 105,000  conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is specially adapted to information-retrieval applications; regularly tested during automatic text processing

Use of Thesaurus in Information Retrieval applications: 

Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) - Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system - SUMMAC conference Thesaurus-based information retrieval - a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision)

English-Russian Sociopolitical Thesaurus: 

English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 63 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri. Cross-checking of translations. Addition multiword variants. Internet checks.

Bilingual Search in UIS RUSSIA: 

Bilingual Search in UIS RUSSIA

Slide36: 

www.cir.ru/is4/

English-Russian Sociopolitical Thesaurus: testing and use in new applications: 

English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian Legislation

Adding languages to Sociopolitical Thesaurus: 

Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net. A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in Russia

Russian Information Retrieval Evaluation Seminar -2003: 

Russian Information Retrieval Evaluation Seminar -2003 Web Collection – 7 Gb (www.narod.yandex.ru) Thematic classification of Web-sites Web Search 10000 real queries from Internet were given 50 queries will be evaluated 8 Russian participants