logging in or signing up loukachevitch Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 94 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 26, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Natalia V. Loukachevitch louk@mail.cir.ru Russian Language in Cross-Language Information Retrieval: Resources and Tools in Russia Research Computing Center of Moscow State University NCO Center for Information ResearchPlan.1: Plan.1 Morphological analyzers of Russian Morphology of East Slavonic languages Multilingual information-retrieval thesauri Electronic bilingual dictionaries Russian and bilingual text collectionsPlan.2: Plan.2 Machine Translation systems Example-Based Machine Translation system and conceptual information retrieval Bilingual ontologies Russian WordNet Sociopolitical Thesaurus for automatic text processingMorphological analysis of Russian language: Morphological analysis of Russian language No problem: a lot of qualitative morphological analyzers of Russian Based on classification in “Grammatical dictionary of Russian language” by A.A. Zalizniak (the first edition was published in 1983)Morphological analyzers:: Morphological analyzers: Zalizniak dictionary and a morphological analyzer http://starling.rinet.ru/morpho.htm With license LGPL http://www.aot.ru/download.html (site in Russian) http://linguist.nm.ru/index.htm (Russian and Ukrainian) - paid resources, used in several known commercial Russian systemsRussian Internet Search Engines use with Russian morphology analysis: Russian Internet Search Engines use with Russian morphology analysis Yandex – www.yandex.ru Rambler – www.rambler.ru Aport – www.aport.ru Morphology of East Slavonic Languages in Search Engines: Morphology of East Slavonic Languages in Search Engines Ukrainian Internet search engine Meta (www.meta.ua) Russian, English and Ukrainian morphology Byelorussian search engine (www.akavita.by) Russian, English and Byelorussian morphology (will be added)Traditional multilingual information-retrieval thesauri: Traditional multilingual information-retrieval thesauriThesaurus of European Union: EUROVOC : Thesaurus of European Union: EUROVOC Translated into 9 languages Translated into Russian language by specialists of Parliamentary library Added with Russian specific terms (9646 descriptors in Russian version) Used for manual indexing of documents in the libraryElectronic dictionaries: Electronic dictionariesMultiLex dictionaries: MultiLex dictionaries www.medialingua.com English, French, Spanish, German, Italian Licenced versions of dictionaries from publishers Usually includes a general dictionary and several domain-specific dictionariesLingvo dictionaries: Lingvo dictionaries www.abbyy.co.uk Abbyy Lingvo 8.0 Multilingual edition: Eight translation directions – 41 general and specialised dictionaries FineReader – the best Russian OCR-system. Support more than 100 languages. Winner in 70 comparative tests worldwidePolyglossum dictionaries: Polyglossum dictionaries ETS publishing house www.ets.ru Electronic (plain text format is possible) versions and traditional printed versions Bilingual English, German, French, Spanish, + Finnish languagesRussian Text Collections: Russian Text CollectionsInternet Library of Moshkov: Internet Library of Moshkov www.lib.ru Fiction in Russian including classic works 3300 Mb Text-files and 300 Mb other files Free access No copyrightInternet library - www.public.ru : Internet library - www.public.ru More than 1000 names of periodic press after 1990. Free access No copyright License to librarian activityMorphologically tagged corpus of Russian “Russian Standard”: Morphologically tagged corpus of Russian “Russian Standard” Creation of a morpologically tagged corpus of Russian in Russia has been begun Russian fiction 583,814 words Serge Sharoff http://corpus.leeds.ac.uk/ Parallel collections: Parallel collectionsParallel translation of news reports: Parallel translation of news reports ITAR-TASS agency: news reports in 6 languages (http://corp.itar-tass.com/english/about/) RIA-Novosti agency: news reports in 12 languages (http://en.rian.ru/rian/index.cfm) Internet newspaper PRAVDA On-Line http://english.pravda.ru/ - translation into EnglishTranslation of Russian Legislation: Translation of Russian Legislation GARANT company – legal information systems http://www.garant.ru/nav.php?pid=286&ssid=89 Translated more than 25 thousand Russian legal acts into English is disseminated via the network of the American company LEXIS/NEXIS.Machine translation systems: Machine translation systemsETAP machine translation system: ETAP machine translation system Based on Meaning-Text Theory by I.Melchuk and Y. Apresyan. Detailed rule-based syntactic analysis. English-Russian http://cl.iitp.ru/etap/index.html Most known commercial machine translation system: PROMT: Most known commercial machine translation system: PROMT www.e-prompt.com Russian - English, French, German, Spanish, Italian English-German Development of domain-specific systems Online translation: www.translate.ruExample-Based Machine Translation: ETRANS, RTRANS: Example-Based Machine Translation: ETRANS, RTRANS Gerold Belonogov Idea was published in 1975 VINITI - All-Russian Scientific and Technical Information Institute of Russian Academy of Sciences (www.viniti.ru)Example –based machine translation in VINITI -2: Example –based machine translation in VINITI -2 VINITI: manual indexing – search images of technical literature, abstracts, collected for many years 900 thousand Russian terms were extracted (length 1-13 words) Parallel collection of English abstracts and their translation into Russian => 800 thousand English termsConceptual indexing in VINITI: Conceptual indexing in VINITI Bilingual base of terms can serve as a resource for bilingual search It is not an ontology, only bilingual pairs An important tool for VINITI: access of foreign researchers to Russian technical literature, but (as I know) not implemented yetMultilingual ontologies: Multilingual ontologiesRussian WordNet - RussNet: Russian WordNet - RussNet Saint-Petersburg State University 2003: 15000 words – 5000 synsets – 8000 relations Adding of several types of new relations such as derivative synonyms, derivative semantic rolesSlide29: University Information System RUSSIA Collections (Center for Information Research) 800,000/ 7.5Gb (www.cir.ru)UIS RUSSIA: UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - collection of Council of Europe documents. access to parallel collections of legislation. Harmonization of legislation Approach to Organization of Bilingual Search in UIS RUSSIA: Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processingSlide32: Sociopolitical Thesaurus 28,000 concepts, 70,000 terms 105,000 conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is specially adapted to information-retrieval applications; regularly tested during automatic text processingUse of Thesaurus in Information Retrieval applications: Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) - Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system - SUMMAC conference Thesaurus-based information retrieval - a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision)English-Russian Sociopolitical Thesaurus: English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 63 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri. Cross-checking of translations. Addition multiword variants. Internet checks.Bilingual Search in UIS RUSSIA: Bilingual Search in UIS RUSSIA Slide36: www.cir.ru/is4/English-Russian Sociopolitical Thesaurus: testing and use in new applications: English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian LegislationAdding languages to Sociopolitical Thesaurus: Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net. A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in RussiaRussian Information Retrieval Evaluation Seminar -2003: Russian Information Retrieval Evaluation Seminar -2003 Web Collection – 7 Gb (www.narod.yandex.ru) Thematic classification of Web-sites Web Search 10000 real queries from Internet were given 50 queries will be evaluated 8 Russian participants You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
loukachevitch Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 94 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 26, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Natalia V. Loukachevitch louk@mail.cir.ru Russian Language in Cross-Language Information Retrieval: Resources and Tools in Russia Research Computing Center of Moscow State University NCO Center for Information ResearchPlan.1: Plan.1 Morphological analyzers of Russian Morphology of East Slavonic languages Multilingual information-retrieval thesauri Electronic bilingual dictionaries Russian and bilingual text collectionsPlan.2: Plan.2 Machine Translation systems Example-Based Machine Translation system and conceptual information retrieval Bilingual ontologies Russian WordNet Sociopolitical Thesaurus for automatic text processingMorphological analysis of Russian language: Morphological analysis of Russian language No problem: a lot of qualitative morphological analyzers of Russian Based on classification in “Grammatical dictionary of Russian language” by A.A. Zalizniak (the first edition was published in 1983)Morphological analyzers:: Morphological analyzers: Zalizniak dictionary and a morphological analyzer http://starling.rinet.ru/morpho.htm With license LGPL http://www.aot.ru/download.html (site in Russian) http://linguist.nm.ru/index.htm (Russian and Ukrainian) - paid resources, used in several known commercial Russian systemsRussian Internet Search Engines use with Russian morphology analysis: Russian Internet Search Engines use with Russian morphology analysis Yandex – www.yandex.ru Rambler – www.rambler.ru Aport – www.aport.ru Morphology of East Slavonic Languages in Search Engines: Morphology of East Slavonic Languages in Search Engines Ukrainian Internet search engine Meta (www.meta.ua) Russian, English and Ukrainian morphology Byelorussian search engine (www.akavita.by) Russian, English and Byelorussian morphology (will be added)Traditional multilingual information-retrieval thesauri: Traditional multilingual information-retrieval thesauriThesaurus of European Union: EUROVOC : Thesaurus of European Union: EUROVOC Translated into 9 languages Translated into Russian language by specialists of Parliamentary library Added with Russian specific terms (9646 descriptors in Russian version) Used for manual indexing of documents in the libraryElectronic dictionaries: Electronic dictionariesMultiLex dictionaries: MultiLex dictionaries www.medialingua.com English, French, Spanish, German, Italian Licenced versions of dictionaries from publishers Usually includes a general dictionary and several domain-specific dictionariesLingvo dictionaries: Lingvo dictionaries www.abbyy.co.uk Abbyy Lingvo 8.0 Multilingual edition: Eight translation directions – 41 general and specialised dictionaries FineReader – the best Russian OCR-system. Support more than 100 languages. Winner in 70 comparative tests worldwidePolyglossum dictionaries: Polyglossum dictionaries ETS publishing house www.ets.ru Electronic (plain text format is possible) versions and traditional printed versions Bilingual English, German, French, Spanish, + Finnish languagesRussian Text Collections: Russian Text CollectionsInternet Library of Moshkov: Internet Library of Moshkov www.lib.ru Fiction in Russian including classic works 3300 Mb Text-files and 300 Mb other files Free access No copyrightInternet library - www.public.ru : Internet library - www.public.ru More than 1000 names of periodic press after 1990. Free access No copyright License to librarian activityMorphologically tagged corpus of Russian “Russian Standard”: Morphologically tagged corpus of Russian “Russian Standard” Creation of a morpologically tagged corpus of Russian in Russia has been begun Russian fiction 583,814 words Serge Sharoff http://corpus.leeds.ac.uk/ Parallel collections: Parallel collectionsParallel translation of news reports: Parallel translation of news reports ITAR-TASS agency: news reports in 6 languages (http://corp.itar-tass.com/english/about/) RIA-Novosti agency: news reports in 12 languages (http://en.rian.ru/rian/index.cfm) Internet newspaper PRAVDA On-Line http://english.pravda.ru/ - translation into EnglishTranslation of Russian Legislation: Translation of Russian Legislation GARANT company – legal information systems http://www.garant.ru/nav.php?pid=286&ssid=89 Translated more than 25 thousand Russian legal acts into English is disseminated via the network of the American company LEXIS/NEXIS.Machine translation systems: Machine translation systemsETAP machine translation system: ETAP machine translation system Based on Meaning-Text Theory by I.Melchuk and Y. Apresyan. Detailed rule-based syntactic analysis. English-Russian http://cl.iitp.ru/etap/index.html Most known commercial machine translation system: PROMT: Most known commercial machine translation system: PROMT www.e-prompt.com Russian - English, French, German, Spanish, Italian English-German Development of domain-specific systems Online translation: www.translate.ruExample-Based Machine Translation: ETRANS, RTRANS: Example-Based Machine Translation: ETRANS, RTRANS Gerold Belonogov Idea was published in 1975 VINITI - All-Russian Scientific and Technical Information Institute of Russian Academy of Sciences (www.viniti.ru)Example –based machine translation in VINITI -2: Example –based machine translation in VINITI -2 VINITI: manual indexing – search images of technical literature, abstracts, collected for many years 900 thousand Russian terms were extracted (length 1-13 words) Parallel collection of English abstracts and their translation into Russian => 800 thousand English termsConceptual indexing in VINITI: Conceptual indexing in VINITI Bilingual base of terms can serve as a resource for bilingual search It is not an ontology, only bilingual pairs An important tool for VINITI: access of foreign researchers to Russian technical literature, but (as I know) not implemented yetMultilingual ontologies: Multilingual ontologiesRussian WordNet - RussNet: Russian WordNet - RussNet Saint-Petersburg State University 2003: 15000 words – 5000 synsets – 8000 relations Adding of several types of new relations such as derivative synonyms, derivative semantic rolesSlide29: University Information System RUSSIA Collections (Center for Information Research) 800,000/ 7.5Gb (www.cir.ru)UIS RUSSIA: UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - collection of Council of Europe documents. access to parallel collections of legislation. Harmonization of legislation Approach to Organization of Bilingual Search in UIS RUSSIA: Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processingSlide32: Sociopolitical Thesaurus 28,000 concepts, 70,000 terms 105,000 conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is specially adapted to information-retrieval applications; regularly tested during automatic text processingUse of Thesaurus in Information Retrieval applications: Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) - Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system - SUMMAC conference Thesaurus-based information retrieval - a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision)English-Russian Sociopolitical Thesaurus: English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 63 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri. Cross-checking of translations. Addition multiword variants. Internet checks.Bilingual Search in UIS RUSSIA: Bilingual Search in UIS RUSSIA Slide36: www.cir.ru/is4/English-Russian Sociopolitical Thesaurus: testing and use in new applications: English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian LegislationAdding languages to Sociopolitical Thesaurus: Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net. A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in RussiaRussian Information Retrieval Evaluation Seminar -2003: Russian Information Retrieval Evaluation Seminar -2003 Web Collection – 7 Gb (www.narod.yandex.ru) Thematic classification of Web-sites Web Search 10000 real queries from Internet were given 50 queries will be evaluated 8 Russian participants