logging in or signing up C3 Yudina Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 7 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: August 31, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Tatyana N. Yudina yudina@mail.cir.ru University Information System RUSSIA (Russian inter-University Social Sciences Information and Analytical consortium) www.cir.ru Research Computing Center of Moscow State University NCO Center for Information Research Plan: Plan UIS RUSSIA. General Thesaurus ALTP Bilingual Information Retrieval Text categorization Examples Slide3: University Information System RUSSIA Collections 1 500,000/ 17.5Gb (www.cir.ru) Slide4: NLP technology in UIS RUSSIA Automatic Linguistic Text Processing/Linguistic Processors *.POD *.OUT *.LEM *.HDR ORACLE WEB www.cir.ru (Apache; OAS) Administrator. holdings convertors *.HTM UIS RUSSIA: UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - Council of Europe/European Court of Human Rights documents - UNESCO documents Slide6: Slide7: Slide8: Slide9: Slide10: Slide11: Thesaurus: Thesaurus Slide13: Sociopolitical Thesaurus 29,000 concepts, 75,000 terms 110,000 conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is adapted to information-retrieval applications; regularly tested during automatic text processing THESAURUS for Information Retrievalin Sociopolitical Domain: THESAURUS for Information Retrieval in Sociopolitical Domain -- Thesaurus provides for query refinement - reformulation/expansion -- Terminology of Thesaurus covers 95-98% of words and terms of Russian government publications, academic papers and mass media texts from 1991 -- Thesaurus is a main element of ALTP/automatic linguistic text processing technology. Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV): Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV) General Structure of Thesaurus: General Structure of Thesaurus Query Refinement: Query Refinement Navigation in Thesaurus : Navigation in Thesaurus ALTP: ALTP Automatic Linguistic Text Processing -- Conceptual Indexing -- Automatic Coherent Summarisation -- Automatic Text Categorisation Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995): Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995) Slide21: Thematic Lines of Thesaurus Terms (RF Government Regulation N604 26.06.1995) Slide22: Network of Thematic Nodes (RF Government Regulation N604 26.06.1995) Slide23: Network of Thematic Nodes (RF Government Regulation N604 26.06.1995) Slide24: Structure of Thematic Representation Main Thematic Nodes Specific Thematic Nodes Structural Thematic Summary(RF Government Regulation N604 26.06.1995): Structural Thematic Summary (RF Government Regulation N604 26.06.1995) Bilingual Information Retrieval: Bilingual Information Retrieval English-Russian Sociopolitical Thesaurus: English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 65 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri Cross-checking of translations. Addition multiword variants. Internet checks. English-Russian Sociopolitical Thesaurus: testing and use in new applications: English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian Legislation Thesaurus Terminology in Sociopolitical Domain: Thesaurus Terminology in Sociopolitical Domain Adding languages to Sociopolitical Thesaurus: Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in Russia Approach to Organization of Bilingual Search in UIS RUSSIA: Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processing Тематическое двуязычное индексирование: Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа Тематическое двуязычное индексирование: Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа Use of Thesaurus in Information Retrieval applications: Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) NEW: Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system -- SUMMAC conference Thesaurus-based information retrieval -- a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision) Bilingual Search in UIS RUSSIA: Bilingual Search in UIS RUSSIA Slide36: www.cir.ru/is4/ Text Categorization: Text Categorization Субъективизм экспертов: Субъективизм экспертов Совпадение при ручной рубрикации между разными экспертами 60% Высокая точность, НО невысокая полнота Автоматическое рубрицирование: Автоматическое рубрицирование Text Categorization Using Thematic Representation: Text Categorization Using Thematic Representation Systems of Subject Headings: -- RF Subject Headings System for Legal Acts (RF President Decree N511, 15.03.200; 1169 items, 4 levels) -- RF Central Election Committee Legal Subject Headings (450 items; 4 levels) -- 80 Top Terms of Legislative Indexing Vocabulary (LIV) Congressional Research Service of the US Congress Схема описания рубрики: Схема описания рубрики Рубрика Альтернатива1 Альтернатива2 У11 У12 У13 У21 Условие22 ИЛИ И И И + + + - ИЛИ ИЛИ Subject Heading as Formulae of Support Concepts: Subject Heading as Formulae of Support Concepts Full Representation of Subject Heading(expansion of support concepts): Full Representation of Subject Heading (expansion of support concepts) Examples: Examples Slide45: http://www.cir.ru/docs/ips/techno/gmtpod/index_e.jsp Results of Text Categorization: Results of Text Categorization Info Known terms: Known terms Word/Phrase Sense Disumbiguation T_M ok M not Related Terms fo judgment: Related Terms fo judgment Thematic Summary: Thematic Summary Link between two thematic lines: DISCHARGE; ORIGINATE; DISMISSAL WAGE; LABOR; and HUNGARY; FORINT; HUNGARIANS; BUDAPEST; STATE, COUNTRY; Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text: Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text Automatic Summary: Automatic Summary Russian Text – English Terms: Russian Text – English Terms Russian Text – English Thematic Summary: Russian Text – English Thematic Summary Bilingual Text Categorization: Bilingual Text Categorization Support of Subject Heading You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
C3 Yudina Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 7 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: August 31, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Tatyana N. Yudina yudina@mail.cir.ru University Information System RUSSIA (Russian inter-University Social Sciences Information and Analytical consortium) www.cir.ru Research Computing Center of Moscow State University NCO Center for Information Research Plan: Plan UIS RUSSIA. General Thesaurus ALTP Bilingual Information Retrieval Text categorization Examples Slide3: University Information System RUSSIA Collections 1 500,000/ 17.5Gb (www.cir.ru) Slide4: NLP technology in UIS RUSSIA Automatic Linguistic Text Processing/Linguistic Processors *.POD *.OUT *.LEM *.HDR ORACLE WEB www.cir.ru (Apache; OAS) Administrator. holdings convertors *.HTM UIS RUSSIA: UIS RUSSIA Collections of documents in English - RePEc (Research Papers in Economics, www.repec.org) abstracts and full texts - Council of Europe/European Court of Human Rights documents - UNESCO documents Slide6: Slide7: Slide8: Slide9: Slide10: Slide11: Thesaurus: Thesaurus Slide13: Sociopolitical Thesaurus 29,000 concepts, 75,000 terms 110,000 conceptual relations constructed specially as a tool for automatic text processing; contains terms from economic, financial, political, military, social, legislative and cultural domains; a set of relations is adapted to information-retrieval applications; regularly tested during automatic text processing THESAURUS for Information Retrievalin Sociopolitical Domain: THESAURUS for Information Retrieval in Sociopolitical Domain -- Thesaurus provides for query refinement - reformulation/expansion -- Terminology of Thesaurus covers 95-98% of words and terms of Russian government publications, academic papers and mass media texts from 1991 -- Thesaurus is a main element of ALTP/automatic linguistic text processing technology. Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV): Sociopolitical Thesaurus vs. Legislative Indexing Vocabulary (LIV) General Structure of Thesaurus: General Structure of Thesaurus Query Refinement: Query Refinement Navigation in Thesaurus : Navigation in Thesaurus ALTP: ALTP Automatic Linguistic Text Processing -- Conceptual Indexing -- Automatic Coherent Summarisation -- Automatic Text Categorisation Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995): Term Extraction for Russian Official Documents (RF Government Regulation N604 26.06.1995) Slide21: Thematic Lines of Thesaurus Terms (RF Government Regulation N604 26.06.1995) Slide22: Network of Thematic Nodes (RF Government Regulation N604 26.06.1995) Slide23: Network of Thematic Nodes (RF Government Regulation N604 26.06.1995) Slide24: Structure of Thematic Representation Main Thematic Nodes Specific Thematic Nodes Structural Thematic Summary(RF Government Regulation N604 26.06.1995): Structural Thematic Summary (RF Government Regulation N604 26.06.1995) Bilingual Information Retrieval: Bilingual Information Retrieval English-Russian Sociopolitical Thesaurus: English-Russian Sociopolitical Thesaurus Hierarchical conceptual net of 65 thousand English terms Manual work Use of general and special English-Russian dictionaries Study of conventional American and British dictionaries and information-retrieval thesauri Cross-checking of translations. Addition multiword variants. Internet checks. English-Russian Sociopolitical Thesaurus: testing and use in new applications: English-Russian Sociopolitical Thesaurus: testing and use in new applications Automatic text categorization of economic papers and abstracts using JEL subject headings (700 categories) (supported by Ford Foundation, USA) Automatic text processing of statistical tables (in cooperation with Berkeley University, USA) Automatic text processing of European documents (European Court of Human Rights, Council of Europe, European Union) – problems of harmonization of Russian Legislation Thesaurus Terminology in Sociopolitical Domain: Thesaurus Terminology in Sociopolitical Domain Adding languages to Sociopolitical Thesaurus: Adding languages to Sociopolitical Thesaurus It is a challenge to develop multilingual Sociopolitical thesaurus, to describe terms of Sociopolitical domain from different languages in the same hierarchical net A project under discussion – to add Tatar language to the bilingual thesaurus. Tatars are the second nation in Russia Approach to Organization of Bilingual Search in UIS RUSSIA: Approach to Organization of Bilingual Search in UIS RUSSIA Development of a bilingual ontology in sociopolitical domain based on Russian Sociopolitical Thesaurus for automatic text processing Тематическое двуязычное индексирование: Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа Тематическое двуязычное индексирование: Тематическое двуязычное индексирование Русскоязычный документ Англоязычный документ Англоязычное представление Русскоязычное представление Тематическое представление содержания документа Use of Thesaurus in Information Retrieval applications: Use of Thesaurus in Information Retrieval applications Flexible knowledge-based categorization systems (9 systems) NEW: Automatic text categorization of Russian legislation (200 000 documents) – 3000 categories Knowledge-based text summarization system -- SUMMAC conference Thesaurus-based information retrieval -- a specially constructed thesaurus can significantly improve efficiency of information retrieval (3-point average precision) Bilingual Search in UIS RUSSIA: Bilingual Search in UIS RUSSIA Slide36: www.cir.ru/is4/ Text Categorization: Text Categorization Субъективизм экспертов: Субъективизм экспертов Совпадение при ручной рубрикации между разными экспертами 60% Высокая точность, НО невысокая полнота Автоматическое рубрицирование: Автоматическое рубрицирование Text Categorization Using Thematic Representation: Text Categorization Using Thematic Representation Systems of Subject Headings: -- RF Subject Headings System for Legal Acts (RF President Decree N511, 15.03.200; 1169 items, 4 levels) -- RF Central Election Committee Legal Subject Headings (450 items; 4 levels) -- 80 Top Terms of Legislative Indexing Vocabulary (LIV) Congressional Research Service of the US Congress Схема описания рубрики: Схема описания рубрики Рубрика Альтернатива1 Альтернатива2 У11 У12 У13 У21 Условие22 ИЛИ И И И + + + - ИЛИ ИЛИ Subject Heading as Formulae of Support Concepts: Subject Heading as Formulae of Support Concepts Full Representation of Subject Heading(expansion of support concepts): Full Representation of Subject Heading (expansion of support concepts) Examples: Examples Slide45: http://www.cir.ru/docs/ips/techno/gmtpod/index_e.jsp Results of Text Categorization: Results of Text Categorization Info Known terms: Known terms Word/Phrase Sense Disumbiguation T_M ok M not Related Terms fo judgment: Related Terms fo judgment Thematic Summary: Thematic Summary Link between two thematic lines: DISCHARGE; ORIGINATE; DISMISSAL WAGE; LABOR; and HUNGARY; FORINT; HUNGARIANS; BUDAPEST; STATE, COUNTRY; Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text: Two thematic lines: JUDICAL TRIAL … and HUNGARY… in text Automatic Summary: Automatic Summary Russian Text – English Terms: Russian Text – English Terms Russian Text – English Thematic Summary: Russian Text – English Thematic Summary Bilingual Text Categorization: Bilingual Text Categorization Support of Subject Heading