logging in or signing up 040916 EV WS 10 More Applications Nivedi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 77 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtechApplications mentioned so far: Applications mentioned so far Thesaurus indexing (summarise main concepts of document) Fully automatic Interactive Monolingual and cross-lingual Document retrieval Monolingual and cross-lingual Eurovoc indexing can be used for MUCH MORE …Main goals of JRC’s Language Technology (LT) activity: Main goals of JRC’s Language Technology (LT) activity Gather potentially user-relevant documents Analyse texts in various languages extract information from texts (Eurovoc) identify similarity between documents (Eurovoc) Classify documents (Eurovoc) Visualise contents of individual documents (Eurovoc) of whole document collections (Eurovoc)Eurovoc indexing as part of a tool set: Eurovoc indexing as part of a tool set (Cross-lingual) document similarity calculation: (Cross-lingual) document similarity calculation Spanish Text Resolución sobre los residuos radioactivos monolingual(Multilingual) text classification: (Multilingual) text classification Most current approaches to text classification are monolingual Text classification, via Eurovoc, is multilingual(Multilingual) document map: (Multilingual) document map © Cartia’s ThemeScape‘Translation Spotting’: ‘Translation Spotting’ Why? To test document similarity calculation To compile a collection of parallel texts (for the training and testing of other multilingual text analysis applications) To detect cross-lingual document plagiarism‘Translation Spotting’ - Results: ‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a parallel text collection Simple document similarity (DS)(Multilingual) clustering of documents: To organise unknown document collections Algorithm: Find pairs of texts that are most similar Group them in one cluster, repeat the operation until only one cluster remains (Multilingual) clustering of documents 90% 80% 75% 40% 10%Building a (multilingual) cluster tree: Building a (multilingual) cluster treeApplication to (multilingual) news analysis: Application to (multilingual) news analysis EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it) Cluster related news stories and identify duplicates (news topic identification) Identify keywords, people’s names, place names, main sentences (information extraction) Find related news stories over time (news topic tracking) Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)Slide13: Detection of the major news of the day (EMM)Establish Links to Related News over time: Establish Links to Related News over timeEstablish links to related news in other languages: Establish links to related news in other languagesSubject-specific summarisation (1): Subject-specific summarisation (1) Title: "Resolution on the 10th anniversary of the Chernobyl accident"Subject-specific summarisation (2): Subject-specific summarisation (2) Further JRC LT applications: Further JRC LT applications Recognition and translation of: Place names; + visualisation Products Recognition of text languagePlace name recognition / Cross-lingual display: Place name recognition / Cross-lingual displayPlace name recognition / Visualisation: Place name recognition / Visualisation 18 references (Boston, American, America, New York) 11 references (Vietnam) 5 references (Iraq) + 1 reference to Sweden (Andre Heinz(…) Swedish based environmental consultant) Place name recognition / Disambiguation: Place name recognition / Disambiguation Requires disambiguation 14 Paris’, 7 Birminghams cities called ‘And’, ‘Annan’ name variants (exonyms) Zoom on Europe Recognising names, places, … - News navigation: Recognising names, places, … - News navigation Top-mentioned personalities En/Fr news 26 July 2004Automatic recognition of name variants: Automatic recognition of name variantsAutomatic link to online encyclopaedia: Automatic link to online encyclopaediaNews clusters mentioning a person: News clusters mentioning a personPersons talked about in same news clusters: Persons talked about in same news clustersCountries talked about in same news clusters: Countries talked about in same news clustersFrequent keywords for these news clusters: Frequent keywords for these news clustersRecognising products and product groups: Recognising products and product groups Sample textRecognising products and product groups: Recognising products and product groups Identified productsRecognising products and product groups: Recognising products and product groups Cross-lingual display of products foundSlide33: Multilingual Information Extraction Language recognition (demo) Keywords (monolingual; cross-lingual) Geographical place names (intro; new EU languages; demo) Products and product groups (slides; demo JRC, demo CIS) Names of people (demo news names, demo recognition, related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching) Dates (demo recognition) Terminology extraction Summarisation (standard sentence extraction; subject-specific summarisation) Cross-lingual navigation and classification Document similarity (monolingual; cross-lingual; translation spotting) Bottom-up document clustering; topic detection (demo news analysis) Classification (multi-monolingual and cross-lingual; pre-classification clustering) Relevance-ranking of documents (slides) News topic tracking (monolingual historical; cross-lingual; demo news analysis) Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names). Visualisation of textual contents Individual documents (document profile) Whole document collections (document map) Geographical information (maps; animated maps, demo) Clustering (ascii, star, tree), key-word-in-context (KWIC), search, … Further tools Document Gathering (Lang-Tech crawler; WT’s EMM system) Document format conversion (PDF, MS-Word, PS, HTML, XML) Character set conversion (UTF-8, ISO-Latin, HTML, …) Projects IDoRA for OLAF (slides) Cross-lingual Indexing (EUROVOC) Breaking News – Detection and Visualisation (BNDV / State-of-the-World) SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH, AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development) JRC Introduction Multilingual and crosslingual text analysis You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
040916 EV WS 10 More Applications Nivedi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 77 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtechApplications mentioned so far: Applications mentioned so far Thesaurus indexing (summarise main concepts of document) Fully automatic Interactive Monolingual and cross-lingual Document retrieval Monolingual and cross-lingual Eurovoc indexing can be used for MUCH MORE …Main goals of JRC’s Language Technology (LT) activity: Main goals of JRC’s Language Technology (LT) activity Gather potentially user-relevant documents Analyse texts in various languages extract information from texts (Eurovoc) identify similarity between documents (Eurovoc) Classify documents (Eurovoc) Visualise contents of individual documents (Eurovoc) of whole document collections (Eurovoc)Eurovoc indexing as part of a tool set: Eurovoc indexing as part of a tool set (Cross-lingual) document similarity calculation: (Cross-lingual) document similarity calculation Spanish Text Resolución sobre los residuos radioactivos monolingual(Multilingual) text classification: (Multilingual) text classification Most current approaches to text classification are monolingual Text classification, via Eurovoc, is multilingual(Multilingual) document map: (Multilingual) document map © Cartia’s ThemeScape‘Translation Spotting’: ‘Translation Spotting’ Why? To test document similarity calculation To compile a collection of parallel texts (for the training and testing of other multilingual text analysis applications) To detect cross-lingual document plagiarism‘Translation Spotting’ - Results: ‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a parallel text collection Simple document similarity (DS)(Multilingual) clustering of documents: To organise unknown document collections Algorithm: Find pairs of texts that are most similar Group them in one cluster, repeat the operation until only one cluster remains (Multilingual) clustering of documents 90% 80% 75% 40% 10%Building a (multilingual) cluster tree: Building a (multilingual) cluster treeApplication to (multilingual) news analysis: Application to (multilingual) news analysis EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it) Cluster related news stories and identify duplicates (news topic identification) Identify keywords, people’s names, place names, main sentences (information extraction) Find related news stories over time (news topic tracking) Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)Slide13: Detection of the major news of the day (EMM)Establish Links to Related News over time: Establish Links to Related News over timeEstablish links to related news in other languages: Establish links to related news in other languagesSubject-specific summarisation (1): Subject-specific summarisation (1) Title: "Resolution on the 10th anniversary of the Chernobyl accident"Subject-specific summarisation (2): Subject-specific summarisation (2) Further JRC LT applications: Further JRC LT applications Recognition and translation of: Place names; + visualisation Products Recognition of text languagePlace name recognition / Cross-lingual display: Place name recognition / Cross-lingual displayPlace name recognition / Visualisation: Place name recognition / Visualisation 18 references (Boston, American, America, New York) 11 references (Vietnam) 5 references (Iraq) + 1 reference to Sweden (Andre Heinz(…) Swedish based environmental consultant) Place name recognition / Disambiguation: Place name recognition / Disambiguation Requires disambiguation 14 Paris’, 7 Birminghams cities called ‘And’, ‘Annan’ name variants (exonyms) Zoom on Europe Recognising names, places, … - News navigation: Recognising names, places, … - News navigation Top-mentioned personalities En/Fr news 26 July 2004Automatic recognition of name variants: Automatic recognition of name variantsAutomatic link to online encyclopaedia: Automatic link to online encyclopaediaNews clusters mentioning a person: News clusters mentioning a personPersons talked about in same news clusters: Persons talked about in same news clustersCountries talked about in same news clusters: Countries talked about in same news clustersFrequent keywords for these news clusters: Frequent keywords for these news clustersRecognising products and product groups: Recognising products and product groups Sample textRecognising products and product groups: Recognising products and product groups Identified productsRecognising products and product groups: Recognising products and product groups Cross-lingual display of products foundSlide33: Multilingual Information Extraction Language recognition (demo) Keywords (monolingual; cross-lingual) Geographical place names (intro; new EU languages; demo) Products and product groups (slides; demo JRC, demo CIS) Names of people (demo news names, demo recognition, related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching) Dates (demo recognition) Terminology extraction Summarisation (standard sentence extraction; subject-specific summarisation) Cross-lingual navigation and classification Document similarity (monolingual; cross-lingual; translation spotting) Bottom-up document clustering; topic detection (demo news analysis) Classification (multi-monolingual and cross-lingual; pre-classification clustering) Relevance-ranking of documents (slides) News topic tracking (monolingual historical; cross-lingual; demo news analysis) Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names). Visualisation of textual contents Individual documents (document profile) Whole document collections (document map) Geographical information (maps; animated maps, demo) Clustering (ascii, star, tree), key-word-in-context (KWIC), search, … Further tools Document Gathering (Lang-Tech crawler; WT’s EMM system) Document format conversion (PDF, MS-Word, PS, HTML, XML) Character set conversion (UTF-8, ISO-Latin, HTML, …) Projects IDoRA for OLAF (slides) Cross-lingual Indexing (EUROVOC) Breaking News – Detection and Visualisation (BNDV / State-of-the-World) SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH, AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development) JRC Introduction Multilingual and crosslingual text analysis