logging in or signing up 2005 JRC Workshop Eisele Tibald Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 123 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: May 02, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Exploiting Multilingual Corpora for Machine Translation: Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI eisele@dfki.de Arona, September 2005 JRC Enlargement and Integration Workshop Exploiting parallel corpora in up to 20 languages Overview: Overview Multilingual/MT Projects & Tools at DFKI MT-Related Activities at Saarland University Work in the PTOLEMAIOS Project Plans for Near-Term FutureMultilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Multilingual Natural Communication: Multilingual Natural Communication NL Dialogue Systems (DISCO, COSMA, Interprice) Speech Dialogue Processing (Verbmobil, Interprice) Robust Speech Parsing (Verbmobil, Interprice) Automatic Processing and Answering of Email (COSMA, ICC, XtraMind) Natural Speech Synthesis (Mary, Interprice) Sample Application Areas: e-commerce (product search, CRM) Application Projects with Interprice, AOL Europe and spin-off company XtraMind TechnologiesMultilingual Document Production: Multilingual Document Production Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE) Grammar and Style Checking (LATESLAV, FLAG, SKATE) Controlled Language Checking (FLAG, WHITEBOARD, SKATE) Automatic XML Tagging (WHITEBOARD) Consistency Control (BiLD, WHITEBOARD) Sample Application Areas: multilingual document production, web-content production Application Project with SAP Spin-Off companyCrosslingual Information and Knowledge Management: Crosslingual Information and Knowledge Management Crosslingual Content Management (TWENTYONE, MUCHMORE) Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE) Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO) Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO) Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO) Multilingual Summarization (MULINEX, MUCHMORE, MUSI) Multilingual Language Generation (TG/2, TEMSIS, MIETTA) Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …Multilingual Resources at DFKI: Multilingual Resources at DFKI POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs) Morphologies from MMorph project exist for German, English, French, Spanish, Italian Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation Adding more languages is very easy (as done for Arabic with A.Soudi) Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lackingMultilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Topic emerging since 2005: Machine Translation Machine Translation at DFKI: Machine Translation at DFKI Topics in Compass (Digital Olympics 2006): Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering Open LOGOS LOGOS MT ® = one of the largest and most powerful among the commercial MT engines DFKI turned LOGOS MT into an open source product (in cooperation with GlobalWare AG) Plans for integrated, hybrid MT from rule-based and stochastic engines (code name: EuroMatrix)MT Activities at Saarland University: MT Activities at Saarland University Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003) Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT Among best approaches in ongoing DARPA evaluation campaign Easy to deploy (thanks to tools by F.J. Och and P. Köhn) Conceptually very simple, hence a good candidate to enrich models with linguistic sophistication MT Activities at Saarland University: MT Activities at Saarland University April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish} English May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese English) Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English) Diploma Thesis on corpus-based MT via RMRS alignment Experience: Using parallel corpora for MT quickly yields very promising results! We should have more language pairs and more data… Crawling of UN document repository, collection of 6-way parallel {Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)The PTOLEMAIOS project: The PTOLEMAIOS project Assumptions: Advanced language technology for truly multilingual applications is a key challenge for computational linguistics Treebanking and supervised learning have been successful for English (and some other languages), but may not be feasible for “smaller” languages Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data Word alignments derived from simple models (GIZA++) can help to support this process “Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the Induction Of Syntactic grammars”PTOLEMAIOS: PTOLEMAIOS Funding: Emmy-Noether fellowship from DFG, P.I. Jonas Kuhn Expected Duration: April 2005 – March 2009 Original Goal: Induce grammars from parallel corpora (and evaluate them in isolation) Revised Goal (since August’05): Evaluate grammars wrt. impact on MT performance First Steps: Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpus Planned Steps: Explore the usefulness of syntactic analyses for phrase-based SMT word-based and syntax-based partial analyses are offered to decoder decoder can exploit syntax if useful, fall back to plain PBSMT if not optimal weight of syntactic dependencies can be determined empirically Work on more languages (UN corpus in 6 languages, AC corpus)EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh): EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh) MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition) EuroMatrix: current situation: EuroMatrix: current situation Most language pairs remain uncoveredEuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages EuroParl Corpus has been constructed to build statistical MT systems Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Multilingual corpora can be aligned across all languages… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages SMT systems derived from the corpora vary in quality Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Difficulty of translation into and from a given language may differ widely… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: EuroMatrix Ideas: For language pairs where rule-based MT and SMT based on parallel corpora exist, they should be integrated to exploit complementary strengths of both approaches Parallel corpora can then be used in two ways feeding the SMT sub-system fine-tuning the integrated setup For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox) Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared taskConclusion: Conclusion Machine translation performance can be enabled/ boosted by parallel corpora Current work just scratches the surface of what can be done SMT systems for the languages of new member states should soon emerge from AC corpus More parallel data for these languages would be desirable (100MW much better than 10MW!) It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,… You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
2005 JRC Workshop Eisele Tibald Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 123 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: May 02, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Exploiting Multilingual Corpora for Machine Translation: Exploiting Multilingual Corpora for Machine Translation Andreas Eisele Saarland University & DFKI eisele@dfki.de Arona, September 2005 JRC Enlargement and Integration Workshop Exploiting parallel corpora in up to 20 languages Overview: Overview Multilingual/MT Projects & Tools at DFKI MT-Related Activities at Saarland University Work in the PTOLEMAIOS Project Plans for Near-Term FutureMultilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Multilingual Natural Communication: Multilingual Natural Communication NL Dialogue Systems (DISCO, COSMA, Interprice) Speech Dialogue Processing (Verbmobil, Interprice) Robust Speech Parsing (Verbmobil, Interprice) Automatic Processing and Answering of Email (COSMA, ICC, XtraMind) Natural Speech Synthesis (Mary, Interprice) Sample Application Areas: e-commerce (product search, CRM) Application Projects with Interprice, AOL Europe and spin-off company XtraMind TechnologiesMultilingual Document Production: Multilingual Document Production Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE) Grammar and Style Checking (LATESLAV, FLAG, SKATE) Controlled Language Checking (FLAG, WHITEBOARD, SKATE) Automatic XML Tagging (WHITEBOARD) Consistency Control (BiLD, WHITEBOARD) Sample Application Areas: multilingual document production, web-content production Application Project with SAP Spin-Off companyCrosslingual Information and Knowledge Management: Crosslingual Information and Knowledge Management Crosslingual Content Management (TWENTYONE, MUCHMORE) Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE) Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO) Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO) Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO) Multilingual Summarization (MULINEX, MUCHMORE, MUSI) Multilingual Language Generation (TG/2, TEMSIS, MIETTA) Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …Multilingual Resources at DFKI: Multilingual Resources at DFKI POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs) Morphologies from MMorph project exist for German, English, French, Spanish, Italian Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation Adding more languages is very easy (as done for Arabic with A.Soudi) Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lackingMultilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge Management Topic emerging since 2005: Machine Translation Machine Translation at DFKI: Machine Translation at DFKI Topics in Compass (Digital Olympics 2006): Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering Open LOGOS LOGOS MT ® = one of the largest and most powerful among the commercial MT engines DFKI turned LOGOS MT into an open source product (in cooperation with GlobalWare AG) Plans for integrated, hybrid MT from rule-based and stochastic engines (code name: EuroMatrix)MT Activities at Saarland University: MT Activities at Saarland University Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003) Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT Among best approaches in ongoing DARPA evaluation campaign Easy to deploy (thanks to tools by F.J. Och and P. Köhn) Conceptually very simple, hence a good candidate to enrich models with linguistic sophistication MT Activities at Saarland University: MT Activities at Saarland University April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish} English May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese English) Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English) Diploma Thesis on corpus-based MT via RMRS alignment Experience: Using parallel corpora for MT quickly yields very promising results! We should have more language pairs and more data… Crawling of UN document repository, collection of 6-way parallel {Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)The PTOLEMAIOS project: The PTOLEMAIOS project Assumptions: Advanced language technology for truly multilingual applications is a key challenge for computational linguistics Treebanking and supervised learning have been successful for English (and some other languages), but may not be feasible for “smaller” languages Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data Word alignments derived from simple models (GIZA++) can help to support this process “Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the Induction Of Syntactic grammars”PTOLEMAIOS: PTOLEMAIOS Funding: Emmy-Noether fellowship from DFG, P.I. Jonas Kuhn Expected Duration: April 2005 – March 2009 Original Goal: Induce grammars from parallel corpora (and evaluate them in isolation) Revised Goal (since August’05): Evaluate grammars wrt. impact on MT performance First Steps: Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpus Planned Steps: Explore the usefulness of syntactic analyses for phrase-based SMT word-based and syntax-based partial analyses are offered to decoder decoder can exploit syntax if useful, fall back to plain PBSMT if not optimal weight of syntactic dependencies can be determined empirically Work on more languages (UN corpus in 6 languages, AC corpus)EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh): EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh) MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition) EuroMatrix: current situation: EuroMatrix: current situation Most language pairs remain uncoveredEuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages EuroParl Corpus has been constructed to build statistical MT systems Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Multilingual corpora can be aligned across all languages… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages SMT systems derived from the corpora vary in quality Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Difficulty of translation into and from a given language may differ widely… Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005EuroMatrix: EuroMatrix Ideas: For language pairs where rule-based MT and SMT based on parallel corpora exist, they should be integrated to exploit complementary strengths of both approaches Parallel corpora can then be used in two ways feeding the SMT sub-system fine-tuning the integrated setup For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox) Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared taskConclusion: Conclusion Machine translation performance can be enabled/ boosted by parallel corpora Current work just scratches the surface of what can be done SMT systems for the languages of new member states should soon emerge from AC corpus More parallel data for these languages would be desirable (100MW much better than 10MW!) It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…