Exploiting Multilingual Corpora for Machine Translation: Exploiting Multilingual Corpora for Machine Translation Andreas Eisele
Saarland University & DFKI
eisele@dfki.de
Arona, September 2005
JRC Enlargement and Integration Workshop
Exploiting parallel corpora in up to 20 languages
Overview: Overview Multilingual/MT Projects & Tools at DFKI
MT-Related Activities at Saarland University
Work in the PTOLEMAIOS Project
Plans for Near-Term Future
Multilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas:
Multilingual Natural Communication
Multilingual Document Production
Crosslingual Information and Knowledge Management
Multilingual Natural Communication: Multilingual Natural Communication NL Dialogue Systems (DISCO, COSMA, Interprice)
Speech Dialogue Processing (Verbmobil, Interprice)
Robust Speech Parsing (Verbmobil, Interprice)
Automatic Processing and Answering of Email (COSMA, ICC, XtraMind)
Natural Speech Synthesis (Mary, Interprice)
Sample Application Areas: e-commerce (product search, CRM)
Application Projects with Interprice, AOL Europe and spin-off company XtraMind Technologies
Multilingual Document Production: Multilingual Document Production Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE)
Grammar and Style Checking (LATESLAV, FLAG, SKATE)
Controlled Language Checking (FLAG, WHITEBOARD, SKATE)
Automatic XML Tagging (WHITEBOARD)
Consistency Control (BiLD, WHITEBOARD)
Sample Application Areas: multilingual document production, web-content production
Application Project with SAP
Spin-Off company
Crosslingual Information and Knowledge Management: Crosslingual Information and Knowledge Management Crosslingual Content Management (TWENTYONE, MUCHMORE)
Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE)
Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO)
Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO)
Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO)
Multilingual Summarization (MULINEX, MUCHMORE, MUSI)
Multilingual Language Generation (TG/2, TEMSIS, MIETTA)
Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations
Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …
Multilingual Resources at DFKI: Multilingual Resources at DFKI POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages
Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs)
Morphologies from MMorph project exist for German, English, French, Spanish, Italian
Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation
Adding more languages is very easy (as done for Arabic with A.Soudi)
Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lacking
Multilingual Projects at DFKI: Multilingual Projects at DFKI Main LT Application Areas:
Multilingual Natural Communication
Multilingual Document Production
Crosslingual Information and Knowledge Management
Topic emerging since 2005:
Machine Translation
Machine Translation at DFKI: Machine Translation at DFKI Topics in Compass (Digital Olympics 2006):
Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering
Open LOGOS
LOGOS MT ® = one of the largest and most powerful among the commercial MT engines
DFKI turned LOGOS MT into an open source product (in cooperation with GlobalWare AG)
Plans for integrated, hybrid MT from rule-based and stochastic engines (code name: EuroMatrix)
MT Activities at Saarland University: MT Activities at Saarland University Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate
Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003)
Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT
Among best approaches in ongoing DARPA evaluation campaign
Easy to deploy (thanks to tools by F.J. Och and P. Köhn)
Conceptually very simple, hence a good candidate to enrich models with linguistic sophistication
MT Activities at Saarland University: MT Activities at Saarland University April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish} English
May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese English)
Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English)
Diploma Thesis on corpus-based MT via RMRS alignment
Experience: Using parallel corpora for MT quickly yields very promising results! We should have more language pairs and more data…
Crawling of UN document repository, collection of 6-way parallel {Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)
The PTOLEMAIOS project: The PTOLEMAIOS project Assumptions:
Advanced language technology for truly multilingual applications is a key challenge for computational linguistics
Treebanking and supervised learning have been successful for English (and some other languages), but may not be feasible for “smaller” languages
Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data
Word alignments derived from simple models (GIZA++) can help to support this process
“Parallel-Text-based Optimization for Language
learning ― Exploiting Multilingual Alignment for the
Induction Of Syntactic grammars”
PTOLEMAIOS: PTOLEMAIOS Funding: Emmy-Noether fellowship from DFG, P.I. Jonas Kuhn
Expected Duration: April 2005 – March 2009
Original Goal:
Induce grammars from parallel corpora (and evaluate them in isolation)
Revised Goal (since August’05):
Evaluate grammars wrt. impact on MT performance
First Steps:
Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms
Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpus
Planned Steps:
Explore the usefulness of syntactic analyses for phrase-based SMT
word-based and syntax-based partial analyses are offered to decoder
decoder can exploit syntax if useful, fall back to plain PBSMT if not
optimal weight of syntactic dependencies can be determined empirically
Work on more languages (UN corpus in 6 languages, AC corpus)
EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh): EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh) MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition)
EuroMatrix: current situation: EuroMatrix: current situation Most language pairs remain uncovered
EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages EuroParl Corpus has been constructed to build statistical MT systems
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Multilingual corpora can be aligned across all languages…
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages SMT systems derived from the corpora vary in quality
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
EuroMatrix: SMT for many languages: EuroMatrix: SMT for many languages Difficulty of translation into and from a given language may differ widely…
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
EuroMatrix: EuroMatrix Ideas:
For language pairs where rule-based MT and SMT based on parallel corpora exist, they should be integrated to exploit complementary strengths of both approaches
Parallel corpora can then be used in two ways
feeding the SMT sub-system
fine-tuning the integrated setup
For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data
We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox)
Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared task
Conclusion: Conclusion Machine translation performance can be enabled/ boosted by parallel corpora
Current work just scratches the surface of what can be done
SMT systems for the languages of new member states should soon emerge from AC corpus
More parallel data for these languages would be desirable (100MW much better than 10MW!)
It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…