Share PowerPoint. Anywhere!

LTtartu

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 2
Like it  ( Likes) Dislike it  ( Dislikes)
Added: January 23, 2008 This presentation is Public
Presentation Category :Education
Presentation StatisticsNew!
Views on authorSTREAM: 2
Presentation Transcript

Language Technology at the University of Tartu Nordic Language Technology Visit 28 October 2004, Tartu : Language Technology at the University of Tartu Nordic Language Technology Visit 28 October 2004, Tartu


Research and education : Research and education People of 2 faculties are involved: Faculty of Mathematics and Computer Science => Institute of Computer Science => Chair of Language Technology (chair exists since 1-9-2001) Faculty of Philosophy => Department of Estonian and Finno-Ugric Linguistics => Chair of General Linguistics Informal research group of computational linguistics Head of the group – professor of general linguistics Haldur Õim


People : People Chair of LT Mare Koit, prof. Tiit Roosmaa, PhD, assoc. prof, vice dean of the Faculty of Mathematics and Computer Science, head of the Institute of Computer Science Heli Uibo, MSc, lecturer/PhD student Kaili Müürisep, PhD, researcher Chair of GL Haldur Õim, prof. Renate Pajusalu, assoc. prof. Heiki-Jaan Kaalep, PhD, senior researcher Neeme Kahusk, researcher/PhD student Kadri Muischnek, MA, researcher/PhD student Heili Orav, MA, researcher/PhD student Andriela Rääbis, MA, researcher/PhD student Kadri Vider, MA, researcher/PhD student Tarmo Vaino, network administrator/programmer Urve Talvik, specialist


Faculty of Mathematics and Computer Science : Faculty of Mathematics and Computer Science Vice dean Tiit Roosmaa


Research and education at the University of Tartu : Research and education at the University of Tartu Dr. Madis Saluveer, Department of R&D, Head of Development Office


Research group of computational linguistics : Research group of computational linguistics Cooperation with the Institute of Cybernetics at the Tallinn University of Technology and the Institute of Estonian Language 2002 these 3 research units together applied to be a centre of exellence in language technology (head – prof. Haldur Õim) => potential centre 2003/4 language technology development centre (head – Dr. Einar Meister, TUT) Members of research group have participated in working out the strategy of development of Estonian language (2004-2010), the language technology part of the state programme “Estonian and national memory” (2004-2010), the roadmap of Estonian language technology (2004-2010), and are involved in preparation of state programme “Technological support of Estonian” (2006-2010). Main research fields computational morphology of Estonian computational syntax semantics spoken Estonian and dialogue modelling corpora a. o. language resources


Computational morphology Heiki-Jaan Kaalep (1/2) : Computational morphology Heiki-Jaan Kaalep (1/2) IEL Unification Guessing (stable and unstable inflectional classes) Ülle Viks UT 2-level Heli Uibo Filosoft http://www.filosoft.ee/index_en.html Unification Lexicon (spelling) Heiki-Jaan Kaalep


Disambiguation Heiki-Jaan Kaalep (2/2) : Disambiguation Heiki-Jaan Kaalep (2/2) CG Tiina Puolakainen (UT, IEL) HMM Heiki-Jaan Kaalep (UT, Filosoft) 500,000 word corpus (gold standard)


Computational syntax : Computational syntax Tiit Roosmaa Heli Uibo Kadri Muischnek Kaili Müürisep


Syntax - Outline : Syntax - Outline Projects, funding Software: Estonian Constraint Grammar Parser and its applications Resources: steps towards Estonian treebank Constraint Grammar corpus Sofie Parallel Treebank Estonian Treebank Arborest


Projects : Projects Estonian Science Foundation grant No. 3314 “A formal grammar for the Estonian language” (1998-2000), total funding 11 600 EUR Project “Syntactically analyzed and disambiguated text corpus” (2002-2003), funded by Estonian Ministry of Education and Research under the national program “Estonian language and national heritage”, total funding 22 500 EUR Project “Syntax-based language software and the resources needed for its development” (2004-2008), funded by Ministry of Education and Research, national program “Estonian language and national memory”, in 2004: 16 000 EUR


International cooperation : International cooperation Network-type projects funded by NorFA under The Nordic Language Technology Research Programme (2000-2004): Nordic Treebank Network (2003-2004), coordinated by Joakim Nivre, Växjö University, joins 15 academic institutions from Sweden, Norway, Denmark, Finland, Estonia and Iceland. PaNoLa (Parsing Nordic Languages) follow-up project (Sep-Dec, 2004), coordinated by Eckhard Bick, University of Southern Denmark. The aim of the project is to create VISL teaching treebanks for smaller Nordic languages – Estonian, Faroese, Greenlandic, Icelandic and Sami.


Software: Syntactic Parser for Estonian (EstCGP) : Software: Syntactic Parser for Estonian (EstCGP) EstCGP (Estonian Constraint Grammar Shallow Syntactic Parser) is the result of two doctoral dissertations: Kaili Müürisep “Computer Grammar of Estonian: Syntax” (Univ of Tartu, 2000) Tiina Puolakainen “Computer Grammar of Estonian: Morphological Disambiguation” (Univ of Tartu, 2001) Current evaluation results of ESTCG: precision 76,4-79,2 % recall 95,5-96,9 %.


Shallow Syntactic Parser: applications : Shallow Syntactic Parser: applications Noun phrase extraction (K. Müürisep, T. Puolakainen) Automatic summarization (K. Müürisep & students) Syntax-based information retrieval (K. Kaljurand) Grammar check (H. Uibo & students)


Syntactically annotated corpora of Estonian : Syntactically annotated corpora of Estonian Estonian Constraint Grammar Corpus size – 200 000 running words ≈ ca 15 000 sentences 184 000 words of Estonian original fiction 10 000 words of newspaper texts 6 000 words of legal texts shallow annotation, using Constraint Grammar: a syntactic function is determined for every word-form has been built to train and test EstCGP is being extended semi-automatically planned size by Dec 2004 – 300 000 words website http://math.ut.ee/~heli_u/syntcorpus.html


Estonian Constraint Grammar Corpus : Estonian Constraint Grammar Corpus Experiments on EstCGC (K. Kaljurand): Conversion of EstCGC to NEGRA export format http://psych.ut.ee/~kaarel/Programs/Treebank/EstCG2Negra/ Automatic extraction of syntactic dependency relations http://psych.ut.ee/~kaarel/Programs/Treebank/DepDict/


Syntactically annotated corpora for Estonian (cont-d) : Syntactically annotated corpora for Estonian (cont-d) Two small-scale experimental treebanks: 2. Sofie Parallel Treebank – a Penn-style phrase structure treebank of 100 sentences 3. Arborest – a VISL-style hybrid treebank of 2500 sentences (first 149 sentences manually revised)


Sofie Parallel Treebank : Sofie Parallel Treebank Sofie Parallel Treebank is a joint effort of the members of Nordic Treebank Network Material – the 1st chapter of Jostein Gaarder's novel "Sophie's World". Currently, the parallel treebank includes Swedish, German, Norwegian, Estonian, Icelandic and two versions of Danish, 20-200 sentences from each language. Website of the Sofie Parallel Treebank: http://omilia.uio.no/sofie


Sofie Parallel Treebank – example from the web-interface : Sofie Parallel Treebank – example from the web-interface


Estonian Treebank Arborest : Estonian Treebank Arborest Joint work with dr. Eckhard Bick, University of Southern Denmark VISL-style (http://beta.visl.sdu.dk) treebank Annotated for both function (S = subject, P = predicate, O = object, A = adverbial,STA = statement, QUE = question, etc.) and form (np, vp, pp, advp, adjp, fcl = finite clause, par = paratagma, etc.)


Arborest : Arborest Automatically generated from a sample of CG-corpus (2500 sentences) with CG→PSG rules 149 sentences revised 1/3 of sentences correct CG→PSG rules are under improvement Webpage http://corp.hum.sdu.dk/arborest.html


Arborest – sample tree : Arborest – sample tree


Plans : Plans To enlarge all three syntactically annotated corpora. To improve the CG-to-PSG rules to facilitate the easy semi-automatic way of building an Estonian treebank. To investigate, how many semantic information can be derived from the syntactic structure. To build a phrase-aligned Estonian-German-Swedish parallel treebank


Semantics : Semantics Haldur Õim Heili Orav Neeme Kahusk Kadri Vider


Semantics – PhD studies : Semantics – PhD studies Kadri Vider – Word Sense Disambiguation of Estonian Verbs According to Lexical-Syntactic Information Heili Orav – Semantics of personal traits. Neeme Kahusk (PhD student at Tallinn Pedagogical University) –The role of semantic relations in word explanation task demanding quick response


Semantics - Grants : Semantics - Grants Target (governmental) financing program Elaboration and implementation of computational linguistics tools for creation of Estonian language resources (SF0180528s98, 01.01.98-31.12.02) Computational models and language resources: for Estonian: theoretical and applicational aspects. (SF0182541s03, 01.01.03-31.12.07) Estonian Science Foundation Creation of a Semantic Disambiguator for Estonian (ETF4467, 01.01.00-31.12.02) Concept based resources and processing tools for the Estonian language (ETF5534, 01.01.03-31.12.06) Governmental Research Program Human Language Technology: Semantic analysis of Estonian simple sentences


Semantics – current courses of action (1) : Semantics – current courses of action (1) Estonian Wordnet 10,000 synsets, 18,900 word senses WordNet taken as a model EuroWordNet-2 project member 1998-1999 Global WordNet Association member Publications: EuroWordNet Technical Reports: Deliverables 2D001, 2D003, 2D006, 2D007, 2D008, 2D010, 2D014, 2D014 Kadri Vider, Neeme Kahusk, Heili Orav, Haldur Õim, Leho Paldre, 2000. Eesti keele tesaurus (The Estonian Thesaurus) - Publications of the Department of General Linguistics of the University of Tartu, vol. 1. Ed. by T. Hennoste. Tartu, 2000, pp. 127-152. H. Orav “Adjectives as semantic problem: wordnet-type thesaurus collection experience” – COMPLEX 2001, Birmingham, UK Orav, H. Adjectives in wordnet-type thesaurus: Estonian experience. In Proceedings of the 1st International Global WordNet Conference, Central Institute of Indian Languages, Mysore, India, 2002, pp. 22-25 Vider, K., Orav, H. Concerning the difference between a conception and its application in the case of the Estonian wordnet Proceedings of the second international wordnet conference. Eds. P.Sojka, K. Pala, P. Smrz, Ch. Fellbaum, P. Vossen. Masaryk University, Brno, 2003, pp. 285-290 Vider, K., Orav, H. Estonian wordnet and Lexicography. Symposium on Lexicography XI. Proceedings of the Eleventh International Symposium on Lexicography. May 2-4, 2002 at the University of Copenhagen. Ed. by H. Gottlieb, J. E. Mogensen and A. Zettersten. Max Niemeyer Verlag, In press Vider, K. Notes about labelling semantic relations in Estonian WordNet. Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources and Evaluation (LREC 2002). Ed. by D. N. Christodoulakis, C. Kunze, L. Lemnitzer. ELRA, Las Palmas de Gran Canaria 2002 pp. 56-59


Semantics – current courses of action (2) : Semantics – current courses of action (2) Word Sense Disambiguation SensEval-2 all-words task for Estonian Results: 2 systems, precision & recall 66% Estonian WSD corpus ~100,000 tokens, ~42,000 annotated content words Publications: • Kahusk, N. and Vider, K. 2002. Estonian WordNet Benefits from Word Sense Disambiguation. In Proceedings of the 1st International Global WordNet Conference, Central Institute of Indian Languages, Mysore, India pp. 26-31 • Vider, K. and Kaljurand, K. Automatic WSD: Does it make sense of Estonian? - Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse 2001, pp. 159-162 • Kahusk, N., Orav, H., Õim, H. Sensiting inflectionality: Estonian Task for SENSEVAL-2. Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, Toulouse 2001, pp. 25-28 • Kahusk, Neeme A Lexicographer's Tool for Word Sense Tagging According to WordNet Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources and Evaluation (LREC 2002). Ed. by D. N. Christodoulakis, C. Kunze, L. Lemnitzer. ELRA, Las Palmas de Gran Canaria 2002, pp. 1-7 • Kaarel Kaljurand. Word Sense Disambiguation of Estonian with syntactic dependency relations and WordNet. Proceedings of the Ninth ESSLLI Student Session. Ed. by L. Alonso i Alemany and P. Egre. August 2004, Nancy, pp. 128-137


Spoken Estonian and dialogue modelling 1/3 : Spoken Estonian and dialogue modelling 1/3 People Tiit Hennoste (is working at Helsinki University since 1/9/2004) Andriela Rääbis Mare Koit PhD and Master’s students Goals to study spoken Estonian, its different registers to collect different kinds of spoken texts into the corpus of spoken Estonian to model human-computer interaction in Estonian


Spoken Estonian and dialogue modelling 2/3 : Spoken Estonian and dialogue modelling 2/3 Corpus of spoken Estonian (started 1997) 490 tapes 1100 transcribed texts (700,000 running words) Dialogue corpus (started 2001) spoken dialogues (sub-part of the corpus of spoken Estonian - 400 texts; 100,000 running words) written dialogues collected by the method of Wizard-of-Oz (20 texts, 2500 running words) dialogue acts are annotated in the dialogue corpus – a typology of dialogue acts is worked out theoretical basis of the typology – conversation analysis


Spoken Estonian and dialogue modelling 3/3 : Spoken Estonian and dialogue modelling 3/3 We analyze how various types of dialogue acts are used in a special domain – calls for information (information offices, travel bureaus), and how it depends on Estonian cultural space. We are testing machine learning methods for automatic recognition of dialogue acts We have presented our work on “Text, Speech and Dialogue” conference (2003), SIGdial workshops (2003, 2004), LREC 2004 workshop “Compiling and Processing Spoken Language Corpora”, 1st Baltic Conference “Human Language Technologies” (2004) etc. Grants: Estonian Science Foundation, Estonian Ministry of Education and Research International cooperation (previous): Nordic network “Corpus-based research on spoken language” (2000-2004, Tiit Hennoste) Nordic network for researchers in conversation studies (2000-2004)


Language resources1/5 Kadri Muischnek : Language resources1/5 Kadri Muischnek Corpora Corpus of Written Estonian 1890-1990 The Mixed Corpus of Estonian: Balanced corpus (newspaper texts+fiction+science texts) Morphologically disambiguated corpus WSD corpus (Word sense disambiguation) Syntactically annotated corpus Language technology resources (besides corpora) Corpus query Frequency Dictionary Database of Multi-Word Expressions Thesaurus Morphological analyser Speller of Estonian (HTML)


Language resources 2/5 : Language resources 2/5 Corpus of Written Estonian 1890-1990 corpus of the 1990s (380 000 words newspaper texts + 600 000 words fiction) corpus of the 1980s (1 million words, Brown & LOB –style textclasses) corpus of the 1970s (170 000 words newspaper texts + 250 000 fiction) corpus of the 1960s (200 000 words newspaper texts + 130 000 fiction) corpus of the 1950s (240 000 words newspaper texts + 60 000 fiction) corpus of the 1930s (120 000 words newspaper texts + 150 000 fiction) corpus of the 1910s (180 000 words newspaper texts + 250 000 fiction) corpus of the 1900s (170 000 words newspaper texts + 65 000 fiction) corpus of the 1890s (190 000 words newspaper texts + 50 000 fiction)


Language resources 3/5 : Language resources 3/5 Mixed Corpus of Estonian Big (in our dreams 200 million words) non-balanced; contains whole texts, not text samples. At the moment, the corpus consists of the following: Weekly «Eesti Ekspress» (issues 09.08.1996 - 29.11.2001, 7.5 million words) daily «Postimees» (issues 27.11.1995 - 10.10.2000, 1760 issues containing 88 600 articles, 32.9 million words) weekly «Maaleht» (6 million words coming soon) journal «Horisont» (1996 - 2003, 260 000 words) journal «Akadeemia» (7,5 million words, coming soon) fiction from the year 1995 onwards (4.2 million words) PhD dissertations (0.5 million words) Parliament transcripts 1995-2001 (13 million words) Estonian and European legal documents (ca 1.8 million and 10 million words)


Language resources 4/5 : Language resources 4/5 Mixed Corpus contains a balanced subcorpus called The Balanced Corpus The aim of this corpus is to enable the comparison of three main textclasses - newspaper, fiction and scientific texts - in written language. 5 million words of newspaper texts 4 million words of fiction (aim: 5 millions) half million words of scientific texts (aim: 5 millions) Morphologically Disambiguated Corpus Fiction 104 000 G. Orwell "1984" 75 800 Newspaper texts 111 000 Legal documents 121 000 journal Horizont 99 000 informative texts 4 000 total 513 000 disambiguated manually by 2 persons


Language resources 5/5 : Language resources 5/5 Frequency Dictionary based on 1 million words (500 000 newspaper texts + 500 000 fiction from the 2. half of the 90ties) Database of Multi-Word Expressions based on 6 dictionaries subpart: Database of Multi-Word Verbs: data extracted from the dictionaries + collocations extracted from the corpora


Education : Education Two models of higher education: old: 4 years (Bachelor) +2 years (Master of Arts or Master of Science) [+4 years (PhD)] new since 2002/2003 (Bologna declaration): 3+2 [+4] 1 year = 40 credits (AP) 1 credit = 40 work hours (=1,5 ECTS)


PhD studies 1/3 : PhD studies 1/3 No speciality of language technology on the PhD level The relevant research training is typically carried out under General Linguistics or Computer Science The number of PhD student positions has been very limited before 2004 (1-2 in GL, 0-1 in CS) Currently, 8 PhD students are specialising in LT (4 in GL, 4 in CS) Individual study plan for every student Obligatory courses 20 AP Optional courses related to the field of specialisation 20 AP PhD thesis 120 AP


PhD studies 2/3 : PhD studies 2/3 Optional courses can also be covered by short courses of visiting professors 2004 Dr. Graham Wilcock (University of Helsinki) “XML-based document transformations”, Prof. Vadim Stefanyuk (Moscow) “Lisp and artificial intelligence” (supported by Estonian Tiger University) 2005 February, Prof. Yorick Wilks. Students of NGSLT are welcome! summer schools organised in Tartu 1998 Formal grammars and their applications (8, courses, supported by HESP), 2002 Applications of language technology, 2004 Empirical methods in language technology (2 courses, supported by FW5 programme eVikings II, Estonian Tiger University, and Nordic Treebank Network) short courses and summer schools abroad (our students have participated in ESSLLI, Finnish GSLT, Swedish GSLT courses, NGSLT, Vilem Mathesius lecture series etc.)


PhD studies 3/3 : PhD studies 3/3 3 PhD theses defended in last 5 years 1999 Heiki-Jaan Kaalep (Creating and use of resources of Estonian in language-technological development work) 2000 Kaili Müürisep (Computational grammar of Estonian: syntax) 2001 Tiina Puolakainen (Computational grammar of Estonian: morphological disambiguation)


Master studies 1/2 : Master studies 1/2 Old model (4+2 years). Number of tuition free positions is very limited! Speciality of computational linguistics on the bachelor level at the Faculty of Philosophy, started in 1998 (supported by HESP) 6 BA, 1 MA 3 MA students at the moment Some students of general linguistics have been specialised in language technology on the master level 4 MA Some students of computer science are specialising in language technology 8 BSc, 5 MSc since 1999 4 MSc students at the moment


Master studies 2/2 : Master studies 2/2 New model (3+2 years, since 2002/2003) Computational linguistics at the Faculty of Philosophy (3+2) => master of Estonian and finno-ugric linguistics (not MA) Language technology at the Faculty of Mathematics and Computer Science (3+2)=> master of informatics (not MSc)


Course for school children : Course for school children Neeme Kahusk and Kadri Vider conducted a training course of computer linguistics in 2002 and 2003 spring term in Hugo Treffner Gymnasium.


PhD studies – personal experience : PhD studies – personal experience Kadri Vider (general linguistics) Heli Uibo (computer science) Different backgrounds Kadri – B.A. in Estonian language and literature in 1995 M.A. in general linguistics in 1999 PhD studies in general linguistics Heli – Bachelor’s studies in applied mathematics (≈computer science) 1989-1993 M.Sc. in computer science in 1999 PhD studies in computer science


PhD courses in CL or LT abroad : PhD courses in CL or LT abroad Supported by NorFA: Graduate School of Language Technology in Finland – 4 students, 3 courses Swedish National Graduate School of Language Technolgy – at least 2 students, 3 courses the Nordic Graduate School of Language Technology Courses in Copenhagen Business School Treebank course, a PhD course organized by Nordic Treebank Network (Stockholm University, March 2004) – 2 students


PhD courses in CL or LT abroad : PhD courses in CL or LT abroad ESSLLI (European Summer School of Logic, Language and Information) Annual summer school Covers a broad variety of courses ranging from pure linguistics to pure theoretical computer science and logics, through the courses combining these areas (computational linguistics, logic programming, etc.) Participants from University of Tartu (students, whose research topic is within CL or LT): 1998 - 1 1999 – 1 2000 – 3 2001 – 3 (participation of Estonian students supported by NorFA) 2002 – 1 2003 – 2 2004 - 1 NATO ASI summer school “LT for lesser-studied languages” (Bilkent, Turkey, 2000) – 2 students


PhD courses in CL or LT abroad : PhD courses in CL or LT abroad Vilem Mathesius Lecture Series (Charles University, Prague) organized by the Vilem Mathesius Centre for Research and Education in Semiotics and Linguistics 19 lecture series during 1992-2004 two intensive weeks with short courses in linguistics and computational linguistics about 20 participants during 1997-2004 from University of Tartu