orientel cocosda LREC06

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Oriental COCOSDA: Past, Present and Future: 

Oriental COCOSDA: Past, Present and Future Shuichi ITAHASHI National Institute of Informatics (NII), Tokyo, Japan AIST, Tsukuba, Japan Chiu-yu TSENG Academia Sinica, Taipei, Taiwan Satoshi NAKAMURA ATR Spoken Language Communication Res. Labs., Kyoto, Japan

Contents: 

Contents Necessity of Speech Corpora Organizations for Speech Corpora Asian Languages Brief History Goals & Strategies Regional Activities Conclusion

Necessity of Speech Corpus: 

Necessity of Speech Corpus Speech Research ↑  Objectivity of Research Speech Data  ↑ + →  Openness to the Public Related Information  ↓ ↓  Preserving Cultural Legacy Preservation of Spoken Language Data

Organizing Creation & Utilization of Speech Corpora: 

Organizing Creation & Utilization of Speech Corpora Creation of speech corpora needs some cost. Utilization needs a system to distribute corpora. Some activities started early in 1990s. 1991 COCOSDA 1992 LDC in U.S.A. 1995 ELRA in Europe

COCOSDA: 

COCOSDA International Coordinating Committee on Speech Databases and Speech I/O Systems Assessment Workshops held annually at Interspeech Cocosda promotes the development of spoken language corpora for building and/or evaluating spoken language technology and offers coordination of projects and research efforts to improve their efficiency.

Features of Asian Languages: 

Features of Asian Languages 1. Many languages belong to different language families. 2. Variety of orthographic systems Various letters/characters used 3. Some tonal languages 4. No space between words in some languages 5. Non-unique romanization systems

Language Families of Asian Languages: 

Language Families of Asian Languages Austronesian (1268 languages): Malay, Indonesian, etc. Sino-Tibetan (403): Chinese, Tibetan, Burmese, etc. Austro-Asiatic (169): Khmer, Vietnamese, etc. Tai-Kadai (76): Thai, Lao, etc. Dravidian (73): Tamil, Telugu, etc. Altaic (66): Mongolian, Turkic, Korean, etc. Japanese (12): Japanese, Ryukyuan, etc. cf. Indo-European (449) by Ethnologue.com

Letters, Tone & Word Order: 

Letters, Tone & Word Order 1. Proper letters: Burmese, Chinese, Japanese, Khmer, Korean, Thai, etc. 2. Latin letters: Indonesian, Malay, Vietnamese, etc. 3. Tonal languages: Burmese, Chinese, Lao, Thai, Vietnamese, etc. 4. Word order: SOV, SVO, VSO, VOS

Word boundary in text: 

Word boundary in text No space between words: Burmese, Chinese, Japanese, Khmer, Lao, Thai, etc. Space between words: Indonesian, Malay, Mongolian, Vietnamese, etc.

Asian Activities: 

Asian Activities 1994, 1997 Oriental COCOSDA 1999 GSK (Language Resource Association) in Japan 2001 SITEC in Korea (Speech Information Technology & Industry Promotion Center) 2002 Chinese LDC CCC (Chinese Corpus Consortium) in China 2006 NII-SRC in Japan (National Institute of Informatics, Speech Resources Consortium)

Oriental COCOSDA: 

Oriental COCOSDA Proposed in 1994, to exchange ideas, share information, discuss regional issues on SLP. Preparatory meeting in Hong Kong in 1997. Annual workshops held since 1998 in Japan, Taiwan, China, Korea, Thailand, Singapore, India, Indonesia.

Necessity of Oriental COCOSDA: 

Necessity of Oriental COCOSDA Asia is a multilingual region. Diversity of the languages is larger than Europe. Speech researches were emerging. Speech corpora were required. Cooperation among countries was necessary. Organizations for speech corpora were needed.

Oriental COCOSDA Mission: 

Oriental COCOSDA Mission To exchange ideas, share information, discuss regional matters on creation, utilization, dissemination of spoken language corpora of oriental languages, assessment methods of speech input/output systems, and To promote speech research on oriental languages.

Goals of Oriental COCOSDA: 

Goals of Oriental COCOSDA Initiating Speech Resources Consortium in each country. Establishment of Asian Network among the Consortia. Creation of multilingual corpus of semantically similar contents.

Strategies of Oriental COCOSDA: 

Strategies of Oriental COCOSDA Foundation of Oriental COCOSDA Forum of speech corpora Establishment of Regional Consortia: GSK, SITEC, Chinese LDC, CCC, NII-SRC 3. Collaboration among the consortia

Oriental COCOSDA Organization: 

Oriental COCOSDA Organization Convenor: Chiu-yu TSENG (2006-) S. ITAHASHI (1998-2005) Advisory members: Three from China, Japan, Korea Committee members: 21 from 10 regions including China, Hong Kong, India, Indonesia, Japan, Korea, Mongolia, Singapore, Taiwan, Thailand.

International Workshop on East-Asian Language Resources and Evaluation - Oriental COCOSDA WORKSHOP -: 

International Workshop on East-Asian Language Resources and Evaluation - Oriental COCOSDA WORKSHOP - 1998 1st Meeting, Tsukuba, Japan (30 papers, 54 participants) 1999 2nd Meeting, Taipei, Taiwan (44, 120) 2000 3rd Meeting, Beijing, China (8, 20) 2001 4th Meeting, Taejon, Korea (11, 25) 5th Meeting, Hua Hin, Thailand (24, 96) + SNLP 2003 6th Meeting, Sentosa, Singapore (28, 60 ) + PACLIC 7th Meeting, Delhi, India (55, 150) + iSTEPS, iSTRANS 8th Meeting, Jakarta, Indonesia (24, 65)

Oriental COCOSDA Organizers: 

Oriental COCOSDA Organizers 8 T.F.Zheng (China) S.S.Agrawal(India) Thanaruk T. (Thailand) K.T.Lua (Singapore) S.Itahashi (Japan) L.S.Lee (Taiwan) C.K.Chan (Hong Kong) H.Riza (Indonesia) Y-J Lee (Korea)

Participation: 

Participation 0. China, Japan, Korea, Taiwan (CJKTw), Hong Kong (HK) CJKTw CJKTw, Thailand (Th), France (F), U.S.A. CJKTw, Th, Mongolia (Mg) CJKTw, Th, Australia (Au) CJKTw, Th, India (Id), Indonesia (Is), Guam CJKTw, Th, Id, Is, Singapore (S) CJKTw, Id, Is, S, Au, F, U.S.A. CJKTw, Th, Is, Malaysia, Mg, HK

Some Regional Activities: 

Some Regional Activities Japan Korea China Hong Kong Mongolia Singapore Taiwan Thailand India Indonesia

Japanese Activities: 

Japanese Activities GSK: Language Resource Association Launched in 1999 Renovated as an NPO in 2003 Project accepted in 2005 for 3 years Emphasizing written text corpora NII-SRC launched in 2006 for speech corpora

Standardization in Japan: 

Standardization in Japan 1) Open Software Tools: Julius, Galatea, etc. 2) Standard of Speech Synthesis System Performance Evaluation Methods by JEITA (2003) 3) Standard of Symbols for Japanese Text-To-Speech Synthesizer by JEIDA (2000) JEITA: Japan Electronics and Information Technology Industries Association JEIDA: Japan Electronic Industry Development Association

Korea: 

Korea SITEC (Speech Information Technology & Industry Promotion Center) Founded in 2001 (Korean LDC/ELRA) Wonkwang University as host organization  (7 full-time staffs)

Chinese LDC: 

Chinese LDC Launched in 2002 Creation of linguistic corpora Management & distribution of language resources Promotion of sharing language resources *Chinese Corpus Consortium (CCC)

Future Prospects: Global Speech Corpus: 

Future Prospects: Global Speech Corpus Digits, digit strings, days of the week, months, time, salutations, yes/no, well-known proper nouns (person names, cities, companies), well-known stories, phonetically-balanced sentences, etc. common to all languages.

Utterance Content: 

Utterance Content Items widely understood in the world: 10 Digits, 12 Months of the year, 7 Days of the week, 4 Words on Weather, 6 Phrases of Greetings, 3 Words of Replies, 4 Words on time. “North Wind” from Aesop’s Fables

Features of the proposed corpus: 

Features of the proposed corpus Containing various Asian Languages With the same semantic content Recorded in a sound-proof room

Future of Oriental COCOSDA: 

Future of Oriental COCOSDA 1. Collaboration among regional activities 2. Cooperative creation of speech corpora 3. Promotion of speech research in Asia Future conference sites: Malaysia, Vietnam, Mongolia, Xinjang Uygur Autonomous Region of China

Conclusion: 

Conclusion 1. Importance of speech corpora for promoting speech research. 2. Role of organizations for speech corpus creation and distribution 4. GSK, SRC/SITEC/Chinese LDC, CCC are expected to further speech corpus creation and distribution together with Oriental COCOSDA in East Asia. http://www.slc.atr.jp/o-cocosda/

Oriental COCOSDA 2006: 

Oriental COCOSDA 2006 9-11 Dec. 2006 Universiti Sains Malaysia Penang, Malaysia Abstract submission: Aug. 5 Notification of acceptance: Aug. 26 Final manuscript: Sep. 30 http://www.usm.my/o-cocosda/