cldr overview

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

CLDR 1.3:Overview and What’s New: 

CLDR 1.3: Overview and What’s New George Rhoten (IBM) Mark Davis (IBM) Steven Loomis (IBM)

Agenda: 

Agenda Background Information What does CLDR contain? Samples of CLDR What is new? Future plans How does CLDR get updated?

Common Locale Data Repository: 

Common Locale Data Repository Relatively new project: 2004 Hosted by Unicode Consortium http://www.unicode.org/cldr/ Goals: Common, necessary software locale data for all world languages Collect and maintain locale data XML format for effective interchange Freely available

Universal Character Encoding: 

Universal Character Encoding Unicode: Unique character codes for all languages …

Direct and Indirect Usage: 

Direct and Indirect Usage Companies / Organizations Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS andamp; Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more… Caveats Not a complete list: usage is not tracked, so this is an estimate CLDR first available in 2004, some may use precursor data

What is Locale Data?: 

What is Locale Data? Locale = identifier referring to linguistic and cultural preferences en_US, en_GB, ja_JP Locale doesn’t refer to data like in POSIX These preferences can change over time due to cultural and political reasons Introduction of new currencies, like the Euro Standard sorting of Spanish changes Many of these preferences have varying degrees of standardization 12 and 24 hour format in the United States This is a very broad topic Scope of data limited to common system applications

Types of Locale Data: 

Types of Locale Data Dates/time formats Number/Currency formats Measurement System Collation Specification Sorting Searching Matching Translated names for language, territory, script, timezones, currencies,… Script and characters used by a language

Sample: Languages, Scripts, Territories in Danish: 

Sample: Languages, Scripts, Territories in Danish This data can be used for web site preferences andlt;localeDisplayNamesandgt; andlt;languagesandgt; andlt;language type='aa'andgt;Afarandlt;/languageandgt; andlt;language type='ab'andgt;Abkhasiskandlt;/languageandgt;… andlt;scriptsandgt; andlt;script type='Arab'andgt;Arabiskandlt;/scriptandgt;… andlt;territoriesandgt; andlt;territory type='AD'andgt;Andorraandlt;/territoryandgt; andlt;territory type='AE'andgt;Forenede Arabiske Emirater andlt;/territoryandgt;…

Sample: Characters / Dates: 

Sample: Characters / Dates andlt;charactersandgt; andlt;exemplarCharactersandgt;[a-z æ å ø á é í ó ú ý] andlt;/exemplarCharactersandgt; andlt;/charactersandgt;… andlt;dayContext type='format'andgt; andlt;dayWidth type='abbreviated'andgt; andlt;day type='sun'andgt;sønandlt;/dayandgt; andlt;day type='mon'andgt;manandlt;/dayandgt;…

Sample: Timezones / Currencies: 

Sample: Timezones / Currencies andlt;timeZoneNamesandgt; andlt;zone type='America/Los_Angeles'andgt; andlt;longandgt; andlt;standardandgt;Pacific-normaltidandlt;/standardandgt; andlt;daylightandgt;Pacific-sommertidandlt;/daylightandgt; andlt;/longandgt;… andlt;currenciesandgt; andlt;currency type='GAF'andgt; andlt;displayNameandgt;Gabonesisk CFA-franc andlt;/displayNameandgt; andlt;symbolandgt;GAFandlt;/symbolandgt;…

Sample: Collation: 

Sample: Collation andlt;collation type='standard' andgt; andlt;settings caseFirst='upper' /andgt; andlt;rulesandgt; andlt;resetandgt;Dandlt;/resetandgt; andlt;sandgt;đandlt;/sandgt; andlt;tandgt;Đandlt;/tandgt; andlt;sandgt;ðandlt;/sandgt; andlt;tandgt;Ðandlt;/tandgt; andlt;resetandgt;tandlt;/resetandgt; …

Latest Release: CLDR 1.3: 

Latest Release: CLDR 1.3 Released: June 2, 2005 296 locales: 96 languages and 130 territories Data Unique keys: 3,974 Actual Values: 52,382 All data fields: 898,183 (not including collation, aliased data)

CLDR 1.3: 

CLDR 1.3 Complete POSIX-format data with POSIX conversion tool More timezone translations Data for UN M.49 regions, including continents and regions Addition of ISO 4217 currency codes change overs Additional number and data tests to verify CLDR implementations Mappings from language to script and territory Various other fixes, additions, and extensions Survey tool for improved collection of data http://www.unicode.org/cgi-bin/cldr-survey (read only to non-members) … and many other minor improvements and bug fixes

Next Release: CLDR 1.4: 

Next Release: CLDR 1.4 2005-05-31 Phase 1 Design  2005-08-31 Phase 2 Structure, Tools, Documentation 2005-09-30 Phase 2 Beta Release 2005-10-31 Phase 3 Data Incorporation andamp; Vetting 2006-01-31 Phase 3 Beta Release 2006-03-31 CLDR 1.4 Released

Samples of PossibleCLDR 1.4 Features: 

Samples of Possible CLDR 1.4 Features Data Enhance data for existing locales Verify coverage level Measurement unit names (eg metric vs US)? Add European Ordering rules to some locales Add data/structure to support lenient parsing, formatting; relative dates, etc. Enhance Indic sorting

Samples of PossibleCLDR 1.4 Features (II): 

Samples of Possible CLDR 1.4 Features (II) Structure Add structure / data for tracking priority and completeness Move weekend data andamp; other country data to country info Improved alias structure to reduce data duplication Add locale specific linebreak, transforms, etc.

Samples of PossibleCLDR 1.4 Features (III): 

Samples of Possible CLDR 1.4 Features (III) Tests andamp; Tools Enhanced Survey tool for collecting/vetting data Enhanced consistency checking, more complete tests Improve the Java tool integration, documentation, testing Actual feature set has not been determined yet!

Committee Process: 

Committee Process Designed for most effective participation from people around the world Meetings By phone, never face to face Short, frequent Allows preparation between meetings Resolves conflicts and new feature requests Written Email Bug database submissions

Vetting Process for Data: 

Vetting Process for Data Collect from different participating organizations, experts and submissions: new or revised References to external sources strongly encouraged Must be given before freeze date for release Use CLDR Survey Tool Enter into the repository Mark with draft attribute Some may be entered as alternates Differences resolved by CLDR committee

Vetting Process (II): 

Vetting Process (II) Vet by CLDR committee members Consulting with country contacts If disagreement, decide in committee Accept As main form: draft attribute removed As alternate form: marked with different attributes

Causes of Conflicting Data: 

Causes of Conflicting Data Typographical errors Canda instead of Canada Regional differences German spelling is different between countries Context of usage Normal German sorting versus German phonebook sorting Parts of speech 'март 2004' versus '3 марта' when the Russian word for March is used in a date

Causes of Conflicting Data (II): 

Causes of Conflicting Data (II) Standards versus common use 'Republic of Laos' versus 'Laos' Misunderstanding Translating year format 'yyyy' as 'jjjj' instead of changing localized pattern characters Uncommon cases Translating the 'Interlingua' language name into other languages Individual preferences 24 hour time format versus 12 hour time format

Challenges: 

Challenges Complex Formats Experts knowledgeable both in technology and a specific language Collation Exemplar characters Etc… Require close interaction of CLDR experts with language experts

Getting Involved: 

Getting Involved Simplest – anyone! Use CLDR Bug report / feature request More Involved Vetting, Assessment, Tools, Policies, Decisions, … Any Unicode member eligible to name representatives including country liason members

Example Country Process (Finland): 

Example Country Process (Finland) Finnish Ministry of Education made CLDR data a major goal, 2004-06 Research Institute for the Languages of Finland ('RILF' aka 'Kotus') designated agency Documenting the national preferences in the open more important than the implementation mechanism Results expected to lead to new/revised national standards

Example Country Process (II): 

Example Country Process (II) RILF a Unicode Liaison member, 2004-07 Set up fully open national group on language and cultural requirements on ICT, 2004-09 Two official languages (Finnish and Swedish) andamp; four regional / minority languages (three Sámi andamp; Romani as spoken in Finland) to be covered Over 30 different parties represented: commercial, non-commercial, individuals Public comments to be allowed: http://www.kotoistus.fi/ Documentation for all controversial issues and deviations from any national standards

For More Information: 

For More Information Unicode http://www.unicode.org/ CLDR http://www.unicode.org/cldr/ This presentation http://www.unicode.org/cldr/data/docs/presentations/cldr_overview.ppt