CLDR 1.3:Overview and What’s New: CLDR 1.3: Overview and What’s New George Rhoten (IBM) Mark Davis (IBM) Steven Loomis (IBM)
Agenda: Agenda Background Information
What does CLDR contain?
Samples of CLDR
What is new?
Future plans
How does CLDR get updated?
Common Locale Data Repository: Common Locale Data Repository Relatively new project: 2004
Hosted by Unicode Consortium
http://www.unicode.org/cldr/
Goals:
Common, necessary software locale data for all world languages
Collect and maintain locale data
XML format for effective interchange
Freely available
Universal Character Encoding: Universal Character Encoding Unicode: Unique character codes for all languages …
Direct and Indirect Usage: Direct and Indirect Usage Companies / Organizations
Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS andamp; Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more…
Caveats
Not a complete list: usage is not tracked, so this is an estimate
CLDR first available in 2004, some may use precursor data
What is Locale Data?: What is Locale Data? Locale = identifier referring to linguistic and cultural preferences
en_US, en_GB, ja_JP
Locale doesn’t refer to data like in POSIX
These preferences can change over time due to cultural and political reasons
Introduction of new currencies, like the Euro
Standard sorting of Spanish changes
Many of these preferences have varying degrees of standardization
12 and 24 hour format in the United States
This is a very broad topic
Scope of data limited to common system applications
Types of Locale Data: Types of Locale Data Dates/time formats
Number/Currency formats
Measurement System
Collation Specification
Sorting
Searching
Matching
Translated names for language, territory, script, timezones, currencies,…
Script and characters used by a language
Sample: Languages, Scripts, Territories in Danish: Sample: Languages, Scripts, Territories in Danish This data can be used for web site preferences
andlt;localeDisplayNamesandgt;
andlt;languagesandgt;
andlt;language type='aa'andgt;Afarandlt;/languageandgt;
andlt;language type='ab'andgt;Abkhasiskandlt;/languageandgt;…
andlt;scriptsandgt;
andlt;script type='Arab'andgt;Arabiskandlt;/scriptandgt;…
andlt;territoriesandgt;
andlt;territory type='AD'andgt;Andorraandlt;/territoryandgt;
andlt;territory type='AE'andgt;Forenede Arabiske Emirater
andlt;/territoryandgt;…
Sample: Characters / Dates: Sample: Characters / Dates andlt;charactersandgt;
andlt;exemplarCharactersandgt;[a-z æ å ø á é í ó ú ý] andlt;/exemplarCharactersandgt;
andlt;/charactersandgt;…
andlt;dayContext type='format'andgt;
andlt;dayWidth type='abbreviated'andgt;
andlt;day type='sun'andgt;sønandlt;/dayandgt;
andlt;day type='mon'andgt;manandlt;/dayandgt;…
Sample: Timezones / Currencies: Sample: Timezones / Currencies andlt;timeZoneNamesandgt;
andlt;zone type='America/Los_Angeles'andgt;
andlt;longandgt;
andlt;standardandgt;Pacific-normaltidandlt;/standardandgt;
andlt;daylightandgt;Pacific-sommertidandlt;/daylightandgt;
andlt;/longandgt;…
andlt;currenciesandgt;
andlt;currency type='GAF'andgt;
andlt;displayNameandgt;Gabonesisk CFA-franc
andlt;/displayNameandgt;
andlt;symbolandgt;GAFandlt;/symbolandgt;…
Sample: Collation: Sample: Collation andlt;collation type='standard' andgt; andlt;settings caseFirst='upper' /andgt; andlt;rulesandgt; andlt;resetandgt;Dandlt;/resetandgt; andlt;sandgt;đandlt;/sandgt; andlt;tandgt;Đandlt;/tandgt; andlt;sandgt;ðandlt;/sandgt; andlt;tandgt;Ðandlt;/tandgt; andlt;resetandgt;tandlt;/resetandgt; …
Latest Release: CLDR 1.3: Latest Release: CLDR 1.3 Released: June 2, 2005
296 locales: 96 languages and 130 territories
Data
Unique keys: 3,974
Actual Values: 52,382
All data fields: 898,183
(not including collation, aliased data)
CLDR 1.3: CLDR 1.3 Complete POSIX-format data with POSIX conversion tool
More timezone translations
Data for UN M.49 regions, including continents and regions
Addition of ISO 4217 currency codes change overs
Additional number and data tests to verify CLDR implementations
Mappings from language to script and territory
Various other fixes, additions, and extensions
Survey tool for improved collection of data http://www.unicode.org/cgi-bin/cldr-survey (read only to non-members)
… and many other minor improvements and bug fixes
Next Release: CLDR 1.4: Next Release: CLDR 1.4 2005-05-31 Phase 1
Design
2005-08-31 Phase 2
Structure, Tools, Documentation
2005-09-30 Phase 2 Beta Release
2005-10-31 Phase 3
Data Incorporation andamp; Vetting
2006-01-31 Phase 3 Beta Release
2006-03-31 CLDR 1.4 Released
Samples of PossibleCLDR 1.4 Features: Samples of Possible CLDR 1.4 Features Data
Enhance data for existing locales
Verify coverage level
Measurement unit names (eg metric vs US)?
Add European Ordering rules to some locales
Add data/structure to support lenient parsing, formatting; relative dates, etc.
Enhance Indic sorting
Samples of PossibleCLDR 1.4 Features (II): Samples of Possible CLDR 1.4 Features (II) Structure
Add structure / data for tracking priority and completeness
Move weekend data andamp; other country data to country info
Improved alias structure to reduce data duplication
Add locale specific linebreak, transforms, etc.
Samples of PossibleCLDR 1.4 Features (III): Samples of Possible CLDR 1.4 Features (III) Tests andamp; Tools
Enhanced Survey tool for collecting/vetting data
Enhanced consistency checking, more complete tests
Improve the Java tool integration, documentation, testing
Actual feature set has not been determined yet!
Committee Process: Committee Process Designed for most effective participation from people around the world
Meetings
By phone, never face to face
Short, frequent
Allows preparation between meetings
Resolves conflicts and new feature requests
Written
Email
Bug database submissions
Vetting Process for Data: Vetting Process for Data Collect from different participating organizations, experts and submissions: new or revised
References to external sources strongly encouraged
Must be given before freeze date for release
Use CLDR Survey Tool
Enter into the repository
Mark with draft attribute
Some may be entered as alternates
Differences resolved by CLDR committee
Vetting Process (II): Vetting Process (II) Vet by CLDR committee members
Consulting with country contacts
If disagreement, decide in committee
Accept
As main form: draft attribute removed
As alternate form: marked with different attributes
Causes of Conflicting Data: Causes of Conflicting Data Typographical errors
Canda instead of Canada
Regional differences
German spelling is different between countries
Context of usage
Normal German sorting versus German phonebook sorting
Parts of speech
'март 2004' versus '3 марта' when the Russian word for March is used in a date
Causes of Conflicting Data (II): Causes of Conflicting Data (II) Standards versus common use
'Republic of Laos' versus 'Laos'
Misunderstanding
Translating year format 'yyyy' as 'jjjj' instead of changing localized pattern characters
Uncommon cases
Translating the 'Interlingua' language name into other languages
Individual preferences
24 hour time format versus 12 hour time format
Challenges: Challenges Complex Formats
Experts knowledgeable both in technology and a specific language
Collation
Exemplar characters
Etc…
Require close interaction of CLDR experts with language experts
Getting Involved: Getting Involved Simplest – anyone!
Use CLDR
Bug report / feature request
More Involved
Vetting, Assessment, Tools, Policies, Decisions, …
Any Unicode member eligible to name representatives including country liason members
Example Country Process (Finland): Example Country Process (Finland) Finnish Ministry of Education made CLDR data a major goal, 2004-06
Research Institute for the Languages of Finland ('RILF' aka 'Kotus') designated agency
Documenting the national preferences in the open more important than the implementation mechanism
Results expected to lead to new/revised national standards
Example Country Process (II): Example Country Process (II) RILF a Unicode Liaison member, 2004-07
Set up fully open national group on language and cultural requirements on ICT, 2004-09
Two official languages (Finnish and Swedish) andamp; four regional / minority languages (three Sámi andamp; Romani as spoken in Finland) to be covered
Over 30 different parties represented: commercial, non-commercial, individuals
Public comments to be allowed: http://www.kotoistus.fi/
Documentation for all controversial issues and deviations from any national standards
For More Information: For More Information Unicode
http://www.unicode.org/
CLDR
http://www.unicode.org/cldr/
This presentation
http://www.unicode.org/cldr/data/docs/presentations/cldr_overview.ppt