U William

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Myanmar Unicode Implementation Standards: 

Myanmar Unicode Implementation Standards William W.L.K (Contribution Member, Myanmar Unicode and NLP Lab) Thursday, May 04, 2006 The 5th Myanmar ICT Week 2006

Overview: 

Overview Myanmar Unicode Encoding Standard Simplified, Standardized, and Finalized Myanmar Unicode Implementations Rendering Technologies (SIL Graphite, m17n, Uniscribe, Pango, ICU) Applied, Tested, Enabled Font Technologies OTF, TTF, Pseudo Unicode Developed, Tested, Applied, Debugged Are we there yet? Simply, NO!

Myanmar Unicode Encoding Standard: 

Myanmar Unicode Encoding Standard CURRENT Standard (Unicode 4.1) Accepted New Standard ISO/IEC JTC1/SC2/WG2 N3043

Myanmar Unicode Implementations: 

Myanmar Unicode Implementations Localization/Rendering Technologies and Standards SIL Graphite m17n ICU Pango OTF Support

Slide5: 

Pango Pango Modules M17N Pango Mod M17N Libs SCIM Input Method Pango Supported Applications Pango Pango Modules Graphite Application Hack Graphite Enabled Applications SCIM Input Method Pango Supported Applications Graphite Pango Module M17N Graphite Pango Pango Modules OTF OTF Pango Mod

Rendering Technologies and Standards: 

Rendering Technologies and Standards PANGO M17N (Multiligualization) Font Layout Table (.flt) Supports many Asian Scripts Pango Module Our Japanese Friends (Dr. Handas and Dr. Takahashi) Graphite Graphite Description Language (.gdl) Supports many non-roman scripts Works well on M$ Platforms  Pango Hacks on the way Our English Friends (Mr. Martin Hoskin and Mr. Keith Stribley) Myanmar OTF Pango Module By our Myanmar Friend U Tin Myo Htet ICU (International Components for Unicode) Rendering for Myanmar already done! Java and C++ ready! Largely used by OpenOffice We use it for “Collation”. Our Spanish friend (Dr. Javier Sola living in Cambodia)

Myanmar1 OTF Font by the Lab: 

Myanmar1 OTF Font by the Lab

What are the things you do most in Data Processing?: 

What are the things you do most in Data Processing? Sorting (Collation) Searching (Tokenization)

Collation: 

Collation The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as alphabetizing or alphabetic sorting. The general term for the process and function of determining the sorting order of strings of characters. The culturally expected ordering of linguistic characters in a particular language.

Collation is not uniform!: 

Collation is not uniform!

So what?: 

So what? How can you build a Myanmar Collation Algorithm? Canonical Encoding Order for Myanmar (Unicode 4: Table 10-3) http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf The "generic" Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/ Too generic for too complex script like Myanmar We need a "tailored" Unicode Collation Algorithm for Myanmar. (We need custom rules!)

Myanmar Collation Algorithm: 

Myanmar Collation Algorithm Myanmar collation can be split into a 5 stage process: A generic Myanmar syllable can be encoded as: < consonant> < medial> < vowel> < final> <tone> This is sorted in the order: 1.<consonant> 2.<medial> 3.<final> 4.<vowel> 5.<tone>

Consonant: 

Consonant

Medial: 

Medial

Finals: 

Finals Consonants, followed by U+ 1039 If the virama is visible a U+200C follows! If omitted the following consonant is stacked underneath the final.

Vowels: 

Vowels

Tones: 

Tones

Other Issues: 

Other Issues Independent Vowels

Other Issues: 

Other Issues Contractions

Other Issues: 

Other Issues Short Forms

Other Issues: 

Other Issues Myanmar Symbols (Various Signs) Myanmar Punctuations Myanmar Digits

Examples: 

Examples

Examples: 

Examples

Myanmar Collation Algorithm Implementation: 

Myanmar Collation Algorithm Implementation ICU (International Components for Unicode) ICU Compatible Locale by Keith Used in OO (Open Office), can test-sort! glibc Myanmar Locale (by Myanmar NLP) and Collation (by Keith) my_MM used by GTK Applications on Linux

Sorting in action (in OO): 

Sorting in action (in OO)

ICU challenges "Collation".: 

ICU challenges "Collation".

Searching: 

Searching Tokenizing Myanmar Tokenizing refers to the process of parsing a string and splitting it into different segments or tokens Useful for searching: allows keyword indexes to be built for searching May also be applicable for identifying syllables for line breaking purposes Traditional space based tokenizing does now work well with Myanmar:

Searching: 

Searching Syllable based Tokenizing using a pair comparison Step 1: Assign classes for each Myanmar code point Step 2: Analyze a potential break point by comparing the class of the code point before and after In many cases this is enough to determine the break status Step 3: In a few cases more context sensitive analysis is required

Searching: 

Searching Details of Tokenizing You don't want to know, Trust me 

Searching: 

Searching Applications for Tokenizing algorithm Syllable Based Line Breaking Algorithm Indexing text using a search engine library e.g. Apache Lucene (Java & C+ + ) Processing text into syllables for lexicon analysis e.g. Machine Translation (MT Engine) Checking for encoding errors – the pair Algorithm can be used to detect invalid sequences and duplicate codes

References: 

References http://www.thanlwinsoft.org/ by Keith Stribley http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Sorting/MyanmarCollation.pdf http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf http://www.unicode.org/notes/tn11/ http://www.unicode.org/unicode/reports/tr10/ http://www.unicode.org/faq/collation.html http://icu.sourceforge.net/userguide/Collate_Intro.html http://en.wikipedia.org/wiki/Locale http://www.mcf.org.mm/unicode/ http://download.microsoft.com/download/2/d/a/2daed6fd-9876-4894-92c2-4ffc51ce5c1a/collationintro-current.ppt http://www.microsoft.com/typography/developers/uniscribe/ http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3043.pdf http://www.unicode.org/charts/PDF/U1000.pdf

Thanks! william.wlk@gmail.com: 

Thanks! william.wlk@gmail.com Any Questions!