logging in or signing up U William Gabir Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 143 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Myanmar Unicode Implementation Standards: Myanmar Unicode Implementation Standards William W.L.K (Contribution Member, Myanmar Unicode and NLP Lab) Thursday, May 04, 2006 The 5th Myanmar ICT Week 2006Overview: Overview Myanmar Unicode Encoding Standard Simplified, Standardized, and Finalized Myanmar Unicode Implementations Rendering Technologies (SIL Graphite, m17n, Uniscribe, Pango, ICU) Applied, Tested, Enabled Font Technologies OTF, TTF, Pseudo Unicode Developed, Tested, Applied, Debugged Are we there yet? Simply, NO!Myanmar Unicode Encoding Standard: Myanmar Unicode Encoding Standard CURRENT Standard (Unicode 4.1) Accepted New Standard ISO/IEC JTC1/SC2/WG2 N3043Myanmar Unicode Implementations: Myanmar Unicode Implementations Localization/Rendering Technologies and Standards SIL Graphite m17n ICU Pango OTF Support Slide5: Pango Pango Modules M17N Pango Mod M17N Libs SCIM Input Method Pango Supported Applications Pango Pango Modules Graphite Application Hack Graphite Enabled Applications SCIM Input Method Pango Supported Applications Graphite Pango Module M17N Graphite Pango Pango Modules OTF OTF Pango ModRendering Technologies and Standards: Rendering Technologies and Standards PANGO M17N (Multiligualization) Font Layout Table (.flt) Supports many Asian Scripts Pango Module Our Japanese Friends (Dr. Handas and Dr. Takahashi) Graphite Graphite Description Language (.gdl) Supports many non-roman scripts Works well on M$ Platforms Pango Hacks on the way Our English Friends (Mr. Martin Hoskin and Mr. Keith Stribley) Myanmar OTF Pango Module By our Myanmar Friend U Tin Myo Htet ICU (International Components for Unicode) Rendering for Myanmar already done! Java and C++ ready! Largely used by OpenOffice We use it for “Collation”. Our Spanish friend (Dr. Javier Sola living in Cambodia) Myanmar1 OTF Font by the Lab: Myanmar1 OTF Font by the LabWhat are the things you do most in Data Processing?: What are the things you do most in Data Processing? Sorting (Collation) Searching (Tokenization)Collation: Collation The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as alphabetizing or alphabetic sorting. The general term for the process and function of determining the sorting order of strings of characters. The culturally expected ordering of linguistic characters in a particular language.Collation is not uniform!: Collation is not uniform!So what?: So what? How can you build a Myanmar Collation Algorithm? Canonical Encoding Order for Myanmar (Unicode 4: Table 10-3) http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf The "generic" Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/ Too generic for too complex script like Myanmar We need a "tailored" Unicode Collation Algorithm for Myanmar. (We need custom rules!)Myanmar Collation Algorithm: Myanmar Collation Algorithm Myanmar collation can be split into a 5 stage process: A generic Myanmar syllable can be encoded as: < consonant> < medial> < vowel> < final> <tone> This is sorted in the order: 1.<consonant> 2.<medial> 3.<final> 4.<vowel> 5.<tone> Consonant: ConsonantMedial: Medial Finals: Finals Consonants, followed by U+ 1039 If the virama is visible a U+200C follows! If omitted the following consonant is stacked underneath the final. Vowels: VowelsTones: Tones Other Issues: Other Issues Independent Vowels Other Issues: Other Issues Contractions Other Issues: Other Issues Short Forms Other Issues: Other Issues Myanmar Symbols (Various Signs) Myanmar Punctuations Myanmar Digits Examples: Examples Examples: Examples Myanmar Collation Algorithm Implementation: Myanmar Collation Algorithm Implementation ICU (International Components for Unicode) ICU Compatible Locale by Keith Used in OO (Open Office), can test-sort! glibc Myanmar Locale (by Myanmar NLP) and Collation (by Keith) my_MM used by GTK Applications on LinuxSorting in action (in OO): Sorting in action (in OO)ICU challenges "Collation".: ICU challenges "Collation".Searching: Searching Tokenizing Myanmar Tokenizing refers to the process of parsing a string and splitting it into different segments or tokens Useful for searching: allows keyword indexes to be built for searching May also be applicable for identifying syllables for line breaking purposes Traditional space based tokenizing does now work well with Myanmar:Searching: Searching Syllable based Tokenizing using a pair comparison Step 1: Assign classes for each Myanmar code point Step 2: Analyze a potential break point by comparing the class of the code point before and after In many cases this is enough to determine the break status Step 3: In a few cases more context sensitive analysis is requiredSearching: Searching Details of Tokenizing You don't want to know, Trust me Searching: Searching Applications for Tokenizing algorithm Syllable Based Line Breaking Algorithm Indexing text using a search engine library e.g. Apache Lucene (Java & C+ + ) Processing text into syllables for lexicon analysis e.g. Machine Translation (MT Engine) Checking for encoding errors – the pair Algorithm can be used to detect invalid sequences and duplicate codesReferences: References http://www.thanlwinsoft.org/ by Keith Stribley http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Sorting/MyanmarCollation.pdf http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf http://www.unicode.org/notes/tn11/ http://www.unicode.org/unicode/reports/tr10/ http://www.unicode.org/faq/collation.html http://icu.sourceforge.net/userguide/Collate_Intro.html http://en.wikipedia.org/wiki/Locale http://www.mcf.org.mm/unicode/ http://download.microsoft.com/download/2/d/a/2daed6fd-9876-4894-92c2-4ffc51ce5c1a/collationintro-current.ppt http://www.microsoft.com/typography/developers/uniscribe/ http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3043.pdf http://www.unicode.org/charts/PDF/U1000.pdf Thanks!william.wlk@gmail.com: Thanks! william.wlk@gmail.com Any Questions! You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
U William Gabir Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 143 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Myanmar Unicode Implementation Standards: Myanmar Unicode Implementation Standards William W.L.K (Contribution Member, Myanmar Unicode and NLP Lab) Thursday, May 04, 2006 The 5th Myanmar ICT Week 2006Overview: Overview Myanmar Unicode Encoding Standard Simplified, Standardized, and Finalized Myanmar Unicode Implementations Rendering Technologies (SIL Graphite, m17n, Uniscribe, Pango, ICU) Applied, Tested, Enabled Font Technologies OTF, TTF, Pseudo Unicode Developed, Tested, Applied, Debugged Are we there yet? Simply, NO!Myanmar Unicode Encoding Standard: Myanmar Unicode Encoding Standard CURRENT Standard (Unicode 4.1) Accepted New Standard ISO/IEC JTC1/SC2/WG2 N3043Myanmar Unicode Implementations: Myanmar Unicode Implementations Localization/Rendering Technologies and Standards SIL Graphite m17n ICU Pango OTF Support Slide5: Pango Pango Modules M17N Pango Mod M17N Libs SCIM Input Method Pango Supported Applications Pango Pango Modules Graphite Application Hack Graphite Enabled Applications SCIM Input Method Pango Supported Applications Graphite Pango Module M17N Graphite Pango Pango Modules OTF OTF Pango ModRendering Technologies and Standards: Rendering Technologies and Standards PANGO M17N (Multiligualization) Font Layout Table (.flt) Supports many Asian Scripts Pango Module Our Japanese Friends (Dr. Handas and Dr. Takahashi) Graphite Graphite Description Language (.gdl) Supports many non-roman scripts Works well on M$ Platforms Pango Hacks on the way Our English Friends (Mr. Martin Hoskin and Mr. Keith Stribley) Myanmar OTF Pango Module By our Myanmar Friend U Tin Myo Htet ICU (International Components for Unicode) Rendering for Myanmar already done! Java and C++ ready! Largely used by OpenOffice We use it for “Collation”. Our Spanish friend (Dr. Javier Sola living in Cambodia) Myanmar1 OTF Font by the Lab: Myanmar1 OTF Font by the LabWhat are the things you do most in Data Processing?: What are the things you do most in Data Processing? Sorting (Collation) Searching (Tokenization)Collation: Collation The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as alphabetizing or alphabetic sorting. The general term for the process and function of determining the sorting order of strings of characters. The culturally expected ordering of linguistic characters in a particular language.Collation is not uniform!: Collation is not uniform!So what?: So what? How can you build a Myanmar Collation Algorithm? Canonical Encoding Order for Myanmar (Unicode 4: Table 10-3) http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf The "generic" Unicode Collation Algorithm http://www.unicode.org/unicode/reports/tr10/ Too generic for too complex script like Myanmar We need a "tailored" Unicode Collation Algorithm for Myanmar. (We need custom rules!)Myanmar Collation Algorithm: Myanmar Collation Algorithm Myanmar collation can be split into a 5 stage process: A generic Myanmar syllable can be encoded as: < consonant> < medial> < vowel> < final> <tone> This is sorted in the order: 1.<consonant> 2.<medial> 3.<final> 4.<vowel> 5.<tone> Consonant: ConsonantMedial: Medial Finals: Finals Consonants, followed by U+ 1039 If the virama is visible a U+200C follows! If omitted the following consonant is stacked underneath the final. Vowels: VowelsTones: Tones Other Issues: Other Issues Independent Vowels Other Issues: Other Issues Contractions Other Issues: Other Issues Short Forms Other Issues: Other Issues Myanmar Symbols (Various Signs) Myanmar Punctuations Myanmar Digits Examples: Examples Examples: Examples Myanmar Collation Algorithm Implementation: Myanmar Collation Algorithm Implementation ICU (International Components for Unicode) ICU Compatible Locale by Keith Used in OO (Open Office), can test-sort! glibc Myanmar Locale (by Myanmar NLP) and Collation (by Keith) my_MM used by GTK Applications on LinuxSorting in action (in OO): Sorting in action (in OO)ICU challenges "Collation".: ICU challenges "Collation".Searching: Searching Tokenizing Myanmar Tokenizing refers to the process of parsing a string and splitting it into different segments or tokens Useful for searching: allows keyword indexes to be built for searching May also be applicable for identifying syllables for line breaking purposes Traditional space based tokenizing does now work well with Myanmar:Searching: Searching Syllable based Tokenizing using a pair comparison Step 1: Assign classes for each Myanmar code point Step 2: Analyze a potential break point by comparing the class of the code point before and after In many cases this is enough to determine the break status Step 3: In a few cases more context sensitive analysis is requiredSearching: Searching Details of Tokenizing You don't want to know, Trust me Searching: Searching Applications for Tokenizing algorithm Syllable Based Line Breaking Algorithm Indexing text using a search engine library e.g. Apache Lucene (Java & C+ + ) Processing text into syllables for lexicon analysis e.g. Machine Translation (MT Engine) Checking for encoding errors – the pair Algorithm can be used to detect invalid sequences and duplicate codesReferences: References http://www.thanlwinsoft.org/ by Keith Stribley http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Sorting/MyanmarCollation.pdf http://www.unicode.org/versions/Unicode4.0.0/ch10.pdf http://www.unicode.org/notes/tn11/ http://www.unicode.org/unicode/reports/tr10/ http://www.unicode.org/faq/collation.html http://icu.sourceforge.net/userguide/Collate_Intro.html http://en.wikipedia.org/wiki/Locale http://www.mcf.org.mm/unicode/ http://download.microsoft.com/download/2/d/a/2daed6fd-9876-4894-92c2-4ffc51ce5c1a/collationintro-current.ppt http://www.microsoft.com/typography/developers/uniscribe/ http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3043.pdf http://www.unicode.org/charts/PDF/U1000.pdf Thanks!william.wlk@gmail.com: Thanks! william.wlk@gmail.com Any Questions!