Unicode: Unicode Mark Davis
Unicode Consortium President
IBM Chief SW Globalization Architect
2003-09-24
Universal Character Encoding: Universal Character Encoding … Unique number for every character
Unifies all Languages : Unifies all Languages 96 thousand characters, so far
All characters accessible at the same time, in the same document:
A, Ž, Ш, Δ, ش,
क, க, ಔ,…
か, 上, 각, …..
Lingua Franca for Computers: Lingua Franca for Computers Developed andamp; supported by industry leaders:
Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, …
Required by modern standards:
XML, HTML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, Perl, etc.
Implemented in:
All modern operating systems, browsers, and other products
International Domain Names: International Domain Names Approved - Unicode-Based
Examples:
http://Юникод.com
http://Βαλκανίων.com
http://हमसब.com
Standard Resources: Standard Resources www.unicode.org
Online Standard
Technical Reports
FAQs
General Information
Discussion Forums, Conferences
Programming Resources: Programming Resources System APIs:
Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, …
Languages
Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, …
Cross-platform libraries:
ICU, Rosette, …
Stability: Stability Developers / other standards need absolute stability
Characters are never moved or deleted
Ordering of characters is by collation, not binary order. See UTS #10: Unicode Collation Algorithm
Characters may be deprecated (discouraged).
Characters never change names
Annotations are used to clarify usage
See Unicode Policies
Indic Support in Unicode: Indic Support in Unicode ISCII the basis for characters and allocation
Consortium actively engaged with Indian Government, which is a member
Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usage
Structural Similarities with ISCII: Structural Similarities with ISCII Within script, layout and contents nearly identical
Independent + dependent vowels
Halant model for representing conjuncts
conjuncts / half-forms not directly encoded
represented by sequences instead
Phonetic sequence – order in syllables
Structural Differences with ISCII: Structural Differences with ISCII Unicode is stateless:
No shifting to get different scripts
Each character has a unique number
Unicode is uniform:
No extension bytes necessary
All characters coded in the same space
Additional Characters: Additional Characters Indian Government is developing proposals for:
Additions of missing characters:
Vedic
Individual characters for certain scripts
Annotations and Descriptions
Global Applications now support languages of India: Global Applications now support languages of India Companies supporting Indic with Unicode
OpenType fonts
Font support for Indic
Microsoft Windows
Java (IBM contributed ICU Indic Layout)
Linux
…
Benefits for India: Benefits for India All documents, anywhere in the world, can have Indic text
Allows seamless multilingual documents in India
including scriptures and minority languages
Opens up software export market, beyond English
Connects India to the world
How India Can Contribute: How India Can Contribute Effective Communication with the Unicode Consortium
Provide Resources for Development
Descriptions of Usage
Descriptions of Character Shaping
Transliteration Tables from Script to Script
Collation Information
OpenType fonts
…
What Developers Can Do: What Developers Can Do Interwork with existing ISCII systems
Move to Unicode for future developments
Java, Windows, Linux, …
The Future: The Future The world is moving rapidly to Unicode
Unicode makes India open to the world
The world comes to you, and
You go to the world
You can help
Q & A: Q andamp; A
Backup Slides: Backup Slides
Multiple Forms : Multiple Forms UTF-8: maximal compatibility with 8-bit systems
UTF-16: good storage, interoperability with Windows/Java
UTF-32: simplest processing
Fast, lossless conversion
See Forms of Unicode