W3C home > Mailing lists > Public > public-i18n-geo@w3.org > July 2005

[ESW Wiki] Update of "geoUnicodeConsiderationsWhenUpgrading" by BjörnHöhrmann

From: <w3t-archive+esw-wiki@w3.org>
Date: Sun, 31 Jul 2005 11:51:14 -0000
To: w3t-archive+esw-wiki@w3.org
Message-ID: <20050731115114.2216.60417@localhost.localdomain>
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "ESW Wiki" for change notification.

The following page has been changed by BjörnHöhrmann:
http://esw.w3.org/topic/geoUnicodeConsiderationsWhenUpgrading


------------------------------------------------------------------------------
+ = FAQ: Upgrading from language-specific legacy encoding to Unicode encoding =
- [http://www.yihongtai.com 模块电源]
- [http://www.gloveboxes.com.cn 手套箱]
- [http://www.gloveboxes.com.cn/gas-purification-systems.htm 气体净化器]
  
+ == Question: What should I consider when upgrading my web pages from legacy encoding to Unicode encoding? ==
+ 
+ === Background ===
+ 
+ You have heard that using Unicode is a good idea and that there are benefits such as standards compatibility, multilingual display on a single page, pan-organisation applications. 
+ 
+ Numerous large organizations are beginning to switch to Unicode: [http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?] This FAQ will attempt to list some of the considerations you need to take into account to upgrade your encoding to Unicode.
+ 
+ Note that if you are using a content management system to generate web pages, you may need to consider your storage encoding, migration of legacy data, software support.
+ 
+ === Answer ===
+ 
+ ==== Which Unicode encoding for web pages? ====
+ 
+ Unicode is the [http://www.w3.org/International/questions/qa-doc-charset Document Character Set for HTML and XML].
+  
+ Unicode has three main encodings: UTF-8, UTF-16, UTF-32.
+ 
+ UTF-8 is the Unicode encoding consistently used for web pages:
+ 
+    * Better compatibility with legacy data, where that legacy data uses ASCII as the 128 codepoints in ASCII match the first 128 codepoints in UTF-8.
+    * No byte order problems.
+ 
+ UTF-16 is often used for the system back-end. 
+ 
+ [http://www.w3.org/International/questions/qa-doc-charset Unicode is the Document Character Set for HTML and XML]
+ 
+ === How well is Unicode supported for my end users? ===
+ 
+ This depends on:
+    * browser support
+    * suitable fonts
+    * rendering software
+ 
+ ==== Browser support ====
+ 
+ Modern browsers support Unicode:
+ 
+    * Internet Explorer
+    * Firefox
+    * Mozilla
+    * Opera
+    * Netscape Navigator
+    * Safari
+ 
+ Although many mobile phones support UTF-8, some do not. Additionally, if they use a legacy encoding, which encoding may vary with different devices. Investigation is required if you are targetting a large mobile phone market.
+ 
+ ==== Suitable fonts ====
+ 
+ Correct script display requires Unicode support at the application or operating system level and availability on the machine of Unicode fonts. 
+ 
+ CSS can help with font family fallbacks in the case where the user does not have a specific font, but another font will display the text readably. Do use CSS generic font family fallbacks, eg, serif, sans-serif, eg:
+ 
+ .headline {font-family:Verdana,Arial,Helvetica,sans-serif; font-size:16px; font-weight:bold; padding-bottom:4px;}
+ 
+ Modern operating systems support Unicode:
+ 
+    * Windows NT and its descendants Windows 2000 and Windows XP
+    * UNIX-like operating systems such as GNU/Linux
+    * BSD
+    * Mac OS X
+ 
+ Fonts not available in a standard installation can often be downloaded from free sites by users, and you can point to those sites from your pages. It is not desirable to embed fonts in pages because the technology for that is proprietary and browser-specific.
+ 
+ Commonly available Unicode fonts (commercial and open source) are [http://en.wikipedia.org/wiki/Truetype TrueType] and the more recent [http://en.wikipedia.org/wiki/Opentype OpenType].
+ 
+ Unicode fonts or ‘font families’ provide a mapping from Unicode codepoints to the graphical representation of characters, ie, glyphs. Unicode fonts usually cover [http://www.babelstone.co.uk/Fonts/Fonts.html specific scripts]. Applications such as browsers usually cover Unicode by using several fonts for different scripts and ranges.
+ 
+ Font display problems:
+ 
+    * Legacy code pages (eg ISO-8859-1/windows-1252): an operating system or browser either has a font installed for that encoding or it doesn't, therefore either the page displays correctly or no characters display (question marks).
+    * Unicode: the operating system or browser has fonts for some, but not all, of the codepoints, so when displaying a Unicode page, some of the characters may display correctly whilst others don't because the browser has access to fonts for some of the codepoints but not all (empty rectangles).
+ 
+ ==== Rendering software ====
+ 
+ Multilingual text rendering engines are built into operating system and browser installation. This is typically needed for '[http://people.w3.org/rishida/scripts/tutorial/all.html#Slide0430 complex scripts]' such as Arabic, Hindi, Urdu, Persian, ie languages which have characters that change appearance based on their context.
+ 
+    * Windows: Uniscribe
+    * Macintosh: Apple Type Services for Unicode Imaging, which replaced the WorldScript engine for legacy encodings.
+    * Pango - open source
+    * Graphite - (open source renderer from SIL)
+ 
+ === What I don’t need to worry about ===
+ 
+ Page weight / download cost is not really an issue: given that a large proportion of a web page is HTML mark-up, where characters remain 1 byte, then the difference between legacy encoding and Unicode encoding is quite negligible. In addition, many legacy encodings for complex scripts are already double-byte, eg, Chinese.
+ 
+ Same page weight as for legacy encodings:
+ 
+    * HTML markup
+    * English
+ 
+ Slightly heavier
+ 
+ '''QUERY FOR (DRC): did you mean the following RFC? [http://www.ietf.org/rfc/rfc3629.txt?number=3629 RFC: UTF-8, a transformation format of ISO 10646] I couldn't find the useful bit you mentioned re weight. Could you point me to it. ALSO QUERY FOR ALL - should we point to a page weight tool here.'''
+ [[DRC I was mistaken in the original source, but this may be of use http://www-128.ibm.com/developerworks/unicode/library/utfencodingforms/index.html#h2 ]]
+ 
+    * Latin languages: characters, eg, e acute, outside the ASCII range (128 codepoints), are represented by one byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but acceptable, increase in page size should be expected.
+ 
+    * Characters that do not fall into the ASCII range, such as Chinese, Arabic, Russian, may use 2 or even 3 bytes. Chinese encodings already use more than 1 byte per character with legacy encodings, where they use double bytes.
+ 
+ === Don't forget ===
+ 
+ ==== Character encoding declaration ====
+ 
+ Ensure that you include or change the [http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: character encoding declaration] from the legacy encoding to Unicode. 
+ 
+    * HTTP header content-type, eg, Content-Type: text/html; charset=utf-8
+    * HTML head, eg, <meta http-equiv"Content-Type" content"text/html; charset=utf-8"/>
+ 
+ ==== File encoding ====
+ 
+ Ensure that the file itself has the correct encoding. With a Unicode encoding, the source text should be readable and match the web page text, rather than with a legacy encoding where the source text is not readable and uses different characters to point to codepoints.
+ 
+ ==== Combining data ====
+ 
+ Ensure that any file fragments that are included the web page, eg using technologies such as Apache SSI (server-side includes), where they will share the encoding of the parent page, are saved with the correct file type/encoding. The fragment encodings must match the parent web file encodings and upgrading to Unicode must happen simultaneously.
+ 
+ [[DRC
+ ==== Forms ====
+ Server side applications, which deal with data returned from a form, must be able to deal with Unicode, or may need to be adapted before upgraded pages containing forms are published.
+ 
+ ]]
+ === Further reading ===
+ 
+    * [http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?]
+    * [http://www.w3.org/International/questions/qa-doc-charset Document Character Set for HTML and XML]
+    * [http://www.unicode.org/help/display_problems.html Settings to change to resolve display problems in Unicode]
+    * [http://en.wikipedia.org/wiki/Truetype Information about TrueType font]
+    * [http://en.wikipedia.org/wiki/Opentype Information about OpenType font]
+    * [http://www.babelstone.co.uk/Fonts/Fonts.html Unicode fonts and specific scripts]
+    * [http://people.w3.org/rishida/scripts/tutorial/all.html#Slide0430 complex scripts]
+    * [http://www.unicode.org Unicode Consortium]
+    * [http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: Character sets & encodings in XHTML, HTML and CSS]
+    * [http://www.w3.org/International/questions/qa-doc-charset Document Character Set for HTML and XML ]
+    * [http://www.ietf.org/rfc/rfc3629.txt?number=3629 RFC: UTF-8, a transformation format of ISO 10646]
+    * [http://www.alanwood.net/unicode/browsers.html Unicode & multilingual web browsers]
+    * [http://en.wikipedia.org/wiki/Unicode_and_HTML Unicode & HTML]
+ 
Received on Sunday, 31 July 2005 15:09:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:12:40 GMT