[ESW Wiki] Update of "geoUnicodeConsiderationsWhenUpgrading" by Deborah Cawkwell

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "ESW Wiki" for change notification.

The following page has been changed by Deborah Cawkwell:
http://esw.w3.org/topic/geoUnicodeConsiderationsWhenUpgrading


The comment on the change is:
New version: WIKI formatting required.

------------------------------------------------------------------------------
  These changes should be sent to the GEO public list - providing us with a  consistent notification method and an archive of changes.
  
  ----
- = FAQ: Upgrading from language-specific legacy encoding to Unicode encoding =
+ FAQ: Upgrading from language-specific legacy encoding to Unicode encoding 
+ Question: What should I consider when upgrading my web pages from legacy encoding to Unicode encoding?
+ Background 
+ You have heard that using Unicode is a good idea and that there are benefits such as standards compatibility, multilingual display on a single page, pan-organisation applications. 
+ Numerous large organizations are beginning to switch to Unicode: [http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?]
+ However, you are not sure what's involved and whether it will work for your site.
+ This FAQ will attempt to list some of the considerations you would need to take into account for the encoding of web pages.
+ Note that if you are using a content management system to generate web pages, you may need to consider your storage encoding, migration of legacy data, software support.
+ [MD 22 mar] Maybe mention here that some mobile phones don't yet support UTF-8 (but some do, although with a limited range of characters).
+ Answer 
+ Which Unicode encoding for web pages? 
+ Unicode has three main encodings: UTF-8, UTF-16, UTF-32.
+ UTF-8 is the Unicode encoding consistently used for web pages:
+ • Better compatibility with legacy data, where that legacy data uses ASCII as the 128 codepoints in ASCII match the first 128 codepoints in UTF-8.
+ • No byte order problems as UTF-8 is 8-bit.
+ How well is Unicode supported for my end users?
+ This depends on:
+ • browser support
+ • suitable fonts
+ • rendering software
+ Browser support
+ Modern browsers support Unicode:
+ • Internet Explorer 6 (Windows)
+ • Firefox 1.0 
+ • Mozilla 1.4
+ • Opera 7.0
+ • Netscape Navigator 7.0
+ • Safari 1.03
+ • Internet Explorer 5.2 (Mac)
+ Suitable fonts
+ Correct script display requires Unicode support by the operating system and availability on the machine of Unicode fonts. 
+ CSS can help with font family fallbacks in the case where the user does not have a specific font, but another font will display the text readably. Do use CSS generic font family fallbacks, eg, serif, sans-serif.
+ Modern operating systems support Unicode:
+ • Windows NT and its descendants Windows 2000 and Windows XP
+ •	UNIX-like operating systems such as GNU/Linux
+ •	BSD
+ •	Mac OS X
+ Standard installation of an operating system includes suitable fonts for the language selected by the user. Fonts not included in a standard installation can usually be added via menu options; they can also be downloaded. Some languages currently require a font download; these languages include Pashto, Hindi, Urdu, Bengali.
+ Commonly available Unicode fonts (commercial and open source) are [http://en.wikipedia.org/wiki/Truetype TrueType] and the more recent [http://en.wikipedia.org/wiki/Opentype OpenType].
+ Unicode fonts or ‘font families’ provide a mapping from Unicode codepoints to the graphical representation of characters, ie, glyphs. Unicode fonts usually cover [http://www.babelstone.co.uk/Fonts/Fonts.html specific scripts]. Applications such as browsers usually cover Unicode by using several fonts for different scripts and ranges.
+ Font display problems:
+ •	ISO-8859-1/windows-XXXX: an operating system or browser either has a font installed for that encoding or it doesn't, therefore either the page displays correctly or no characters display (question marks).
+ • Unicode: the operating system or browser has fonts for some, but not all, of the codepoints, so when displaying a Unicode page, some of the characters may display correctly whilst others don't because the browser has access to fonts for some of the codepoints but not all (empty rectangles).
+ Rendering software
+ Multilingual text rendering engines are built into operating system and browser installation.
+ • Windows: Uniscribe
+ • Macintosh: Apple Type Services for Unicode Imaging, which replaced the WorldScript engine for legacy encodings.
+ • Pango - open source
+ • Graphite - (open source renderer from SIL)
+ What I don’t need to worry about
+ Page weight
+ Same page weight as for legacy encodings:
+ •	HTML markup
+ •	English
+ Slightly heavier
+ •	Latin languages: characters, eg, e acute, outside the ASCII range (128 codepoints), are represented by one byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but acceptable, increase in page size should be expected.
+ • Characters that do not fall into the ASCII range, such as Chinese, Arabic, Russian, may use 2 or even 3 bytes. Chinese encodings already use more than 1 byte per character with legacy encodings, where they use double bytes.
+ Don't forget
+ Character encoding declaration 
+ You should ensure that you change the [http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: character encoding declaration] from legacy to Unicode. 
+ • HTTP header content-type, eg, Content-Type: text/html; charsetutf-8
+ • HTML head, eg, <meta http-equiv"Content-Type" content"text/html; charsetutf-8"/>
+ Further reading 
+   * [http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?]
+   * [http://www.unicode.org/help/display_problems.html Settings to change to resolve display problems in Unicode]
+   * [http://en.wikipedia.org/wiki/Truetype Information about TrueType font]
+   * [http://en.wikipedia.org/wiki/Opentype Information about OpenType font]
+   * [http://www.babelstone.co.uk/Fonts/Fonts.html Unicode fonts and specific scripts].
+   * [http://www.unicode.org Unicode Consortium]
+   * [http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: Character sets & encodings in XHTML, HTML and CSS]
+   * [http://www.alanwood.net/unicode/browsers.html Unicode & multilingual web browsers]
+   * [http://en.wikipedia.org/wiki/Unicode_and_HTML Unicode & HTML]
  
- '''Question: What should I consider when upgrading my web pages from legacy encoding to Unicode encoding?'''
- 
- 
- == Background ==
- 
- '''Numerous large organizations are beginning to switch to Unicode: [http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?]'''
- 
- '''You have heard that using Unicode is a good idea and that there are benefits such as multilingual display and standards compatability. However, you are not sure what's involved and whether it will work for your site.'''
- 
- '''This FAQ will attempt to list some of the considerations you would need to take into account.'''
- 
-    * '''If you have a database-backed web-publishing system, ie, a CPMS (Content Production and Management System), you may need to consider migrating your legacy data to Unicode encoding (NB FAQ being worked on for this in response to migrating legacy data question on i18n IG list. But unlikely to be finished when this FAQ published - can be added later.)'''
- 
- [MD 22 mar] Maybe mention here that some mobile phones don't yet support UTF-8 (but some do, although with a limited range of characters).
- 
- == Answer ==
- 
- === Will my users be able to see the stuff? Which browsers do they need to be using? ===
- 
-    * How did I evalute this?
- 
-       Internet Explorer 6 (Windows) - yes
-       Firefox 1.0 - yes
-       Mozilla 1.4 - yes
-       Opera 7.0 - yes
-       Netscape Navigator 7.0
-       Safari 1.03
-       Internet Explorer 5.2 (Mac) 
- 
- 
- 
- 
- 
-    * '''Background info: how fonts work seems to be different with Unicode.
-    * '''Which Unicode encoding''' (Another FAQ here with answers from a recently asked IG question: "which Unicode encoding on the web? Also, which Unicode encoding for my data storage?"'''
-    * '''Will my pages be heavier?''' (Could be in a what you really don't need to worry about section as suggested by RI).
-    * '''If I'm using a CPMS to generate my web pages, is there anything I need to consider. (FAQ: migration of legacy data. Also FAQ which Unicode encoding for my storage?)
-    * '''Forms: is anything different? do I need to change anything? Can I now prepare for Xforms? On my organisation's site, there are some sites using UTF-8 because they don't work with anything else & forms work, so that seems to be a non-issue?? (There are issues around form input & offering online keyboards.)
-    * '''Don't assume encoding. Pointers to other FAQs. Also, a useful FAQ might be "why do I see ? in my web pages?". Recent page I saw which had no encoding statement showed ? where there should have an e with a grave accent. I was viewing the page in FF where I'd set up default encoding to Unicode; the page assumed iso-8859-1 and displayed correctly when I changed the encoding via the toolbar (view, etc). I think this UA behaviour might be contrary to W3C HTML standards, but the display could be easily accommodated with an encoding statement &/or HTML entities / NCRs...
- 
- 
-    * '''Are there any other questions I have missed?'''
-    * '''Is this a better direction?'''
-    
- -------------------------------------
- 
- ''' OLD '''
- 
- 
- 
- === How widely is Unicode supported for my users? ===
- 
- '''This depends on:'''
-   * '''browser support'''
-   * '''suitable fonts'''
-   * '''rendering software'''
- 
- === Background Information ===
- 
- '''[How Unicode works with fonts inc using rendering engines. This is something I would want to know and understand.]'''
- 
- '''Most recently released web browsers are able to display content encoded using Unicode if a suitable font which supports Unicode is available to the system.'''
- 
- '''Multilingual Text Rendering Engines:'''
- 
-   *'''Windows: Uniscribe'''
-   *''' Macintosh: Apple Type Services for Unicode Imaging, which replaced the WorldScript engine for legacy encodings.'''
-   *''' Pango - open source'''
-   *''' Graphite - (open source renderer from SIL)'''
- 
- 
- ==== Browser Support ====
- 
- * '''[All modern browsers" pretty much correct but vague: specify which browser versions in significant families that can display Unicode and perhaps more importantly ones that can't]'''
- 
- 
- [DC 6 apr] Is this section relevant at all if we have a Unicode 'how it works for web pages' background section?
- ==== Rendering Software  ====
- 
- [AC 24 mar] although, practically, the langauge [[DRC 05 May spellling... language]] capabilities of a web browser depend on the rendering system. There are languages that IE or Firefox on WinXP-sp2 can render than the same versions of IE and firefox on WinXP can not, since WinXP and WinXP-sp2 use different versions of uniscribe. Also at issue, are languages supported by Unicode that are not official supported or are unsupported by current commercial rendering systems.
- 
- 
- ==== Fonts ====
- 
- '''Fonts which support Unicode are now commonly available, both commercial and open source, examples being TrueType and the more recent OpenType, which both support Unicode. These font families provide a mapping from Unicode codepoints to the graphical representation of characters, i.e. glyphs.'''
- 
- [MD 22 mar] please make sure here that people don't get the impression that usually a single font covers all of Unicode. You are almost there, but maybe need some tweaks, or an additional sentence, such as "Applications such as browsers usually cover Unicode by using
- several fonts for different scripts and ranges."
- 
- [telcon 23 mar]
-   * OS font vs UA distinction not clear enough, currently implies that if font isn't visible, UA doesn't support Unicode, which is wrong, probably that you need a font to display those characters on your system 
-   * Clarify: don't need huge Unicode font - user would need a font which would display appropriate necessary Unicode characters, cf, OS / UA combinations.
-   * Ensure users have fonts, which probably will have as built into system.
-   * Usually Unicode fonts cover 'specific' scripts, point to Alan Wood's v useful list of Unicode fonts & and to [http://www.babelstone.co.uk/Fonts/Fonts.html babelstone font information].
- 
- [[DRC 05 May Note. Some fonts only support older Unicode versions and most only support the base plane]]
- 
- '''If using a legacy encoding, ie, a non-Unicode encoding, eg ie/eg looks repetitious.'''
- 
- '''ISO-8859-1/windows-XXXX, then an operating system or browser either has a font installed for that encoding or it doesn't, therefore either the page displays correctly or no characters display (question marks). With Unicode, the operating system or browser has fonts for some, but not all, of the codepoints, so when displaying a Unicode page, it's not unusual to have some of the characters display correctly whilst others don't (empty rectangles) because the browser has access to fonts for some of the codepoints but not all.'''
- 
- '''For complex scripts such as Arabic and Thai, rules need to be applied to transform the underlying character sequence to the appropriate glyphs for display. Middle Eastern languages also need support for directionality. Fonts, particularly OpenType fonts, often contain information about the shaping transformations required, but usually some operating system level support is needed to help ensure the correct output from multilingual text rendering engines; these are usually bundled with either the operating system or with the browser.'''
- 
- [MD 22 mar] This is basically explaining why one doesn't need to be concerned about this issue, yes? If so, it probably can be shorter, and say something like "most browers these days also support shaping and bidirectional display for ... or use the support provided by the operating system."
- 
- [FS 3 May] On should add something on unification of scripts, e.g. Han unification. To distinguish unified skripts the content author might have to use additional markup to allow font assignment for CJK text. See [http://www.w3.org/International/tests/sec-cjk-fonts.html] for examples.
- 
- [telcon 23 mar] 
-   * CSS generic font family fallbacks very relevant, eg, serif, sans-serif: always good practice. DC ACTION check CSS spec to double-check wording ('generic font family'). 
-   * Para begining "ISO...", easier to say this:
-   * Issue people might see rectangles, if they don't have correct fonts. This is only the case if multi-lingual text. In the current draft sounds like a general unicode problem 
- 
- 
- [MD 22 mar] My recollection is that this depends on the browser. This should probably be in a separate doc (technique or FAQ).
- 
- [DC 30 mar] Can you be more specific?
- 
- === Which Unicode encoding? ===
- 
- '''UTF-8 is the Unicode encoding most commonly used for web pages. Unicode has essentially three encodings: UTF-8, UTF-16, UTF-32.'''
- 
- [telcon 23 mar] 
-    * Which unicode encoding? 
-    * ACTION DC - DONE PEND RESPONSE: question to IG list: "For web pages, would you consider using a Unicode encoding other than UTF-8, eg UTF-16? If so, why? or why not?"
-    * Possibly better compatibility with legacy data, ie ASCII / UTF-8?
-  
- '''QUESTION Should the Basic Multilingual Plane (BMP) & byte representation be mentioned here?'''
- 
- [telcon 23 mar]
-    * Don't mention BMP.
-    * (BMP = first 65k approx characters, then introduced 15/16 more planes of characters, all have 65k characters in them, most of more common scripts in first BMP. 1 byte stuff: only 7-bit ASCII, not upper ASCII, ie ANSI, ANSI territory -> 2 bytes, 2 bytes beg/end Arabic, 3 bytes ideographic characters Chinese & Japanese - it's the script which is important here.  
-    * UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
-    * UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
-    * UTF-32 uses 4 bytes everywhere. In the chart above, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
- 
- [MD 22 mar] I don't understand the question. Are BMP and byte representation two
- separate issues, or the same issue?
- 
- [DC 23 mar] This came out of commments on an earlier draft by RI. He said:
- 
- [RI 25 jan] If you use Unicode you will need to decide on an encoding. If you use UTF-8, ASCII characters represented by a single byte (just as in an ASCII encoding), while other characters in the Basic Multilingual Plane (BMP) are represented by two or three bytes.  If you use UTF-16, all characters in the BMP are represented by 2 bytes."
- 
- [RI 25 jan] I wouldn't bother talking about the higher planes, since they are rarely used and since those characters are likely ot provide complications anyway. (Use the 80:20 rule).
- 
- [DC 23 mar] I am confused about the BMP & bytes.
- 
- === Will UTF-8 make web pages heavier to download? ===
- 
- '''Characters that fall in the the 'traditional ASCII' space will use 1 byte per character; this is the same as legacy encodings.'''
- 
- '''Same page weight as for legacy encodings:'''
- 
-   *'''HTML markup'''
-   *''' English'''
- 
- '''Slightly heavier'''
-   *'''Latin languages'''
- 
- '''Characters, eg, e acute, outside the ASCII range are represented by one byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but acceptable, increase in page size should be expected.'''
- 
- '''Characters that do not fall into the 'traditional ASCII' space such as Chinese, Arabic, Russian may use 2 or even 3 bytes, however, Chinese encodings already use more than 1 byte per character with legacy encodings.'''
- 
- '''QUESTION With which languages/scripts does 1-byte encoding stop?'''
- '''QUESTION Should this talk about scripts, rather than languages?'''
- 
- === Does the software you use to produce your pages support Unicode, including input environment, database, programming languages? ===
- 
- '''QUESTION Are there any server issues?'''
- 
- [MD 22 mar] Of course there are. I think most of the answers you got on your question a while ago (was that on www-international?) were about server-side issues.
- 
- [DC 23 mar] Which FAQ? Maybe I should point to it.
- 
- === What happens to legacy data? Do you transcode it all or do you build a transcoder into the pipeline? ===
- 
- '''QUESTION Maybe another FAQ (based on the I18N IG responses to Legacy data & upgrading to Unicode question)?'''
- 
- [MD 22 mar] I thought that this was what this FAQ was about. After reading it, my impression is that it's more a FAQ about Unicode support on browsers. Maybe the question should be changed to indicate it.
- 
- [DC 23 mar] Suppose that indicates my biggest concern: how would a Unicode upgrade be for the audience? Am I missing other issues?
- 
- [AC 24 mar] From the point of view of minority languages, and languages not supported by major international character sets (including the windows codepages and the ISO-8859 series) there is the issue of locating tools to convert legacy data to Uniocde. In some case this requires developing a mapping table between the legacy character set and unicode.
- 
- == Related == 
- 
- '''Don't forget:'''
- 
- === Character encoding declaration ===
- 
- '''When using Unicode, the encoding should be specified (as with legacy encodings) in the HTTP header content-type (eg, Content-Type: text/html; charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/): [http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: Character sets & encodings in XHTML, HTML and CSS].'''
- 
- [telcon 23 mar] Character encoding declaration: rather you need to ensure that you change the way you serve your files, so the information is up-to-date.
- 
- == Things you don't need to think about ==
- 
- [telcon 23 mar] 'Consider' a section: "things you don't have to think about", eg heaviness 
- 
- == Further reading ==
- 
-   * '''[http://www.unicode.org Unicode Consortium]'''
-   * '''[http://www.w3.org/International/tutorials/tutorial-char-enc Tutorial: Character sets & encodings in XHTML, HTML and CSS].'''
-   * '''[http://www.w3.org/International/questions/qa-who-uses-unicode FAQ: Who uses Unicode?]'''
-   * '''[http://www.alanwood.net/unicode/browsers.html Unicode & multilingual web browsers]'''
-   * '''[http://en.wikipedia.org/wiki/Unicode_and_HTML Unicode & HTML]'''
- 
- [DanConnolly 24 mar] I marked up the links in this section as links rather than PoorMansHypertext. I hope you'll do likewise for the other links. Or you can change them back, but I'd like to know why. I also formatted the list items as list items. 
- 
- [DC 30 mar] Thanks. Think I've done everything. Tell me if otherwise.
- 

Received on Tuesday, 3 May 2005 21:36:34 UTC