WORK IN PROGRESS: FAQ: Upgrading from language-specific legacy encoding to Unicode encoding

WORK IN PROGRESS: FAQ: Upgrading from language-specific legacy encoding to Unicode encoding
Question: What should you consider when upgrading to Unicode?


Background

Numerous large organizations are beginning to switch to Unicode. See FAQ: Who uses Unicode? (http://www.w3.org/International/questions/qa-who-uses-unicode).

You have heard that using Unicode is a good idea and that there are benefits such as multilingual display and standards compatability. However, you are not sure what's involved and whether it will work for your site.

This FAQ will attempt to list some of the considerations you would need to take into account.


What to consider


Character encoding declaration

When using Unicode, the encoding should be specified (as with legacy encodings) in the HTTP header content-type (eg, Content-Type: text/html; charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/). See Tutorial: Character sets & encodings in XHTML, HTML and CSS (http://www.w3.org/International/tutorials/tutorial-char-enc).


Text rendering

Most recently released web browsers are able to display content encoded using Unicode if a suitable font which supports Unicode is available to the system.

Fonts which support Unicode are now commonly available, both commercial and open source, examples being TrueType and the more recent OpenType, which both support Unicode. These font families provide a mapping from Unicode codepoints to the graphical representation of characters, i.e. glyphs.

If using a legacy encoding, ie, a non-Unicode encoding, eg ISO-8859-1/windows-XXXX, then an operating system or browser either has a font installed for that encoding or it doesn't, therefore either the page displays correctly or no characters display (question marks). With Unicode, the operating system or browser has fonts for some, but not all, of the codepoints, so when displaying a Unicode page, it's not unusual to have some of the characters display correctly whilst others don't (empty rectangles) because the browser has access to fonts for some of the codepoints but not all.

For complex scripts such as Arabic and Thai, rules need to be applied to transform the underlying character sequence to the appropriate glyphs for display. Middle Eastern languages also need support for directionality. Fonts, particularly OpenType fonts, often contain information about the shaping transformations required, but usually some operating system level support is needed to help ensure the correct output from multilingual text rendering engines; these are usually bundled with either the operating system or with the browser.

Multilingual Text Rendering Engines

- Windows: Uniscribe
- Macintosh: Apple Type Services for Unicode Imaging, which replaced the WorldScript engine for legacy encodings.
- Pango - open source
- Graphite - (open source renderer from SIL)

QUESTION On the web, does it help to provide generic font family fallbacks in CSS, such as serif, sans-serif, etc? Example:
body {font-family:Verdana,Arial,Helvetica,sans-serif;}


Which Unicode encoding?

UTF-8 is the Unicode encoding most commonly used for web pages. Unicode has essentially three encodings: UTF-8, UTF-16, UTF-32.

QUESTION Should the Basic Multilingual Plane (BMP) & byte representation be mentioned here?


Will UTF-8 make web pages heavier to download?

Characters that fall in the the 'traditional ASCII' space will use 1 byte per character; this is the same as legacy encodings.

Same page weight as for legacy encodings:

- HTML markup
- English

Slightly heavier
- Latin languages

Characters, eg, e acute, outside the ASCII range are represented by one byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but acceptable, increase in page size should be expected.

Characters that do not fall into the 'traditional ASCII' space such as Chinese, Arabic, Russian may use 2 or even 3 bytes, however, Chinese encodings already use more than 1 byte per character with legacy encodings.

QUESTION With which languages/scripts does 1-byte encoding stop?
QUESTION Should this talk about scripts, rather than languages?


Does the software you use to produce your pages support Unicode, including input environment, database, programming languages?
QUESTION Are there any server issues?

What happens to legacy data? Do you transcode it all or do you build a transcoder into the pipeline?
QUESTION Maybe another FAQ (based on the I18N IG responses to Legacy data & upgrading to Unicode question)?


Further reading

Unicode Consortium (http://www.unicode.org)
Tutorial: Character sets & encodings in XHTML, HTML and CSS (http://www.w3.org/International/tutorials/tutorial-char-enc).
FAQ: Who uses Unicode? (http://www.w3.org/International/questions/qa-who-uses-unicode)
Unicode & multilingual web browsers (http://www.w3.org/International/questions/qa-who-uses-unicode)
Unicode & HTML (http://en.wikipedia.org/wiki/Unicode_and_HTML)

http://www.bbc.co.uk/

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received. 
Further communication will signify your consent to this.

Received on Monday, 21 March 2005 22:26:54 UTC