W3C home > Mailing lists > Public > public-i18n-geo@w3.org > March 2005

Re: WORK IN PROGRESS: FAQ: Upgrading from language-specific legacy encoding to Unicode encoding

From: Martin Duerst <duerst@w3.org>
Date: Tue, 22 Mar 2005 16:05:15 +0900
Message-Id: <6.0.0.20.2.20050322155350.026a7030@localhost>
To: "Deborah Cawkwell" <deborah.cawkwell@bbc.co.uk>, <public-i18n-geo@w3.org>

Hello Deborah,

Great work! some comments below:

At 07:26 05/03/22, Deborah Cawkwell wrote:
 >
 >WORK IN PROGRESS: FAQ: Upgrading from language-specific legacy encoding to
 >Unicode encoding
 >Question: What should you consider when upgrading to Unicode?
 >
 >
 >Background
 >
 >Numerous large organizations are beginning to switch to Unicode. See FAQ:
 >Who uses Unicode?
 >(http://www.w3.org/International/questions/qa-who-uses-unicode).
 >
 >You have heard that using Unicode is a good idea and that there are
 >benefits such as multilingual display and standards compatability. However,
 >you are not sure what's involved and whether it will work for your site.
 >
 >This FAQ will attempt to list some of the considerations you would need to
 >take into account.

Maybe be even more explicit here that this is only a start!

 >What to consider
 >
 >
 >Character encoding declaration
 >
 >When using Unicode, the encoding should be specified (as with legacy
 >encodings) in the HTTP header content-type (eg, Content-Type: text/html;
 >charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type"
 >content="text/html; charset=utf-8"/). See Tutorial: Character sets &
 >encodings in XHTML, HTML and CSS
 >(http://www.w3.org/International/tutorials/tutorial-char-enc).
 >
 >
 >Text rendering
 >
 >Most recently released web browsers are able to display content encoded
 >using Unicode if a suitable font which supports Unicode is available to the
 >system.
 >
 >Fonts which support Unicode are now commonly available, both commercial and
 >open source, examples being TrueType and the more recent OpenType, which
 >both support Unicode. These font families provide a mapping from Unicode
 >codepoints to the graphical representation of characters, i.e. glyphs.

please make sure here that people don't get the impression that
usually a single font covers all of Unicode. You are almost there,
but maybe need some tweaks, or an additional sentence, such as
"Applications such as browsers usually cover Unicode by using
several fonts for different scripts and ranges."

 >If using a legacy encoding, ie, a non-Unicode encoding, eg

i.e./e.g. looks repetitious.

 >ISO-8859-1/windows-XXXX, then an operating system or browser either has a
 >font installed for that encoding or it doesn't, therefore either the page
 >displays correctly or no characters display (question marks). With Unicode,
 >the operating system or browser has fonts for some, but not all, of the
 >codepoints, so when displaying a Unicode page, it's not unusual to have
 >some of the characters display correctly whilst others don't (empty
 >rectangles) because the browser has access to fonts for some of the
 >codepoints but not all.
 >
 >For complex scripts such as Arabic and Thai, rules need to be applied to
 >transform the underlying character sequence to the appropriate glyphs for
 >display. Middle Eastern languages also need support for directionality.
 >Fonts, particularly OpenType fonts, often contain information about the
 >shaping transformations required, but usually some operating system level
 >support is needed to help ensure the correct output from multilingual text
 >rendering engines; these are usually bundled with either the operating
 >system or with the browser.

This is basically explaining why one doesn't need to be concerned
about this issue, yes? If so, it probably can be shorter, and say
something like "most browers these days also support shaping and
bidirectional display for ... or use the support provided by the
operating system."

Maybe mention here that some mobile phones don't yet support
UTF-8 (but some do, although with a limited range of characters).

 >Multilingual Text Rendering Engines
 >
 >- Windows: Uniscribe
 >- Macintosh: Apple Type Services for Unicode Imaging, which replaced the
 >WorldScript engine for legacy encodings.
 >- Pango - open source
 >- Graphite - (open source renderer from SIL)

What's this for? My assumption is that the reader wants to
convert Web pages and use existing browsers, not to implement
a browser, yes?

 >QUESTION On the web, does it help to provide generic font family fallbacks
 >in CSS, such as serif, sans-serif, etc? Example:
 >body {font-family:Verdana,Arial,Helvetica,sans-serif;}

My recollection is that this depends on the browser. This
should probably be in a separate doc (technique or FAQ).

 >Which Unicode encoding?
 >
 >UTF-8 is the Unicode encoding most commonly used for web pages. Unicode has
 >essentially three encodings: UTF-8, UTF-16, UTF-32.
 >
 >QUESTION Should the Basic Multilingual Plane (BMP) & byte representation be
 >mentioned here?

I don't understand the question. Are BMP and byte representation two
separate issues, or the same issue?

 >Will UTF-8 make web pages heavier to download?
 >
 >Characters that fall in the the 'traditional ASCII' space will use 1 byte
 >per character; this is the same as legacy encodings.
 >
 >Same page weight as for legacy encodings:
 >
 >- HTML markup
 >- English
 >
 >Slightly heavier
 >- Latin languages
 >
 >Characters, eg, e acute, outside the ASCII range are represented by one
 >byte in ISO-8859-1, but typically two bytes in UTF-8, so a small, but
 >acceptable, increase in page size should be expected.
 >
 >Characters that do not fall into the 'traditional ASCII' space such as
 >Chinese, Arabic, Russian may use 2 or even 3 bytes, however, Chinese
 >encodings already use more than 1 byte per character with legacy encodings.
 >
 >QUESTION With which languages/scripts does 1-byte encoding stop?
 >QUESTION Should this talk about scripts, rather than languages?
 >
 >
 >Does the software you use to produce your pages support Unicode, including
 >input environment, database, programming languages?
 >QUESTION Are there any server issues?

Of course there are. I think most of the answers you got on your
question a while ago (was that on www-international?) were about
server-side issues.

 >What happens to legacy data? Do you transcode it all or do you build a
 >transcoder into the pipeline?
 >QUESTION Maybe another FAQ (based on the I18N IG responses to Legacy data &
 >upgrading to Unicode question)?

I thought that this was what this FAQ was about. After reading it,
my impression is that it's more a FAQ about Unicode support on
browsers. Maybe the question should be changed to indicate it.

Regards,    Martin.

 >Further reading
 >
 >Unicode Consortium (http://www.unicode.org)
 >Tutorial: Character sets & encodings in XHTML, HTML and CSS
 >(http://www.w3.org/International/tutorials/tutorial-char-enc).
 >FAQ: Who uses Unicode?
 >(http://www.w3.org/International/questions/qa-who-uses-unicode)
 >Unicode & multilingual web browsers
 >(http://www.w3.org/International/questions/qa-who-uses-unicode)
 >Unicode & HTML (http://en.wikipedia.org/wiki/Unicode_and_HTML)
 >
 >http://www.bbc.co.uk/
 >
 >This e-mail (and any attachments) is confidential and may contain
 >personal views which are not the views of the BBC unless specifically
 >stated.
 >If you have received it in error, please delete it from your system.
 >Do not use, copy or disclose the information in any way nor act in
 >reliance on it and notify the sender immediately. Please note that the
 >BBC monitors e-mails sent or received.
 >Further communication will signify your consent to this. 
Received on Tuesday, 22 March 2005 09:52:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:12:39 GMT