W3C home > Mailing lists > Public > public-i18n-geo@w3.org > March 2005

RE: WORK IN PROGRESS: FAQ: Upgrading from language-specific legacy encoding to Unicode encoding

From: Richard Ishida <ishida@w3.org>
Date: Tue, 22 Mar 2005 18:55:35 -0000
To: "'Martin Duerst'" <duerst@w3.org>, "'Deborah Cawkwell'" <deborah.cawkwell@bbc.co.uk>, <public-i18n-geo@w3.org>
Message-Id: <20050322185534.973924F2C1@homer.w3.org>


Could you please transfer your FAQ and Martin's notes to a wiki asap, and
send a note to the group to get them to subscribe to it.


Richard Ishida

contact info:

W3C Internationalization:

Publication blog:

> -----Original Message-----
> From: public-i18n-geo-request@w3.org 
> [mailto:public-i18n-geo-request@w3.org] On Behalf Of Martin Duerst
> Sent: 22 March 2005 07:05
> To: Deborah Cawkwell; public-i18n-geo@w3.org
> Subject: Re: WORK IN PROGRESS: FAQ: Upgrading from 
> language-specific legacy encoding to Unicode encoding
> Hello Deborah,
> Great work! some comments below:
> At 07:26 05/03/22, Deborah Cawkwell wrote:
>  >
>  >WORK IN PROGRESS: FAQ: Upgrading from language-specific 
> legacy encoding to  >Unicode encoding
>  >Question: What should you consider when upgrading to Unicode?
>  >
>  >
>  >Background
>  >
>  >Numerous large organizations are beginning to switch to 
> Unicode. See FAQ:
>  >Who uses Unicode?
>  >(http://www.w3.org/International/questions/qa-who-uses-unicode).
>  >
>  >You have heard that using Unicode is a good idea and that 
> there are  >benefits such as multilingual display and 
> standards compatability. However,  >you are not sure what's 
> involved and whether it will work for your site.
>  >
>  >This FAQ will attempt to list some of the considerations 
> you would need to  >take into account.
> Maybe be even more explicit here that this is only a start!
>  >What to consider
>  >
>  >
>  >Character encoding declaration
>  >
>  >When using Unicode, the encoding should be specified (as with legacy
>  >encodings) in the HTTP header content-type (eg, 
> Content-Type: text/html;
>  >charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type"
>  >content="text/html; charset=utf-8"/). See Tutorial: 
> Character sets &  >encodings in XHTML, HTML and CSS  
> >(http://www.w3.org/International/tutorials/tutorial-char-enc).
>  >
>  >
>  >Text rendering
>  >
>  >Most recently released web browsers are able to display 
> content encoded  >using Unicode if a suitable font which 
> supports Unicode is available to the  >system.
>  >
>  >Fonts which support Unicode are now commonly available, 
> both commercial and  >open source, examples being TrueType 
> and the more recent OpenType, which  >both support Unicode. 
> These font families provide a mapping from Unicode  
> >codepoints to the graphical representation of characters, 
> i.e. glyphs.
> please make sure here that people don't get the impression 
> that usually a single font covers all of Unicode. You are 
> almost there, but maybe need some tweaks, or an additional 
> sentence, such as "Applications such as browsers usually 
> cover Unicode by using several fonts for different scripts 
> and ranges."
>  >If using a legacy encoding, ie, a non-Unicode encoding, eg
> i.e./e.g. looks repetitious.
>  >ISO-8859-1/windows-XXXX, then an operating system or 
> browser either has a  >font installed for that encoding or it 
> doesn't, therefore either the page  >displays correctly or no 
> characters display (question marks). With Unicode,  >the 
> operating system or browser has fonts for some, but not all, 
> of the  >codepoints, so when displaying a Unicode page, it's 
> not unusual to have  >some of the characters display 
> correctly whilst others don't (empty
>  >rectangles) because the browser has access to fonts for 
> some of the  >codepoints but not all.
>  >
>  >For complex scripts such as Arabic and Thai, rules need to 
> be applied to  >transform the underlying character sequence 
> to the appropriate glyphs for  >display. Middle Eastern 
> languages also need support for directionality.
>  >Fonts, particularly OpenType fonts, often contain 
> information about the  >shaping transformations required, but 
> usually some operating system level  >support is needed to 
> help ensure the correct output from multilingual text  
> >rendering engines; these are usually bundled with either the 
> operating  >system or with the browser.
> This is basically explaining why one doesn't need to be 
> concerned about this issue, yes? If so, it probably can be 
> shorter, and say something like "most browers these days also 
> support shaping and bidirectional display for ... or use the 
> support provided by the operating system."
> Maybe mention here that some mobile phones don't yet support
> UTF-8 (but some do, although with a limited range of characters).
>  >Multilingual Text Rendering Engines
>  >
>  >- Windows: Uniscribe
>  >- Macintosh: Apple Type Services for Unicode Imaging, which 
> replaced the  >WorldScript engine for legacy encodings.
>  >- Pango - open source
>  >- Graphite - (open source renderer from SIL)
> What's this for? My assumption is that the reader wants to 
> convert Web pages and use existing browsers, not to implement 
> a browser, yes?
>  >QUESTION On the web, does it help to provide generic font 
> family fallbacks  >in CSS, such as serif, sans-serif, etc? Example:
>  >body {font-family:Verdana,Arial,Helvetica,sans-serif;}
> My recollection is that this depends on the browser. This 
> should probably be in a separate doc (technique or FAQ).
>  >Which Unicode encoding?
>  >
>  >UTF-8 is the Unicode encoding most commonly used for web 
> pages. Unicode has  >essentially three encodings: UTF-8, 
> UTF-16, UTF-32.
>  >
>  >QUESTION Should the Basic Multilingual Plane (BMP) & byte 
> representation be  >mentioned here?
> I don't understand the question. Are BMP and byte 
> representation two separate issues, or the same issue?
>  >Will UTF-8 make web pages heavier to download?
>  >
>  >Characters that fall in the the 'traditional ASCII' space 
> will use 1 byte  >per character; this is the same as legacy encodings.
>  >
>  >Same page weight as for legacy encodings:
>  >
>  >- HTML markup
>  >- English
>  >
>  >Slightly heavier
>  >- Latin languages
>  >
>  >Characters, eg, e acute, outside the ASCII range are 
> represented by one  >byte in ISO-8859-1, but typically two 
> bytes in UTF-8, so a small, but  >acceptable, increase in 
> page size should be expected.
>  >
>  >Characters that do not fall into the 'traditional ASCII' 
> space such as  >Chinese, Arabic, Russian may use 2 or even 3 
> bytes, however, Chinese  >encodings already use more than 1 
> byte per character with legacy encodings.
>  >
>  >QUESTION With which languages/scripts does 1-byte encoding stop?
>  >QUESTION Should this talk about scripts, rather than languages?
>  >
>  >
>  >Does the software you use to produce your pages support 
> Unicode, including  >input environment, database, programming 
> languages?
>  >QUESTION Are there any server issues?
> Of course there are. I think most of the answers you got on 
> your question a while ago (was that on www-international?) 
> were about server-side issues.
>  >What happens to legacy data? Do you transcode it all or do 
> you build a  >transcoder into the pipeline?
>  >QUESTION Maybe another FAQ (based on the I18N IG responses 
> to Legacy data &  >upgrading to Unicode question)?
> I thought that this was what this FAQ was about. After 
> reading it, my impression is that it's more a FAQ about 
> Unicode support on browsers. Maybe the question should be 
> changed to indicate it.
> Regards,    Martin.
>  >Further reading
>  >
>  >Unicode Consortium (http://www.unicode.org)
>  >Tutorial: Character sets & encodings in XHTML, HTML and CSS 
>  >(http://www.w3.org/International/tutorials/tutorial-char-enc).
>  >FAQ: Who uses Unicode?
>  >(http://www.w3.org/International/questions/qa-who-uses-unicode)
>  >Unicode & multilingual web browsers
>  >(http://www.w3.org/International/questions/qa-who-uses-unicode)
>  >Unicode & HTML (http://en.wikipedia.org/wiki/Unicode_and_HTML)
>  >
>  >http://www.bbc.co.uk/
>  >
>  >This e-mail (and any attachments) is confidential and may 
> contain  >personal views which are not the views of the BBC 
> unless specifically  >stated.
>  >If you have received it in error, please delete it from your system.
>  >Do not use, copy or disclose the information in any way nor 
> act in  >reliance on it and notify the sender immediately. 
> Please note that the  >BBC monitors e-mails sent or received.
>  >Further communication will signify your consent to this. 
Received on Tuesday, 22 March 2005 18:55:43 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:28:02 UTC