- From: Richard Ishida <ishida@w3.org>
- Date: Tue, 22 Mar 2005 18:55:35 -0000
- To: "'Martin Duerst'" <duerst@w3.org>, "'Deborah Cawkwell'" <deborah.cawkwell@bbc.co.uk>, <public-i18n-geo@w3.org>
Deborah,
Could you please transfer your FAQ and Martin's notes to a wiki asap, and
send a note to the group to get them to subscribe to it.
Thanks,
RI
============
Richard Ishida
W3C
contact info:
http://www.w3.org/People/Ishida/
W3C Internationalization:
http://www.w3.org/International/
Publication blog:
http://people.w3.org/rishida/blog/
> -----Original Message-----
> From: public-i18n-geo-request@w3.org
> [mailto:public-i18n-geo-request@w3.org] On Behalf Of Martin Duerst
> Sent: 22 March 2005 07:05
> To: Deborah Cawkwell; public-i18n-geo@w3.org
> Subject: Re: WORK IN PROGRESS: FAQ: Upgrading from
> language-specific legacy encoding to Unicode encoding
>
>
> Hello Deborah,
>
> Great work! some comments below:
>
> At 07:26 05/03/22, Deborah Cawkwell wrote:
> >
> >WORK IN PROGRESS: FAQ: Upgrading from language-specific
> legacy encoding to >Unicode encoding
> >Question: What should you consider when upgrading to Unicode?
> >
> >
> >Background
> >
> >Numerous large organizations are beginning to switch to
> Unicode. See FAQ:
> >Who uses Unicode?
> >(http://www.w3.org/International/questions/qa-who-uses-unicode).
> >
> >You have heard that using Unicode is a good idea and that
> there are >benefits such as multilingual display and
> standards compatability. However, >you are not sure what's
> involved and whether it will work for your site.
> >
> >This FAQ will attempt to list some of the considerations
> you would need to >take into account.
>
> Maybe be even more explicit here that this is only a start!
>
> >What to consider
> >
> >
> >Character encoding declaration
> >
> >When using Unicode, the encoding should be specified (as with legacy
> >encodings) in the HTTP header content-type (eg,
> Content-Type: text/html;
> >charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type"
> >content="text/html; charset=utf-8"/). See Tutorial:
> Character sets & >encodings in XHTML, HTML and CSS
> >(http://www.w3.org/International/tutorials/tutorial-char-enc).
> >
> >
> >Text rendering
> >
> >Most recently released web browsers are able to display
> content encoded >using Unicode if a suitable font which
> supports Unicode is available to the >system.
> >
> >Fonts which support Unicode are now commonly available,
> both commercial and >open source, examples being TrueType
> and the more recent OpenType, which >both support Unicode.
> These font families provide a mapping from Unicode
> >codepoints to the graphical representation of characters,
> i.e. glyphs.
>
> please make sure here that people don't get the impression
> that usually a single font covers all of Unicode. You are
> almost there, but maybe need some tweaks, or an additional
> sentence, such as "Applications such as browsers usually
> cover Unicode by using several fonts for different scripts
> and ranges."
>
> >If using a legacy encoding, ie, a non-Unicode encoding, eg
>
> i.e./e.g. looks repetitious.
>
> >ISO-8859-1/windows-XXXX, then an operating system or
> browser either has a >font installed for that encoding or it
> doesn't, therefore either the page >displays correctly or no
> characters display (question marks). With Unicode, >the
> operating system or browser has fonts for some, but not all,
> of the >codepoints, so when displaying a Unicode page, it's
> not unusual to have >some of the characters display
> correctly whilst others don't (empty
> >rectangles) because the browser has access to fonts for
> some of the >codepoints but not all.
> >
> >For complex scripts such as Arabic and Thai, rules need to
> be applied to >transform the underlying character sequence
> to the appropriate glyphs for >display. Middle Eastern
> languages also need support for directionality.
> >Fonts, particularly OpenType fonts, often contain
> information about the >shaping transformations required, but
> usually some operating system level >support is needed to
> help ensure the correct output from multilingual text
> >rendering engines; these are usually bundled with either the
> operating >system or with the browser.
>
> This is basically explaining why one doesn't need to be
> concerned about this issue, yes? If so, it probably can be
> shorter, and say something like "most browers these days also
> support shaping and bidirectional display for ... or use the
> support provided by the operating system."
>
> Maybe mention here that some mobile phones don't yet support
> UTF-8 (but some do, although with a limited range of characters).
>
> >Multilingual Text Rendering Engines
> >
> >- Windows: Uniscribe
> >- Macintosh: Apple Type Services for Unicode Imaging, which
> replaced the >WorldScript engine for legacy encodings.
> >- Pango - open source
> >- Graphite - (open source renderer from SIL)
>
> What's this for? My assumption is that the reader wants to
> convert Web pages and use existing browsers, not to implement
> a browser, yes?
>
> >QUESTION On the web, does it help to provide generic font
> family fallbacks >in CSS, such as serif, sans-serif, etc? Example:
> >body {font-family:Verdana,Arial,Helvetica,sans-serif;}
>
> My recollection is that this depends on the browser. This
> should probably be in a separate doc (technique or FAQ).
>
> >Which Unicode encoding?
> >
> >UTF-8 is the Unicode encoding most commonly used for web
> pages. Unicode has >essentially three encodings: UTF-8,
> UTF-16, UTF-32.
> >
> >QUESTION Should the Basic Multilingual Plane (BMP) & byte
> representation be >mentioned here?
>
> I don't understand the question. Are BMP and byte
> representation two separate issues, or the same issue?
>
> >Will UTF-8 make web pages heavier to download?
> >
> >Characters that fall in the the 'traditional ASCII' space
> will use 1 byte >per character; this is the same as legacy encodings.
> >
> >Same page weight as for legacy encodings:
> >
> >- HTML markup
> >- English
> >
> >Slightly heavier
> >- Latin languages
> >
> >Characters, eg, e acute, outside the ASCII range are
> represented by one >byte in ISO-8859-1, but typically two
> bytes in UTF-8, so a small, but >acceptable, increase in
> page size should be expected.
> >
> >Characters that do not fall into the 'traditional ASCII'
> space such as >Chinese, Arabic, Russian may use 2 or even 3
> bytes, however, Chinese >encodings already use more than 1
> byte per character with legacy encodings.
> >
> >QUESTION With which languages/scripts does 1-byte encoding stop?
> >QUESTION Should this talk about scripts, rather than languages?
> >
> >
> >Does the software you use to produce your pages support
> Unicode, including >input environment, database, programming
> languages?
> >QUESTION Are there any server issues?
>
> Of course there are. I think most of the answers you got on
> your question a while ago (was that on www-international?)
> were about server-side issues.
>
> >What happens to legacy data? Do you transcode it all or do
> you build a >transcoder into the pipeline?
> >QUESTION Maybe another FAQ (based on the I18N IG responses
> to Legacy data & >upgrading to Unicode question)?
>
> I thought that this was what this FAQ was about. After
> reading it, my impression is that it's more a FAQ about
> Unicode support on browsers. Maybe the question should be
> changed to indicate it.
>
> Regards, Martin.
>
> >Further reading
> >
> >Unicode Consortium (http://www.unicode.org)
> >Tutorial: Character sets & encodings in XHTML, HTML and CSS
> >(http://www.w3.org/International/tutorials/tutorial-char-enc).
> >FAQ: Who uses Unicode?
> >(http://www.w3.org/International/questions/qa-who-uses-unicode)
> >Unicode & multilingual web browsers
> >(http://www.w3.org/International/questions/qa-who-uses-unicode)
> >Unicode & HTML (http://en.wikipedia.org/wiki/Unicode_and_HTML)
> >
> >http://www.bbc.co.uk/
> >
> >This e-mail (and any attachments) is confidential and may
> contain >personal views which are not the views of the BBC
> unless specifically >stated.
> >If you have received it in error, please delete it from your system.
> >Do not use, copy or disclose the information in any way nor
> act in >reliance on it and notify the sender immediately.
> Please note that the >BBC monitors e-mails sent or received.
> >Further communication will signify your consent to this.
>
>
Received on Tuesday, 22 March 2005 18:55:43 UTC