- From: Richard Ishida <ishida@w3.org>
- Date: Tue, 22 Mar 2005 18:55:35 -0000
- To: "'Martin Duerst'" <duerst@w3.org>, "'Deborah Cawkwell'" <deborah.cawkwell@bbc.co.uk>, <public-i18n-geo@w3.org>
Deborah, Could you please transfer your FAQ and Martin's notes to a wiki asap, and send a note to the group to get them to subscribe to it. Thanks, RI ============ Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ > -----Original Message----- > From: public-i18n-geo-request@w3.org > [mailto:public-i18n-geo-request@w3.org] On Behalf Of Martin Duerst > Sent: 22 March 2005 07:05 > To: Deborah Cawkwell; public-i18n-geo@w3.org > Subject: Re: WORK IN PROGRESS: FAQ: Upgrading from > language-specific legacy encoding to Unicode encoding > > > Hello Deborah, > > Great work! some comments below: > > At 07:26 05/03/22, Deborah Cawkwell wrote: > > > >WORK IN PROGRESS: FAQ: Upgrading from language-specific > legacy encoding to >Unicode encoding > >Question: What should you consider when upgrading to Unicode? > > > > > >Background > > > >Numerous large organizations are beginning to switch to > Unicode. See FAQ: > >Who uses Unicode? > >(http://www.w3.org/International/questions/qa-who-uses-unicode). > > > >You have heard that using Unicode is a good idea and that > there are >benefits such as multilingual display and > standards compatability. However, >you are not sure what's > involved and whether it will work for your site. > > > >This FAQ will attempt to list some of the considerations > you would need to >take into account. > > Maybe be even more explicit here that this is only a start! > > >What to consider > > > > > >Character encoding declaration > > > >When using Unicode, the encoding should be specified (as with legacy > >encodings) in the HTTP header content-type (eg, > Content-Type: text/html; > >charset=utf-8) and HTML head (eg, <meta http-equiv="Content-Type" > >content="text/html; charset=utf-8"/). See Tutorial: > Character sets & >encodings in XHTML, HTML and CSS > >(http://www.w3.org/International/tutorials/tutorial-char-enc). > > > > > >Text rendering > > > >Most recently released web browsers are able to display > content encoded >using Unicode if a suitable font which > supports Unicode is available to the >system. > > > >Fonts which support Unicode are now commonly available, > both commercial and >open source, examples being TrueType > and the more recent OpenType, which >both support Unicode. > These font families provide a mapping from Unicode > >codepoints to the graphical representation of characters, > i.e. glyphs. > > please make sure here that people don't get the impression > that usually a single font covers all of Unicode. You are > almost there, but maybe need some tweaks, or an additional > sentence, such as "Applications such as browsers usually > cover Unicode by using several fonts for different scripts > and ranges." > > >If using a legacy encoding, ie, a non-Unicode encoding, eg > > i.e./e.g. looks repetitious. > > >ISO-8859-1/windows-XXXX, then an operating system or > browser either has a >font installed for that encoding or it > doesn't, therefore either the page >displays correctly or no > characters display (question marks). With Unicode, >the > operating system or browser has fonts for some, but not all, > of the >codepoints, so when displaying a Unicode page, it's > not unusual to have >some of the characters display > correctly whilst others don't (empty > >rectangles) because the browser has access to fonts for > some of the >codepoints but not all. > > > >For complex scripts such as Arabic and Thai, rules need to > be applied to >transform the underlying character sequence > to the appropriate glyphs for >display. Middle Eastern > languages also need support for directionality. > >Fonts, particularly OpenType fonts, often contain > information about the >shaping transformations required, but > usually some operating system level >support is needed to > help ensure the correct output from multilingual text > >rendering engines; these are usually bundled with either the > operating >system or with the browser. > > This is basically explaining why one doesn't need to be > concerned about this issue, yes? If so, it probably can be > shorter, and say something like "most browers these days also > support shaping and bidirectional display for ... or use the > support provided by the operating system." > > Maybe mention here that some mobile phones don't yet support > UTF-8 (but some do, although with a limited range of characters). > > >Multilingual Text Rendering Engines > > > >- Windows: Uniscribe > >- Macintosh: Apple Type Services for Unicode Imaging, which > replaced the >WorldScript engine for legacy encodings. > >- Pango - open source > >- Graphite - (open source renderer from SIL) > > What's this for? My assumption is that the reader wants to > convert Web pages and use existing browsers, not to implement > a browser, yes? > > >QUESTION On the web, does it help to provide generic font > family fallbacks >in CSS, such as serif, sans-serif, etc? Example: > >body {font-family:Verdana,Arial,Helvetica,sans-serif;} > > My recollection is that this depends on the browser. This > should probably be in a separate doc (technique or FAQ). > > >Which Unicode encoding? > > > >UTF-8 is the Unicode encoding most commonly used for web > pages. Unicode has >essentially three encodings: UTF-8, > UTF-16, UTF-32. > > > >QUESTION Should the Basic Multilingual Plane (BMP) & byte > representation be >mentioned here? > > I don't understand the question. Are BMP and byte > representation two separate issues, or the same issue? > > >Will UTF-8 make web pages heavier to download? > > > >Characters that fall in the the 'traditional ASCII' space > will use 1 byte >per character; this is the same as legacy encodings. > > > >Same page weight as for legacy encodings: > > > >- HTML markup > >- English > > > >Slightly heavier > >- Latin languages > > > >Characters, eg, e acute, outside the ASCII range are > represented by one >byte in ISO-8859-1, but typically two > bytes in UTF-8, so a small, but >acceptable, increase in > page size should be expected. > > > >Characters that do not fall into the 'traditional ASCII' > space such as >Chinese, Arabic, Russian may use 2 or even 3 > bytes, however, Chinese >encodings already use more than 1 > byte per character with legacy encodings. > > > >QUESTION With which languages/scripts does 1-byte encoding stop? > >QUESTION Should this talk about scripts, rather than languages? > > > > > >Does the software you use to produce your pages support > Unicode, including >input environment, database, programming > languages? > >QUESTION Are there any server issues? > > Of course there are. I think most of the answers you got on > your question a while ago (was that on www-international?) > were about server-side issues. > > >What happens to legacy data? Do you transcode it all or do > you build a >transcoder into the pipeline? > >QUESTION Maybe another FAQ (based on the I18N IG responses > to Legacy data & >upgrading to Unicode question)? > > I thought that this was what this FAQ was about. After > reading it, my impression is that it's more a FAQ about > Unicode support on browsers. Maybe the question should be > changed to indicate it. > > Regards, Martin. > > >Further reading > > > >Unicode Consortium (http://www.unicode.org) > >Tutorial: Character sets & encodings in XHTML, HTML and CSS > >(http://www.w3.org/International/tutorials/tutorial-char-enc). > >FAQ: Who uses Unicode? > >(http://www.w3.org/International/questions/qa-who-uses-unicode) > >Unicode & multilingual web browsers > >(http://www.w3.org/International/questions/qa-who-uses-unicode) > >Unicode & HTML (http://en.wikipedia.org/wiki/Unicode_and_HTML) > > > >http://www.bbc.co.uk/ > > > >This e-mail (and any attachments) is confidential and may > contain >personal views which are not the views of the BBC > unless specifically >stated. > >If you have received it in error, please delete it from your system. > >Do not use, copy or disclose the information in any way nor > act in >reliance on it and notify the sender immediately. > Please note that the >BBC monitors e-mails sent or received. > >Further communication will signify your consent to this. > >
Received on Tuesday, 22 March 2005 18:55:43 UTC