- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Wed, 09 Apr 2008 18:26:27 -0700
- To: I18N <www-international@w3.org>
All, Several email threads have appeared on this list with regard to the Unicode Migration article currently up for review [1]. This email lists my specific responses (as the editor). I've sent the updated document to Richard for posting tomorrow. 1. David Clarke (and others) noted that I failed to remove a comment by Richard from the section on BOM. I removed the comment. Further, I added text responding to the comment. 2. David Clarke. Typo involving the entity for the Euro character. The entity was generated by an editor tool somewhere. I fixed it. 3. Frank Ellermann wrote a number of comments in Msg076. Responses follow: a. s/might be/are/ DONE. b. add "or its proper subset US-ASCII", because that avoids any potential problems with a text/xml Content-Type. NOT DONE. I did add the note about Content-Type defaulting to US-ASCII. However, the point here is to use UTF-8 and NOT some other encoding. Character entities are less desirable than real characters. Said warning reads: Note that the HTTP Content-Type <code>text/xml</code> defaults to US-ASCII (for this reason, <code>application/xml+*</code> is usually preferred) and you'll still need to specify the charset if you use the <code>text/xml</code> Content-Type. c. HTTP or MIME. DONE. d. Regarding external and internal encoding declarations, Frank wrote: -- | the external encoding specification may duplicate one that's part | of the byte sequence - that's a good thing Dubious, it can be a pain when the info differs. Maybe "usually a good thing" or similar (often, generally, typically, dunno, but definitely not always). -- ... but only when the info differs. When the info *duplicates* one that's part of the byte sequence--that's a good thing. Announcing the encoding in the file is good because off-line tools can still make sense of the byte sequence. Announcing it in the protocol is good because often this takes precedence (or other Bad Things happen if you don't set your server to emit the *correct* encoding declaration---like it emits the wrong one). e. Frank writes: -- | users commonly change the browser encoding Why would they still do this ? This sounds as if written in 1996. -- (laughing) In my day job users often do this---sometimes they are even obliged to do this in order to use a particular Web property (my day job--not to mention this article--would be to eliminate the need for them to do it, nu?). Either way, they do change it (and sometimes they do it by setting their browser default). However, I did tone down "commonly" to "sometimes". Besides: I see this question on various Web programming lists all the time. f. Frank writes: -- | such as UTF-8, EUC-KR, ISO 2022-JP US-ASCII is another prominent example allowing validation. -- Yes, but it is a special case. Ignoring EBCDIC for a moment, a pure ASCII sequence with no control characters validly matches nearly all multibyte encodings. Detecting ASCII is sometimes useful, but not necessarily as an example here. g. Permalink to emoji: DONE. h. Frank asked: -- | ISO-8859-1 | Western European | 10% | 100% | Interesting, where did you find 10% as a "typical expansion" ? -- It's a rough approximation. Actual expansion amounts depends heavily on the language and the particular source text. Actually, most languages experience much smaller expansion. For example "Was ist Unicode?" on the Unicode site expands by less than 2%. So... I changed the intro to the table to say: -- The exact amount of expansion depends on the language and particular text involved. Expansions for some common encodings might be as much as: -- i. Frank writes: -- | representing completely different character sets from ASCII. Maybe s/completely/completely or slightly/ for the old ISO 646 variants of US-ASCII. What's an example for "completely" ? -- Completely is probably off the mark, since virtually all character encodings also encode the ASCII set (even if they "do it funny"). Changed "completely" to "very" (to maintain the important emphasis). j. UTF-7 reference. Changed to RFC 2152. Then washed hands. 4. Observant readers will note that I omitted responding to Frank's comment on windows-1252 vs. Latin-1, which spawned a long-ish thread. Personally, I think that calling it an "extension" is justifiable. However, since I have had to write the same things about encoding "superset" pairs many times (I even have slides about it in the Internationalization Tutorial), I think it worthwhile to point it out again. I could probably defend the original wording, but I have separated the example into its own paragraph that reads: -- <p>For example, the name <code>ISO-8859-1</code> is often used to describe data that actually uses the encoding <code>windows-1252</code>. This latter encoding (Microsoft Windows code page 1252) is very similar to ISO 8859-1 but assigns graphic characters to the range of bytes between 0x80 and 0x9F. Many Web applications (such as browsers, search engines, etc.) treat content bearing the ISO 8859-1 label as using the windows-1252 encoding instead, since, for all practical purposes, windows-1252 is a "superset" of ISO 8859-1. Other applications, such as encoding converters (like iconv or ICU) are pretty literal, and you must specify the right encoding name in order to get the proper results.</p> -- Any other comments, or replies to these comments, please send them to this list. Best Regards, Addison [1] http://www.w3.org/International/articles/unicode-migration/ -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature.
Received on Thursday, 10 April 2008 01:26:02 UTC