W3C home > Mailing lists > Public > www-international@w3.org > April to June 2008

Unicode Migration comment responses...

From: Addison Phillips <addison@yahoo-inc.com>
Date: Wed, 09 Apr 2008 18:26:27 -0700
Message-ID: <47FD6CC3.40909@yahoo-inc.com>
To: I18N <www-international@w3.org>


Several email threads have appeared on this list with regard to the 
Unicode Migration article currently up for review [1]. This email lists 
my specific responses (as the editor). I've sent the updated document to 
Richard for posting tomorrow.

1. David Clarke (and others) noted that I failed to remove a comment by 
Richard from the section on BOM. I removed the comment. Further, I added 
text responding to the comment.

2. David Clarke. Typo involving the entity for the Euro character. The 
entity was generated by an editor tool somewhere. I fixed it.

3. Frank Ellermann wrote a number of comments in Msg076. Responses follow:

   a. s/might be/are/  DONE.

   b. add "or its proper subset US-ASCII", because that avoids any
potential problems with a text/xml Content-Type.

   NOT DONE. I did add the note about Content-Type defaulting to 
US-ASCII. However, the point here is to use UTF-8 and NOT some other 
encoding. Character entities are less desirable than real characters.

Said warning reads:

Note that the HTTP Content-Type <code>text/xml</code> defaults to 
US-ASCII (for this reason, <code>application/xml+*</code> is usually 
preferred) and you'll still need to specify the charset if you use the 
<code>text/xml</code> Content-Type.

    c. HTTP or MIME.  DONE.

    d. Regarding external and internal encoding declarations, Frank wrote:

| the external encoding specification may duplicate one that's part
| of the byte sequence - that's a good thing

Dubious, it can be a pain when the info differs.  Maybe "usually
a good thing" or similar (often, generally, typically, dunno, but
definitely not always).

... but only when the info differs. When the info *duplicates* one 
that's part of the byte sequence--that's a good thing. Announcing the 
encoding in the file is good because off-line tools can still make sense 
of the byte sequence. Announcing it in the protocol is good because 
often this takes precedence (or other Bad Things happen if you don't set 
your server to emit the *correct* encoding declaration---like it emits 
the wrong one).

    e. Frank writes:

| users commonly change the browser encoding

Why would they still do this ?  This sounds as if written in 1996.

(laughing) In my day job users often do this---sometimes they are even 
obliged to do this in order to use a particular Web property (my day 
job--not to mention this article--would be to eliminate the need for 
them to do it, nu?). Either way, they do change it (and sometimes they 
do it by setting their browser default). However, I did tone down 
"commonly" to "sometimes".

Besides: I see this question on various Web programming lists all the time.

    f. Frank writes:

| such as UTF-8, EUC-KR, ISO 2022-JP

US-ASCII is another prominent example allowing validation.

Yes, but it is a special case. Ignoring EBCDIC for a moment, a pure 
ASCII sequence with no control characters validly matches nearly all 
multibyte encodings. Detecting ASCII is sometimes useful, but not 
necessarily as an example here.

    g. Permalink to emoji: DONE.

    h. Frank asked:

| ISO-8859-1 | Western European |  10% | 100% |

Interesting, where did you find 10% as a "typical expansion" ?

It's a rough approximation. Actual expansion amounts depends heavily on 
the language and the particular source text. Actually, most languages 
experience much smaller expansion. For example "Was ist Unicode?" on the 
Unicode site expands by less than 2%. So... I changed the intro to the 
table to say:

The exact amount of expansion depends on the language and particular 
text involved. Expansions for some common encodings might be as much as:

    i. Frank writes:

| representing completely different character sets from ASCII.

Maybe s/completely/completely or slightly/ for the old ISO 646
variants of US-ASCII.  What's an example for "completely" ?

Completely is probably off the mark, since virtually all character 
encodings also encode the ASCII set (even if they "do it funny"). 
Changed "completely" to "very" (to maintain the important emphasis).

    j. UTF-7 reference. Changed to RFC 2152. Then washed hands.

4. Observant readers will note that I omitted responding to Frank's 
comment on windows-1252 vs. Latin-1, which spawned a long-ish thread. 
Personally, I think that calling it an "extension" is justifiable. 
However, since I have had to write the same things about encoding 
"superset" pairs many times (I even have slides about it in the 
Internationalization Tutorial), I think it worthwhile to point it out 
again. I could probably defend the original wording, but I have 
separated the example into its own paragraph that reads:

<p>For example, the name <code>ISO-8859-1</code> is
often  used to describe data that actually uses the encoding 
<code>windows-1252</code>. This latter encoding (Microsoft Windows code 
page 1252) is very similar to ISO 8859-1 but assigns graphic characters 
to the range of bytes between 0x80 and 0x9F. Many Web applications (such 
as browsers, search engines, etc.) treat content bearing the ISO 8859-1 
label as using the windows-1252 encoding instead, since, for all 
practical purposes, windows-1252 is a "superset" of ISO 8859-1. Other 
applications, such as encoding converters (like iconv or ICU) are pretty 
literal, and you must specify the right encoding name in order to get 
the proper results.</p>

Any other comments, or replies to these comments, please send them to 
this list.

Best Regards,


[1] http://www.w3.org/International/articles/unicode-migration/
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.
Received on Thursday, 10 April 2008 01:26:02 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:29 UTC