W3C home > Mailing lists > Public > www-international@w3.org > April to June 2008

Re: Unicode Migration comment responses...

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 10 Apr 2008 09:30:00 +0200
To: www-international@w3.org
Message-ID: <ftkfgi$7co$1@ger.gmane.org>

Addison Phillips wrote:
> I've sent the updated document to Richard for posting tomorrow.


 [text/xml and US-ASCII]
> However, the point here is to use UTF-8 and NOT some other 
> encoding. Character entities are less desirable than real 
> characters.

Okay.  Of course it also depends on the platform, the tools,
what the user intends to do (read or edit), and what's more
user friendly in the case of unsupported characters.  

Admittedly my preferences are odd (= "all I need to know is
the hex. codepoint number for anything above u+00FF"), and
arguably RFC 5198 says "not good enough, hex. is not NFC".  

 [about the wonders of "redundant" charset declarations]
> Announcing it in the protocol is good because often this
> takes precedence (or other Bad Things happen if you don't
> set your server to emit the *correct* encoding declaration
> ---like it emits the wrong one).

The three servers I use at the moment are not "my" servers.
One of the them happily says Latin-1 for any text/html, no
matter what it really is.  As it happens it is either ASCII
or windows-1252, never Latin-1.  I don't see that the HTTP
or HTML5 WGs intend to improve this situation, putting it
mildly.  The concept "your server" is already broken, it's
the same idea as "your UA".

 [Latin-1 vs. UTF-8]
> The exact amount of expansion depends on the language and
> particular text involved. Expansions for some common 
> encodings might be as much as:

Right, I was curious if 10% was based on some reproducible
results for the languages allegedly covered by Latin-1.

> j. UTF-7 reference. Changed to RFC 2152. Then washed hands.

Not fair, UTF-7 was an important step in 1994, at that time
UTF-8 was apparently still "work in progress".  An IETF AD
volunteered to sponsor an Internet draft deprecating UTF-7,
maybe try it.  My attempt failed, I ended up with talking
about UTF-32, UTF-16, UTF-8, CESU-8, UTF-1, BOCU-1, "UTF-4",
UTF-EBCDIC, UNICODE-1.1, and "UTF-5" - excluding SCSU isn't
the trick to get a remotely comprehensible text.

Received on Thursday, 10 April 2008 07:28:07 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:29 UTC