Re: For review: Character encodings for beginners from Frank Ellermann on 2007-12-06 (www-international@w3.org from October to December 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 6 Dec 2007 23:46:27 +0100
To: www-international@w3.org
Message-ID: <fj9u03$6k4$1@ger.gmane.org>

Richard Ishida wrote:

> Please send any comments to www-international@w3.org.

| Note however that that byte may represent either é or щ,
| depending on the context.

... "represent é, щ, other characters, or no character at
all depending on the charset".

You'd need a definition of the shorthand "charset" first,
maybe your wording "context" allows you to skip this here.
My point:  There are more possibilities than only é or щ.

| Most Web pages use the UTF-8 encoding for Unicode text.

..."pages and Internet protocols use"...  RFC 2277 (BCP 18)
is the LAW, it even has a decent deadline until UTF-8 will
replace all legacy charsets in protocols (not before 2048).

Are you sure about "most Web pages" (as of today) ?

| some more complicated decoding is needed

UTF-8 isn't complicated, it's brilliant.  It's just not
obvious for most human readers (including some folks who
like modulo 16 better than modulo 64).

Maybe pick only UTF-16BE as contrast, and omit UTF-32BE,
the implicit message should be "there is UTF-8, anything
else is doomed" (not before 2048, as noted above).

The "how does this affect me" section is rather short, 
and using s/Unicode/UTF-8/ everywhere does not simplify
everything.  E.g. I'm sure that most of my fonts can 
handle windows-1252, and for the reasons noted in your
text I'm also sure that they can't handle all of UTF-8.

For Web pages the encoding is almost irrelevant, authors
can anyway insert any Unicode point as NCR, and in that
case picking ASCII or window-1252 can be a valid choice.

But whatever authors pick, they MUST declare it, that's
the important point, addressed earlier in your text.

One of the more tricky issues is text/plain, depending
on the platform it might be not possible to declare what
it is, and then UTF-8 has some nice properties allowing
to guess correctly.  But maybe talking about "guessing"
would be at odds with the goals of your text (?)

 Frank

Received on Thursday, 6 December 2007 22:44:56 UTC