Re: Charset "iso-10646-1" from Terje Bless on 2001-08-31 (www-html@w3.org from August 2001)

From: Terje Bless <link@pobox.com>
Date: Fri, 31 Aug 2001 22:02:27 +0200
To: Masayasu Ishikawa <mimasa@w3.org>
cc: www-html@w3.org
Message-ID: <20010831220748-r01010800-55602a8d-0910-010c@localhost>
On 01.09.01 at 03:42, Masayasu Ishikawa <mimasa@w3.org> wrote:

>[ www-html only ]

BTW, does www-html-edior auto-CC www-html?


>Terje Bless <link@pobox.com> wrote:
>
>>>I wonder where you came up with iso-10646-1.
>>
>>It's a common misconception. Character Encoding issues are _hard_ and
>>most people don't understand them.
>
>That may be true.  For example,
>
>>Since the ISO-8859-* series has been well worked into the collective
>>subsconscious, if a spec uses a similar looking string (such as
>>"ISO-10646") anywhere in relation to charset issues, a lot of people will
>>immediately assume it is a charset name in the same vein as the
>>ISO-8859-* encodings.
>
>That assumption happens to be correct, as charset name "ISO-10646"
>does exist as an alias of "ISO-10646-Unicode-Latin1", as opposed to
>"ISO-10646-1".

Arrrgh! I _knew_ I shouldn't have opened my mouth on charset[0] issues; I
somehow always end up filling it squarely with my foot. :-)


>ISO-8859-* encodings are also not easy to understand

Actually, ISO-8859-* are somewhat bearable since the ISO released the specs
(I dunno when, but I only found them very recently) and there are a couple
of good resources dealing with them.


>>This has cropped up periodically and should probably be mentioned to the
>>HTML WG; a small explanatory note, strategically placed, could avoid a
>>lot of confusion.
>
>I would say confusion about "iso-10646-1" is only the tip of an iceberg
>and I don't think a "small" note can avoid "a lot of" confusion.

Let me rephrase/expand on that. On a pragmatic level, what is desireable is
to get people to put the string "UTF-8" in whatever place they put such
meta-information because that is the encoding they are actually using.

Those still using legacy Character Encodings from the ISO-8859-* series
(and their cohorts) can and will continue to do so until they suddenly "see
the light" and start to use "Unicode".

At this point they will likely be using "UTF-8" as the physical encoding,
but look at the spec and decide that they should put "ISO10646" or some
arbitrary mutation thereof instead. A note to the effect that they are most
likely to want "UTF-8", strategically placed, might stave off the questions
such as the one that prompted my message.

The goal isn't to impart a full understanding of charset issues; it's to
get them to put in the right "magic ingredient" to make it "work".


As Martin said in his reply, the specs -- which the original questioner
cited! -- make no mention of the string "ISO-10646-1" anywhere. Where /did/
he get that string? Most likely, by extrapolating from ISO-8859-1 because
that's what's etched into the collective subconscious by now.

[ from original message ]
>I found charset=iso-10646-1  on W3C website 
>(http://www.w3.org/TR/html4/intro/intro.html#h-2.3.1).

Hmmm...


You may also want to go to <URL:http://www.w3.org/TR/html4/charset.html>
and try a visual scan of the text for "UTF-8". Can't find it? Use your
browser "Find..." function; there is exactly _one_ reference to it, and
it's buried several paragraphs in and in a paragraph listing a myriad
legacy charsets (inluding ISO-8859-1, -5, and Shift_JIS).

This section is titled "HTML Document Representation". Obvious?

Fair enough the detailed Table of Contents gives the sub-heading "Character
sets, character encodings, and entities", but that only helps if you
actually read the detailed ToC (and pay close attention!). I actually read
that beast cover to cover when it came out -- which I also did with HTML
4.0 and HTML 3.2; both of which are in a three-ring binder on my desk --
and I had to look hard to find the section dealing with charset issues.

To put it this way: It's no great surprise to me that Joe R. Webduhsigner
can't find the information he so sorely needs.


Looking closer to home, XHTML 1.0 only mentioned the issue in Appendix C:

# C.1 Processing Instructions
# 
# Be aware that processing instructions are rendered on some user agents. 
# However, also note that when the XML declaration is not included in a
# document, the document can only use the default character encodings
# UTF-8 or UTF-16.

XHTML 1.1 doesn't mention it, and neither does XHTML M12N or XHTML Basic.




[0] - And I'm using that term deliberately. :-)
Received on Friday, 31 August 2001 16:08:47 UTC