Charmod NONconformance and superset encodings (was: Re: Joint meeting at TPAC from HTML and i18n core WG minutes 2007-11-09) from Martin Duerst on 2007-11-20 (public-i18n-core@w3.org from October to December 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 20 Nov 2007 14:42:03 +0900
To: Felix Sasaki <fsasaki@w3.org>, public-html@w3.org, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20071113125321.09cb9b50@localhost>
Dear I18N WG, HTML WG,

I have looked at the minutes of your recent meeting.
There are a couple topics I want to comment on, I have
created separate threads for them.

At 02:06 07/11/10, Felix Sasaki wrote:
>
>... are at http://www.w3.org/2007/11/09-i18n-minutes.html and below as text.


>   <Hixie> When a user agent would otherwise use the ISO-8859-1
>   encoding, it must instead use the Windows-1252 encoding."
>
>   Henri: that part is a violation of charmod

I strongly agree.

>   Addison doesn't consider that a violation of charmod

I'm not sure how this could NOT be a violation. I'd also be quite
sure that most if not all of the authors of Charmod Fundamentals
would agree. The principle that charset labels must mean what they
do is very fundamental to "4.4.2 Character encoding identification"
(http://www.w3.org/TR/charmod/#sec-EncodingIdent), and besides
taking center stage in C025 (http://www.w3.org/TR/charmod/#C025,
for senders) and C030 (http://www.w3.org/TR/charmod/#C030, for
receivers), shines through in other criteria in that section.

>   Addison: There are superset encodings and they're often tagged with
>   the subset encodings.

The character model does not use the concepts of subset encoding
or superset encoding. And the fact that there is such practice
can't be used to judge whether it conforms to CharMod or not.
We all know that sloppy tagging is a reality, not only in the
'charset' area, but using that to claim conformance to CharMod
in such a roundabout way is a bad idea.

>   ... using the superset interpretation doesn't conflict with using
>   the subset interpretation

How not? If you meant that you can tag something as windows-1252
even if it doesn't contain any graphics characters only available
in windows-1252, that would be okay, I don't remember that there
would be any requirement to label as tightly as possible in CharMod.
But the reverse doesn't fly. There are many cases where there is
more than one possible superset encoding, so this just doesn't
fly. As iso-8859-1 is clearly a superset encoding of US-ASCII,
by your argumentation, we would conclude that it's okay to
interpret stuff labeled as US-ASCII as iso-8859-1. But by the
same logic, we can also argue that it is okay to interpret stuff
labeled as US-ASCII as iso-8859-2, and so on. It's easy to see
that this doesn't make sense.

I could also to some extent understand this if you use these terms
with regards to undefined codepoints. We all know Unicode has some
as-of-yet undefined codepoints, and we all understand that by using
charset labels such as "UTF-8" or "UTF-16",..., we include future
additions of characters to Unicode, even if this creates the risk
that some characters may not render properly on all platforms. 
Also, many of us know that windows-1252 still has some unused,
reserved code positions. (see e.g. the light bluegreen squares at
http://en.wikipedia.org/wiki/Windows-1252). At some point, it had
even more. Extending the above reasoning for Unicode, I think it's
fair to argue that we expect the label windows-1252 to be usable
for the case that Microsoft (who created and controls windows-1252)
assigns some of the codepoints that are still open currently.

However, iso-8859-1 does NOT have any codepoints reserved for
future assignements. Like any other member of the 8859 series,
it is designed to leave the C1 area (byte values 0x80-0x9F)
for non-graphic characters. It is rather unclear whether they
are assigned to any specific control characters, whether one
should just consider them mapped to the corresponding codepoints
in Unicode (some of which are still unassigned), or whether they
can be freely used with some other collection of control characters.
The whole issue is mostly irrelevant because in actual iso-8859-1 data,
these codepoints for control characters are virtually never used.

What is anyway very clear is that iso-8859-1 does, for example,
NOT have the Euro symbol at position 0x80, and so on.
So requiring to use Windows-1252 to interpret data labeled
as iso-8859-1 is in square violation of the relevant conditions
in Charmod. In my view, any other conclusion would mean that
Charmod Fundamentals isn't worth the paper it's (occasionally)
printed on, or the electrons used to send it.

>   ... We're not proposing a substantive change, just providing more
>   justification for what you're doing.

I very much hope this gets reexamined. There may be various ways
to work this out, but just claiming that there is no violation of
Charmod in this case is a very bad start.


Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 20 November 2007 06:14:08 UTC