Re: Accept-Charset support from Martin J. Duerst on 1996-12-13 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Fri, 13 Dec 1996 12:42:01 +0100 (MET)
To: Larry Masinter <masinter@parc.xerox.com>
cc: Drazen.Kacar@public.srce.hr, Chris.Lilley@sophia.inria.fr, www-international@w3.org, Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com, bobj@netscape.com, wjs@netscape.com, erik@netscape.com, Ed_Batutis/CAM/Lotus@crd.lotus.com
Message-ID: <Pine.SUN.3.95.961213110151.245C-100000@enoshima>
On Thu, 12 Dec 1996, Larry Masinter wrote:

> # However, what I (and others) object against very strongly is
> # the combination of RFC 1522 with the exception for ISO-8859-1.
> # If everybody has to use RFC 1522, there must not be an
> # exception for ISO-8859-1. ISO-8859-1 and Western Europe is
> # not really anything special. If we choose RFC 1522, then
> # everybody should use it, there should not be any exceptions.
> 
> The exception for ISO-8859-1 for warning messages in HTTP is based on
> the fact that there is an exception for ISO-8859-1 for text documents,
> and that it made no sense for the protocol to be inconsistent.

Larry - What you say here is a very weak excuse at best.
I would like to comment specifically about two points:
- The exception for ISO-8859-1
- The "inconsistency" of the protocol

With regards to ISO-8859-1 as a default, Francois already commented
about this in very strong words. With the words I usually describe it,
the reality is: ISO-8859-1 comes in as a default after a long list of
other things, which are, in (rough) order of priority:
- "charset" parameter in Content-Type header
- "charset" parameter in META element
- User settings
- Heuristic guessing by the client

In actual consequence (because all major browsers have a
"Document Encoding" submenu), this means that the default is
close to worthless. It also means that correct labeling is preferred
to relying on the "default". In Section 3.7.1 (as of version 06),
the HTTP draft acknowledges this fact:

>>
Some HTTP/1.0 software has interpreted a Content-Type header without
charset parameter incorrectly to mean "recipient should guess." Senders
wishing to defeat this behavior MAY include a charset parameter even
when the charset is ISO-8859-1 and SHOULD do so when it is known that it
will not confuse the recipient.
<<

Actually, in the first sentence, the description of the situation
is rather complacent. *All* widely deployed HTTP/1.0 software
interprets a Content-Type header without "charset" as "use
other means to determine the encoding" (see above). This is
a bare necessity because of the fact that currently, maybe
99% of the documents not in ISO-8859-1 are not labelled.
This in turn is a result of the fact that i18n was not considered
from the beginning when creating the web; the originators
just took what was in front of their nose, ISO-8859-1.

Now for your claim that "it made no sense for the protocol to be
inconsistent". I greatly value protocol consistency, but let's
be fair: One might claim consistency if for the actual documents:

- It would be allowed to send ISO-8859-1 in 8-bit without labeling,
	but all other encodings would have to use QP to avoid
	using the 8th bit.
- Having a default of ISO-8859-1 would actually work in practice,
	and not be virtually worthless as described above.
- Having a default of ISO-8859-1 for document content would
	have proven to be a good solution, and not a coincidence
	of history that is creating countless headaches.
- Dealing with HTTP headers and with document content would not,
	in many implementations, occur at very different places
	in the implementation, so that the savings, in terms of
	implementation complexity, are nonexistent.

Claiming that the way HTTP now deals with document encodings
and the way HTTP 1.1 proposes to deal with the encodings of
warning headers has anything to do with consistency is more
of a joke than serious engineering.


> It is a historical fact that the web's origin at CERN in western
> Europe gives it a western-European bias. This is perhaps unfortunate
> (if you're not a western European),

I am a western European. I grew up and live in central western Europe,
in the same small country where the Web originated. I acknowledge the
fact that the Web has its roots in western Europe, and I am even proud
of it. Nevertheless, I think the Web is global, and its protocols
and formats have to be designed with the whole world in mind, and
the historical existence of some western-European bias in some
of its components should in no way be used as an argument to
prepetuate and spread such a bias to places where it is absolutely
not necessary. Historical coincidence should also not be used
as an argument to repeat the errors of the past.


> but your proposal that the default
> be UTF-8 doesn't actually advantage much of the world that currently
> has different encodings as their default.

I have proposed various alternatives, all of which are strongly
preferable to the current proposal. I would be happy if the
current HTTP 1.1 spec would say:
- Use RFC 1522 for everything (except ASCII).
- Keep the 8-bit channel open for future use.
So that we might then specify to used raw UTF-8 in the future
when there is consensus that a reasonable base of implementations
can handle it.

I would also be happy if the current spec said:
- Use RFC 1522 for whatever you want to encode.
- Use raw UTF-8 in the 8-bit channel.
This would allow much of the world that currently has different
encodings at their command to use these encodings.

At this point, it is worth considering some additional points.
First, if we specified RFC1522 + UTF-8, the only people that
would be affected would be those implementing ISO-8859-1 only.
As I believe in the global web, I hope that this is a small
minority. They would have the choice between RFC1522 and
UTF-8. Because ISO-8859-1 is a subset of Unicode, the conversion
from ISO-8859-1 to UTF-8 is trivial, easier than the conversion
to RFC1522, and easier than the conversion from anything else
(except plain ASCII) to UTF-8. It's really not a big burden
on ISO-8859-1 implementers.
In addition, as it currently stands, the number of warnings
is limited (6 in version 06), and they will probably end up
as constants in the program or in some file. Thus the server
does not have to implement a conversion algorithm, the
programmer just has to assure that (s)he includes the correct
string constants. With a list providing HTML entity to
C constant notation, doing the conversion by hand is a matter
of a few seconds for the 6 warnings currently defined.
Such a list could look as follows:

HTML	Hex		C		C
	Latin-1		Latin-1		UTF-8
	Unicode

&uuml;	FC		\374		\303\274

General			\xyz		\30x\2yz

I'm ready to provide a full list.


As we have seen, the problems on the server side when removing
the use of raw ISO-8859-1 are extremely marginal.

Now let's have a look at the client side. The argument that
conversion from UTF-8 to ISO-8859-1 is easier that from RFC1522
applies in this case, too. In addition, one should mention that
any browser that correctly supports numeric character references
has to have an idea about ISO10646/Unicode. Decoding UTF-8 to
this is very easy. As we know, the main browsers will support
UTF-8 for document content in their next version, and they
should not have any problems doing that for warnings, either.


> You're proposing that recipients apply heuristics to decide if the
> warning messages are in UTF-8 or ISO-8859-1. This seems like a bad
> idea, to make something that's deterministic into something that's
> heuristic.

This was one of my proposals, and not the most preferred one.
I added it because I thought that one argument against removing
the ISO-8859-1 default would be that it were already in use.
As this seems not the case, there is no need for such heuristics.
Let's make it deterministically RFC1522+8-bit extension in the
future, or even better RFC1522+deterministic UTF-8.


> The 12-byte overhead for the "=?UTF-8?Q?" and "?=" suffix
> in the warning message isn't so big,

Francois already has commented about this. I would have assumed
that the whole HTTP group would have had more expertise in
QP, RFC1522, and i18n to weed out such strange claims earlier.


>and isn't really "Clogging up the
> 8-bit channel".

It's not RFC1522 and QP that is clogging up the 8-bit channel.
It's the ISO-8859-1 "default".


> Perhaps by the time Unicode is widespread -- in the next 3-5 years --

Unicode/UTF-8 support will be widespread next year. And I hope
we still can make a small change to the HTTP 1.1 draft so that
it will not look obviously outdated next year.


> we'll have a new version of HTTP 2.x or HTTP-NG. I would certainly
> propose that in the future, new versions of HTTP default to UTF-8.

If we remove that ISO-8859-1 "default" from the warnings, we can
even have UTF-8 as a default, now, or in the future if you think
there is not enough consensus now.

Larry - Unless you have any stringent arguments for keeping
in the ISO-8859-1 "default" for warnings in HTTP1.1 headers,
we should really get serious about removing that "default"
from the spec. I have made several proposals for a different
specification, and if you tell me which of the proposals
would be most acceptable to you, I will gladly work out
the wording. If you tell me how to subscribe to the HTTP
list, I will also propose and defend this change on this
list. It may be late for this change, but it can be done
quickly, and it is worth to do it for the future of HTTP.


Regards,	Martin.
Received on Friday, 13 December 1996 06:51:53 UTC