Re: Default charsets for text media types [i20] from Frank Ellermann on 2008-03-26 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 26 Mar 2008 18:04:59 +0100
To: ietf-http-wg@w3.org
Message-ID: <fsdvj6$oep$1@ger.gmane.org>

Martin Dürst wrote:

> [I was a co-author, but that was a long time ago.]

It's still interesting for reconstructions how precisely
the Web and the Internet at large ended up with Unicode,
later UTF-8, now net-utf8 (in essence NFC).  

My private theory how this all happened is that Harald
considered ISO 2022 as a hopeless case after seriously
trying to make it work, and you consider anything where
it is not intuitively possible to use "ü" as broken by
design.  

 [old browsers hating charset parameters]
> My understanding is that this problem was corrected in
> version 3 or so of Netscape and IE, or anyway in a 
> timeframe that makes in irrelevant for our current
> spec.

+1  IIRC IBM Webexplorer still had issues with it, but
HTTP/1.0 browsers not supporting Host: header fields
are irrelevant today.  And we are not updating gopher
type h, simple HTTP (0.9), or similar relics.

>| In the case where a document is accessed from a 
>| hyperlink in an origin HTML document, a CHARSET
>| attribute is added to the attribute list of 
>| elements with link semantics (A and LINK)
[...]

> [not sure how much this is implemented or in use; it's
> not directly a HTTP issue]

Yes, no HTTP issue.  I use 
charset="PC-Multilingual-850+euro" in some links, but
I'm not aware of a spider or browser doing anything with
this info.  Which does not mean that it is necessarily a
waste of time - I'm also not aware of UAs looking at say
hreflang=, but I could add some CSS magic for it later.

>|    <META HTTP-EQUIV="Content-Type"
>|     CONTENT="text/html; charset=ISO-2022-JP">
>| 
>|   This is not foolproof, but will work if the encoding
>|   scheme is such that ASCII-valued octets stand for 
>|   ASCII characters only at least until the META element
>|   is parsed.

> [This is very, very widely used. As far as it's HTML,
>  it's nothing HTTP should be concerned, but it is highly
>  relevant for HTTP because it is dead straight against
>  any default on the charset parameter in HTTP.]

Wait a moment, it is dead straight against any default that
is *NOT* ASCII, or rather against a default not containing
ASCII as proper subset.  

For the [i20] question it only tells us that we cannot pick
say BOCU-1 as new default, even if that's MIME compatible. 

Arguably it also tells us that the "default" does not mean
much for HTTP.  It is interesting for HTTP header fields.

For the text/* [i20] issue we might be free to pick ASCII
instead of Latin-1 if that's better for MIME compatibility,
especially for text/plain, naturally for text/xml, and no
problem for text/html.

>| see [NICOL2] for some details and a proposal.

What was NICOL2, was that your heuristic to "sniff" UTF-8 ?
The main problem I have with the "Latin-1 default" is that
it blocks a future "UTF-8 default" (talking about HTTP/1.1)

 Frank

Received on Wednesday, 26 March 2008 17:03:27 UTC