Re: Unknown text/* subtypes [i20]

Roy T. Fielding wrote:

> Servers, OTOH, send text/* content with the assumption that it will be
> treated as iso-8859-1 (or at least some safe superset of US-ASCII).

That could be US-ASCII itself, windows-1252, UTF-8, and a bunch of
similar windows-xxxx, iso-8859-x, or other "unknown-ascii" supersets
(not counting UTF-1, UTF-7, or weirder charsets for obvious reasons).

> None of these implementations assume that a missing charset means
> US-ASCII.  We cannot "pass the buck" to MIME because we are still
> not MIME-compliant and never will be (see Content-Encoding).

All sound ASCII supersets have one thing in common, the 128 US-ASCII
octets, meaning U+0000 up to U+007F.     

> iso-8859-1 is still the most interoperable default *with* the
> addition of safe sniffing only when the charset is left unlabeled
> or when charset="iso-8859-1".

Any of these US-ASCII supersets could do as default.  The problems
of an explicit iso-8859-1 where that is not true cannot get worse,
and unfortunately also not better, with another default.

> In other words, it is safe to sniff for charsets in the first ten
> or so characters, and also to switch to other US-ASCII supersets
> after reading something like the <meta http-equiv="content-type"

An US-ASCII, windows-1252, or UTF-8 default would not change that.
And US-ASCII is the best approximation of "unknown-ascii" we have
at the moment.

If you decide that it's not good enough the "charset list" already
discussed to register "unknown-ascii" in addition to the existing
"unknown-8bit" some months ago.  And we could make sure that this
default pseudo-charset by definition won't cover UTF-1, UTF-7, or
similar abominations.  

But I think an US-ASCII default does precisely what you want, and
I fail to see how this could break existing HTTP implementations.

 Frank

Received on Tuesday, 12 February 2008 22:18:22 UTC