Re: Unknown text/* subtypes from Frank Ellermann on 2007-12-28 (www-archive@w3.org from January 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 28 Dec 2007 03:26:46 +0100
To: ietf-types@alvestrand.no
Cc: ietf-http-wg@w3.org
Message-ID: <fl1mtc$d82$1@ger.gmane.org>

Martin Duerst wrote:

> The new version of the HTTP spec, 2616bis, should definitely
> drop the iso-8859-1 default, but NOT in favor of "unknown
> text is ASCII".  It should just say that there is no default.

A MIME entity with "default ASCII" using any 1xxx xxxx octets
is erroneous.  With "default ASCII" 2616bis would be consistent
with MIME, that's good.  We have no "unknown-7bit" charset for
unidentified "ASCII compatible" encodings (for octets 0..127),
and the "default ASCII" is an emulation for such dubious cases,
same idea as in mail.

Years later (after 2616bis) it might be possible to upgrade
"default ASCII" to UTF-8, Latin-1 was a dead end.  As soon as
we're back to "default ASCII" just let RFC 2277 finish it off.

> There is a big difference between these two, especially for 
> document formats that contain internal 'charset' information.
> A default of US-ASCII makes document-internal 'charset'
> information useless (because the external information wins).

Right, that must not happen, IMO a "default" is an assumption
if no better info is available.  For HTTP it also limits what
can be used in *headers* (no message/rfc822 vs. message/global
abstractions necessary, HTTP isn't UTF8SMTP)

The *body* contains octets, only 0..127 can be interpreted as
ASCII, anything else needs an explicit declaration somewhere -
"internal" would be fine for many users who can't change the
"external" declaration.

That's actually the same issue as it is today with an external
"default Latin-1", the internal UTF-8 / KOI8-R / windows-1252
(etc.) declaration wins if there is no explicit statement from
the server.  Otherwise my non-ASCII Web pages won't validate,
but they do.

> One reason for the problems with text/xml was that the
> original MIME default of US-ASCII was enforced. This made
> it impossible to serve XML documents with internal 'charset'
> information only as text/xml.

The odd text/xml case is different, there's a MUST somewhere
in the text/xml spec.  But nobody treats text/html as "default
Latin-1" ignoring the internal declaration.  The W3C validator
even enforces its very own UTF-8 default for HTML 2, where it
really should be Latin-1, maybe we could report this as bug :-)

 Frank

Received on Saturday, 12 January 2008 18:01:29 UTC