Re: Unknown text/* subtypes [i20] from Roy T. Fielding on 2008-02-15 (ietf-http-wg@w3.org from January to March 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Thu, 14 Feb 2008 19:01:57 -0800
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: ietf-http-wg@w3.org
Message-Id: <70126EB9-128E-4A02-8A3F-057048523E8E@gbiv.com>
It looks like 1/2 of your response is about small changes to the
text that is being deleted, and another 1/4 about the bits left
after the last change, and only the last 1/4 about my proposed
rewrite.  That is really confusing.

I'm just going to skip ahead to the comments on my proposed text...

> On Feb 14, 2008, at 9:18 AM, Frank Ellermann wrote:
>> Roy T. Fielding wrote:
>>
> ACK.  Potential issues in your version:
>
>> : When a media type is registered with a default charset value
>> : of "US-ASCII", it MAY be used to label data transmitted via
>> : HTTP in the "iso-8859-1" charset (a superset of US-ASCII)
>> : without including an explicit charset parameter on the media
>> : type.
>
> For 2616bis that should be no valid option (MAY), it should be
> a *violation* of a new SHOULD for the stated historical reason.
> Going from MAY to SHOULD NOT is possible, nothing breaks.

That would change the protocol such that all currently compliant
HTTP senders that transmit text messages in "iso-8859-1" without
a charset parameter would be violating a SHOULD requirement.
My proposal states the fact that such messages do occur in practice
and alters the MIME requirement for HTTP to accommodate them.

>> : In addition, when a media type registered with a default
>> : charset value of "US-ASCII" is received via HTTP without a
>> : charset parameter or with a charset value of "iso-8859-1",
>> : the recipient MAY inspect the data for indications of a
>> : different character encoding
> [...]
>
> That is convoluted.  Certainly it "MAY" try to determine the
> charset by sniffing if there is no charset, arguably it "must"
> (lower case) do this for the (non-HTTP) purpose of displaying
> a document.  And it "MAY" do this whenever it wishes, the case
> of an erroneous iso-8859-1 IMO does not justify a HTTP "MAY".

If browsers are willing to implement that, fine.

>> : if the encoding can be determined within the first 16 octets
>> : of data and interpreted consistently thereafter.
>
> Please no arbitrary magic numbers like "16" in a standard, let
> alone in a standard where the complete "sniffing" business is
> off topic.

It is more important that it works (or that we find out it doesn't).

>> : Note: The first variance is due to a significant portion of
>> : early HTTP user agents not parsing media type parameters and
>> : instead relying on a then-common default encoding of iso-8859-1.
>> : As a result, early server implementations avoided the use of
>> : charset parameters and user agents evolved to "sniff" for new
>> : character encodings as the Web expanded beyond iso-8859-1
>> : content.
>
> Yes, and (as you noted in another article) servers have no time
> for any sniffing on their side for dynamical content.  But that
> does not justify a "variance" going as far as an option (MAY),
> violating a SHOULD NOT is good enough for this historical case.

Sorry, that decision was made in 1994 and is now way out of scope.

> I don't see why 2616bis should try to overrule text/xml defaults
> with a MAY, as HTTP certainly does not try to tell clients what
> a say image/x-icon might be, and how to display it.

Then you don't know (or don't care) what the MIME specs say.  I do.
It was an intentional decision based on the needs of different
protocols.  The other alternative would be to define a separate
media type registration system, which was considered more harmful
than simply stating the differences and noting the requirements
for translating an HTTP-compliant message to a MIME-compliant
message.

>> : The second variance is due to a certain popular user agent that
>> : employed an unsafe encoding detection and switching algorithm
>> : within documents that might contain user-provided data (see
>> : Section security.sniffing), the most common workaround for
>> : which is to supply a specific charset parameter even when the
>> : actual character encoding is unknown.
>
> No.  Plausible reasons why servers might intentionally lie with
> "iso-8859-1" do not belong in an Internet standard.  If an UA is
> broken it needs to be fixed.  Servers could also try their luck
> with the registered "unknown-8bit" instead of lying, this is out
> of scope for HTTP.

Then get back to us when you have fixed that user agent.  Sending
any charset that is invalid/unknown to that user agent will fail to
trigger the one safe path that allows us to workaround its stupid
bugs.  All I am trying to explain in that paragraph is why the
theory of "servers should just leave the charset empty" is never
going to happen in the foreseeable future.  Otherwise, I am happy
to mark the issue as WONTFIX and let the browsers deal with their
own bugs.

....Roy
Received on Friday, 15 February 2008 03:02:06 UTC