Re: Unknown text/* subtypes [i20] from Frank Ellermann on 2008-02-15 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 15 Feb 2008 17:32:08 +0100
To: ietf-http-wg@w3.org
Message-ID: <fp4en0$ol5$1@ger.gmane.org>
Roy T. Fielding wrote:
 
> It looks like 1/2 of your response is about small changes to the
> text that is being deleted, and another 1/4 about the bits left
> after the last change, and only the last 1/4 about my proposed
> rewrite.  That is really confusing.

Well, I was confused, I found your long "p3" text, and seeing that
it went on and on thought that's your proposal, commenting inline
as good as possible.  And about an hour later I arrived at the cut
where you started your real/new proposal.

After that I marked the first part as "p3" keeping ">|" as quote
indicator, and used ">:" for your real/new proposal.  Whatever
"p3" actually is, it is what you wrote, and took about 3/4 of my
reply, yielding a complete proposal based on "p3".

The last 1/4 was what you said above.

>> For 2616bis that should be no valid option (MAY), it should be
>> a *violation* of a new SHOULD for the stated historical reason.
>> Going from MAY to SHOULD NOT is possible, nothing breaks.
 
> That would change the protocol such that all currently compliant
> HTTP senders that transmit text messages in "iso-8859-1" without
> a charset parameter would be violating a SHOULD requirement.

Yes, that is the point of getting rid of "default Latin-1", which
was quite popular in this long "unknown text/* subtypes" thread,
and before.   IIRC Martin proposed to get rid of it back in 2006,
admittedly that was about 2617bis, and at that time I screamed.

The HTML5 WG apparently wants to replace Latin-1 by windows-1252
to some degree, I didn't check the details, only what I saw when
trying to figure out interesting points in the "HTNL5 diff" draft.

> My proposal states the fact that such messages do occur in
> practice

Sure, but that's not more desirable.  As you explained earlier it
used to be a kind of hack 15 years ago, and one of the early RFCs 
already said "we'll do Unicode a.s.a.p.".

>>>: if the encoding can be determined within the first 16 octets
>>>: of data and interpreted consistently thereafter.

>> Please no arbitrary magic numbers like "16" in a standard, let
>> alone in a standard where the complete "sniffing" business is
>> off topic.
 
> It is more important that it works (or that we find out it 
> doesn't).

IMO it is a design principle, stay away from magic numbers unless
you really must have them, and then folks will ask "why 16 if 512
is a much better buffer size ?"  Which could be ignored as stupid
question - unless it turns out that 512 is really better.

HTTP in its role as "TP" doesn't need to sniff, so why specify it
at all here ?  Whatever browsers do, it has also has to work with
other protocols (or other URI schemes to cover the file: case).

>> (as you noted in another article) servers have no time for any
>> sniffing on their side for dynamical content.  But that does 
>> not justify a "variance" going as far as an option (MAY),
>> violating a SHOULD NOT is good enough for this historical case.
 
> Sorry, that decision was made in 1994 and is now way out of scope.

It's 2008 now, the old browsers choking on an explicit charset did
HTTP/1.0 without a Host: header field, they are gone.  I tried to
use "IBM Webexplorer" in this millennium a few times, they really
are hopeless.  

A decision in 1994, before UTF-8 existed, is not necessarily good
enough today.  UTF-7 was published 1994, it was cute, but we all
agree to deprecate it somehow for today.  

>> I don't see why 2616bis should try to overrule text/xml defaults
>> with a MAY, as HTTP certainly does not try to tell clients what
>> a say image/x-icon might be, and how to display it.
 
> Then you don't know (or don't care) what the MIME specs say.

I know that image/vnd.microsoft.icon is a registered MIME type, and
I think that 2616bis doesn't need to talk about it.

>> Plausible reasons why servers might intentionally lie with
>> "iso-8859-1" do not belong in an Internet standard.  If an UA is
>> broken it needs to be fixed.  Servers could also try their luck
>> with the registered "unknown-8bit" instead of lying, this is out
>> of scope for HTTP.
 
> Then get back to us when you have fixed that user agent.

If we are talking about IE6 among others, I get monthly fixes for
security issues in this beast.  If something in IE6 is so horribly
wrong that it affects what 2616bis will say 2009 I'd like to know
precisely what it is.

What you propose boils down to "iso-8859-1" means "dunno", and it
is required to disable some unclear (for me) bug in IE versions.
Likewise no charset means "iso-8859-1" and it is required for old
Mosaic derivates back in 1994.  

And servers wishing to announce a real Latin-1 text can pick what
they like better, but actually no client (apart from IE and Mosaic
with their own problems) can actually believe them.

Do you see how *odd* that sounds, when I extract it from your 2119
prose ?  Is "iso8859-1" = "dunno" really what you want in 2616bis ?

If HTTP servers wish to use this hack, why can't they limit this
effort depending on the User-Agent ?  

 Frank
Received on Friday, 15 February 2008 16:30:52 UTC