Re: IRIs, IDNAbis, and HTTP

Martin Duerst wrote:
 
> It's all those protocol fields where you need human-readable
> text. The Subject: of an email very clearly qualifies as
> text, so it's not only body.

Using RFC 2047 for <unstructured> header fields bodies is
straight forward, no conflict with RFC 2277.  Apparently HTTP
has no unstructured fields, so that might be irrelevant.

Brian has already mentioned comments, using RFC 2047 "as is"
(no 2231 parameter-folding) should be no issue for 2616bis
comments.  I've no idea why anybody would want 4646 language
tags in a comment, but that part of RFC 2231 would be also
straight forward.  And if the other side has no clue what an
encoded comment with language tags is, nothing will break -
or rather I'm not aware of scenarios where comments could be
critical.

> Can you show how allowing e.g. new HTTP header fields to
> use UTF-8 would break anything in the installed base?

I'm sure that the mentioned HTTP/1.0 browser didn't support
UTF-8, and I'm sure that I have not yet tested an IE6 plugin
claiming to offer some kind IDN of support.  Hard to prove a
negative, but FF2 didn't like non-UTF-8 <ipath> in your test
suite, <ihost> and UTF-8 worked, from that I *guess* UTF-8 can
work for this browser.  

2616bis is supposed to work with any server and UA since the
times of RFC 2068, that's why I think we can at best get rid
of the "default Latin-1", but not simply replace it by UTF-8.

Your heuristics to distinguish Latin-1 and UTF-8 depends on
finding 0x80..0x9F (=> potential UTF-8 trail byte, likely no
C1), 0xC0..0xC1 (=> can't be UTF-8), or 0xFE..0xFF (ditto),
and further fine tuning opportunities for the STD 63 UTF-8.

But it's not guaranteed to work for short strings in a header
field, and there's no way to put it into old servers and UAs
supporting only Latin-1.  

> There is a big difference between MIME (ASCII+RFC2045) and
> HTTP (iso-8859-1+RFC2045).

That's why I'd prefer to get rid of "default Latin-1", going
from US-ASCII to UTF-8 later (after 2616bis) is hopefully
simpler than from Latin-1 to UTF-8.  Using a BOM for this
magic is dubious, 0xEFBBBF is valid Latin-1.  

> I have yet to see a case where the absence of language 
> information in a header is a problem in practice. Do you
> know of any?

No.  It would need to be something that's displayed, parsed,
voice output, dunno, anything where the language or script
helps.  Excluding eur-EU, mis-EA, zxx-DG, und-IC, etc. <eg>

 Frank

Received on Monday, 17 March 2008 05:19:36 UTC