Re: PROPOSAL: i74: Encoding for non-ASCII headers from Martin Duerst on 2008-03-27 (ietf-http-wg@w3.org from January to March 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 27 Mar 2008 14:28:48 +0900
To: "Roy T. Fielding" <fielding@gbiv.com>, Mark Nottingham <mnot@mnot.net>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <6.0.0.20.2.20080327131939.078c3d80@localhost>
Hello Mark, Roy, others,

At 03:32 08/03/27, Roy T. Fielding wrote:
>
>On Mar 25, 2008, at 6:01 PM, Mark Nottingham wrote:
>> On 26/03/2008, at 11:40 AM, Roy T. Fielding wrote:
>>
>>>> A secondary issue is what encoding should be used in those cases  
>>>> were it is reasonable to allow it. I'm not sure what the value of  
>>>> requiring that it be the same everywhere is; some payloads (e.g.,  
>>>> IRIs, e-mail addresses) have well-defined "natural" encodings  
>>>> into ASCII that are more appropriate.
>>>
>>> Unless we are going to change the protocol, the answer to that  
>>> question
>>> is ISO-8859-1 or RFC2047.  If we are going to change the protocol,  
>>> then
>>> the answer would be raw UTF-8 (HTTP doesn't care about the content of
>>> TEXT as long as the encoding is a superset of ASCII, so the only
>>> compatibility issue here is understanding the intent of the sender).
>>
>> What do you mean by ISO-8859-1 *or* RFC2047 here?
>
>I meant the existing requirements as stated in 2616 for TEXT (but only
>for those rules that really should be TEXT).
>
>> Even if RFC2047 encoding is in effect, the actual character set in  
>> use is a subset of ISO-8859-1; no characters outside of that are  
>> actually on the wire, it's just an encoding of them into ASCII.

Mark, I think you have to be explicit on when you mean the bytes
on the wire and when you mean the characters they represent, and
when you mean the process or method of encoding. It's usually
pretty clear from context, but when reading your
earlier emails, I got as confused as Roy.


>> This is why I question whether it's realistic to require RFC2047,  
>> given that some applications -- e.g., headers that might want to  
>> carry a IRI -- are already using an encoding that's not RFC2047.
>
>HTTP does not carry IRIs in the headers. There are many implementation
>reasons why an IRI must be encoded as a URI before being used in a
>protocol context (security, minimization of duplicates, decoding once
>at the source instead of at every recipient, ...).

Just an aside:
I don't really understand the "decoding once at the source instead";
that might argue for encoding IRIs to URIs before putting them e.g.
in a href attribute, but doesn't really apply that much in an HTTP
context (in many cases, the server actually decodes the URI back
to bytes, which may be looked up directly as UTF-8 in the file system).
If it applies to chains of proxies, I don't think these are usually
that long to make this a serious issue.


>> Of course, you can say that they're not carrying non-ASCII  
>> characters, because it's just a URI, but I'd say that's just a way  
>> of squinting at the problem, and RFC2047 is yet another way of  
>> squinting; it looks like it's just ASCII as well.
>
>RFC2047 was chosen because the only fields we have that use TEXT
>are actually defined by MIME and hence are already required to use
>that encoding in their normal contexts.

The choice of adopting the basic email/MIME mechanisms was a good one.
In many areas, that was followed up by a careful selection and some
changes (e.g. for line breaks, for entity boundaries,...). I very
much wish a little bit more restriction had been applied to the
jungle of escaping conventions in mail headers (completely independent
of internationalization).


>>>> Mind you, personally I'm not religious about this; I just think  
>>>> that if we mandate RFC2047 encoding be used in new headers that  
>>>> need an encoding, we're going to be ignored, for potentially good  
>>>> reasons.
>>>
>>> What good reasons?

Mostly, ease of implementation. See below.


>>> In this case, we are not mandating anything.
>>> We are simply passing through the one and only defined i18n solution
>>> for HTTP/1.1 because it was the only solution available in 1994.

"The only solution available in 1994" is a far stretch. Unicode
was available then. But I agree that the choice, in 1994 terms,
wasn't as bad as an idea as it looks now.


>>> If email clients can (and do) implement it, then so can WWW clients.

Email clients implement RFC 2047. They do NOT implement the strange
mixture of iso-8859-1 and RFC 2047 sketched out in the HTTP spec.
A lot of libraries are available for RFC 2047. That cannot be said
for the strange mixture of iso-8859-1 and RFC 2047. Correctly
encoding or decoding an iso-8859-1/RFC 2047 mixture is not trivial,
even assuming that an RFC 2047 library is available.


>> That if another problem domain, for whatever reason, decides to  
>> mint a header that uses BCP137 instead of RFC2047, that it also  
>> violates HTTP? This seems a stretch to me... I'd put forth that the  
>> requirement is spurious.
>>
>>> People who want to fix that should start queueing for HTTP/1.2.
>>
>> Please explain how removing the requirement that only RFC2047 be  
>> used to encode non-ISO-8859-1 characters in new headers requires a  
>> version bump.
>
>Because there are no existing implementations that do what you  
>suggested.
>We could spend a ridiculous amount of time trying to draft a set of
>rules by which both existing compliant messages and newly encoded
>messages can be parsed correctly, but since we have NO NEED for such
>a mechanism and NO IMPLEMENTATIONS to guide us, we might as well be
>drafting HTTP/1.2.

There is the question of what to say about new headers.
I think it's clear that the possibility for new headers should
be left open in HTTP/1.1, that we don't want to require HTTP/1.2
for that. I think it's also clear that new headers, per definition,
are not implemented yet. So the question arises: iff somebody defines
a new header, what should they do in terms of internationalization
if they need it?

Now my understanding is that when it comes to specific (in particular
new) headers, the average HTTP/1.1 implementation doesn't do much
of anything. Indeed Roy wrote above:
   HTTP doesn't care about the content of TEXT as long as the encoding
   is a superset of ASCII,...
On the other hand, those components (plugins, modules, scripts, you
name it) that actually get written to deal with new header functionality
have to do the actual work of encoding/decoding characters. It is
very clear that for such applications, using UTF-8 is much easier
than using the ugly mixture of iso-8859-1 and RFC 2047.

Now Roy has argued, as far as I understand it, that we can't
possibly allow new headers to use UTF-8, "[u]nless we are going
to change the protocol". If this is about the protocol as written
in RFC 2616, this is probably true (although see below), but what
we are supposed to be most interested in is the protocol as implemented.
Just a few days ago, we have had what's probably the first
report of some actual iso-8859-1 data (some Spanish with
a wrong accent). But how much is that in terms of implementations?

In other words, how much is actually going to break if we allow
new headers to use UTF-8, and they go ahead and use it? How does
that minor potential breakage compare to the implementation
benefits for new headers using UTF-8? Is it worth bumping the
protocol version to 1.2 just for this minor fix? Is it a
mistake to admit that we have learned a bit between 1994
and 2008? Please answer for yourself.

On top of that, Mark was looking at whether we can prohibit
e.g. something along the lines of BCP 137. These things,
including RFC 2047, BCP 137-style stuff, IRI->URI conversion,
punycode,... are called ASCII-compatible encodings (ACE for
short), because they encode something (a large set of characters)
into something that to the lower level looks just like ASCII.
Well, it is also possible to see UTF-8 as a Latin-1-compatible
encoding (L1CE anybody). The lower layer (HTTP) may still see
it as iso-8859-1 bytes, but the higher layer will interpret it
as UTF-8.

>>>> 2) Constrain TEXT to contain only characters from iso-8859-1.
>>>
>>> No, that breaks compliant senders.
>>
>> How? Are you saying that senders are already sending text that  
>> contains non-8859-1 characters (post-encoding)?
>
>I thought you meant disallow 2047 encoding.

I agree with Roy here. Disallowing 2047 and leaving only iso-8859-1
would indeed be a bad idea.


Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 27 March 2008 05:34:49 UTC