Re: PROPOSAL: i74: Encoding for non-ASCII headers

On Mar 25, 2008, at 6:01 PM, Mark Nottingham wrote:
> On 26/03/2008, at 11:40 AM, Roy T. Fielding wrote:
>
>>> A secondary issue is what encoding should be used in those cases  
>>> were it is reasonable to allow it. I'm not sure what the value of  
>>> requiring that it be the same everywhere is; some payloads (e.g.,  
>>> IRIs, e-mail addresses) have well-defined "natural" encodings  
>>> into ASCII that are more appropriate.
>>
>> Unless we are going to change the protocol, the answer to that  
>> question
>> is ISO-8859-1 or RFC2047.  If we are going to change the protocol,  
>> then
>> the answer would be raw UTF-8 (HTTP doesn't care about the content of
>> TEXT as long as the encoding is a superset of ASCII, so the only
>> compatibility issue here is understanding the intent of the sender).
>
> What do you mean by ISO-8859-1 *or* RFC2047 here?

I meant the existing requirements as stated in 2616 for TEXT (but only
for those rules that really should be TEXT).

> Even if RFC2047 encoding is in effect, the actual character set in  
> use is a subset of ISO-8859-1; no characters outside of that are  
> actually on the wire, it's just an encoding of them into ASCII.
>
> This is why I question whether it's realistic to require RFC2047,  
> given that some applications -- e.g., headers that might want to  
> carry a IRI -- are already using an encoding that's not RFC2047.

HTTP does not carry IRIs in the headers. There are many implementation
reasons why an IRI must be encoded as a URI before being used in a
protocol context (security, minimization of duplicates, decoding once
at the source instead of at every recipient, ...).

> Of course, you can say that they're not carrying non-ASCII  
> characters, because it's just a URI, but I'd say that's just a way  
> of squinting at the problem, and RFC2047 is yet another way of  
> squinting; it looks like it's just ASCII as well.

RFC2047 was chosen because the only fields we have that use TEXT
are actually defined by MIME and hence are already required to use
that encoding in their normal contexts.

>>> Mind you, personally I'm not religious about this; I just think  
>>> that if we mandate RFC2047 encoding be used in new headers that  
>>> need an encoding, we're going to be ignored, for potentially good  
>>> reasons.
>>
>> What good reasons?  In this case, we are not mandating anything.
>> We are simply passing through the one and only defined i18n solution
>> for HTTP/1.1 because it was the only solution available in 1994.
>> If email clients can (and do) implement it, then so can WWW clients.
>
> See above. Specifically, what impact does the requirement to use  
> RFC2047 have on other encodings -- is it saying that serialising an  
> IRI as a URI in a HTTP header is non-conformant?

No.  A URI is not TEXT, nor is it an IRI.  An IRI can be translated
to the actual URI it represents in a protocol context.  IRI is just
one interpretation of the encoding of a URI.

> That if another problem domain, for whatever reason, decides to  
> mint a header that uses BCP137 instead of RFC2047, that it also  
> violates HTTP? This seems a stretch to me... I'd put forth that the  
> requirement is spurious.
>
>> People who want to fix that should start queueing for HTTP/1.2.
>
> Please explain how removing the requirement that only RFC2047 be  
> used to encode non-ISO-8859-1 characters in new headers requires a  
> version bump.

Because there are no existing implementations that do what you  
suggested.
We could spend a ridiculous amount of time trying to draft a set of
rules by which both existing compliant messages and newly encoded
messages can be parsed correctly, but since we have NO NEED for such
a mechanism and NO IMPLEMENTATIONS to guide us, we might as well be
drafting HTTP/1.2.

>>> 2) Constrain TEXT to contain only characters from iso-8859-1.
>>
>> No, that breaks compliant senders.
>
> How? Are you saying that senders are already sending text that  
> contains non-8859-1 characters (post-encoding)?

I thought you meant disallow 2047 encoding.

>>> 3) Add advice that, for a particular context of use, other  
>>> characters MAY be encoded (whether that's strictly RFC2047, or  
>>> more fine-grained advice TBD) by specifying it in that context.
>>> 4) Add new issues for dealing with specific circumstances (e.g.,  
>>> From, Content-Disposition, Warning) as necessary. If the outcome  
>>> of #3 is to require RFC2047, this is relatively straightforward.
>>
>> There is no great need that has been established to support any
>> changes to the allowed TEXT encoding other than to separate the
>> rules that don't actually allow that encoding.  IMO, changes to
>> HTTP/1.1 must be motivated by actual implementations.
>
> Could be. Again, my main concern here is to take the blanket  
> requirement away and make it more focused.

Then we can just rephrase the requirement so that it isn't blanketed.

....Roy

Received on Wednesday, 26 March 2008 18:33:25 UTC