Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

On 20/08/2007, at 5:54 PM, Martin Duerst wrote:
> What do you mean by "encodings *like* 2047"? And why do you think
> that UTF-8 may break existing implementations?

See previous message.

> UTF-8 has virtually
> the same footprint in terms of bytes as ISO-8859-1: All bytes
> above 0x7F may be used. Implementations that have to deal with
> ISO-8859-1 usually do this by just being 8-bit-transparent;
> that works for UTF-8, too.

If utf-8 is a subset of iso-8859-1, it would work; but I don't think  
that's the case (not that I'm an expert in this area, by any means).

> If your opinion is that UTF-8 cannot be allowed at all, then
> that's going to be a problem for cases where it's already in
> use, see e.g. earlier posts in the http list.

Sorry, I looked and didn't find anything. Can you give references?

> It's easy to say "may break existing implementations",
> but in over 10 years of being involved in the Web, I haven't
> heard about that happening. If you or anybody have, please
> speak up.

I'm wary of changing fundamental things in HTTP like the charset for  
headers because nobody's heard of it breaking things. My experience  
has been that the people who show up in this forum and the others  
that we circulate in are the tip of a much larger iceberg of  
implementations, and we make such changes at our peril.

In particular, performance-sensitive implementations like proxies  
often rely on particular data structures that are mandated by the  
specs for optimisation. As you point out, they'll probably store them  
as bytes, but they may not. Client and server APIs have been built  
that make (again, reasonable) assumptions about headers' character sets.

In both cases, changing the spec may break those implementations.  
Yes, it's just "may", but the burden of proof should be on those who  
want to make the change. To that end, if UTF-8 were widely used in  
HTTP headers already, I wouldn't be so uncomfortable, but AFAIK  
they're few and far between.

>> For example, the encoding
>> specified in 3987 is appropriate for URIs.
>
> As one of the authors of RFC 3987, I know what you mean, but
> "the encoding specified in 3987" wouldn't be enough for a spec.
> Also, it's not really very well suited to the job, because
> %hh-encoding is used to escape any bytes, not only UTF-8,
> and there is a considerably length increase for some scripts.

I know. I didn't say it was a great solution, just one that works.

>> However, it *has* to be
>> explicit; I've heard some people read this requirement and think that
>> they need to check *every* header for 2047 encoding.
>
> I have read it that way, too. If it can be safely argued that
> it was never intended that way, and that no harm is produced
> if this is restricted, then I'd befine with restricting it,
> because checking everything for 2047 is indeed tough, but
> I'd really like to make sure that this isn't creating problems.
> (i.e. that a careful examination of the various headers makes
> some reasonably conservative assumptions).

I think that's a great first step.

Taking a glance at the header registry <http://www.iana.org/ 
assignments/message-headers/perm-headers.html>, there are about 30  
headers that might conceivably need non-ASCII content. Of those, the  
vast majority have URIs as a payload, which means that if you want to  
change the encoding, you'll need to change their syntax pretty  
fundamentally -- an incompatible change.

Of the remaining ones, I see (with ones that may IMO be a problem  
marked with a *):
   * WWW-Authenticate, Proxy-Authenticate -- as discussed elsewhere
   * Authorization, Proxy-Authorization -- as discussed elsewhere
   * Content-Disposition -- can contain a filename
   - Cookie, Cookie2, Set-Cookie -- can contain user data, but it's  
opaque, so servers can choose their own encoding strategy
   - Server, Via, User-Agent -- not usually presented to users
   - Warning - can contain warning text, not usually presented to users
   - From -- an e-mail address (and possibly a candidate for  
deprecation)

My earlier proposal would have the effect of clarifying the use of  
RFC2047 encoding in those marked with "*" above. It sounds like you  
want to change the syntax of those headers to allow UTF-8, no?

If so, the problem that I have with that is that no existing  
conformance-minded implementer could have looked at RFC2616 and said  
to themselves, "oh, I should use/expect UTF-8 there," while such a  
statement could quite easily have been made about RFC2047. Now, if  
everybody has gone off and used UTF-8 despite the RFC, I could see an  
argument for changing the spec to match reality, but I don't think  
that's the case here.

If there were a huge win here -- e.g., lots of improvements in i18n  
and efficiency -- again, it might be persuasive. But, looking at the  
list above, the benefits of such a change don't seem overwhelming.

Again, I'm not an i18n expert, so please educate me.

>> So, I think this means;
>>
>> 1) Change
>>   "Words of *TEXT MAY contain characters from character sets other
>> than ISO-8859-1 [22] only when encoded according to the rules of RFC
>> 2047 [14]."
>> to
>>   "Words of *TEXT MUST NOT contain characters from character sets
>> other than ISO-885901 [22]."
>> and,
>>
>> 2) Identify headers that may have non-8859
>
> There are many parts to ISO-8859, not just ISO-8859-1.

Sorry, just using shorthand.

>> content and explicitly say
>> how to encode them (IRI, 2047, whatever; the existing ones will have
>> to be 2047, I believe), modifying their BNF to suit.
>>
>> 3) When we document extensibility, require new headers to nominate
>> any encoding explicitly.
>
> If that includes UTF-8, I'd be fine with it. If it excludes
> UTF-8, I think that would be a problem.

In my view, the proxy case and the API case above both rule that out;  
we can't assume that deployed software is set up to produce, accept  
or process UTF-8.


> Regards,    Martin.

Cheers,

--
Mark Nottingham     http://www.mnot.net/

Received on Monday, 20 August 2007 10:03:55 UTC