Re: IRIs, IDNAbis, and HTTP

At 18:49 08/03/14, Mark Nottingham wrote:
>
>Personally, I am *very* -1 on doing this.

I don't make this proposal because I think UTF-8 should rule
the world, but because over the years, I have received quite
a few requests along the line of "if we define this new HTTP
header, can't we just say it uses UTF-8?".


>Changing the allowable characters in a protocol element is a *big*  
>change,

First, most of the HTTP implementations treat this stuff as
bytes-in-byte-out, nothing else. Second, the allowable characters
aren't changed, it's just a different way of encoding.

>and there is not an interoperability gain to doing so.

There is an interoperability gain with the rest of the infrastructure
(APIs,...). This is serious. UTF-8 is used widely these days, telling
a developer that something is in UTF-8 means it can get implemented
quickly. That's not at all the case for an ugly and not particularly
well-defined mish-mash of iso-8859-1 and RFC 2047.

>There is also not a functionality gain; it is possible (if not pretty)  
>to serialise other characters into HTTP headers.

In the old days, people were somehow used to this character encoding
stuff being complicated. But these days, there are much more straight-
forward solutions, and tolerance and understanding for this kind of
cruft is getting lower and lower. The chances that something gets
implemented if it's just UTF-8, but not if it's some ugly cruft,
are considerable. At the end of the day, that results in a net
functionality gain.


>Furthermore, HTTP headers for the most part don't carry user-visible  
>data,

Yes indeed. For most part, that's the way it should be. But in some
cases, the complications involved may have been an additional reason
for why it's not done. That's not the way it should be, and we can
fix it.


>Before we go too far down this path again, folks should refresh  
>themselves with the last round of discussion, starting at:
>   http://lists.w3.org/Archives/Public/ietf-http-wg/2007JulSep/0323.html

I have reread that.

I don't think there is anything seriously prohibiting using raw
UTF-8 in new headers (when more than US-ASCII is needed). The
greatest number of implementations will just treat it as byte-in-byte-out.

An occasional rare implementation may actually interpret the data
as iso-8859-1 and convert it to some internal representation, and
do the reverse on the way back out, but that shouldn't really hurt.
(Character encoding converters usually map the C1 area in iso-8859-1
to the C1 area in e.g. Unicode, if they not just assume windows-1252).

In an even rarer case, that data may then actually be displayed,
resulting in some kind of garbage on screen. But for these cases,
we could just claim that we are doing an encoding on top of
iso-8859-1. Eventually, these few implementations will catch up.

For those implementations that really care about the new header,
things will be a lot, lot easier than if we stayed with the
iso-8859-1+RFC 2047 cruft.

Regards,   Martin.

>Cheers,
>
>
>On 14/03/2008, at 8:36 PM, Julian Reschke wrote:
>
>>
>> Martin Duerst wrote:
>>> ...
>>> It may be difficult to fix the truely horrible RFC 2047 on top of
>>> iso-8859-1 mess for existing headers. But in order to move in the
>>> right direction, it would be a very good idea to allow newly defined
>>> headers to specify that they just use UTF-8.
>>> ...
>>
>> That's related to <http://www3.tools.ietf.org/wg/httpbis/trac/ticket/74 >.
>>
>> This really sounds like the simplest way to do it -- if this applies  
>> only to new headers, is there even a remote chance of breaking  
>> something?
>>
>> BR, Julian
>>
>
>
>--
>Mark Nottingham     http://www.mnot.net/
>
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp    

Received on Monday, 17 March 2008 05:21:00 UTC