Re: IRIs, IDNAbis, and HTTP from Julian Reschke on 2008-03-13 (ietf-http-wg@w3.org from January to March 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 13 Mar 2008 16:28:48 +0100
To: Brian Smith <brian@briansmith.org>
CC: ietf-http-wg@w3.org
Message-ID: <47D94830.9060509@gmx.de>
Brian Smith wrote:
>>> "For existing protocols or protocols that move data
>>> from existing datastores, support of other charsets,
>>> or even using a default other than UTF-8, may be a
>>> requirement. This is acceptable, but UTF-8 support
>>> MUST be possible."
>> All nice in theory, but it hasn't been done in RFC2616.
> 
> The purpose of HTTPbis is to fix problems with RFC2616. That is one of
> the problems that needs to be fixed.

What exactly is the problem that needs to be fixed?

>>>> HTTP is no "new" protocol, like mail or news:  2821bis and 2822upd 
>>>> and FWIW RFC.usefor-usefor don't "violate" any IETF 
>>>> policy.  But atom and xmpp were new, a different situation.
>>> RFC 2277 applies to any updates to an existing protocol, as 
>>> far as I can tell.
>> I don't see how it could apply to that.
> 
> Please read what I quoted above. HTTP is an existing protocol, so it can
> have a default charset other than UTF-8, but "UTF-8 support MUST be
> possible." 

<http://tools.ietf.org/html/rfc2277#section-2>:

    Internationalization is for humans. This means that protocols are not
    subject to internationalization; text strings are. Where protocol
    elements look like text tokens, such as in many IETF application
    layer protocols, protocols MUST specify which parts are protocol and
    which are text. [WR 2.2.1.1]

    Names are a problem, because people feel strongly about them, many of
    them are mostly for local usage, and all of them tend to leak out of
    the local context at times. RFC 1958 [RFC 1958] recommends US-ASCII
    for all globally visible names.

    This document does not mandate a policy on name internationalization,
    but requires that all protocols describe whether names are
    internationalized or US-ASCII.

    NOTE: In the protocol stack for any given application, there is
    usually one or a few layers that need to address these problems.

    It would, for instance, not be appropriate to define language tags
    for Ethernet frames. But it is the responsibility of the WGs to
    ensure that whenever responsibility for internationalization is left
    to "another layer", those responsible for that layer are in fact
    aware that they HAVE that responsibility.

So HTTP uses US-ASCII for names; that includes method names, header 
names, relation names, whatever. No problem so far.

Where HTTP headers transport I18Nable text, it should be using the TEXT 
BNF rule, which allows transport of any Unicode character, although in a 
really ugly way.

>>> that HTTPbis should explain how to encode UTF-8 text in newly 
>>> registered header fields. The de-facto mechanism for this, used by 
>>> Atom and WebDAV, is percent-encoded UTF-8.
>> Note: one instance in WebDAV Delta-V, one in AtomPub.
>>
>> Are you saying httpbis should recommend that for new headers? 
>> I'm not against it, but it sounds like something for an 
>> update to the HTTP header registry.
> 
> HTTPbis should at least standardize a mechanism for new headers to
> support Unicode text. Percent-encoded UTF-8 is one possibility. Or--just
> thinking off the top of my head--HTTPbis could allow new headers to
> encode UTF-8 text directly in quoted-strings, by starting the quoted
> string with the BOM (<EF><BB><BF>, which is "" in Latin-1).

It already allows that through RFC2047-style encoding.

> But, it is totally unacceptable to add the Link header with a
> non-Unicode-capable title subfield, it is unacceptable to specify any
> new headers that have any human-oriented text that is not Unicode
> enabled, and any existing headers that have human-oriented text should
> be revised (in the most backwards-compatible way possible) to support
> Unicode text.

The title subfield is Unicode-capable. Read the grammar, plus RFC2047.

(No I don't like that encoding either, but that's what we have).

> ...

BR, Julian
Received on Thursday, 13 March 2008 15:29:48 UTC