Re: PROPOSAL: i74: Encoding for non-ASCII headers from Frank Ellermann on 2008-03-25 (ietf-http-wg@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Tue, 25 Mar 2008 14:01:25 +0100
To: ietf-http-wg@w3.org
Message-ID: <fsasuf$l95$1@ger.gmane.org>

Julian Reschke wrote:

>> If there is a chance that these values have to be displayed in
>> HTML pages or used in XML files the NCR form &#xnnnnnn; might
>> work "as is", for \u'nnnnnn' something needs to determine a
>> corresponding UTF-16, hex. NCR, or UTF-8.

> Not sure I understand this.

> 1) Even if you want to use a value in HTML or XML, you will
> need to decode first, then re-encode, otherwise you'll end up
> with something like "&amp;xnnnnnnn;".

Not for "work as is", where decoding hex. NCRs is the job of a 
browser, or in the XML case unnecessary.  If you want something
better than "as is" for various Unicode security considerations
both notations are fine.

To protect encodings both forms allow this, your proposal &amp;
is okay for HTML and XML, maybe the &#x26; in RFC 5137 is more
general.  For the \u form RFC 5137 mentions \\u as protection,
in essence that is "double all backslashes" (for a shell prompt
when I had to do this "manually" it made me nervous... ;-)  

In both cases you'd have to explain what you want and how this
works, for NCRs that might be simpler (YMMV).

> the only difference between the two formats (BCP137, 5.1 and
> 5.2) is how they are embedded.

Yes, both specify Unicode points with hex. digits minus leading
zeros, \u uses 4..6 digits, NCRs use 2..6 digits.  Both forms
have a clear trailing terminator as recommended in Charmod.

Mark Nottingham wrote:

| BCP137 does note the ugliness factor WRT NCRs.

Yes, a matter of taste, in that case John's taste.  Pick what
you like better, for implementors both forms are trivial if the
protection is clear (if it is at all needed, I'm not sure when).

 Frank

Received on Tuesday, 25 March 2008 13:00:02 UTC