Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range from Matthew Kerwin on 2016-12-14 (ietf-http-wg@w3.org from October to December 2016)

From: Matthew Kerwin <matthew@kerwin.net.au>
Date: Wed, 14 Dec 2016 21:53:45 +1000
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Alexey Melnikov <alexey.melnikov@isode.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Kari Hurtta <hurtta-ietf@elmme-mailer.org>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Message-ID: <CACweHNBYf-UuxsKNxYakt22rgku9xEP4YK4yL2R+=vMf_uB2Vg@mail.gmail.com>

On 14 December 2016 at 20:46, Julian Reschke <julian.reschke@gmx.de> wrote:

> On 2016-12-14 11:38, Alexey Melnikov wrote:
>
>> ...
>>
>>> Has this ever been used in a protocol?
>>>
>> Some:
>> https://datatracker.ietf.org/doc/rfc5137/referencedby/
>>
>
> Actually, one.
>
> This was also extensively used in other RFCs without referencing the BCP.
>>
>
> Example?
>
> The reason why I'm asking is because the notation
>
>  \u'HHHH' or \u'HHHHHH'
>
> strikes me as:
>
> 1) verbose
>
> 2) potentially problematic because of the use of the single quote (which
> might require extra escaping in some contexts)
>
>
Yes.

It says that "forms that use explicit string delimiters are generally
preferred over other alternatives. In many contexts, symmetric paired
delimiters are easier to recognize and understand than visually unrelated
ones." So brackets are good.

And while it advises against using Perl's \x{NNNN...} syntax (because of
potential ambiguities with two-digit hex codes), it doesn't say anything at
all about \u{N...}

Curly braces cost 14+15 bits in HPACK, parentheses 10+10 (incidentally
cheaper than single quotes, which are 11+11). It's also convenient that
little 'u' is one bit cheaper than little 'x'.

I don't think parentheses are at too much risk of needing escaping, so it
seems like the solution that goes with BCP 137, and compresses alright with
HPACK, is:

    %x5c.75.28 1*6HEXDIGIT %x29

It's still a little bit clunky for things like "Stra\u(df)e", but not so
bad for emoji "\u(1f602)" and somewhere in between for Hiragana "
\u(3053)\u(3093)\u(306b)\u(3064)".

Cheers

> Best regards, Julian
>
> PS: and, as a nit, it's strange that the syntax uses delimiters but
> doesn't allow sequences of 1 to 3 HEXDIGs...
>
>
Having just written "\u(df)" I kind of understand; it really feels like
I'm describing an octet rather than a codepoint. I don't think there's a
*technical* reason, though.  Is it alright to see "\u(9)" or an equivalent
in text?
-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/

Received on Wednesday, 14 December 2016 11:54:19 UTC