Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range from Matthew Kerwin on 2016-12-15 (ietf-http-wg@w3.org from October to December 2016)

From: Matthew Kerwin <matthew@kerwin.net.au>
Date: Thu, 15 Dec 2016 11:57:40 +1000
To: Kari Hurtta <hurtta-ietf@elmme-mailer.org>
Cc: Julian Reschke <julian.reschke@gmx.de>, Alexey Melnikov <alexey.melnikov@isode.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Message-ID: <CACweHNDbv9dDXqjpU61HvfpgZ6Dt4S-CG=GjwOZcwaZh6LEirQ@mail.gmail.com>

On 15 December 2016 at 03:39, Kari Hurtta <hurtta-ietf@elmme-mailer.org>
wrote:

> Matthew Kerwin <matthew@kerwin.net.au>: (Wed Dec 14 13:53:45 2016)
> > It says that "forms that use explicit string delimiters are generally
> > preferred over other alternatives. In many contexts, symmetric paired
> > delimiters are easier to recognize and understand than visually unrelated
> > ones." So brackets are good.
> >
> > And while it advises against using Perl's \x{NNNN...} syntax (because of
> > potential ambiguities with two-digit hex codes), it doesn't say anything
> at
> > all about \u{N...}
> >
>

I have should noted here that Ruby uses this \u{N...} syntax, including
the lower limit of one hexadecimal digit.  This is a valid string literal
in Ruby:

"\u{df}\u{9}\u{1f602}"



> > Curly braces cost 14+15 bits in HPACK, parentheses 10+10 (incidentally
> > cheaper than single quotes, which are 11+11). It's also convenient that
> > little 'u' is one bit cheaper than little 'x'.
> >
> > I don't think parentheses are at too much risk of needing escaping, so it
> > seems like the solution that goes with BCP 137, and compresses alright
> with
> > HPACK, is:
> >
> >     %x5c.75.28 1*6HEXDIGIT %x29
> >
> > It's still a little bit clunky for things like "Stra\u(df)e", but not so
> > bad for emoji "\u(1f602)" and somewhere in between for Hiragana "
> > \u(3053)\u(3093)\u(306b)\u(3064)".
>
>
> I think that this is best suggestion so far.
>
> But can this also be shorter ?
>
>      %x5c.28 1*6HEXDIGIT %x29
>
> Makes
>
>         \(3064)
>
>
> { Yes, it is not visible that this is hexadecimal. }
>
>
There is precedent, although I'm not sure if it's a good precedent: the
"content" attribute in CSS uses:

    %5c 1*6HEXDIGIT

...which is both undelimited (which I oppose) and without an explicit
hexadecimal indicator (about which I'm mostly ambivalent.)



>
> Although
>
>         EmbeddedUnicodeChar =  %x5C.75.27 4*6HEXDIG %x27
>
> works for me.
>
>
I suppose it comes down to a question of which data we want to target for
optimisation, and then taking measurements and evaluating them.

It sounds like Julian thinks «%x5c.75 DELIM 1*6HEXDIGIT DELIM» "\u(abc)" is
verbose, and we don't have many opinions yet on «%x5c.28 1*6HEXDIGIT %x29»
"\(abc)"

I'm not sure at what point this decision becomes so minor that it's just
paint on a bike shed. :)



> > Cheers
> >
> >
> >
> > > Best regards, Julian
> > >
> > > PS: and, as a nit, it's strange that the syntax uses delimiters but
> > > doesn't allow sequences of 1 to 3 HEXDIGs...
> > >
> > >
> > Having just written "\u(df)" I kind of understand; it really feels like
> > I'm describing an octet rather than a codepoint. I don't think there's a
> > *technical* reason, though.
>
> Yes.
>
> >                              Is it alright to see "\u(9)" or an
> equivalent
> > in text?
>
>         Or is that "\(9)" alright if 'u' is also dropped.
>
> If that wanted to be avoid, that means
>
>         %x5c.75.28 3*6HEXDIGIT %x29
>
> or
>
>         %x5c.28 3*6HEXDIGIT %x29
>
> on my newest suggestion.
>
>
Left-padding a with zeroes to make three digits screams "octal" at me,
even when they're not all octal digits, which elicits an even stronger
Pavlovian response. I think it has to be either
1*6
or
4*6
, and I lean towards
1*6
.

Cheers
-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/

Received on Thursday, 15 December 2016 01:58:14 UTC