Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range from Matthew Kerwin on 2016-12-14 (ietf-http-wg@w3.org from October to December 2016)

From: Matthew Kerwin <matthew@kerwin.net.au>
Date: Wed, 14 Dec 2016 18:46:04 +1000
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: Kari Hurtta <hurtta-ietf@elmme-mailer.org>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Message-ID: <CACweHNDKgWQewZHb=Kz3_2=41M58sY5472Q5OwpqPLxorvkzHQ@mail.gmail.com>

On 14 December 2016 at 17:42, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:

> --------
> In message <201612140628.uBE6SO3L025885@shell.siilo.fmi.fi>, Kari Hurtta
> writes
> :
>
> >I think that one escape sequence is more sane than something like
> >\uD834\uDD1E  for one unicode codepoint.
> >
> >> Any suggestions ?
> >
> >Ilari Liusvaara told that 10FFFD is the last codepoint. So 6
> >hex digits is sufficient.
>
> I'm totally agnostic on this one, but would lean on doing it
> like JSON according to Occams Razor.
>
>
There's a recursive argument to be made here about accepting one of JSON's
flaws, so why not another? (In my experience surrogate pairs are never not
a problem.)



> If we do something different, does the HPACK-Huffman efficiency matter ?
>
> >       ( "\" "X" 6*HEXDIG )
>
> HPACK: 19 + 8 + 6 * 5.625-ish = 61-ish bits
> (lowercase 'x' would save a bit)
>
> >        ( "\" "X" 1*6HEXDIG "#" )
>
> HPACK: 19 + 8 + 3-ish * 5.625-ish + 8 = 51-ish bits
> (lowercase 'x' would save a bit)
>
> >        ( "\" "#" 1*6HEXDIG "#" )
>
> HPACK: 19 + 12 + 3-ish * 5.625-ish + 12 = 60-ish bits
>
>         ( "\" "u" 4*HEXDIG )
>
> HPACK: 19 + 6 + 4 * 5.625-ish = 47-ish bits
>
>
Unless we're using a different ABNF, doesn't "X" match both %x58 and %x78
?  I'm sure this doesn't intend to be case-insensitive.

If efficiency matters we're probably better off using a sentinel character
like "%" that encodes down much smaller. (Maybe we should be more like
%DOS% here... ;)

If we're looking for inspiration elsewhere, why not C99?

      "\" %x75 1*4HEXDIG
    / "\" %x55 1*6HEXDIG  ; C99 accepts 1*8

Even removing the '1*' from the recurrence rules still gives ~47 bits for
most uses and ~60 for astral plane emoji. (That's better than \#XXXXXX# for
large codepoints, by the way)

Cheers
-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/

Received on Wednesday, 14 December 2016 08:46:37 UTC