On 14 December 2016 at 17:42, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> --------
> In message <201612140628.uBE6SO3L025885@shell.siilo.fmi.fi>, Kari Hurtta
> writes
> :
>
> >I think that one escape sequence is more sane than something like
> >\uD834\uDD1E for one unicode codepoint.
> >
> >> Any suggestions ?
> >
> >Ilari Liusvaara told that 10FFFD is the last codepoint. So 6
> >hex digits is sufficient.
>
> I'm totally agnostic on this one, but would lean on doing it
> like JSON according to Occams Razor.
>
>
There's a recursive argument to be made here about accepting one of JSON's
flaws, so why not another? (In my experience surrogate pairs are never not
a problem.)
> If we do something different, does the HPACK-Huffman efficiency matter ?
>
> > ( "\" "X" 6*HEXDIG )
>
> HPACK: 19 + 8 + 6 * 5.625-ish = 61-ish bits
> (lowercase 'x' would save a bit)
>
> > ( "\" "X" 1*6HEXDIG "#" )
>
> HPACK: 19 + 8 + 3-ish * 5.625-ish + 8 = 51-ish bits
> (lowercase 'x' would save a bit)
>
> > ( "\" "#" 1*6HEXDIG "#" )
>
> HPACK: 19 + 12 + 3-ish * 5.625-ish + 12 = 60-ish bits
>
> ( "\" "u" 4*HEXDIG )
>
> HPACK: 19 + 6 + 4 * 5.625-ish = 47-ish bits
>
>
Unless we're using a different ABNF, doesn't "X" match both %x58 and %x78
? I'm sure this doesn't intend to be case-insensitive.
If efficiency matters we're probably better off using a sentinel character
like "%" that encodes down much smaller. (Maybe we should be more like
%DOS% here... ;)
If we're looking for inspiration elsewhere, why not C99?
"\" %x75 1*4HEXDIG
/ "\" %x55 1*6HEXDIG ; C99 accepts 1*8
Even removing the '1*' from the recurrence rules still gives ~47 bits for
most uses and ~60 for astral plane emoji. (That's better than \#XXXXXX# for
large codepoints, by the way)
Cheers
--
Matthew Kerwin
http://matthew.kerwin.net.au/