W3C home > Mailing lists > Public > ietf-http-wg@w3.org > October to December 2016

Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range

From: Kari Hurtta <hurtta-ietf@elmme-mailer.org>
Date: Wed, 14 Dec 2016 19:39:58 +0200 (EET)
Message-Id: <201612141739.uBEHdwiq024972@shell.siilo.fmi.fi>
To: Matthew Kerwin <matthew@kerwin.net.au>
CC: Julian Reschke <julian.reschke@gmx.de>, Alexey Melnikov <alexey.melnikov@isode.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Kari Hurtta <hurtta-ietf@elmme-mailer.org>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Matthew Kerwin <matthew@kerwin.net.au>: (Wed Dec 14 13:53:45 2016)
> It says that "forms that use explicit string delimiters are generally
> preferred over other alternatives. In many contexts, symmetric paired
> delimiters are easier to recognize and understand than visually unrelated
> ones." So brackets are good.
> 
> And while it advises against using Perl's \x{NNNN...} syntax (because of
> potential ambiguities with two-digit hex codes), it doesn't say anything at
> all about \u{N...}
> 
> Curly braces cost 14+15 bits in HPACK, parentheses 10+10 (incidentally
> cheaper than single quotes, which are 11+11). It's also convenient that
> little 'u' is one bit cheaper than little 'x'.
> 
> I don't think parentheses are at too much risk of needing escaping, so it
> seems like the solution that goes with BCP 137, and compresses alright with
> HPACK, is:
> 
>     %x5c.75.28 1*6HEXDIGIT %x29
> 
> It's still a little bit clunky for things like "Stra\u(df)e", but not so
> bad for emoji "\u(1f602)" and somewhere in between for Hiragana "
> \u(3053)\u(3093)\u(306b)\u(3064)".


I think that this is best suggestion so far.

But can this also be shorter ?

     %x5c.28 1*6HEXDIGIT %x29

Makes

	\(3064)


{ Yes, it is not visible that this is hexadecimal. }


Although

	EmbeddedUnicodeChar =  %x5C.75.27 4*6HEXDIG %x27

works for me.
 
> Cheers​
> 
> 
> 
> > Best regards, Julian
> >
> > PS: and, as a nit, it's strange that the syntax uses delimiters but
> > doesn't allow sequences of 1 to 3 HEXDIGs...
> >
> >
> ​Having just written "\u(df)" I kind of understand; it really feels like
> I'm describing an octet rather than a codepoint. I don't think there's a
> *technical* reason, though.  

Yes.

>                              Is it alright to see "\u(9)" or an equivalent
> in text?

	Or is that "\(9)" alright if 'u' is also dropped.

If that wanted to be avoid, that means

	%x5c.75.28 3*6HEXDIGIT %x29

or

	%x5c.28 3*6HEXDIGIT %x29

on my newest suggestion.


> -- 
>   Matthew Kerwin
>   http://matthew.kerwin.net.au/

/ Kari Hurtta
Received on Wednesday, 14 December 2016 17:45:01 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 14 December 2016 17:45:07 UTC