Re: draft-ietf-httpbis-header-structure-00, unicode range

On Tue, Dec 13, 2016 at 09:28:47PM +0000, Poul-Henning Kamp wrote:
> --------
> In message <20161213173327.C1F7D1714B@welho-filter2.welho.com>, Kari Hurtta wri
> tes:
> 
> >2.  Definition of HTTP Header Common Structure
> >https://tools.ietf.org/html/draft-ietf-httpbis-header-structure-00#section-2
> >
> >|     unicode_string = * unicode_codepoint
> >|             # XXX: Is there a place to import this from ?
> >|             # Unrestricted unicode, because there is no sane
> >|             # way to restrict or otherwise make unicode "safe".
> >
> >What is range of unicode_codepoint ?
> 
> As far as I know, UNICODE does not have a firm upper end, but
> everybody _expects_ 32 bits to be enough for everybody.

Actually, it does: 10FFFD is the last codepoint in Unicode (it is
actually allocated as part of PUA).

IIRC, Unicode has exactly 1,111,998 codepoints in total (most of those
are unallocated). 
 
> Since section two is the abstract datamodel, that's the best we can
> do there.
> 
> >3.  HTTP/1 Serialization of HTTP Header Common Structure
> >https://tools.ietf.org/html/draft-ietf-httpbis-header-structure-00#section-3
> >[...]
> >Or is unicode values > 0xFFFF
> >encoded with surrogates  (values 0xd8000 - 0xdffff) ?
> >( UCS-2 or UTF-16 is used )
> 
> That was the plan.
> 
> Not a particular good plan, as evindenced by the fact that I forgot
> to write that, and that JSON has seen interop issues with parsers
> missing that detail.

Also, note that the surrogate mechanism can only encode up to plane 16
(that's the reason why unicode only has 17 planes!)

And I suppose that the surrogates MUST be paired properly (JSON actually
does not require this).


-Ilari

Received on Tuesday, 13 December 2016 21:43:05 UTC