Re: If not JSON, what then ? from James M Snell on 2016-08-01 (ietf-http-wg@w3.org from July to September 2016)

From: James M Snell <jasnell@gmail.com>
Date: Mon, 1 Aug 2016 07:57:26 -0700
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CABP7RbefhRj1BgZQ67MKOu-xBD+r2zdO6zVrVckQ_SRx=zPxdg@mail.gmail.com>
phk,

I'm very happy to see the discussion of efficient binary encoding of
HTTP headers coming back around. This is an area that I had explored
fairly extensively early in the process of designing HTTP/2 with the
"Binary-optimized Header Encoding" I-D's (see
https://tools.ietf.org/html/draft-snell-httpbis-bohe-13). While HPACK
won out with regards to being the header compression scheme used for
HTTP/2, there is still quite a bit in the BOHE drafts that could be
useful here.

- James

On Mon, Aug 1, 2016 at 12:43 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> Based on discussions in email and at the workshop in Stockholm,
> JSON doesn't seem like a good fit for HTTP headers.
>
> A number of inputs came up in Stockholm which informs the process,
> Marks earlier attempt to classify header syntax into groups and the
> desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)
>
> My personal intuition was that we should find a binary serialization
> (like CORS), and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated, and gag-reflexes were triggered by ASN.1.
>
> Inspired by Marks HTTP-header classification, I spent the train-trip
> back home to Denmark pondering the opposite attack:  Is there a
> common data structure which (many) existing headers would fit into,
> which could serve our needs going forward?
>
> This document chronicles my deliberations, and the strawman I came
> up with:  Not only does it seem possible, it has some very interesting
> possibilities down the road.
>
> Disclaimer:  ABNF may not be perfect.
>
> Structure of headers
> ====================
>
> I surveyed current headers, and a very large fraction of them
> fit into this data structure:
>
>         header: ordered sequence of named dictionaries
>
> The "ordered" constraint arises in two ways:  We have explicitly
> ordered headers like {Content|Transfer}-Encoding and we have headers
> which have order by their q=%f parameters.
>
> If we unserialize this model from RFC723x definitions, then ',' is
> the list separator and ';' the dictionary indicator and separator:
>
>      Accept: audio/*; q=0.2, audio/basic
>
> The "ordered ... named" combination does not map directly to most
> contemporary object models (JSON, python, ...) where dictionary
> order is undefined, so a definition list is required to represent
> this in JSON:
>
>         [
>             [ "audio/*", { "q": 0.2 }],
>             [ "audio/basic", {}]
>         ]
>
> It looks tempting to find a way to make the toplevel JSON a dictionary
> too, but given the use of wildcards in many of the keys ("text/*"),
> and the q=%f ordering, that would not be helpful.
>
> Next we want to give people the ability to have deeper structure,
> and we can either do that recursively (ie: nested ordered seq of
> dict) or restrict the deeper levels to only dict.
>
> That is probably a matter of taste more than anything, but the
> recursive design will probably appeal aesthetically to more than
> just me, and as we shall see shortly, it comes with certain economies.
>
> So let us use '<...>' to mark the recursion, since <> are shorter than
> [] and {} in HPACK/huffman.
>
> Here is a two level example:
>
>         foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar
>
> Parsed into JSON that would be:
>
>         [
>             [
>                 "foo",
>                 {
>                     "p1": 1,
>                     "p4": {},
>                     "p3": [
>                         [
>                             "x1",
>                             {}
>                         ],
>                         [
>                             "x2",
>                             {}
>                         ],
>                         [
>                             "x3",
>                             {
>                                 "y2": 2
>                                 "y1": 1,
>                             }
>                         ]
>                     ],
>                     "p2": "abc"
>                 }
>             ],
>             [
>                 "bar",
>                 {}
>             ]
>         ]
>
> (NB shuffled dictionary elements to show that JSON dicts are unordered)
>
> And now comes the recursion economy:
>
> First we wrap the entire *new* header in <...>:
>
>         foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>
>
> This way, the first character of the header tells us that this header
> has "common structure".
>
> That explicit "common structure" signal means privately defined
> headers can use "common structure" as well, and middleware and
> frameworks will automatically Do The Right Thing with them.
>
> Next, we add a field to the IANA HTTP header registry (one can do
> that I hope ?) classifying their "angle-bracket status":
>
>  A) not angle-brackets -- incompatible structure use topical parser
>         Range
>
>  B) implicit angle-brackets -- Has common structure but is not <> enclosed
>         Accept
>         Content-Encoding
>         Transfer-Encoding
>
>  C) explicit angle-brackets -- Has common structure and <> encloosed
>         all new headers go here
>
>  D) unknown status.
>         As it says on the tin.
>
> Using this as whitelist, and given suitable schemas, a good number
> of existing headers can go into the common parser.
>
> And then for the final trick:   We can now define new variants of
> existing headers which "sidegrade" them into the common parser:
>
>         Date: < 1469734833 >
>
> This obviously needs a signal/negotiation so we know the other side
> can grok them (HTTP2: SETTINGS, HTTP1: TE?)
>
> Next:
>
> Data Types
> ==========
>
> I think we need these fundamental data types, and subtypes:
>
> 1)   Unicode strings
>
> 2)      ascii-string (maybe)
>
> 3)      binary blob
>
> 4)   Token
>
> 5)   Qualified-token
>
> 6)   Number
>
> 7)      integer
>
> 8)   Timestamp
>
> In addition to these subtypes, schemas can constrain types
> further, for instance integer ranges, string lengths etc.
> more on this below.
>
> I will talk about each type in turn, but it goes without saying
> that we need to fit them all into RFC723x, in a way that is not
> going to break anything important and HPACK should not hate
> them either.
>
> In HTTP3+, they should be serialized intelligently, but that
> should be trivial and I will not cover that here.
>
> 1) Unicode string
> -----------------
>
> The first question is do we mean "unrestricted unicode" or do
> we want to try to sanitize it.
>
> An example of sanitation is RFC7230's "quoted-string" which bans
> control characters except forward horizontal white-space (=TAB).
>
> Another is I-JSON (RFC7493)'s:
>
>    MUST NOT include code points that identify Surrogates or
>    Noncharacters as defined by UNICODE.
>
> As far as I can tell, that means that you have to keep a full UNICODE
> table handy at all times, and update it whenever additions are made
> to unicode.  Not cool IMO.
>
> Imposing a RFC7230 like restriction on unicode gets totally
> roccoco:  What does "forward horizontal white-space" mean on
> a line which used both left-to-right and right-to-left alphabets ?
> What does it mean in alphabets which write vertically ?
>
> Let us absolve the parser from such intimate unicode scholarship
> and simply say that the data type "unicode string" is what it says,
> and use the schemas to sanitize its individual use.
>
> Encoding unicode strings in HTTP1+2 requires new syntax and
> for any number of reasons, I would like to minimize that
> and {re-|ab-}use quoted-strings.
>
> RFC7230 does not specify what %80-%FF means in quoted-string, but
> hints that it might be ISO8859.
>
> Now we want it to become UTF-8.
>
> My proposal at the workshop, to make the first three characters
> inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
> encoding:  It takes 68 bits.
>
> Encoding the BOM as '\ufeff' helps but still takes an unreasonable
> 48 bits in HPACK/huffman.
>
> In both H1 and H2 defining a new "\U" escape seems better.
>
> Since we want to carry unrestricted unicode, we also need escapes
> to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.
>
> (We should not restict which codepoints may/should use \u%%%% until
> we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
> in asian codepages.)
>
> The heuristic for parsing a quoted-string then becomes:
>
>         1) If the quoted-string first two characters are "\U"
>                 -> UTF-8
>
>         2)  If the quoted-string contains "\u%%%%" escape anywhere
>                 -> UTF-8
>
>         3)  If the quoted-string contains only %09-%7E
>                 -> UTF-8 (actually: ASCII)
>
>         4)  If the quoted-string contains any %7F-%8F
>                 -> UTF-8
>
>         5)  If header definition explicitly says ISO-8859
>                 -> ISO8859
>
>         6)  else
>                 -> UTF-8
>
> 2) Ascii strings
> ----------------
>
> I'm not sure if we need these or if they are even a good idea.
>
> The "pro" argument is if we insist they are also english text
> so we have something the entire world stands a chance to understand.
>
> The "contra" arguement is that some people will be upset about that.
>
> If we want them, they're quoted-strings from RFC723x without %7F-%FF.
>
> It is probably better the schema them from unicode strings.
>
> 3) Binary blobs
> ---------------
>
> Fitting binary blobs from crypto into RFC7230 should squeeze into
> quoted-string as well, since we cannot put any kinds of markers or
> escapes on tokens without breaking things.
>
> Proposal:
>
>         Quoted-string with "\#" as first two chars indicates base64
>         encoded binary blob.
>
> I chose "\#" because "#" is not in the base64 set, so if some
> nonconforming implementation eliminates the "unnecessary escape"
> it will be clearly visible (and likely recoverable) rather than
> munge up the content of the base64.
>
> Base64 is chosen because it is the densest well known encoding which
> works well with HPACK/huffman:  The b64 characters on average emit
> 6.46 bits.
>
> I have no idea how these blobs would look when parsed into JSON,
> probably as base64 ?  But in languages which can, they should
> probably become native byte-strings.
>
> 4) Token
> --------
>
> As we know it from RFC7230:
>
>    tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
>     "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
>    token = 1*tchar
>
> 5) Qualified Token
> ------------------
>
>    qualified_token = token 0*1("/" token)
>
> All keys in all dictionaries are of this type.  (In JSON/python...
> the keys are strings)
>
> Schemas can restrict this further.
>
> 6 Numbers
> ---------
>
> These are signed decimal numbers which may have a fraction
>
> In HTTP1+2 we want them always on "%f" format and we want them to
> fit in IEEE754 64 bit floating point, which lead to the following
> definition:
>
>         0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )        n+m < 15
>
> (15 digits fit in IEEE754 64 binary floating point.)
>
> These numbers can (also) be used for millisecond resolution absolute
> UNIX-epoch relative timestamps for all forseeable future.
>
> 7) Integers
> -----------
>
>         0*1"-" 1*15 DIGIT
>
> Same restriction as above to fit into IEEE 754.
>
> Range can & should be restricted by schemas as necessary.
>
> 8 Timestamps
> ------------
>
> I propose we do these as subtype of Numbers, as UNIX-epoch relative
> time.  That is somewhat human-hostile and is leap-second-challenged.
>
> If you know from the schema that a timestamp is coming, the parser
> can easily tell the difference between a RFC7231 IMF-fixdate or a
> Number-Date.
>
> Without guidance from a schema it becomes inefficient to determine
> if it is an IMF-fixdate, since the week day part looks like a token,
> but it is not impossible.
>
>
> Schemas
> =======
>
> There needs a "ABNF"-parallel to specify what is mandatory and
> allowed for these headers in "common structure".
>
> Ideally this should be in machine-readable format, so that
> validation tools and parser-code can be produced without
> (too much) human intervation.  I'm tempted to say we should
> make the schemas JSON, but then we need to write JSON schemas
> for our schemas :-/
>
> Since schemas basically restict what you are allowed to
> express, we need to examine and think about what restrictions
> we want to be able to impose, before we design the schema.
>
> This is the least thought about part of this document, since
> the train is now in Lund:
>
> Unicode strings:
> ----------------
>
> * Limit by (UTF-8) encoded length.
>         Ie: a resource restriction, not a typographical restriction.
>
> * Limit by codepoints
>         Example: Allow only "0-9" and "a-f"
>         The specification of code-points should be list of codepoint
>         ranges.  (Ascii strings could be defined this way)
>
> * Limit by allowed strings
>         ie: Allow only "North", "South", "East" and "West"
>
> Tokens
> ------
>
> * Limit by codepoints
>         Example: Allow only "A-Z"
>
> * Limit by length
>         Example: Max 7 characters
>
> * Limit by pattern
>         Example: "A-Z" "a-z" "-" "0-9" "0-9"
>         (use ABNF to specify ?)
>
> * Limit by well known set
>         Example: Token must be ISO3166-1 country code
>         Example: Token must be in IANA FooBar registry
>
> Qualified Tokens
> ----------------
>
> * Limit each of the two component tokens as above.
>
> Binary Blob
> -----------
>
> * Limit by length in bytes
>         Example: 128 bytes
>         Example: 16-64 or 80 bytes
>
> Number
> ------
>
> * Limit resolution
>         Example: exactly 3 decimal digits
>
> * Limit range
>         Example: [2.716 ... 3.1415]
>
> Integer
> -------
>
> * Limit range
>         Example [0 ... 65535]
>
> Timestamp
> ---------
>
> (I cant thing of usable restrictions here)
>
>
> Aaand... I'm in Copenhagen...
>
> Let me know if any of this looks usable...
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk@FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
>
Received on Monday, 1 August 2016 15:06:00 UTC