- From: James M Snell <jasnell@gmail.com>
- Date: Mon, 1 Aug 2016 07:57:26 -0700
- To: Poul-Henning Kamp <phk@phk.freebsd.dk>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>
phk, I'm very happy to see the discussion of efficient binary encoding of HTTP headers coming back around. This is an area that I had explored fairly extensively early in the process of designing HTTP/2 with the "Binary-optimized Header Encoding" I-D's (see https://tools.ietf.org/html/draft-snell-httpbis-bohe-13). While HPACK won out with regards to being the header compression scheme used for HTTP/2, there is still quite a bit in the BOHE drafts that could be useful here. - James On Mon, Aug 1, 2016 at 12:43 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote: > Based on discussions in email and at the workshop in Stockholm, > JSON doesn't seem like a good fit for HTTP headers. > > A number of inputs came up in Stockholm which informs the process, > Marks earlier attempt to classify header syntax into groups and the > desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++) > > My personal intuition was that we should find a binary serialization > (like CORS), and base64 it into HTTP1-2: Ie: design for the future > and shoe-horn into the present. But no obvious binary serialization > seems to exist, CORS was panned by a number of people in the WS as > too complicated, and gag-reflexes were triggered by ASN.1. > > Inspired by Marks HTTP-header classification, I spent the train-trip > back home to Denmark pondering the opposite attack: Is there a > common data structure which (many) existing headers would fit into, > which could serve our needs going forward? > > This document chronicles my deliberations, and the strawman I came > up with: Not only does it seem possible, it has some very interesting > possibilities down the road. > > Disclaimer: ABNF may not be perfect. > > Structure of headers > ==================== > > I surveyed current headers, and a very large fraction of them > fit into this data structure: > > header: ordered sequence of named dictionaries > > The "ordered" constraint arises in two ways: We have explicitly > ordered headers like {Content|Transfer}-Encoding and we have headers > which have order by their q=%f parameters. > > If we unserialize this model from RFC723x definitions, then ',' is > the list separator and ';' the dictionary indicator and separator: > > Accept: audio/*; q=0.2, audio/basic > > The "ordered ... named" combination does not map directly to most > contemporary object models (JSON, python, ...) where dictionary > order is undefined, so a definition list is required to represent > this in JSON: > > [ > [ "audio/*", { "q": 0.2 }], > [ "audio/basic", {}] > ] > > It looks tempting to find a way to make the toplevel JSON a dictionary > too, but given the use of wildcards in many of the keys ("text/*"), > and the q=%f ordering, that would not be helpful. > > Next we want to give people the ability to have deeper structure, > and we can either do that recursively (ie: nested ordered seq of > dict) or restrict the deeper levels to only dict. > > That is probably a matter of taste more than anything, but the > recursive design will probably appeal aesthetically to more than > just me, and as we shall see shortly, it comes with certain economies. > > So let us use '<...>' to mark the recursion, since <> are shorter than > [] and {} in HPACK/huffman. > > Here is a two level example: > > foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar > > Parsed into JSON that would be: > > [ > [ > "foo", > { > "p1": 1, > "p4": {}, > "p3": [ > [ > "x1", > {} > ], > [ > "x2", > {} > ], > [ > "x3", > { > "y2": 2 > "y1": 1, > } > ] > ], > "p2": "abc" > } > ], > [ > "bar", > {} > ] > ] > > (NB shuffled dictionary elements to show that JSON dicts are unordered) > > And now comes the recursion economy: > > First we wrap the entire *new* header in <...>: > > foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar> > > This way, the first character of the header tells us that this header > has "common structure". > > That explicit "common structure" signal means privately defined > headers can use "common structure" as well, and middleware and > frameworks will automatically Do The Right Thing with them. > > Next, we add a field to the IANA HTTP header registry (one can do > that I hope ?) classifying their "angle-bracket status": > > A) not angle-brackets -- incompatible structure use topical parser > Range > > B) implicit angle-brackets -- Has common structure but is not <> enclosed > Accept > Content-Encoding > Transfer-Encoding > > C) explicit angle-brackets -- Has common structure and <> encloosed > all new headers go here > > D) unknown status. > As it says on the tin. > > Using this as whitelist, and given suitable schemas, a good number > of existing headers can go into the common parser. > > And then for the final trick: We can now define new variants of > existing headers which "sidegrade" them into the common parser: > > Date: < 1469734833 > > > This obviously needs a signal/negotiation so we know the other side > can grok them (HTTP2: SETTINGS, HTTP1: TE?) > > Next: > > Data Types > ========== > > I think we need these fundamental data types, and subtypes: > > 1) Unicode strings > > 2) ascii-string (maybe) > > 3) binary blob > > 4) Token > > 5) Qualified-token > > 6) Number > > 7) integer > > 8) Timestamp > > In addition to these subtypes, schemas can constrain types > further, for instance integer ranges, string lengths etc. > more on this below. > > I will talk about each type in turn, but it goes without saying > that we need to fit them all into RFC723x, in a way that is not > going to break anything important and HPACK should not hate > them either. > > In HTTP3+, they should be serialized intelligently, but that > should be trivial and I will not cover that here. > > 1) Unicode string > ----------------- > > The first question is do we mean "unrestricted unicode" or do > we want to try to sanitize it. > > An example of sanitation is RFC7230's "quoted-string" which bans > control characters except forward horizontal white-space (=TAB). > > Another is I-JSON (RFC7493)'s: > > MUST NOT include code points that identify Surrogates or > Noncharacters as defined by UNICODE. > > As far as I can tell, that means that you have to keep a full UNICODE > table handy at all times, and update it whenever additions are made > to unicode. Not cool IMO. > > Imposing a RFC7230 like restriction on unicode gets totally > roccoco: What does "forward horizontal white-space" mean on > a line which used both left-to-right and right-to-left alphabets ? > What does it mean in alphabets which write vertically ? > > Let us absolve the parser from such intimate unicode scholarship > and simply say that the data type "unicode string" is what it says, > and use the schemas to sanitize its individual use. > > Encoding unicode strings in HTTP1+2 requires new syntax and > for any number of reasons, I would like to minimize that > and {re-|ab-}use quoted-strings. > > RFC7230 does not specify what %80-%FF means in quoted-string, but > hints that it might be ISO8859. > > Now we want it to become UTF-8. > > My proposal at the workshop, to make the first three characters > inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman > encoding: It takes 68 bits. > > Encoding the BOM as '\ufeff' helps but still takes an unreasonable > 48 bits in HPACK/huffman. > > In both H1 and H2 defining a new "\U" escape seems better. > > Since we want to carry unrestricted unicode, we also need escapes > to put the <%20 codepoints back in. I suggest "\u%%%%" like JSON. > > (We should not restict which codepoints may/should use \u%%%% until > we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8 > in asian codepages.) > > The heuristic for parsing a quoted-string then becomes: > > 1) If the quoted-string first two characters are "\U" > -> UTF-8 > > 2) If the quoted-string contains "\u%%%%" escape anywhere > -> UTF-8 > > 3) If the quoted-string contains only %09-%7E > -> UTF-8 (actually: ASCII) > > 4) If the quoted-string contains any %7F-%8F > -> UTF-8 > > 5) If header definition explicitly says ISO-8859 > -> ISO8859 > > 6) else > -> UTF-8 > > 2) Ascii strings > ---------------- > > I'm not sure if we need these or if they are even a good idea. > > The "pro" argument is if we insist they are also english text > so we have something the entire world stands a chance to understand. > > The "contra" arguement is that some people will be upset about that. > > If we want them, they're quoted-strings from RFC723x without %7F-%FF. > > It is probably better the schema them from unicode strings. > > 3) Binary blobs > --------------- > > Fitting binary blobs from crypto into RFC7230 should squeeze into > quoted-string as well, since we cannot put any kinds of markers or > escapes on tokens without breaking things. > > Proposal: > > Quoted-string with "\#" as first two chars indicates base64 > encoded binary blob. > > I chose "\#" because "#" is not in the base64 set, so if some > nonconforming implementation eliminates the "unnecessary escape" > it will be clearly visible (and likely recoverable) rather than > munge up the content of the base64. > > Base64 is chosen because it is the densest well known encoding which > works well with HPACK/huffman: The b64 characters on average emit > 6.46 bits. > > I have no idea how these blobs would look when parsed into JSON, > probably as base64 ? But in languages which can, they should > probably become native byte-strings. > > 4) Token > -------- > > As we know it from RFC7230: > > tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / > "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA > token = 1*tchar > > 5) Qualified Token > ------------------ > > qualified_token = token 0*1("/" token) > > All keys in all dictionaries are of this type. (In JSON/python... > the keys are strings) > > Schemas can restrict this further. > > 6 Numbers > --------- > > These are signed decimal numbers which may have a fraction > > In HTTP1+2 we want them always on "%f" format and we want them to > fit in IEEE754 64 bit floating point, which lead to the following > definition: > > 0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT ) n+m < 15 > > (15 digits fit in IEEE754 64 binary floating point.) > > These numbers can (also) be used for millisecond resolution absolute > UNIX-epoch relative timestamps for all forseeable future. > > 7) Integers > ----------- > > 0*1"-" 1*15 DIGIT > > Same restriction as above to fit into IEEE 754. > > Range can & should be restricted by schemas as necessary. > > 8 Timestamps > ------------ > > I propose we do these as subtype of Numbers, as UNIX-epoch relative > time. That is somewhat human-hostile and is leap-second-challenged. > > If you know from the schema that a timestamp is coming, the parser > can easily tell the difference between a RFC7231 IMF-fixdate or a > Number-Date. > > Without guidance from a schema it becomes inefficient to determine > if it is an IMF-fixdate, since the week day part looks like a token, > but it is not impossible. > > > Schemas > ======= > > There needs a "ABNF"-parallel to specify what is mandatory and > allowed for these headers in "common structure". > > Ideally this should be in machine-readable format, so that > validation tools and parser-code can be produced without > (too much) human intervation. I'm tempted to say we should > make the schemas JSON, but then we need to write JSON schemas > for our schemas :-/ > > Since schemas basically restict what you are allowed to > express, we need to examine and think about what restrictions > we want to be able to impose, before we design the schema. > > This is the least thought about part of this document, since > the train is now in Lund: > > Unicode strings: > ---------------- > > * Limit by (UTF-8) encoded length. > Ie: a resource restriction, not a typographical restriction. > > * Limit by codepoints > Example: Allow only "0-9" and "a-f" > The specification of code-points should be list of codepoint > ranges. (Ascii strings could be defined this way) > > * Limit by allowed strings > ie: Allow only "North", "South", "East" and "West" > > Tokens > ------ > > * Limit by codepoints > Example: Allow only "A-Z" > > * Limit by length > Example: Max 7 characters > > * Limit by pattern > Example: "A-Z" "a-z" "-" "0-9" "0-9" > (use ABNF to specify ?) > > * Limit by well known set > Example: Token must be ISO3166-1 country code > Example: Token must be in IANA FooBar registry > > Qualified Tokens > ---------------- > > * Limit each of the two component tokens as above. > > Binary Blob > ----------- > > * Limit by length in bytes > Example: 128 bytes > Example: 16-64 or 80 bytes > > Number > ------ > > * Limit resolution > Example: exactly 3 decimal digits > > * Limit range > Example: [2.716 ... 3.1415] > > Integer > ------- > > * Limit range > Example [0 ... 65535] > > Timestamp > --------- > > (I cant thing of usable restrictions here) > > > Aaand... I'm in Copenhagen... > > Let me know if any of this looks usable... > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. >
Received on Monday, 1 August 2016 15:06:00 UTC