- From: Poul-Henning Kamp <phk@phk.freebsd.dk>
- Date: Mon, 01 Aug 2016 07:43:34 +0000
- To: HTTP Working Group <ietf-http-wg@w3.org>
Based on discussions in email and at the workshop in Stockholm, JSON doesn't seem like a good fit for HTTP headers. A number of inputs came up in Stockholm which informs the process, Marks earlier attempt to classify header syntax into groups and the desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++) My personal intuition was that we should find a binary serialization (like CORS), and base64 it into HTTP1-2: Ie: design for the future and shoe-horn into the present. But no obvious binary serialization seems to exist, CORS was panned by a number of people in the WS as too complicated, and gag-reflexes were triggered by ASN.1. Inspired by Marks HTTP-header classification, I spent the train-trip back home to Denmark pondering the opposite attack: Is there a common data structure which (many) existing headers would fit into, which could serve our needs going forward? This document chronicles my deliberations, and the strawman I came up with: Not only does it seem possible, it has some very interesting possibilities down the road. Disclaimer: ABNF may not be perfect. Structure of headers ==================== I surveyed current headers, and a very large fraction of them fit into this data structure: header: ordered sequence of named dictionaries The "ordered" constraint arises in two ways: We have explicitly ordered headers like {Content|Transfer}-Encoding and we have headers which have order by their q=%f parameters. If we unserialize this model from RFC723x definitions, then ',' is the list separator and ';' the dictionary indicator and separator: Accept: audio/*; q=0.2, audio/basic The "ordered ... named" combination does not map directly to most contemporary object models (JSON, python, ...) where dictionary order is undefined, so a definition list is required to represent this in JSON: [ [ "audio/*", { "q": 0.2 }], [ "audio/basic", {}] ] It looks tempting to find a way to make the toplevel JSON a dictionary too, but given the use of wildcards in many of the keys ("text/*"), and the q=%f ordering, that would not be helpful. Next we want to give people the ability to have deeper structure, and we can either do that recursively (ie: nested ordered seq of dict) or restrict the deeper levels to only dict. That is probably a matter of taste more than anything, but the recursive design will probably appeal aesthetically to more than just me, and as we shall see shortly, it comes with certain economies. So let us use '<...>' to mark the recursion, since <> are shorter than [] and {} in HPACK/huffman. Here is a two level example: foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar Parsed into JSON that would be: [ [ "foo", { "p1": 1, "p4": {}, "p3": [ [ "x1", {} ], [ "x2", {} ], [ "x3", { "y2": 2 "y1": 1, } ] ], "p2": "abc" } ], [ "bar", {} ] ] (NB shuffled dictionary elements to show that JSON dicts are unordered) And now comes the recursion economy: First we wrap the entire *new* header in <...>: foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar> This way, the first character of the header tells us that this header has "common structure". That explicit "common structure" signal means privately defined headers can use "common structure" as well, and middleware and frameworks will automatically Do The Right Thing with them. Next, we add a field to the IANA HTTP header registry (one can do that I hope ?) classifying their "angle-bracket status": A) not angle-brackets -- incompatible structure use topical parser Range B) implicit angle-brackets -- Has common structure but is not <> enclosed Accept Content-Encoding Transfer-Encoding C) explicit angle-brackets -- Has common structure and <> encloosed all new headers go here D) unknown status. As it says on the tin. Using this as whitelist, and given suitable schemas, a good number of existing headers can go into the common parser. And then for the final trick: We can now define new variants of existing headers which "sidegrade" them into the common parser: Date: < 1469734833 > This obviously needs a signal/negotiation so we know the other side can grok them (HTTP2: SETTINGS, HTTP1: TE?) Next: Data Types ========== I think we need these fundamental data types, and subtypes: 1) Unicode strings 2) ascii-string (maybe) 3) binary blob 4) Token 5) Qualified-token 6) Number 7) integer 8) Timestamp In addition to these subtypes, schemas can constrain types further, for instance integer ranges, string lengths etc. more on this below. I will talk about each type in turn, but it goes without saying that we need to fit them all into RFC723x, in a way that is not going to break anything important and HPACK should not hate them either. In HTTP3+, they should be serialized intelligently, but that should be trivial and I will not cover that here. 1) Unicode string ----------------- The first question is do we mean "unrestricted unicode" or do we want to try to sanitize it. An example of sanitation is RFC7230's "quoted-string" which bans control characters except forward horizontal white-space (=TAB). Another is I-JSON (RFC7493)'s: MUST NOT include code points that identify Surrogates or Noncharacters as defined by UNICODE. As far as I can tell, that means that you have to keep a full UNICODE table handy at all times, and update it whenever additions are made to unicode. Not cool IMO. Imposing a RFC7230 like restriction on unicode gets totally roccoco: What does "forward horizontal white-space" mean on a line which used both left-to-right and right-to-left alphabets ? What does it mean in alphabets which write vertically ? Let us absolve the parser from such intimate unicode scholarship and simply say that the data type "unicode string" is what it says, and use the schemas to sanitize its individual use. Encoding unicode strings in HTTP1+2 requires new syntax and for any number of reasons, I would like to minimize that and {re-|ab-}use quoted-strings. RFC7230 does not specify what %80-%FF means in quoted-string, but hints that it might be ISO8859. Now we want it to become UTF-8. My proposal at the workshop, to make the first three characters inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman encoding: It takes 68 bits. Encoding the BOM as '\ufeff' helps but still takes an unreasonable 48 bits in HPACK/huffman. In both H1 and H2 defining a new "\U" escape seems better. Since we want to carry unrestricted unicode, we also need escapes to put the <%20 codepoints back in. I suggest "\u%%%%" like JSON. (We should not restict which codepoints may/should use \u%%%% until we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8 in asian codepages.) The heuristic for parsing a quoted-string then becomes: 1) If the quoted-string first two characters are "\U" -> UTF-8 2) If the quoted-string contains "\u%%%%" escape anywhere -> UTF-8 3) If the quoted-string contains only %09-%7E -> UTF-8 (actually: ASCII) 4) If the quoted-string contains any %7F-%8F -> UTF-8 5) If header definition explicitly says ISO-8859 -> ISO8859 6) else -> UTF-8 2) Ascii strings ---------------- I'm not sure if we need these or if they are even a good idea. The "pro" argument is if we insist they are also english text so we have something the entire world stands a chance to understand. The "contra" arguement is that some people will be upset about that. If we want them, they're quoted-strings from RFC723x without %7F-%FF. It is probably better the schema them from unicode strings. 3) Binary blobs --------------- Fitting binary blobs from crypto into RFC7230 should squeeze into quoted-string as well, since we cannot put any kinds of markers or escapes on tokens without breaking things. Proposal: Quoted-string with "\#" as first two chars indicates base64 encoded binary blob. I chose "\#" because "#" is not in the base64 set, so if some nonconforming implementation eliminates the "unnecessary escape" it will be clearly visible (and likely recoverable) rather than munge up the content of the base64. Base64 is chosen because it is the densest well known encoding which works well with HPACK/huffman: The b64 characters on average emit 6.46 bits. I have no idea how these blobs would look when parsed into JSON, probably as base64 ? But in languages which can, they should probably become native byte-strings. 4) Token -------- As we know it from RFC7230: tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA token = 1*tchar 5) Qualified Token ------------------ qualified_token = token 0*1("/" token) All keys in all dictionaries are of this type. (In JSON/python... the keys are strings) Schemas can restrict this further. 6 Numbers --------- These are signed decimal numbers which may have a fraction In HTTP1+2 we want them always on "%f" format and we want them to fit in IEEE754 64 bit floating point, which lead to the following definition: 0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT ) n+m < 15 (15 digits fit in IEEE754 64 binary floating point.) These numbers can (also) be used for millisecond resolution absolute UNIX-epoch relative timestamps for all forseeable future. 7) Integers ----------- 0*1"-" 1*15 DIGIT Same restriction as above to fit into IEEE 754. Range can & should be restricted by schemas as necessary. 8 Timestamps ------------ I propose we do these as subtype of Numbers, as UNIX-epoch relative time. That is somewhat human-hostile and is leap-second-challenged. If you know from the schema that a timestamp is coming, the parser can easily tell the difference between a RFC7231 IMF-fixdate or a Number-Date. Without guidance from a schema it becomes inefficient to determine if it is an IMF-fixdate, since the week day part looks like a token, but it is not impossible. Schemas ======= There needs a "ABNF"-parallel to specify what is mandatory and allowed for these headers in "common structure". Ideally this should be in machine-readable format, so that validation tools and parser-code can be produced without (too much) human intervation. I'm tempted to say we should make the schemas JSON, but then we need to write JSON schemas for our schemas :-/ Since schemas basically restict what you are allowed to express, we need to examine and think about what restrictions we want to be able to impose, before we design the schema. This is the least thought about part of this document, since the train is now in Lund: Unicode strings: ---------------- * Limit by (UTF-8) encoded length. Ie: a resource restriction, not a typographical restriction. * Limit by codepoints Example: Allow only "0-9" and "a-f" The specification of code-points should be list of codepoint ranges. (Ascii strings could be defined this way) * Limit by allowed strings ie: Allow only "North", "South", "East" and "West" Tokens ------ * Limit by codepoints Example: Allow only "A-Z" * Limit by length Example: Max 7 characters * Limit by pattern Example: "A-Z" "a-z" "-" "0-9" "0-9" (use ABNF to specify ?) * Limit by well known set Example: Token must be ISO3166-1 country code Example: Token must be in IANA FooBar registry Qualified Tokens ---------------- * Limit each of the two component tokens as above. Binary Blob ----------- * Limit by length in bytes Example: 128 bytes Example: 16-64 or 80 bytes Number ------ * Limit resolution Example: exactly 3 decimal digits * Limit range Example: [2.716 ... 3.1415] Integer ------- * Limit range Example [0 ... 65535] Timestamp --------- (I cant thing of usable restrictions here) Aaand... I'm in Copenhagen... Let me know if any of this looks usable... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
Received on Monday, 1 August 2016 07:44:02 UTC