Re: If not JSON, what then ? from Sam Johnston on 2016-08-18 (ietf-http-wg@w3.org from July to September 2016)

From: Sam Johnston <samj@samj.net>
Date: Thu, 18 Aug 2016 14:43:19 +0100
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CAKTR03-7J8WV8dZr8Y5p_YFsWBBr5gv47dRXXKvUpSByjXCu7g@mail.gmail.com>
Shame to have missed this discussion as it starts to look like one we had a
few years ago around using headers directly rather than trying to embed
another envelope format in them (ala SOAP in the body):

https://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html

I actually wrote drafts for "Category" and "Attribute" headers at the time;
the latter is yet to see the light of day but the former is here:
https://tools.ietf.org/html/draft-johnston-http-category-header-02

My view is that you should be able to have e.g. a photo with headers
containing attributes like title & summary, categories like landscape, and
links to e.g. author (which Mark has already standardised in RFC5988 Web
Linking).

We got caught up with things like unicode, client library support, etc. at
the time, but I expect some of these things have been resolved in the
interim.

Sam


On Mon, Aug 1, 2016 at 8:43 AM, Poul-Henning Kamp <phk@phk.freebsd.dk>
wrote:

> Based on discussions in email and at the workshop in Stockholm,
> JSON doesn't seem like a good fit for HTTP headers.
>
> A number of inputs came up in Stockholm which informs the process,
> Marks earlier attempt to classify header syntax into groups and the
> desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)
>
> My personal intuition was that we should find a binary serialization
> (like CORS), and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated, and gag-reflexes were triggered by ASN.1.
>
> Inspired by Marks HTTP-header classification, I spent the train-trip
> back home to Denmark pondering the opposite attack:  Is there a
> common data structure which (many) existing headers would fit into,
> which could serve our needs going forward?
>
> This document chronicles my deliberations, and the strawman I came
> up with:  Not only does it seem possible, it has some very interesting
> possibilities down the road.
>
> Disclaimer:  ABNF may not be perfect.
>
> Structure of headers
> ====================
>
> I surveyed current headers, and a very large fraction of them
> fit into this data structure:
>
>         header: ordered sequence of named dictionaries
>
> The "ordered" constraint arises in two ways:  We have explicitly
> ordered headers like {Content|Transfer}-Encoding and we have headers
> which have order by their q=%f parameters.
>
> If we unserialize this model from RFC723x definitions, then ',' is
> the list separator and ';' the dictionary indicator and separator:
>
>      Accept: audio/*; q=0.2, audio/basic
>
> The "ordered ... named" combination does not map directly to most
> contemporary object models (JSON, python, ...) where dictionary
> order is undefined, so a definition list is required to represent
> this in JSON:
>
>         [
>             [ "audio/*", { "q": 0.2 }],
>             [ "audio/basic", {}]
>         ]
>
> It looks tempting to find a way to make the toplevel JSON a dictionary
> too, but given the use of wildcards in many of the keys ("text/*"),
> and the q=%f ordering, that would not be helpful.
>
> Next we want to give people the ability to have deeper structure,
> and we can either do that recursively (ie: nested ordered seq of
> dict) or restrict the deeper levels to only dict.
>
> That is probably a matter of taste more than anything, but the
> recursive design will probably appeal aesthetically to more than
> just me, and as we shall see shortly, it comes with certain economies.
>
> So let us use '<...>' to mark the recursion, since <> are shorter than
> [] and {} in HPACK/huffman.
>
> Here is a two level example:
>
>         foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar
>
> Parsed into JSON that would be:
>
>         [
>             [
>                 "foo",
>                 {
>                     "p1": 1,
>                     "p4": {},
>                     "p3": [
>                         [
>                             "x1",
>                             {}
>                         ],
>                         [
>                             "x2",
>                             {}
>                         ],
>                         [
>                             "x3",
>                             {
>                                 "y2": 2
>                                 "y1": 1,
>                             }
>                         ]
>                     ],
>                     "p2": "abc"
>                 }
>             ],
>             [
>                 "bar",
>                 {}
>             ]
>         ]
>
> (NB shuffled dictionary elements to show that JSON dicts are unordered)
>
> And now comes the recursion economy:
>
> First we wrap the entire *new* header in <...>:
>
>         foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>
>
> This way, the first character of the header tells us that this header
> has "common structure".
>
> That explicit "common structure" signal means privately defined
> headers can use "common structure" as well, and middleware and
> frameworks will automatically Do The Right Thing with them.
>
> Next, we add a field to the IANA HTTP header registry (one can do
> that I hope ?) classifying their "angle-bracket status":
>
>  A) not angle-brackets -- incompatible structure use topical parser
>         Range
>
>  B) implicit angle-brackets -- Has common structure but is not <> enclosed
>         Accept
>         Content-Encoding
>         Transfer-Encoding
>
>  C) explicit angle-brackets -- Has common structure and <> encloosed
>         all new headers go here
>
>  D) unknown status.
>         As it says on the tin.
>
> Using this as whitelist, and given suitable schemas, a good number
> of existing headers can go into the common parser.
>
> And then for the final trick:   We can now define new variants of
> existing headers which "sidegrade" them into the common parser:
>
>         Date: < 1469734833 >
>
> This obviously needs a signal/negotiation so we know the other side
> can grok them (HTTP2: SETTINGS, HTTP1: TE?)
>
> Next:
>
> Data Types
> ==========
>
> I think we need these fundamental data types, and subtypes:
>
> 1)   Unicode strings
>
> 2)      ascii-string (maybe)
>
> 3)      binary blob
>
> 4)   Token
>
> 5)   Qualified-token
>
> 6)   Number
>
> 7)      integer
>
> 8)   Timestamp
>
> In addition to these subtypes, schemas can constrain types
> further, for instance integer ranges, string lengths etc.
> more on this below.
>
> I will talk about each type in turn, but it goes without saying
> that we need to fit them all into RFC723x, in a way that is not
> going to break anything important and HPACK should not hate
> them either.
>
> In HTTP3+, they should be serialized intelligently, but that
> should be trivial and I will not cover that here.
>
> 1) Unicode string
> -----------------
>
> The first question is do we mean "unrestricted unicode" or do
> we want to try to sanitize it.
>
> An example of sanitation is RFC7230's "quoted-string" which bans
> control characters except forward horizontal white-space (=TAB).
>
> Another is I-JSON (RFC7493)'s:
>
>    MUST NOT include code points that identify Surrogates or
>    Noncharacters as defined by UNICODE.
>
> As far as I can tell, that means that you have to keep a full UNICODE
> table handy at all times, and update it whenever additions are made
> to unicode.  Not cool IMO.
>
> Imposing a RFC7230 like restriction on unicode gets totally
> roccoco:  What does "forward horizontal white-space" mean on
> a line which used both left-to-right and right-to-left alphabets ?
> What does it mean in alphabets which write vertically ?
>
> Let us absolve the parser from such intimate unicode scholarship
> and simply say that the data type "unicode string" is what it says,
> and use the schemas to sanitize its individual use.
>
> Encoding unicode strings in HTTP1+2 requires new syntax and
> for any number of reasons, I would like to minimize that
> and {re-|ab-}use quoted-strings.
>
> RFC7230 does not specify what %80-%FF means in quoted-string, but
> hints that it might be ISO8859.
>
> Now we want it to become UTF-8.
>
> My proposal at the workshop, to make the first three characters
> inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
> encoding:  It takes 68 bits.
>
> Encoding the BOM as '\ufeff' helps but still takes an unreasonable
> 48 bits in HPACK/huffman.
>
> In both H1 and H2 defining a new "\U" escape seems better.
>
> Since we want to carry unrestricted unicode, we also need escapes
> to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.
>
> (We should not restict which codepoints may/should use \u%%%% until
> we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
> in asian codepages.)
>
> The heuristic for parsing a quoted-string then becomes:
>
>         1) If the quoted-string first two characters are "\U"
>                 -> UTF-8
>
>         2)  If the quoted-string contains "\u%%%%" escape anywhere
>                 -> UTF-8
>
>         3)  If the quoted-string contains only %09-%7E
>                 -> UTF-8 (actually: ASCII)
>
>         4)  If the quoted-string contains any %7F-%8F
>                 -> UTF-8
>
>         5)  If header definition explicitly says ISO-8859
>                 -> ISO8859
>
>         6)  else
>                 -> UTF-8
>
> 2) Ascii strings
> ----------------
>
> I'm not sure if we need these or if they are even a good idea.
>
> The "pro" argument is if we insist they are also english text
> so we have something the entire world stands a chance to understand.
>
> The "contra" arguement is that some people will be upset about that.
>
> If we want them, they're quoted-strings from RFC723x without %7F-%FF.
>
> It is probably better the schema them from unicode strings.
>
> 3) Binary blobs
> ---------------
>
> Fitting binary blobs from crypto into RFC7230 should squeeze into
> quoted-string as well, since we cannot put any kinds of markers or
> escapes on tokens without breaking things.
>
> Proposal:
>
>         Quoted-string with "\#" as first two chars indicates base64
>         encoded binary blob.
>
> I chose "\#" because "#" is not in the base64 set, so if some
> nonconforming implementation eliminates the "unnecessary escape"
> it will be clearly visible (and likely recoverable) rather than
> munge up the content of the base64.
>
> Base64 is chosen because it is the densest well known encoding which
> works well with HPACK/huffman:  The b64 characters on average emit
> 6.46 bits.
>
> I have no idea how these blobs would look when parsed into JSON,
> probably as base64 ?  But in languages which can, they should
> probably become native byte-strings.
>
> 4) Token
> --------
>
> As we know it from RFC7230:
>
>    tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
>     "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
>    token = 1*tchar
>
> 5) Qualified Token
> ------------------
>
>    qualified_token = token 0*1("/" token)
>
> All keys in all dictionaries are of this type.  (In JSON/python...
> the keys are strings)
>
> Schemas can restrict this further.
>
> 6 Numbers
> ---------
>
> These are signed decimal numbers which may have a fraction
>
> In HTTP1+2 we want them always on "%f" format and we want them to
> fit in IEEE754 64 bit floating point, which lead to the following
> definition:
>
>         0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )        n+m < 15
>
> (15 digits fit in IEEE754 64 binary floating point.)
>
> These numbers can (also) be used for millisecond resolution absolute
> UNIX-epoch relative timestamps for all forseeable future.
>
> 7) Integers
> -----------
>
>         0*1"-" 1*15 DIGIT
>
> Same restriction as above to fit into IEEE 754.
>
> Range can & should be restricted by schemas as necessary.
>
> 8 Timestamps
> ------------
>
> I propose we do these as subtype of Numbers, as UNIX-epoch relative
> time.  That is somewhat human-hostile and is leap-second-challenged.
>
> If you know from the schema that a timestamp is coming, the parser
> can easily tell the difference between a RFC7231 IMF-fixdate or a
> Number-Date.
>
> Without guidance from a schema it becomes inefficient to determine
> if it is an IMF-fixdate, since the week day part looks like a token,
> but it is not impossible.
>
>
> Schemas
> =======
>
> There needs a "ABNF"-parallel to specify what is mandatory and
> allowed for these headers in "common structure".
>
> Ideally this should be in machine-readable format, so that
> validation tools and parser-code can be produced without
> (too much) human intervation.  I'm tempted to say we should
> make the schemas JSON, but then we need to write JSON schemas
> for our schemas :-/
>
> Since schemas basically restict what you are allowed to
> express, we need to examine and think about what restrictions
> we want to be able to impose, before we design the schema.
>
> This is the least thought about part of this document, since
> the train is now in Lund:
>
> Unicode strings:
> ----------------
>
> * Limit by (UTF-8) encoded length.
>         Ie: a resource restriction, not a typographical restriction.
>
> * Limit by codepoints
>         Example: Allow only "0-9" and "a-f"
>         The specification of code-points should be list of codepoint
>         ranges.  (Ascii strings could be defined this way)
>
> * Limit by allowed strings
>         ie: Allow only "North", "South", "East" and "West"
>
> Tokens
> ------
>
> * Limit by codepoints
>         Example: Allow only "A-Z"
>
> * Limit by length
>         Example: Max 7 characters
>
> * Limit by pattern
>         Example: "A-Z" "a-z" "-" "0-9" "0-9"
>         (use ABNF to specify ?)
>
> * Limit by well known set
>         Example: Token must be ISO3166-1 country code
>         Example: Token must be in IANA FooBar registry
>
> Qualified Tokens
> ----------------
>
> * Limit each of the two component tokens as above.
>
> Binary Blob
> -----------
>
> * Limit by length in bytes
>         Example: 128 bytes
>         Example: 16-64 or 80 bytes
>
> Number
> ------
>
> * Limit resolution
>         Example: exactly 3 decimal digits
>
> * Limit range
>         Example: [2.716 ... 3.1415]
>
> Integer
> -------
>
> * Limit range
>         Example [0 ... 65535]
>
> Timestamp
> ---------
>
> (I cant thing of usable restrictions here)
>
>
> Aaand... I'm in Copenhagen...
>
> Let me know if any of this looks usable...
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk@FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
>
>
Received on Thursday, 18 August 2016 13:44:09 UTC