Re: If not JSON, what then ? from Mark Nottingham on 2016-08-02 (ietf-http-wg@w3.org from July to September 2016)

From: Mark Nottingham <mnot@mnot.net>
Date: Tue, 2 Aug 2016 13:33:39 +0200
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <12ED69B4-C924-475E-9432-B8FEB4B9DF80@mnot.net>
Hey PHK,

Sorry for the delay, been in (and still remain in) transit hell (hi from FRA!).

Overall I like this.

A few thoughts come to mind:

1) Using the first character of the field-value as a signal that the encoding is in use is interesting. I was thinking of indicating it with a suffix on the header field name (e.g., Date-J). Either is viable, but I don't think it's a good idea to reuse existing header field names and rely on that signal to differentiate the value type; that seems like it would cause a lot of interop problems to me. Defining a new header field (whether it's Date-J or Date2 or whatever) seems much safer to me.

2) Regardless of #1, using < as your indicator character is going to collide with the existing syntax of the Link header.

3) I really, really wonder whether we need recursion beyond one level; e.g., I can see a list of dicts, or a dict of dicts, but beyond that seems like a lot of complexity to support. Fields like Accept that have complex structure turn out not to be implemented (qvalues are commonly ignored); having an ordered list would work much better (and defining new header fields as per #1 means we have an opportunity to do this!).

4) I agree with the sentiment that non-ascii strings in header field values are comparatively rare (since most headers are not intended for display), so while we should accommodate them, they shouldn't be the default.

5) I like the idea of 'implicit angle brackets' to retrofit some existing headers. Depending on the parse algorithm we define, we could potentially fit a fair number of existing headers into this, although deriving the specific data types of things like parameter arguments is going to be difficult (or maybe impossible). Needs some investigation before we know whether this would be viable.

Cheers,




> On 1 Aug 2016, at 9:43 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> 
> Based on discussions in email and at the workshop in Stockholm,
> JSON doesn't seem like a good fit for HTTP headers.
> 
> A number of inputs came up in Stockholm which informs the process,
> Marks earlier attempt to classify header syntax into groups and the
> desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)
> 
> My personal intuition was that we should find a binary serialization
> (like CORS), and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated, and gag-reflexes were triggered by ASN.1.
> 
> Inspired by Marks HTTP-header classification, I spent the train-trip
> back home to Denmark pondering the opposite attack:  Is there a
> common data structure which (many) existing headers would fit into,
> which could serve our needs going forward?
> 
> This document chronicles my deliberations, and the strawman I came
> up with:  Not only does it seem possible, it has some very interesting
> possibilities down the road.
> 
> Disclaimer:  ABNF may not be perfect.
> 
> Structure of headers
> ====================
> 
> I surveyed current headers, and a very large fraction of them
> fit into this data structure:
> 
> 	header: ordered sequence of named dictionaries
> 
> The "ordered" constraint arises in two ways:  We have explicitly
> ordered headers like {Content|Transfer}-Encoding and we have headers
> which have order by their q=%f parameters.
> 
> If we unserialize this model from RFC723x definitions, then ',' is
> the list separator and ';' the dictionary indicator and separator:
> 
>     Accept: audio/*; q=0.2, audio/basic
> 
> The "ordered ... named" combination does not map directly to most
> contemporary object models (JSON, python, ...) where dictionary
> order is undefined, so a definition list is required to represent
> this in JSON:
> 
> 	[
> 	    [ "audio/*", { "q": 0.2 }],
> 	    [ "audio/basic", {}]
> 	]
> 
> It looks tempting to find a way to make the toplevel JSON a dictionary
> too, but given the use of wildcards in many of the keys ("text/*"),
> and the q=%f ordering, that would not be helpful.
> 
> Next we want to give people the ability to have deeper structure,
> and we can either do that recursively (ie: nested ordered seq of
> dict) or restrict the deeper levels to only dict.
> 
> That is probably a matter of taste more than anything, but the
> recursive design will probably appeal aesthetically to more than
> just me, and as we shall see shortly, it comes with certain economies.
> 
> So let us use '<...>' to mark the recursion, since <> are shorter than
> [] and {} in HPACK/huffman.
> 
> Here is a two level example:
> 
> 	foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar
> 
> Parsed into JSON that would be:
> 
> 	[
> 	    [
> 		"foo",
> 		{
> 		    "p1": 1,
> 		    "p4": {},
> 		    "p3": [
> 			[
> 			    "x1",
> 			    {}
> 			],
> 			[
> 			    "x2",
> 			    {}
> 			],
> 			[
> 			    "x3",
> 			    {
> 				"y2": 2
> 				"y1": 1,
> 			    }
> 			]
> 		    ],
> 		    "p2": "abc"
> 	        }
> 	    ],
> 	    [
> 		"bar",
> 		{}
> 	    ]
> 	]
> 
> (NB shuffled dictionary elements to show that JSON dicts are unordered)
> 
> And now comes the recursion economy:
> 
> First we wrap the entire *new* header in <...>:
> 
> 	foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>
> 
> This way, the first character of the header tells us that this header
> has "common structure".
> 
> That explicit "common structure" signal means privately defined
> headers can use "common structure" as well, and middleware and
> frameworks will automatically Do The Right Thing with them.
> 
> Next, we add a field to the IANA HTTP header registry (one can do
> that I hope ?) classifying their "angle-bracket status":
> 
> A) not angle-brackets -- incompatible structure use topical parser
> 	Range
> 
> B) implicit angle-brackets -- Has common structure but is not <> enclosed
> 	Accept
> 	Content-Encoding
> 	Transfer-Encoding
> 
> C) explicit angle-brackets -- Has common structure and <> encloosed
> 	all new headers go here
> 
> D) unknown status.
> 	As it says on the tin.
> 
> Using this as whitelist, and given suitable schemas, a good number
> of existing headers can go into the common parser.
> 
> And then for the final trick:   We can now define new variants of
> existing headers which "sidegrade" them into the common parser:
> 
> 	Date: < 1469734833 >
> 
> This obviously needs a signal/negotiation so we know the other side
> can grok them (HTTP2: SETTINGS, HTTP1: TE?)
> 
> Next:
> 
> Data Types
> ==========
> 
> I think we need these fundamental data types, and subtypes:
> 
> 1)   Unicode strings
> 
> 2)	ascii-string (maybe)
> 
> 3)	binary blob
> 
> 4)   Token
> 
> 5)   Qualified-token
> 
> 6)   Number
> 
> 7)      integer
> 
> 8)   Timestamp
> 
> In addition to these subtypes, schemas can constrain types
> further, for instance integer ranges, string lengths etc.
> more on this below.
> 
> I will talk about each type in turn, but it goes without saying
> that we need to fit them all into RFC723x, in a way that is not
> going to break anything important and HPACK should not hate
> them either.
> 
> In HTTP3+, they should be serialized intelligently, but that
> should be trivial and I will not cover that here.
> 
> 1) Unicode string
> -----------------
> 
> The first question is do we mean "unrestricted unicode" or do
> we want to try to sanitize it.
> 
> An example of sanitation is RFC7230's "quoted-string" which bans
> control characters except forward horizontal white-space (=TAB).
> 
> Another is I-JSON (RFC7493)'s:
> 
>   MUST NOT include code points that identify Surrogates or
>   Noncharacters as defined by UNICODE.
> 
> As far as I can tell, that means that you have to keep a full UNICODE
> table handy at all times, and update it whenever additions are made
> to unicode.  Not cool IMO.
> 
> Imposing a RFC7230 like restriction on unicode gets totally
> roccoco:  What does "forward horizontal white-space" mean on
> a line which used both left-to-right and right-to-left alphabets ?
> What does it mean in alphabets which write vertically ?
> 
> Let us absolve the parser from such intimate unicode scholarship
> and simply say that the data type "unicode string" is what it says,
> and use the schemas to sanitize its individual use.
> 
> Encoding unicode strings in HTTP1+2 requires new syntax and
> for any number of reasons, I would like to minimize that
> and {re-|ab-}use quoted-strings.
> 
> RFC7230 does not specify what %80-%FF means in quoted-string, but
> hints that it might be ISO8859.
> 
> Now we want it to become UTF-8.
> 
> My proposal at the workshop, to make the first three characters
> inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
> encoding:  It takes 68 bits.
> 
> Encoding the BOM as '\ufeff' helps but still takes an unreasonable
> 48 bits in HPACK/huffman.
> 
> In both H1 and H2 defining a new "\U" escape seems better.
> 
> Since we want to carry unrestricted unicode, we also need escapes
> to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.
> 
> (We should not restict which codepoints may/should use \u%%%% until
> we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
> in asian codepages.)
> 
> The heuristic for parsing a quoted-string then becomes:
> 
> 	1) If the quoted-string first two characters are "\U"
> 		-> UTF-8
> 
> 	2)  If the quoted-string contains "\u%%%%" escape anywhere
> 		-> UTF-8
> 
> 	3)  If the quoted-string contains only %09-%7E
> 		-> UTF-8 (actually: ASCII)
> 
> 	4)  If the quoted-string contains any %7F-%8F
> 		-> UTF-8
> 
> 	5)  If header definition explicitly says ISO-8859
> 		-> ISO8859
> 
> 	6)  else
> 		-> UTF-8
> 
> 2) Ascii strings
> ----------------
> 
> I'm not sure if we need these or if they are even a good idea.
> 
> The "pro" argument is if we insist they are also english text
> so we have something the entire world stands a chance to understand.
> 
> The "contra" arguement is that some people will be upset about that.
> 
> If we want them, they're quoted-strings from RFC723x without %7F-%FF.
> 
> It is probably better the schema them from unicode strings.
> 
> 3) Binary blobs
> ---------------
> 
> Fitting binary blobs from crypto into RFC7230 should squeeze into
> quoted-string as well, since we cannot put any kinds of markers or
> escapes on tokens without breaking things.
> 
> Proposal:
> 
> 	Quoted-string with "\#" as first two chars indicates base64
> 	encoded binary blob.
> 
> I chose "\#" because "#" is not in the base64 set, so if some
> nonconforming implementation eliminates the "unnecessary escape"
> it will be clearly visible (and likely recoverable) rather than
> munge up the content of the base64.
> 
> Base64 is chosen because it is the densest well known encoding which
> works well with HPACK/huffman:  The b64 characters on average emit
> 6.46 bits.
> 
> I have no idea how these blobs would look when parsed into JSON,
> probably as base64 ?  But in languages which can, they should
> probably become native byte-strings.
> 
> 4) Token
> --------
> 
> As we know it from RFC7230:
> 
>   tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
>    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
>   token = 1*tchar
> 
> 5) Qualified Token
> ------------------
> 
>   qualified_token = token 0*1("/" token)
> 
> All keys in all dictionaries are of this type.  (In JSON/python...
> the keys are strings)
> 
> Schemas can restrict this further.
> 
> 6 Numbers
> ---------
> 
> These are signed decimal numbers which may have a fraction
> 
> In HTTP1+2 we want them always on "%f" format and we want them to
> fit in IEEE754 64 bit floating point, which lead to the following
> definition:
> 
> 	0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )	n+m < 15
> 
> (15 digits fit in IEEE754 64 binary floating point.)
> 
> These numbers can (also) be used for millisecond resolution absolute
> UNIX-epoch relative timestamps for all forseeable future.
> 
> 7) Integers
> -----------
> 
> 	0*1"-" 1*15 DIGIT
> 
> Same restriction as above to fit into IEEE 754.
> 
> Range can & should be restricted by schemas as necessary.
> 
> 8 Timestamps
> ------------
> 
> I propose we do these as subtype of Numbers, as UNIX-epoch relative
> time.  That is somewhat human-hostile and is leap-second-challenged.
> 
> If you know from the schema that a timestamp is coming, the parser
> can easily tell the difference between a RFC7231 IMF-fixdate or a
> Number-Date.
> 
> Without guidance from a schema it becomes inefficient to determine
> if it is an IMF-fixdate, since the week day part looks like a token,
> but it is not impossible.
> 
> 
> Schemas
> =======
> 
> There needs a "ABNF"-parallel to specify what is mandatory and
> allowed for these headers in "common structure".
> 
> Ideally this should be in machine-readable format, so that
> validation tools and parser-code can be produced without
> (too much) human intervation.  I'm tempted to say we should
> make the schemas JSON, but then we need to write JSON schemas
> for our schemas :-/
> 
> Since schemas basically restict what you are allowed to
> express, we need to examine and think about what restrictions
> we want to be able to impose, before we design the schema.
> 
> This is the least thought about part of this document, since
> the train is now in Lund:
> 
> Unicode strings:
> ----------------
> 
> * Limit by (UTF-8) encoded length.
> 	Ie: a resource restriction, not a typographical restriction.
> 
> * Limit by codepoints
> 	Example: Allow only "0-9" and "a-f"
> 	The specification of code-points should be list of codepoint
> 	ranges.  (Ascii strings could be defined this way)
> 
> * Limit by allowed strings
> 	ie: Allow only "North", "South", "East" and "West"
> 
> Tokens
> ------
> 
> * Limit by codepoints
> 	Example: Allow only "A-Z"
> 
> * Limit by length
> 	Example: Max 7 characters
> 
> * Limit by pattern
> 	Example: "A-Z" "a-z" "-" "0-9" "0-9"
> 	(use ABNF to specify ?)
> 
> * Limit by well known set
> 	Example: Token must be ISO3166-1 country code
> 	Example: Token must be in IANA FooBar registry
> 
> Qualified Tokens
> ----------------
> 
> * Limit each of the two component tokens as above.
> 	
> Binary Blob
> -----------
> 
> * Limit by length in bytes
> 	Example: 128 bytes
> 	Example: 16-64 or 80 bytes
> 
> Number
> ------
> 
> * Limit resolution
> 	Example: exactly 3 decimal digits
> 
> * Limit range
> 	Example: [2.716 ... 3.1415]
> 
> Integer
> -------
> 
> * Limit range
> 	Example [0 ... 65535]
> 
> Timestamp
> ---------
> 
> (I cant thing of usable restrictions here)
> 
> 
> Aaand... I'm in Copenhagen...
> 
> Let me know if any of this looks usable...
> 
> -- 
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk@FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
> 

--
Mark Nottingham   https://www.mnot.net/
Received on Tuesday, 2 August 2016 11:36:26 UTC