If not JSON, what then ? from Poul-Henning Kamp on 2016-08-01 (ietf-http-wg@w3.org from July to September 2016)

From: Poul-Henning Kamp <phk@phk.freebsd.dk>
Date: Mon, 01 Aug 2016 07:43:34 +0000
To: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <77778.1470037414@critter.freebsd.dk>
Based on discussions in email and at the workshop in Stockholm,
JSON doesn't seem like a good fit for HTTP headers.

A number of inputs came up in Stockholm which informs the process,
Marks earlier attempt to classify header syntax into groups and the
desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)

My personal intuition was that we should find a binary serialization
(like CORS), and base64 it into HTTP1-2:  Ie: design for the future
and shoe-horn into the present.  But no obvious binary serialization
seems to exist, CORS was panned by a number of people in the WS as
too complicated, and gag-reflexes were triggered by ASN.1.

Inspired by Marks HTTP-header classification, I spent the train-trip
back home to Denmark pondering the opposite attack:  Is there a
common data structure which (many) existing headers would fit into,
which could serve our needs going forward?

This document chronicles my deliberations, and the strawman I came
up with:  Not only does it seem possible, it has some very interesting
possibilities down the road.

Disclaimer:  ABNF may not be perfect.

Structure of headers
====================

I surveyed current headers, and a very large fraction of them
fit into this data structure:

	header: ordered sequence of named dictionaries

The "ordered" constraint arises in two ways:  We have explicitly
ordered headers like {Content|Transfer}-Encoding and we have headers
which have order by their q=%f parameters.

If we unserialize this model from RFC723x definitions, then ',' is
the list separator and ';' the dictionary indicator and separator:

     Accept: audio/*; q=0.2, audio/basic

The "ordered ... named" combination does not map directly to most
contemporary object models (JSON, python, ...) where dictionary
order is undefined, so a definition list is required to represent
this in JSON:

	[
	    [ "audio/*", { "q": 0.2 }],
	    [ "audio/basic", {}]
	]

It looks tempting to find a way to make the toplevel JSON a dictionary
too, but given the use of wildcards in many of the keys ("text/*"),
and the q=%f ordering, that would not be helpful.

Next we want to give people the ability to have deeper structure,
and we can either do that recursively (ie: nested ordered seq of
dict) or restrict the deeper levels to only dict.

That is probably a matter of taste more than anything, but the
recursive design will probably appeal aesthetically to more than
just me, and as we shall see shortly, it comes with certain economies.

So let us use '<...>' to mark the recursion, since <> are shorter than
[] and {} in HPACK/huffman.

Here is a two level example:

	foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar

Parsed into JSON that would be:

	[
	    [
		"foo",
		{
		    "p1": 1,
		    "p4": {},
		    "p3": [
			[
			    "x1",
			    {}
			],
			[
			    "x2",
			    {}
			],
			[
			    "x3",
			    {
				"y2": 2
				"y1": 1,
			    }
			]
		    ],
		    "p2": "abc"
	        }
	    ],
	    [
		"bar",
		{}
	    ]
	]

(NB shuffled dictionary elements to show that JSON dicts are unordered)

And now comes the recursion economy:

First we wrap the entire *new* header in <...>:

	foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>

This way, the first character of the header tells us that this header
has "common structure".

That explicit "common structure" signal means privately defined
headers can use "common structure" as well, and middleware and
frameworks will automatically Do The Right Thing with them.

Next, we add a field to the IANA HTTP header registry (one can do
that I hope ?) classifying their "angle-bracket status":

 A) not angle-brackets -- incompatible structure use topical parser
	Range

 B) implicit angle-brackets -- Has common structure but is not <> enclosed
	Accept
	Content-Encoding
	Transfer-Encoding

 C) explicit angle-brackets -- Has common structure and <> encloosed
	all new headers go here

 D) unknown status.
	As it says on the tin.

Using this as whitelist, and given suitable schemas, a good number
of existing headers can go into the common parser.

And then for the final trick:   We can now define new variants of
existing headers which "sidegrade" them into the common parser:

	Date: < 1469734833 >

This obviously needs a signal/negotiation so we know the other side
can grok them (HTTP2: SETTINGS, HTTP1: TE?)

Next:

Data Types
==========

I think we need these fundamental data types, and subtypes:

1)   Unicode strings

2)	ascii-string (maybe)

3)	binary blob

4)   Token

5)   Qualified-token

6)   Number

7)      integer

8)   Timestamp

In addition to these subtypes, schemas can constrain types
further, for instance integer ranges, string lengths etc.
more on this below.

I will talk about each type in turn, but it goes without saying
that we need to fit them all into RFC723x, in a way that is not
going to break anything important and HPACK should not hate
them either.

In HTTP3+, they should be serialized intelligently, but that
should be trivial and I will not cover that here.

1) Unicode string
-----------------

The first question is do we mean "unrestricted unicode" or do
we want to try to sanitize it.

An example of sanitation is RFC7230's "quoted-string" which bans
control characters except forward horizontal white-space (=TAB).

Another is I-JSON (RFC7493)'s:

   MUST NOT include code points that identify Surrogates or
   Noncharacters as defined by UNICODE.

As far as I can tell, that means that you have to keep a full UNICODE
table handy at all times, and update it whenever additions are made
to unicode.  Not cool IMO.

Imposing a RFC7230 like restriction on unicode gets totally
roccoco:  What does "forward horizontal white-space" mean on
a line which used both left-to-right and right-to-left alphabets ?
What does it mean in alphabets which write vertically ?

Let us absolve the parser from such intimate unicode scholarship
and simply say that the data type "unicode string" is what it says,
and use the schemas to sanitize its individual use.

Encoding unicode strings in HTTP1+2 requires new syntax and
for any number of reasons, I would like to minimize that
and {re-|ab-}use quoted-strings.

RFC7230 does not specify what %80-%FF means in quoted-string, but
hints that it might be ISO8859.

Now we want it to become UTF-8.

My proposal at the workshop, to make the first three characters
inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
encoding:  It takes 68 bits.

Encoding the BOM as '\ufeff' helps but still takes an unreasonable
48 bits in HPACK/huffman.

In both H1 and H2 defining a new "\U" escape seems better.

Since we want to carry unrestricted unicode, we also need escapes
to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.

(We should not restict which codepoints may/should use \u%%%% until
we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
in asian codepages.)

The heuristic for parsing a quoted-string then becomes:

	1) If the quoted-string first two characters are "\U"
		-> UTF-8

	2)  If the quoted-string contains "\u%%%%" escape anywhere
		-> UTF-8

	3)  If the quoted-string contains only %09-%7E
		-> UTF-8 (actually: ASCII)

	4)  If the quoted-string contains any %7F-%8F
		-> UTF-8

	5)  If header definition explicitly says ISO-8859
		-> ISO8859

	6)  else
		-> UTF-8

2) Ascii strings
----------------

I'm not sure if we need these or if they are even a good idea.

The "pro" argument is if we insist they are also english text
so we have something the entire world stands a chance to understand.

The "contra" arguement is that some people will be upset about that.

If we want them, they're quoted-strings from RFC723x without %7F-%FF.

It is probably better the schema them from unicode strings.

3) Binary blobs
---------------

Fitting binary blobs from crypto into RFC7230 should squeeze into
quoted-string as well, since we cannot put any kinds of markers or
escapes on tokens without breaking things.

Proposal:

	Quoted-string with "\#" as first two chars indicates base64
	encoded binary blob.

I chose "\#" because "#" is not in the base64 set, so if some
nonconforming implementation eliminates the "unnecessary escape"
it will be clearly visible (and likely recoverable) rather than
munge up the content of the base64.

Base64 is chosen because it is the densest well known encoding which
works well with HPACK/huffman:  The b64 characters on average emit
6.46 bits.

I have no idea how these blobs would look when parsed into JSON,
probably as base64 ?  But in languages which can, they should
probably become native byte-strings.

4) Token
--------

As we know it from RFC7230:

   tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
   token = 1*tchar

5) Qualified Token
------------------

   qualified_token = token 0*1("/" token)

All keys in all dictionaries are of this type.  (In JSON/python...
the keys are strings)

Schemas can restrict this further.

6 Numbers
---------

These are signed decimal numbers which may have a fraction

In HTTP1+2 we want them always on "%f" format and we want them to
fit in IEEE754 64 bit floating point, which lead to the following
definition:

	0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )	n+m < 15

(15 digits fit in IEEE754 64 binary floating point.)

These numbers can (also) be used for millisecond resolution absolute
UNIX-epoch relative timestamps for all forseeable future.

7) Integers
-----------

	0*1"-" 1*15 DIGIT

Same restriction as above to fit into IEEE 754.

Range can & should be restricted by schemas as necessary.

8 Timestamps
------------

I propose we do these as subtype of Numbers, as UNIX-epoch relative
time.  That is somewhat human-hostile and is leap-second-challenged.

If you know from the schema that a timestamp is coming, the parser
can easily tell the difference between a RFC7231 IMF-fixdate or a
Number-Date.

Without guidance from a schema it becomes inefficient to determine
if it is an IMF-fixdate, since the week day part looks like a token,
but it is not impossible.


Schemas
=======

There needs a "ABNF"-parallel to specify what is mandatory and
allowed for these headers in "common structure".

Ideally this should be in machine-readable format, so that
validation tools and parser-code can be produced without
(too much) human intervation.  I'm tempted to say we should
make the schemas JSON, but then we need to write JSON schemas
for our schemas :-/

Since schemas basically restict what you are allowed to
express, we need to examine and think about what restrictions
we want to be able to impose, before we design the schema.

This is the least thought about part of this document, since
the train is now in Lund:

Unicode strings:
----------------

* Limit by (UTF-8) encoded length.
	Ie: a resource restriction, not a typographical restriction.

* Limit by codepoints
	Example: Allow only "0-9" and "a-f"
	The specification of code-points should be list of codepoint
	ranges.  (Ascii strings could be defined this way)

* Limit by allowed strings
	ie: Allow only "North", "South", "East" and "West"

Tokens
------

* Limit by codepoints
	Example: Allow only "A-Z"

* Limit by length
	Example: Max 7 characters

* Limit by pattern
	Example: "A-Z" "a-z" "-" "0-9" "0-9"
	(use ABNF to specify ?)

* Limit by well known set
	Example: Token must be ISO3166-1 country code
	Example: Token must be in IANA FooBar registry

Qualified Tokens
----------------

* Limit each of the two component tokens as above.
	
Binary Blob
-----------

* Limit by length in bytes
	Example: 128 bytes
	Example: 16-64 or 80 bytes

Number
------

* Limit resolution
	Example: exactly 3 decimal digits

* Limit range
	Example: [2.716 ... 3.1415]

Integer
-------

* Limit range
	Example [0 ... 65535]

Timestamp
---------

(I cant thing of usable restrictions here)


Aaand... I'm in Copenhagen...

Let me know if any of this looks usable...

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Received on Monday, 1 August 2016 07:44:02 UTC