Re: JFV and Common Structure specifications from Poul-Henning Kamp on 2016-11-23 (ietf-http-wg@w3.org from October to December 2016)

From: Poul-Henning Kamp <phk@phk.freebsd.dk>
Date: Wed, 23 Nov 2016 20:58:35 +0000
To: Julian Reschke <julian.reschke@gmx.de>
cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>, Patrick McManus <mcmanus@ducksong.com>
Message-ID: <89874.1479934715@critter.freebsd.dk>
--------
In message <b7b1ab21-ca9f-2ba0-4005-779848875470@gmx.de>, Julian Reschke writes:

>> In difference from JSON, there is nothing in CS which cares either way.
>
>How does this work with the hope of using common parsers? Do they need 
>to be configurable with respect to that?

I see it more as a matter of API design, but yes, "blind" parsing
is impossible, because our header definitions are ambiguous and/or
incomplete [1].

Let me try to tackle the general version of this issue, instead
of just the specific one you refer to.

First, I want to make it absolutely clear that when we talk about
"parsers" in this context, we are really talking about deserializers
for a particular serialization of CS.

Right now we only have the HTTP/1 serialization, but we should keep
HPACKng/H3 firmly in the picture, because they are a large part of
the reason to even consider CS in the first place.

We normally see parsers/deserializers as something which produces
an exact and unique output stream, but they dont have to be, and
in the case of CS, because it is synthesized from existing and loose
definitions, they cannot be (see [1])

If we take a hypothetical HPACKng/H3 serialization it is almost
certain that it could/would communicate type information, because
that is where the money is for improved compression:

	number={2345}, ascii_string={foo}, identifier={bla}

The HTTP1/1 serialization allows us to derive some type distinctions
for instance string vs. identifier, but other types we cannot
distinguish, in particular number vs. identifer.

In practice there are only two ways the H1 deserializer/parser can
find out if {2345} is a number or an identifier:

A)  It can have a schema for the header under consideration,
    which says what it is.

B)  The "schema" instead lives in code which calls the parser,
    using either the gimme_number() or the gimme_identifer()
    entry points.

Given the state of art in "schemas", the you would almost certainly
implement A) as a layer of header-specific subroutines calling B).

I don't think this ambiguity is a problem however, because the
conversions between the ambiguous types are trivial.

If you have an identifier and find out you need a number,
you have briefly postponed the unavoidable call to strtoul(3),
and the conversion is safe and unique.

In the future we will have more exiting corner cases: H3 deserializer
hands over number={0.3} but you want identifier.

When we define that H3 serilization, and the H1->H3 inter-op rules,
we will have to ensure that a HTTP1.1 identifier {000.003} doesnt
arrive as {0.3} in the other end.  This is no big deal, worst case
is numbers which are not in canonical format, being transmitted as
identifiers with lower compression efficiency.

And so, to get back to the answer to your question:

I don't know of any HTTP headers today which allow duplicate indicies
in their dictionaries.

But I also don't know anything which prevents them from existing
outside my bubble or from being defined by this WG in the future.

As for the question of deep structure/recursion I would prefer to
have CS be general, and let the individual HTTP headers impose the
limitations on what they allow, and I strongly support that "official"
HTTP headers are banned from using both deep structure and duplicate
dictionary indicies.

However, if the sense of the WG is that "Postel Was Wrong", and
that we should tighten bolts and nuts because he was, CS should
loose both the recursion and unicode_string, both of which are
forward-looking speculative extensions of the vocabulary over the
set of rfc723x headers from which CS is synthesized.

But if that is the case, we should also deprecate RFC5987 instead
of bis'ing it ?

Poul-Henning


[1]  Let me show you one rabbit-hole I passed through along the way:

We probably all tend to read and understand ABNF top down, but look
at RFC723x bottom up, and you find:

	tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
	    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA

	token = 1*tchar

	type = token

	subtype = token

	media-type = type "/" subtype *( OWS ";" OWS parameter )

	Content-Type = media-type

Which means that as far as RFC723x is concerned, these are both OK:

	Content-Type: application/666

	Content-Type: -3.14159/...---...

It would be tempting to assume that if we see the '/' then the two
tokens surrounding it are not numbers, but there is no foundation
to build that assumption on.

RFC2046 enumerates seven "initial" types, but doesn't say anything
about the shape of the namespace.

The IANA registry has no guidance about it either and "Ask the
IANA Expert Reviewers" is not a valid parser algorithm.

Nothing anywhere says that the next media type being defined cannot
have the subtype defined as "a number", with the implied or explicit
understanding that 'foo/3' sorts before 'foo/10'.

[2] 

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.
Received on Wednesday, 23 November 2016 20:59:07 UTC