- From: Poul-Henning Kamp <phk@phk.freebsd.dk>
- Date: Wed, 23 Nov 2016 20:58:35 +0000
- To: Julian Reschke <julian.reschke@gmx.de>
- cc: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>, Patrick McManus <mcmanus@ducksong.com>
-------- In message <b7b1ab21-ca9f-2ba0-4005-779848875470@gmx.de>, Julian Reschke writes: >> In difference from JSON, there is nothing in CS which cares either way. > >How does this work with the hope of using common parsers? Do they need >to be configurable with respect to that? I see it more as a matter of API design, but yes, "blind" parsing is impossible, because our header definitions are ambiguous and/or incomplete [1]. Let me try to tackle the general version of this issue, instead of just the specific one you refer to. First, I want to make it absolutely clear that when we talk about "parsers" in this context, we are really talking about deserializers for a particular serialization of CS. Right now we only have the HTTP/1 serialization, but we should keep HPACKng/H3 firmly in the picture, because they are a large part of the reason to even consider CS in the first place. We normally see parsers/deserializers as something which produces an exact and unique output stream, but they dont have to be, and in the case of CS, because it is synthesized from existing and loose definitions, they cannot be (see [1]) If we take a hypothetical HPACKng/H3 serialization it is almost certain that it could/would communicate type information, because that is where the money is for improved compression: number={2345}, ascii_string={foo}, identifier={bla} The HTTP1/1 serialization allows us to derive some type distinctions for instance string vs. identifier, but other types we cannot distinguish, in particular number vs. identifer. In practice there are only two ways the H1 deserializer/parser can find out if {2345} is a number or an identifier: A) It can have a schema for the header under consideration, which says what it is. B) The "schema" instead lives in code which calls the parser, using either the gimme_number() or the gimme_identifer() entry points. Given the state of art in "schemas", the you would almost certainly implement A) as a layer of header-specific subroutines calling B). I don't think this ambiguity is a problem however, because the conversions between the ambiguous types are trivial. If you have an identifier and find out you need a number, you have briefly postponed the unavoidable call to strtoul(3), and the conversion is safe and unique. In the future we will have more exiting corner cases: H3 deserializer hands over number={0.3} but you want identifier. When we define that H3 serilization, and the H1->H3 inter-op rules, we will have to ensure that a HTTP1.1 identifier {000.003} doesnt arrive as {0.3} in the other end. This is no big deal, worst case is numbers which are not in canonical format, being transmitted as identifiers with lower compression efficiency. And so, to get back to the answer to your question: I don't know of any HTTP headers today which allow duplicate indicies in their dictionaries. But I also don't know anything which prevents them from existing outside my bubble or from being defined by this WG in the future. As for the question of deep structure/recursion I would prefer to have CS be general, and let the individual HTTP headers impose the limitations on what they allow, and I strongly support that "official" HTTP headers are banned from using both deep structure and duplicate dictionary indicies. However, if the sense of the WG is that "Postel Was Wrong", and that we should tighten bolts and nuts because he was, CS should loose both the recursion and unicode_string, both of which are forward-looking speculative extensions of the vocabulary over the set of rfc723x headers from which CS is synthesized. But if that is the case, we should also deprecate RFC5987 instead of bis'ing it ? Poul-Henning [1] Let me show you one rabbit-hole I passed through along the way: We probably all tend to read and understand ABNF top down, but look at RFC723x bottom up, and you find: tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA token = 1*tchar type = token subtype = token media-type = type "/" subtype *( OWS ";" OWS parameter ) Content-Type = media-type Which means that as far as RFC723x is concerned, these are both OK: Content-Type: application/666 Content-Type: -3.14159/...---... It would be tempting to assume that if we see the '/' then the two tokens surrounding it are not numbers, but there is no foundation to build that assumption on. RFC2046 enumerates seven "initial" types, but doesn't say anything about the shape of the namespace. The IANA registry has no guidance about it either and "Ask the IANA Expert Reviewers" is not a valid parser algorithm. Nothing anywhere says that the next media type being defined cannot have the subtype defined as "a number", with the implied or explicit understanding that 'foo/3' sorts before 'foo/10'. [2] -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
Received on Wednesday, 23 November 2016 20:59:07 UTC