- From: Sampo Syreeni <decoy@iki.fi>
- Date: Sat, 1 Jul 2017 14:11:11 +0300 (EEST)
- To: Jörn Hees <j_hees@cs.uni-kl.de>
- cc: SW-forum Web <semantic-web@w3.org>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
- Message-ID: <alpine.DEB.2.11.1707011139040.6254@lakka.kapsi.fi>
On 2017-06-29, Jörn Hees wrote:
> Canonical N-Triples for example allow you to count predicates this
> easy:
>
> cut -d' ' -f2 < input.nt | sort | uniq -c
Even if n-triples without spaces can be easily cut into, it is a huge
hassle to always have to be prepared for the eventuality. Especially in
a syntax meant to simplify things from the very start.
I'd also like to reminisce on a little project of mine (unfortunately
never finished). I once took upon myself to write a parser for the PICS
vocabulary, in its pre-RDF form. I actually did finish a do-nothing
parser, using the then current version of GNU flex. Nothing great here,
because it really did nothing but parse, but I did learn a couple of
things one the way.
First, the vocabulary suffered the very same problem with spaces.
Parsing wise that problem can be benign, as it was with PICS, and
because of N-triples's even more regular syntax, most likely it's going
to be begign there as well.
Formally, such problems tend to be caused by defining a syntax one step
too shallow in (A)BNF. What should be terminals only, often prove to be
de facto abstract non-terminals, so that the precise usage of especially
white space is left unspecified. Typically the languages we want to
specify can be easily and naturally augmented with the precise space
conventions, as was the case with my fling with PICS. Mostly you just
include an \s* (using eregex notation) into the syntax preceding and
following "the usual suspects" at the before-transformation terminal
level. Then left-normalize the grammar, and be done with it. It almost
always works.
The operative word being almost. In some cases you can't get away with
such a straight forward transformation. One of the most commonly used
and as such perilous productions would be the one defining an
identifier, and its use in a delimited list:
id ::= alpha (alpha | digit)*
id-list ::= "(" id* ")"
If you want to put in those optional spaces, you'll have to know that
you shouldn't be putting them around the alpha and digit parts of id.
But you should at the same time be putting them around the parenthesis
literals and any full invokation of id, so that the list of id's stays
space separated.
Except that the definition of the list is already ambiguous. The grammar
accepts the string "aa", but it can be parsed in two separate ways, as
one id consisting of two a's in a row, or as one id "a" followed by
another identical id.
If we introduce the idea that every left hand side of a production
sorta-kinda-introduces implicit bracketing of the right hand side by
optional white space, we sorta-kinda get what we meant to say from the
start. At least as long as we keep our grammar in some sense minimal and
concept-like, as we often do in the grammars we use in standards work.
You can't really make that idea crisp and formal, but "everybody knows
already what we mean by it, and go along with the ploy": my id above is
a whole concept which admits optional, implicit white space on both
edges of its right hand side, as is id-list, and if we apply that idea
even formally by adding the prerequisite \s* (in regexp notation) on
both sides and then normalize from regular right part to BNF,, we get
the kind of grammar we wanted to have in the first place.
It's just that we never bothered to spell our little trick aloud, and
the trick can't really be generalized; in particular it doesn't survive
even the most simplest of manipulations formal grammar theory subjects
its sets of productions to. You suddenly can't even transform
id ::= alpha (alpha | digit)*
into
id ::= alpha \1
\1 ::= (alpha | digit)*
because expanded the first one would then become
id ::= \s* alpha (alpha | digit)* \s*
and the latter one
id ::= \s* alpha \s* (alpha | digit)* \s*
which is clearly different.
Now, there is ample precedent in the compiler literature for doing
something like this. This isn't just idle speculation, nor are its
sequelae. Namely, the whole, classical idea of using a lower level, more
efficient (?) lexer (recognizing at most regular languages, eagerly) in
conjunction with a higher level, more involved parser (recognizing a
context free language), leads to just this kind of thinking. The higher
level parser language is there thought to define units, easily amenable
to free space on either side, and the lower level lexer is usually
constructed so as to just eat white space before returning a lexeme.
Often in fact even to to rely on white space separation of lexemes,
which by definition the parser then never sees. The precise interaction
between the parser and the lexer, especially with regard to white space,
is actually left formally unspecified -- just like in my above example
the idea of implicit bracketing didn't really lead to sane, consistent
derivational rules within a given context free grammar.
So, that overall picture has tended to leave quite a number of formal
languages out there underspecified. Or at least weirdly specified, by
implementations not really formulated in any of the accepted grammar
formalisms. Not too amenable to formal, unambiguous specification in
even ABNF.
Here I finally get back to my little piece of Flex (I didn't know how to
use a second level parser like bison then, so I actually did the
recursive work within flex itself) work with the early PICS vocabulary.
Not only did the W3C spec not really specify how white space should be
handled, the reduced-to-character-level grammar actually managed to
contain *precisely* one thing not amenable to flex's LARL(1) framework,
plus what I did by hand. I don't remember exactly how many characters of
lookup it eventually needed in order for the grammar to be
disambiguated, but it was 5, 6 or 7; no more, no less. In a sure not to
be used error reporting branch of the thing, yielding a conflict with
one of the most used parsing paths.
That is a 100% certain signal nobody bothered to implement even the most
basic parser for the grammar before setting the latter in W3C sanctioned
stone. It's also something which should *never*, *ever* happen, because
it means that you're not perfectly sure how your code should interpret
its input. An uncertainty about what to accept as input.
Just think of the security implications. Rev back to my simplistic
examples of ambiguity above. Even such simple things as a couple of
spaces here and there can and do wreak havoc when used by a malicious
attacker. It only takes a poorly specified language, leading to a poorly
specified parser, which then accepts an empty space separated pair of
identifiers as one full identifier. Or accepts a quadrillions spaces
between them for a nicely exploitable buffer overrun. Or my favourite,
runs into an ambiguity in a grammar like PICS, which seems like it
shouldn't have any, so that suddenly a well optimized parser which only
looks one character ahead, could mistake an error report for more
typical content, or vice versa, or altogether lose track of its internal
state, leading to who knows what kind of attack surface as a result.
Long story short, spaces and their modelling in grammars do matter. The
precise minutiae of grammar formalisms matter. I'd just *hate* if
something like N-triples, meant to simplify N3 and to serve as an all
round, most-easiest textual transfer syntax of the Graph Model, stumbled
into something as trivial as this. Because it could, and it'd be beyond
horrible for anyone even superficially interested in making the Web
Semantic, because there we have the simplest of simple problems, already
well-understood, yet ostensibly overlooked till now. If we can't do even
that right, what hope is there to do anything more interesting/less
trivial?
> One could obviously split at '<' or '>' and depending on BNode take
> the first or second of those, oh and objects become even more fun...
Yes. Depending on your interpretation, you can do whatever you happen to
like. There isn't even an overarching generalization which everybody
agrees upon, nor some minimal intersection. There's ambiguity; the spice
of bad.
> So sorry, but i don't think that is "still easy".
As per the above, it should be easy as peaches. Whoever wrote the spec
should have cast in foolproof, hardened by experience and actual
implementation, formal grammar. Not prose involving white space and
separations and whatnot, but something translatable to an automaton
either accepting a certain string of octets (as by themselves well
defined by ISO)...or not.
If you were *really* uptight about it, you would preëmptively implement
both a canonical parser, in some well-known parser framework, and a
directable fuzzer as its dual, so as to derive all the examples and
counter-examples ever needed in discussions like these.
Seems like someone wasn't uptight enough.
>> It might have been better at the beginning to require white space
>> after the subject, predicate, and object of a triple in N-Triples,
>> but given that that wasn't required I don't see that the costs of
>> requiring them now are worth any minor benefits in human readability
>> that might ensue.
That ain't the problem at all. The problem is that the vocabulary
suddenly isn't well defined for machine consumption. We suddenly don't
know whether with-spaces and without-spaces versions of the same precise
N-triples documents must be accepted by conforming processors; i.e. we
suddenly have a two different views of what Truth is, based on how
separate people with their separate parsers happen to want to see the
truth.
>> Note that nothing (except the badly written grammar for N-Triples)
>> prevents tools from putting single spaces after subjects, predicates,
>> and objects.
>
> I agree with the cost consideration, but i never actually saw nt/ttl
> serialized without whitespace. Can you point me to a serializer that
> does this?
I'd argue the problem is never with the serializer. It's always with how
the deserializer interprets its data, and how that affects what/how is
accepted as part of the Ground Truth represented by data ingested by and
and all software systems. If your deserializer deems N-Triples data
without spaces to be malformed, then as far as your whole software
framework sees the world, anything seen using without-spaces formatting
suddenly became deasserted.
> IMO this whole discussion is mostly a theoretical one, doesn't really
> lead us anywhere and could easily be concluded: [...]
I'd argue to the contrary, at least on security grounds.
> [...] I wouldn't update the spec prohibiting parsing no whitespace nt,
> as this would be backwards incompatible.
I'd update the spec, allowing white space. That's what errata are for,
and the result would be backwards compatible. At least
intuition-compatible, whatever that means.
> I'd however slightly update the grammar prelude to more explicitly
> allow (and encourage) whitespace after each of the terms (even if not
> totally necessary) by replacing the following sentence: [...]
I'd rewrite the whole thing to actually make it at the same time both
reflect what was meant, as it does now, and to also be (preferably
obviously) part of a well-known formal grammar family.
--
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-40-3255353, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Saturday, 1 July 2017 11:11:45 UTC