Re: Are spaces allowed between terms in N-Triples 1.1? from Sampo Syreeni on 2017-07-01 (semantic-web@w3.org from July 2017)

From: Sampo Syreeni <decoy@iki.fi>
Date: Sat, 1 Jul 2017 14:11:11 +0300 (EEST)
To: Jörn Hees <j_hees@cs.uni-kl.de>
cc: SW-forum Web <semantic-web@w3.org>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
Message-ID: <alpine.DEB.2.11.1707011139040.6254@lakka.kapsi.fi>
On 2017-06-29, Jörn Hees wrote:

> Canonical N-Triples for example allow you to count predicates this 
> easy:
>
> cut -d' ' -f2 < input.nt | sort | uniq -c

Even if n-triples without spaces can be easily cut into, it is a huge 
hassle to always have to be prepared for the eventuality. Especially in 
a syntax meant to simplify things from the very start.

I'd also like to reminisce on a little project of mine (unfortunately 
never finished). I once took upon myself to write a parser for the PICS 
vocabulary, in its pre-RDF form. I actually did finish a do-nothing 
parser, using the then current version of GNU flex. Nothing great here, 
because it really did nothing but parse, but I did learn a couple of 
things one the way.

First, the vocabulary suffered the very same problem with spaces. 
Parsing wise that problem can be benign, as it was with PICS, and 
because of N-triples's even more regular syntax, most likely it's going 
to be begign there as well.

Formally, such problems tend to be caused by defining a syntax one step 
too shallow in (A)BNF. What should be terminals only, often prove to be 
de facto abstract non-terminals, so that the precise usage of especially 
white space is left unspecified. Typically the languages we want to 
specify can be easily and naturally augmented with the precise space 
conventions, as was the case with my fling with PICS. Mostly you just 
include an \s* (using eregex notation) into the syntax preceding and 
following "the usual suspects" at the before-transformation terminal 
level. Then left-normalize the grammar, and be done with it. It almost 
always works.

The operative word being almost. In some cases you can't get away with 
such a straight forward transformation. One of the most commonly used 
and as such perilous productions would be the one defining an 
identifier, and its use in a delimited list:

id ::= alpha (alpha | digit)*
id-list ::= "(" id* ")"

If you want to put in those optional spaces, you'll have to know that 
you shouldn't be putting them around the alpha and digit parts of id. 
But you should at the same time be putting them around the parenthesis 
literals and any full invokation of id, so that the list of id's stays 
space separated.

Except that the definition of the list is already ambiguous. The grammar 
accepts the string "aa", but it can be parsed in two separate ways, as 
one id consisting of two a's in a row, or as one id "a" followed by 
another identical id.

If we introduce the idea that every left hand side of a production 
sorta-kinda-introduces implicit bracketing of the right hand side by 
optional white space, we sorta-kinda get what we meant to say from the 
start. At least as long as we keep our grammar in some sense minimal and 
concept-like, as we often do in the grammars we use in standards work. 
You can't really make that idea crisp and formal, but "everybody knows 
already what we mean by it, and go along with the ploy": my id above is 
a whole concept which admits optional, implicit white space on both 
edges of its right hand side, as is id-list, and if we apply that idea 
even formally by adding the prerequisite \s* (in regexp notation) on 
both sides and then normalize from regular right part to BNF,, we get 
the kind of grammar we wanted to have in the first place.

It's just that we never bothered to spell our little trick aloud, and 
the trick can't really be generalized; in particular it doesn't survive 
even the most simplest of manipulations formal grammar theory subjects 
its sets of productions to. You suddenly can't even transform

id ::= alpha (alpha | digit)*

into

id ::= alpha \1
\1 ::= (alpha | digit)*

because expanded the first one would then become

id ::= \s* alpha (alpha | digit)* \s*

and the latter one

id ::= \s* alpha \s* (alpha | digit)* \s*

which is clearly different.

Now, there is ample precedent in the compiler literature for doing 
something like this. This isn't just idle speculation, nor are its 
sequelae. Namely, the whole, classical idea of using a lower level, more 
efficient (?) lexer (recognizing at most regular languages, eagerly) in 
conjunction with a higher level, more involved parser (recognizing a 
context free language), leads to just this kind of thinking. The higher 
level parser language is there thought to define units, easily amenable 
to free space on either side, and the lower level lexer is usually 
constructed so as to just eat white space before returning a lexeme. 
Often in fact even to to rely on white space separation of lexemes, 
which by definition the parser then never sees. The precise interaction 
between the parser and the lexer, especially with regard to white space, 
is actually left formally unspecified -- just like in my above example 
the idea of implicit bracketing didn't really lead to sane, consistent 
derivational rules within a given context free grammar.

So, that overall picture has tended to leave quite a number of formal 
languages out there underspecified. Or at least weirdly specified, by 
implementations not really formulated in any of the accepted grammar 
formalisms. Not too amenable to formal, unambiguous specification in 
even ABNF.

Here I finally get back to my little piece of Flex (I didn't know how to 
use a second level parser like bison then, so I actually did the 
recursive work within flex itself) work with the early PICS vocabulary. 
Not only did the W3C spec not really specify how white space should be 
handled, the reduced-to-character-level grammar actually managed to 
contain *precisely* one thing not amenable to flex's LARL(1) framework, 
plus what I did by hand. I don't remember exactly how many characters of 
lookup it eventually needed in order for the grammar to be 
disambiguated, but it was 5, 6 or 7; no more, no less. In a sure not to 
be used error reporting branch of the thing, yielding a conflict with 
one of the most used parsing paths.

That is a 100% certain signal nobody bothered to implement even the most 
basic parser for the grammar before setting the latter in W3C sanctioned 
stone. It's also something which should *never*, *ever* happen, because 
it means that you're not perfectly sure how your code should interpret 
its input. An uncertainty about what to accept as input.

Just think of the security implications. Rev back to my simplistic 
examples of ambiguity above. Even such simple things as a couple of 
spaces here and there can and do wreak havoc when used by a malicious 
attacker. It only takes a poorly specified language, leading to a poorly 
specified parser, which then accepts an empty space separated pair of 
identifiers as one full identifier. Or accepts a quadrillions spaces 
between them for a nicely exploitable buffer overrun. Or my favourite, 
runs into an ambiguity in a grammar like PICS, which seems like it 
shouldn't have any, so that suddenly a well optimized parser which only 
looks one character ahead, could mistake an error report for more 
typical content, or vice versa, or altogether lose track of its internal 
state, leading to who knows what kind of attack surface as a result.

Long story short, spaces and their modelling in grammars do matter. The 
precise minutiae of grammar formalisms matter. I'd just *hate* if 
something like N-triples, meant to simplify N3 and to serve as an all 
round, most-easiest textual transfer syntax of the Graph Model, stumbled 
into something as trivial as this. Because it could, and it'd be beyond 
horrible for anyone even superficially interested in making the Web 
Semantic, because there we have the simplest of simple problems, already 
well-understood, yet ostensibly overlooked till now. If we can't do even 
that right, what hope is there to do anything more interesting/less 
trivial?

> One could obviously split at '<' or '>' and depending on BNode take 
> the first or second of those, oh and objects become even more fun...

Yes. Depending on your interpretation, you can do whatever you happen to 
like. There isn't even an overarching generalization which everybody 
agrees upon, nor some minimal intersection. There's ambiguity; the spice 
of bad.

> So sorry, but i don't think that is "still easy".

As per the above, it should be easy as peaches. Whoever wrote the spec 
should have cast in foolproof, hardened by experience and actual 
implementation, formal grammar. Not prose involving white space and 
separations and whatnot, but something translatable to an automaton 
either accepting a certain string of octets (as by themselves well 
defined by ISO)...or not.

If you were *really* uptight about it, you would preëmptively implement 
both a canonical parser, in some well-known parser framework, and a 
directable fuzzer as its dual, so as to derive all the examples and 
counter-examples ever needed in discussions like these.

Seems like someone wasn't uptight enough.

>> It might have been better at the beginning to require white space 
>> after the subject, predicate, and object of a triple in N-Triples, 
>> but given that that wasn't required I don't see that the costs of 
>> requiring them now are worth any minor benefits in human readability 
>> that might ensue.

That ain't the problem at all. The problem is that the vocabulary 
suddenly isn't well defined for machine consumption. We suddenly don't 
know whether with-spaces and without-spaces versions of the same precise 
N-triples documents must be accepted by conforming processors; i.e. we 
suddenly have a two different views of what Truth is, based on how 
separate people with their separate parsers happen to want to see the 
truth.

>> Note that nothing (except the badly written grammar for N-Triples) 
>> prevents tools from putting single spaces after subjects, predicates, 
>> and objects.
>
> I agree with the cost consideration, but i never actually saw nt/ttl 
> serialized without whitespace. Can you point me to a serializer that 
> does this?

I'd argue the problem is never with the serializer. It's always with how 
the deserializer interprets its data, and how that affects what/how is 
accepted as part of the Ground Truth represented by data ingested by and 
and all software systems. If your deserializer deems N-Triples data 
without spaces to be malformed, then as far as your whole software 
framework sees the world, anything seen using without-spaces formatting 
suddenly became deasserted.

> IMO this whole discussion is mostly a theoretical one, doesn't really 
> lead us anywhere and could easily be concluded: [...]

I'd argue to the contrary, at least on security grounds.

> [...] I wouldn't update the spec prohibiting parsing no whitespace nt, 
> as this would be backwards incompatible.

I'd update the spec, allowing white space. That's what errata are for, 
and the result would be backwards compatible. At least 
intuition-compatible, whatever that means.

> I'd however slightly update the grammar prelude to more explicitly 
> allow (and encourage) whitespace after each of the terms (even if not 
> totally necessary) by replacing the following sentence: [...]

I'd rewrite the whole thing to actually make it at the same time both 
reflect what was meant, as it does now, and to also be (preferably 
obviously) part of a well-known formal grammar family.
-- 
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-40-3255353, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Saturday, 1 July 2017 11:11:45 UTC