- From: Sampo Syreeni <decoy@iki.fi>
- Date: Sat, 1 Jul 2017 14:11:11 +0300 (EEST)
- To: Jörn Hees <j_hees@cs.uni-kl.de>
- cc: SW-forum Web <semantic-web@w3.org>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
- Message-ID: <alpine.DEB.2.11.1707011139040.6254@lakka.kapsi.fi>
On 2017-06-29, Jörn Hees wrote: > Canonical N-Triples for example allow you to count predicates this > easy: > > cut -d' ' -f2 < input.nt | sort | uniq -c Even if n-triples without spaces can be easily cut into, it is a huge hassle to always have to be prepared for the eventuality. Especially in a syntax meant to simplify things from the very start. I'd also like to reminisce on a little project of mine (unfortunately never finished). I once took upon myself to write a parser for the PICS vocabulary, in its pre-RDF form. I actually did finish a do-nothing parser, using the then current version of GNU flex. Nothing great here, because it really did nothing but parse, but I did learn a couple of things one the way. First, the vocabulary suffered the very same problem with spaces. Parsing wise that problem can be benign, as it was with PICS, and because of N-triples's even more regular syntax, most likely it's going to be begign there as well. Formally, such problems tend to be caused by defining a syntax one step too shallow in (A)BNF. What should be terminals only, often prove to be de facto abstract non-terminals, so that the precise usage of especially white space is left unspecified. Typically the languages we want to specify can be easily and naturally augmented with the precise space conventions, as was the case with my fling with PICS. Mostly you just include an \s* (using eregex notation) into the syntax preceding and following "the usual suspects" at the before-transformation terminal level. Then left-normalize the grammar, and be done with it. It almost always works. The operative word being almost. In some cases you can't get away with such a straight forward transformation. One of the most commonly used and as such perilous productions would be the one defining an identifier, and its use in a delimited list: id ::= alpha (alpha | digit)* id-list ::= "(" id* ")" If you want to put in those optional spaces, you'll have to know that you shouldn't be putting them around the alpha and digit parts of id. But you should at the same time be putting them around the parenthesis literals and any full invokation of id, so that the list of id's stays space separated. Except that the definition of the list is already ambiguous. The grammar accepts the string "aa", but it can be parsed in two separate ways, as one id consisting of two a's in a row, or as one id "a" followed by another identical id. If we introduce the idea that every left hand side of a production sorta-kinda-introduces implicit bracketing of the right hand side by optional white space, we sorta-kinda get what we meant to say from the start. At least as long as we keep our grammar in some sense minimal and concept-like, as we often do in the grammars we use in standards work. You can't really make that idea crisp and formal, but "everybody knows already what we mean by it, and go along with the ploy": my id above is a whole concept which admits optional, implicit white space on both edges of its right hand side, as is id-list, and if we apply that idea even formally by adding the prerequisite \s* (in regexp notation) on both sides and then normalize from regular right part to BNF,, we get the kind of grammar we wanted to have in the first place. It's just that we never bothered to spell our little trick aloud, and the trick can't really be generalized; in particular it doesn't survive even the most simplest of manipulations formal grammar theory subjects its sets of productions to. You suddenly can't even transform id ::= alpha (alpha | digit)* into id ::= alpha \1 \1 ::= (alpha | digit)* because expanded the first one would then become id ::= \s* alpha (alpha | digit)* \s* and the latter one id ::= \s* alpha \s* (alpha | digit)* \s* which is clearly different. Now, there is ample precedent in the compiler literature for doing something like this. This isn't just idle speculation, nor are its sequelae. Namely, the whole, classical idea of using a lower level, more efficient (?) lexer (recognizing at most regular languages, eagerly) in conjunction with a higher level, more involved parser (recognizing a context free language), leads to just this kind of thinking. The higher level parser language is there thought to define units, easily amenable to free space on either side, and the lower level lexer is usually constructed so as to just eat white space before returning a lexeme. Often in fact even to to rely on white space separation of lexemes, which by definition the parser then never sees. The precise interaction between the parser and the lexer, especially with regard to white space, is actually left formally unspecified -- just like in my above example the idea of implicit bracketing didn't really lead to sane, consistent derivational rules within a given context free grammar. So, that overall picture has tended to leave quite a number of formal languages out there underspecified. Or at least weirdly specified, by implementations not really formulated in any of the accepted grammar formalisms. Not too amenable to formal, unambiguous specification in even ABNF. Here I finally get back to my little piece of Flex (I didn't know how to use a second level parser like bison then, so I actually did the recursive work within flex itself) work with the early PICS vocabulary. Not only did the W3C spec not really specify how white space should be handled, the reduced-to-character-level grammar actually managed to contain *precisely* one thing not amenable to flex's LARL(1) framework, plus what I did by hand. I don't remember exactly how many characters of lookup it eventually needed in order for the grammar to be disambiguated, but it was 5, 6 or 7; no more, no less. In a sure not to be used error reporting branch of the thing, yielding a conflict with one of the most used parsing paths. That is a 100% certain signal nobody bothered to implement even the most basic parser for the grammar before setting the latter in W3C sanctioned stone. It's also something which should *never*, *ever* happen, because it means that you're not perfectly sure how your code should interpret its input. An uncertainty about what to accept as input. Just think of the security implications. Rev back to my simplistic examples of ambiguity above. Even such simple things as a couple of spaces here and there can and do wreak havoc when used by a malicious attacker. It only takes a poorly specified language, leading to a poorly specified parser, which then accepts an empty space separated pair of identifiers as one full identifier. Or accepts a quadrillions spaces between them for a nicely exploitable buffer overrun. Or my favourite, runs into an ambiguity in a grammar like PICS, which seems like it shouldn't have any, so that suddenly a well optimized parser which only looks one character ahead, could mistake an error report for more typical content, or vice versa, or altogether lose track of its internal state, leading to who knows what kind of attack surface as a result. Long story short, spaces and their modelling in grammars do matter. The precise minutiae of grammar formalisms matter. I'd just *hate* if something like N-triples, meant to simplify N3 and to serve as an all round, most-easiest textual transfer syntax of the Graph Model, stumbled into something as trivial as this. Because it could, and it'd be beyond horrible for anyone even superficially interested in making the Web Semantic, because there we have the simplest of simple problems, already well-understood, yet ostensibly overlooked till now. If we can't do even that right, what hope is there to do anything more interesting/less trivial? > One could obviously split at '<' or '>' and depending on BNode take > the first or second of those, oh and objects become even more fun... Yes. Depending on your interpretation, you can do whatever you happen to like. There isn't even an overarching generalization which everybody agrees upon, nor some minimal intersection. There's ambiguity; the spice of bad. > So sorry, but i don't think that is "still easy". As per the above, it should be easy as peaches. Whoever wrote the spec should have cast in foolproof, hardened by experience and actual implementation, formal grammar. Not prose involving white space and separations and whatnot, but something translatable to an automaton either accepting a certain string of octets (as by themselves well defined by ISO)...or not. If you were *really* uptight about it, you would preëmptively implement both a canonical parser, in some well-known parser framework, and a directable fuzzer as its dual, so as to derive all the examples and counter-examples ever needed in discussions like these. Seems like someone wasn't uptight enough. >> It might have been better at the beginning to require white space >> after the subject, predicate, and object of a triple in N-Triples, >> but given that that wasn't required I don't see that the costs of >> requiring them now are worth any minor benefits in human readability >> that might ensue. That ain't the problem at all. The problem is that the vocabulary suddenly isn't well defined for machine consumption. We suddenly don't know whether with-spaces and without-spaces versions of the same precise N-triples documents must be accepted by conforming processors; i.e. we suddenly have a two different views of what Truth is, based on how separate people with their separate parsers happen to want to see the truth. >> Note that nothing (except the badly written grammar for N-Triples) >> prevents tools from putting single spaces after subjects, predicates, >> and objects. > > I agree with the cost consideration, but i never actually saw nt/ttl > serialized without whitespace. Can you point me to a serializer that > does this? I'd argue the problem is never with the serializer. It's always with how the deserializer interprets its data, and how that affects what/how is accepted as part of the Ground Truth represented by data ingested by and and all software systems. If your deserializer deems N-Triples data without spaces to be malformed, then as far as your whole software framework sees the world, anything seen using without-spaces formatting suddenly became deasserted. > IMO this whole discussion is mostly a theoretical one, doesn't really > lead us anywhere and could easily be concluded: [...] I'd argue to the contrary, at least on security grounds. > [...] I wouldn't update the spec prohibiting parsing no whitespace nt, > as this would be backwards incompatible. I'd update the spec, allowing white space. That's what errata are for, and the result would be backwards compatible. At least intuition-compatible, whatever that means. > I'd however slightly update the grammar prelude to more explicitly > allow (and encourage) whitespace after each of the terms (even if not > totally necessary) by replacing the following sentence: [...] I'd rewrite the whole thing to actually make it at the same time both reflect what was meant, as it does now, and to also be (preferably obviously) part of a well-known formal grammar family. -- Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front +358-40-3255353, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Saturday, 1 July 2017 11:11:45 UTC