- From: Alex Hall <alexhall@revelytix.com>
- Date: Fri, 18 May 2012 09:58:14 -0400
- To: Sandro Hawke <sandro@w3.org>
- Cc: Richard Cyganiak <richard@cyganiak.de>, public-rdf-wg <public-rdf-wg@w3.org>
- Message-ID: <CAFq2biyTD8wb8iZZN5_qygHEDNPHJY5rREc7t6gZkXOVeuxrNQ@mail.gmail.com>
On Fri, May 18, 2012 at 9:52 AM, Sandro Hawke <sandro@w3.org> wrote: > On Fri, 2012-05-18 at 09:30 -0400, Alex Hall wrote: > > On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak > > <richard@cyganiak.de> wrote: > > Sandro, > > > > -1 to a “loose Turtle”. > > > > If a conforming Turtle parser were allowed to accept a > > document containing <http://example.org/a|b>, then what next? > > This is not a valid IRI. So it is not allowed in an RDF graph. > > A Turtle parser is rarely a stand-alone system — it's a > > component in a larger system. Once the Turtle parser tries > > passing on the pseudo-IRI to the next component, then a number > > of things can happen: > > > > The next component might reject it outright. > > > > Or the next component accepts it and stores the pseudo-IRI. > > Then the user can do their thing. Then when the user tries to > > save their work, the serializer checks IRIs and rejects it, > > taking down the app with an error message. (This is Jena's > > default behaviour, or at least was the last time I checked.) > > > > Or maybe the entire system works, except that now we have a > > situation where certain RDF “graphs” can be loaded and saved > > in Turtle but not in other syntaxes. This will cause major > > headaches for users, who will end up messing around with > > format converters in order to get broken data into a format > > that doesn't complain about the data being broken. > > > > Or maybe the system accepts the IRI and puts it into its > > store, but then you can't delete it from the store any more > > because the SPARQL Update part of the system is stricter and > > rejects DELETE DATA commands containing broken IRIs. > > > > Given the complexity of RDF-bases systems, and the many > > interacting components and specifications involved, this kind > > of error handling cannot be introduced for a single syntax. It > > has to be done centrally so that all involved components and > > specifications can behave in a consistent way. Defining > > algorithms for error recovery for broken RDF data may well be > > a good idea, but I don't think this should be part of an 1.1 > > update to RDF, and I don't think we are chartered to do it. > > > > > > > > +1 to all these sentiments - in practice, letting an invalid IRI into > > an RDF system will likely screw things up later down the line when > > validation is eventually applied. > > > > > > However, the Turtle grammar already allows the creation of invalid > > IRIs. The main purpose of the IRIREF rule is to disallow characters > > that are illegal everywhere in an IRI, but you can still construct an > > invalid IRI by either: > > 1. Using legal characters in an illegal order, e.g. <a#b#c> > > 2. Using Unicode escapes for illegal characters, e.g. <a\u007Cb> > > (which is the escaped form of <a|b>) > > > > > > This was illustrated in a message to public-rdf-comments > > ( > http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html), > which pointed out a positive parser test in the test suite which contained > an IRI with escaped control characters. An implementation that parsed the > resulting, unescaped IRI using an IRI library was reporting an error for > this case. The consensus on the list was that parsing with an IRI library > is a perfectly appropriate thing to do, and that the test case should be > changed or removed. > > > > > > In light of this, I think the Turtle document should give guidance > > that the Turtle grammar alone is not sufficient to reject invalid > > IRIs, and that conforming parsers MAY (or even SHOULD) do additional > > validation against the grammar from RFC3987. > > That makes sense. So what's the point of the IRIREF pattern being > something more complex than /<[^ \t\n\r>]*>/ ? (Or even /<[^>]*>/, or > -- if you have the nongreedy operator -- just /<.*?>/) > I think the grammar as written is a happy compromise of rejecting input that is obviously not an IRI since it contains illegal characters, without introducing the full-blown complexity of RFC3987. Keeping in mind that not all environments will have access to an IRI library, I don't think it's appropriate to allow absolutely everything within the <> brackets. -Alex > > -- Sandro > > > > > -Alex > > > > > > > > Best, > > Richard > > > > > > On 17 May 2012, at 21:35, Sandro Hawke wrote: > > > > > What should/may/must a Turtle parser do with a turtle > > document like > > > this: > > > > > > <http://example.org/a> <http://example.org/a> > > <http://example.org/a|b>. > > > > > > By the grammar, this is not a Turtle document, because of > > the '|' > > > character in a URI. I don't think, however, that people > > writing Turtle > > > parsers will want to enforce this. If they come across some > > Turtle > > > document that's got a URI like this -- they can still parse > > it just > > > fine, so they probably will. > > > > > > The language tokens like IRIREF and PNAME are defined in the > > grammar > > > with these vast regexps (if you macro-expand what's there, > > now), but > > > actually much simpler ones will produce the same result in > > practice -- > > > they'll just tolerate some files that are not, > > strictly-speaking, > > > Turtle. (I'm pretty sure -- maybe there are some corner > > cases with > > > missing whitespace where these regexps will give you a > > different result > > > than something more like any-character-up-until-a-delimiter. > > > > > > I'm not sure anything has to change, but I think at very > > least the > > > conformance clause should be clear about whether it's okay > > to accept a > > > turtle document like my example above. > > > > > > It might be nice to have "strict" and "loose" parsers, > > especially if we > > > can define loose parsers in a way that makes them simpler to > > implement, > > > run faster, and never parse anything differently from a > > strict parser. > > > > > > Of course, then I'm not quite sure the point of the strict > > parsers. > > > > > > -- Sandro > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Received on Friday, 18 May 2012 13:59:05 UTC