Re: turtle conformance clause / strict-vs-loose parsing from Alex Hall on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Alex Hall <alexhall@revelytix.com>
Date: Fri, 18 May 2012 09:30:44 -0400
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Sandro Hawke <sandro@w3.org>, public-rdf-wg <public-rdf-wg@w3.org>
Message-ID: <CAFq2biyaDnPUxbtKu5a0sFwTbJNfALUPvafYhG8vVXn5vcot8Q@mail.gmail.com>
On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak <richard@cyganiak.de>wrote:

> Sandro,
>
> -1 to a “loose Turtle”.
>
> If a conforming Turtle parser were allowed to accept a document containing
> <http://example.org/a|b>, then what next? This is not a valid IRI. So it
> is not allowed in an RDF graph. A Turtle parser is rarely a stand-alone
> system — it's a component in a larger system. Once the Turtle parser tries
> passing on the pseudo-IRI to the next component, then a number of things
> can happen:
>
> The next component might reject it outright.
>
> Or the next component accepts it and stores the pseudo-IRI. Then the user
> can do their thing. Then when the user tries to save their work, the
> serializer checks IRIs and rejects it, taking down the app with an error
> message. (This is Jena's default behaviour, or at least was the last time I
> checked.)
>
> Or maybe the entire system works, except that now we have a situation
> where certain RDF “graphs” can be loaded and saved in Turtle but not in
> other syntaxes. This will cause major headaches for users, who will end up
> messing around with format converters in order to get broken data into a
> format that doesn't complain about the data being broken.
>
> Or maybe the system accepts the IRI and puts it into its store, but then
> you can't delete it from the store any more because the SPARQL Update part
> of the system is stricter and rejects DELETE DATA commands containing
> broken IRIs.
>
> Given the complexity of RDF-bases systems, and the many interacting
> components and specifications involved, this kind of error handling cannot
> be introduced for a single syntax. It has to be done centrally so that all
> involved components and specifications can behave in a consistent way.
> Defining algorithms for error recovery for broken RDF data may well be a
> good idea, but I don't think this should be part of an 1.1 update to RDF,
> and I don't think we are chartered to do it.
>
>
+1 to all these sentiments - in practice, letting an invalid IRI into an
RDF system will likely screw things up later down the line when validation
is eventually applied.

However, the Turtle grammar already allows the creation of invalid IRIs.
The main purpose of the IRIREF rule is to disallow characters that are
illegal everywhere in an IRI, but you can still construct an invalid IRI by
either:
1. Using legal characters in an illegal order, e.g. <a#b#c>
2. Using Unicode escapes for illegal characters, e.g. <a\u007Cb> (which is
the escaped form of <a|b>)

This was illustrated in a message to public-rdf-comments (
http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html),
which pointed out a positive parser test in the test suite which contained
an IRI with escaped control characters. An implementation that parsed the
resulting, unescaped IRI using an IRI library was reporting an error for
this case. The consensus on the list was that parsing with an IRI library
is a perfectly appropriate thing to do, and that the test case should be
changed or removed.

In light of this, I think the Turtle document should give guidance that the
Turtle grammar alone is not sufficient to reject invalid IRIs, and that
conforming parsers MAY (or even SHOULD) do additional validation against
the grammar from RFC3987.

-Alex



> Best,
> Richard
>
>
> On 17 May 2012, at 21:35, Sandro Hawke wrote:
>
> > What should/may/must a Turtle parser do with a turtle document like
> > this:
> >
> > <http://example.org/a> <http://example.org/a> <http://example.org/a|b>.
> >
> > By the grammar, this is not a Turtle document, because of the '|'
> > character in a URI.   I don't think, however, that people writing Turtle
> > parsers will want to enforce this.  If they come across some Turtle
> > document that's got a URI like this -- they can still parse it just
> > fine, so they probably will.
> >
> > The language tokens like IRIREF and PNAME are defined in the grammar
> > with these vast regexps (if you macro-expand what's there, now), but
> > actually much simpler ones will produce the same result in practice --
> > they'll just tolerate some files that are not, strictly-speaking,
> > Turtle.  (I'm pretty sure -- maybe there are some corner cases with
> > missing whitespace where these regexps will give you a different result
> > than something more like any-character-up-until-a-delimiter.
> >
> > I'm not sure anything has to change, but I think at very least the
> > conformance clause should be clear about whether it's okay to accept a
> > turtle document like my example above.
> >
> > It might be nice to have "strict" and "loose" parsers, especially if we
> > can define loose parsers in a way that makes them simpler to implement,
> > run faster, and never parse anything differently from a strict parser.
> >
> > Of course, then I'm not quite sure the point of the strict parsers.
> >
> >  -- Sandro
> >
> >
> >
> >
> >
> >
>
>
>
Received on Friday, 18 May 2012 13:31:41 UTC