Re: turtle conformance clause / strict-vs-loose parsing from Alex Hall on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Alex Hall <alexhall@revelytix.com>
Date: Fri, 18 May 2012 09:58:14 -0400
To: Sandro Hawke <sandro@w3.org>
Cc: Richard Cyganiak <richard@cyganiak.de>, public-rdf-wg <public-rdf-wg@w3.org>
Message-ID: <CAFq2biyTD8wb8iZZN5_qygHEDNPHJY5rREc7t6gZkXOVeuxrNQ@mail.gmail.com>
On Fri, May 18, 2012 at 9:52 AM, Sandro Hawke <sandro@w3.org> wrote:

> On Fri, 2012-05-18 at 09:30 -0400, Alex Hall wrote:
> > On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak
> > <richard@cyganiak.de> wrote:
> >         Sandro,
> >
> >         -1 to a “loose Turtle”.
> >
> >         If a conforming Turtle parser were allowed to accept a
> >         document containing <http://example.org/a|b>, then what next?
> >         This is not a valid IRI. So it is not allowed in an RDF graph.
> >         A Turtle parser is rarely a stand-alone system — it's a
> >         component in a larger system. Once the Turtle parser tries
> >         passing on the pseudo-IRI to the next component, then a number
> >         of things can happen:
> >
> >         The next component might reject it outright.
> >
> >         Or the next component accepts it and stores the pseudo-IRI.
> >         Then the user can do their thing. Then when the user tries to
> >         save their work, the serializer checks IRIs and rejects it,
> >         taking down the app with an error message. (This is Jena's
> >         default behaviour, or at least was the last time I checked.)
> >
> >         Or maybe the entire system works, except that now we have a
> >         situation where certain RDF “graphs” can be loaded and saved
> >         in Turtle but not in other syntaxes. This will cause major
> >         headaches for users, who will end up messing around with
> >         format converters in order to get broken data into a format
> >         that doesn't complain about the data being broken.
> >
> >         Or maybe the system accepts the IRI and puts it into its
> >         store, but then you can't delete it from the store any more
> >         because the SPARQL Update part of the system is stricter and
> >         rejects DELETE DATA commands containing broken IRIs.
> >
> >         Given the complexity of RDF-bases systems, and the many
> >         interacting components and specifications involved, this kind
> >         of error handling cannot be introduced for a single syntax. It
> >         has to be done centrally so that all involved components and
> >         specifications can behave in a consistent way. Defining
> >         algorithms for error recovery for broken RDF data may well be
> >         a good idea, but I don't think this should be part of an 1.1
> >         update to RDF, and I don't think we are chartered to do it.
> >
> >
> >
> > +1 to all these sentiments - in practice, letting an invalid IRI into
> > an RDF system will likely screw things up later down the line when
> > validation is eventually applied.
> >
> >
> > However, the Turtle grammar already allows the creation of invalid
> > IRIs. The main purpose of the IRIREF rule is to disallow characters
> > that are illegal everywhere in an IRI, but you can still construct an
> > invalid IRI by either:
> > 1. Using legal characters in an illegal order, e.g. <a#b#c>
> > 2. Using Unicode escapes for illegal characters, e.g. <a\u007Cb>
> > (which is the escaped form of <a|b>)
> >
> >
> > This was illustrated in a message to public-rdf-comments
> > (
> http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html),
> which pointed out a positive parser test in the test suite which contained
> an IRI with escaped control characters. An implementation that parsed the
> resulting, unescaped IRI using an IRI library was reporting an error for
> this case. The consensus on the list was that parsing with an IRI library
> is a perfectly appropriate thing to do, and that the test case should be
> changed or removed.
> >
> >
> > In light of this, I think the Turtle document should give guidance
> > that the Turtle grammar alone is not sufficient to reject invalid
> > IRIs, and that conforming parsers MAY (or even SHOULD) do additional
> > validation against the grammar from RFC3987.
>
> That makes sense.   So what's the point of the IRIREF pattern being
> something more complex than /<[^ \t\n\r>]*>/ ?    (Or even /<[^>]*>/, or
> -- if you have the nongreedy operator --  just /<.*?>/)
>

I think the grammar as written is a happy compromise of rejecting input
that is obviously not an IRI since it contains illegal characters, without
introducing the full-blown complexity of RFC3987. Keeping in mind that not
all environments will have access to an IRI library, I don't think it's
appropriate to allow absolutely everything within the <> brackets.

-Alex



>
>    -- Sandro
>
> >
> > -Alex
> >
> >
> >
> >         Best,
> >         Richard
> >
> >
> >         On 17 May 2012, at 21:35, Sandro Hawke wrote:
> >
> >         > What should/may/must a Turtle parser do with a turtle
> >         document like
> >         > this:
> >         >
> >         > <http://example.org/a> <http://example.org/a>
> >         <http://example.org/a|b>.
> >         >
> >         > By the grammar, this is not a Turtle document, because of
> >         the '|'
> >         > character in a URI.   I don't think, however, that people
> >         writing Turtle
> >         > parsers will want to enforce this.  If they come across some
> >         Turtle
> >         > document that's got a URI like this -- they can still parse
> >         it just
> >         > fine, so they probably will.
> >         >
> >         > The language tokens like IRIREF and PNAME are defined in the
> >         grammar
> >         > with these vast regexps (if you macro-expand what's there,
> >         now), but
> >         > actually much simpler ones will produce the same result in
> >         practice --
> >         > they'll just tolerate some files that are not,
> >         strictly-speaking,
> >         > Turtle.  (I'm pretty sure -- maybe there are some corner
> >         cases with
> >         > missing whitespace where these regexps will give you a
> >         different result
> >         > than something more like any-character-up-until-a-delimiter.
> >         >
> >         > I'm not sure anything has to change, but I think at very
> >         least the
> >         > conformance clause should be clear about whether it's okay
> >         to accept a
> >         > turtle document like my example above.
> >         >
> >         > It might be nice to have "strict" and "loose" parsers,
> >         especially if we
> >         > can define loose parsers in a way that makes them simpler to
> >         implement,
> >         > run faster, and never parse anything differently from a
> >         strict parser.
> >         >
> >         > Of course, then I'm not quite sure the point of the strict
> >         parsers.
> >         >
> >         >  -- Sandro
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >
> >
> >
> >
>
>
>
Received on Friday, 18 May 2012 13:59:05 UTC