Re: turtle conformance clause / strict-vs-loose parsing from Sandro Hawke on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Fri, 18 May 2012 10:24:54 -0400
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: public-rdf-wg@w3.org
Message-ID: <1337351094.17747.62.camel@waldron>
On Fri, 2012-05-18 at 14:06 +0100, Andy Seaborne wrote:
> jison supports flex format.

Yes, I know.  I think I made several references to that in my email,
although I spoke in generalities.   That support appears to be one of
the sources of bugs.    When UCHAR was expanded into IRIREF, the \x5c
got turned into a \\x5cu.

> On 18/05/12 13:08, Sandro Hawke wrote:
> > On Fri, 2012-05-18 at 12:22 +0100, Steve Harris wrote:
> >> Yes, we've actually had this issue in practice (lots of web URLs are not legal URIs), you want to find that out as early as possible that you have a problem. For us that's when PUTing Turtle to a RDF store.
> >>
> >> - Steve
> >>
> >> On 2012-05-18, at 11:20, Richard Cyganiak wrote:
> >>
> >>> Sandro,
> >>>
> >>> -1 to a “loose Turtle”.
> >>>
> >>> If a conforming Turtle parser were allowed to accept a document containing<http://example.org/a|b>, then what next? This is not a valid IRI. So it is not allowed in an RDF graph. A Turtle parser is rarely a stand-alone system — it's a component in a larger system. Once the Turtle parser tries passing on the pseudo-IRI to the next component, then a number of things can happen:
> >>>
> >>> The next component might reject it outright.
> >>>
> >>> Or the next component accepts it and stores the pseudo-IRI. Then the user can do their thing. Then when the user tries to save their work, the serializer checks IRIs and rejects it, taking down the app with an error message. (This is Jena's default behaviour, or at least was the last time I checked.)
> >>>
> >>> Or maybe the entire system works, except that now we have a situation where certain RDF “graphs” can be loaded and saved in Turtle but not in other syntaxes. This will cause major headaches for users, who will end up messing around with format converters in order to get broken data into a format that doesn't complain about the data being broken.
> >>>
> >>> Or maybe the system accepts the IRI and puts it into its store, but then you can't delete it from the store any more because the SPARQL Update part of the system is stricter and rejects DELETE DATA commands containing broken IRIs.
> >>>
> >>> Given the complexity of RDF-bases systems, and the many interacting components and specifications involved, this kind of error handling cannot be introduced for a single syntax. It has to be done centrally so that all involved components and specifications can behave in a consistent way. Defining algorithms for error recovery for broken RDF data may well be a good idea, but I don't think this should be part of an 1.1 update to RDF, and I don't think we are chartered to do it.
> >
> > Yeah, speaking as a a standards person, I absolutely agree with you.
> >
> > But I put on my implementer's hat this week and had two problems:
> >
> > 1.  The regexps seemed to be breaking various tools.
> 
> If their regexp impl is broken, then it's their bug, not our problem.

In theory.   But in practice, if our language is complex enough to break
tools, that still hurts Turtle.   And if that complexity is needless, as
it seems to be in this case, it seems to me we should simplify.

> > But it's very hard
> > to tell if it's the regexps or the tools, because of how big they are
> > (and I'm using Javascript which doesn't allow spaces or comments in
> > regexps).   For example, here's the regexp generated from our grammar
> > for IRIREF, with all the chars that I was particularly worried about
> > done as hex escapes:
> >
> > /^(<([^\x00-\x20<>\x5c\x22\x7b\x7d\x7c^`\x5c]|((\\x5cu([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f]))|(\\x5cU([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f]))))*>)
> >
> > Actually, looking at it now, I see two errors in it, caused by how jison
> > composes named regexps containing backslashes.
> 
> jison documentation says it supports flex format.  No need to expand it, 
> write token rules.  In fact, just copy from yacker.  Run the tool, don't 
> read the machine generated code.

That's certainly what I tried, at first.

> Let's not sanction a lack of interoperability. 

My proposed change to the Turtle spec would increase interoperability.
It would make it easier to implement Turtle; it would make it easier to
reuse IRI validation libraries; it would make it easier to give helpful
error messages/warnings; it could avoid making an arbitrary decision
about which IRI syntax errors are allowed in Turtle and which are not;
and it could allow for a clear and standard way to handle Turtle
documents which happen to violate the IRI syntax rules.

>  Where's the web in 'semantic web' gone?

I have no idea what this sentence means.

   -- Sandro
Received on Friday, 18 May 2012 14:25:10 UTC