Re: turtle conformance clause / strict-vs-loose parsing from Sandro Hawke on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Fri, 18 May 2012 10:08:48 -0400
To: Alex Hall <alexhall@revelytix.com>
Cc: Richard Cyganiak <richard@cyganiak.de>, public-rdf-wg <public-rdf-wg@w3.org>
Message-ID: <1337350128.17747.48.camel@waldron>
On Fri, 2012-05-18 at 09:58 -0400, Alex Hall wrote:
> On Fri, May 18, 2012 at 9:52 AM, Sandro Hawke <sandro@w3.org> wrote:
>         On Fri, 2012-05-18 at 09:30 -0400, Alex Hall wrote:
>         > On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak
>         > <richard@cyganiak.de> wrote:
>         >         Sandro,
>         >
>         >         -1 to a “loose Turtle”.
>         >
>         >         If a conforming Turtle parser were allowed to accept
>         a
>         >         document containing <http://example.org/a|b>, then
>         what next?
>         >         This is not a valid IRI. So it is not allowed in an
>         RDF graph.
>         >         A Turtle parser is rarely a stand-alone system —
>         it's a
>         >         component in a larger system. Once the Turtle parser
>         tries
>         >         passing on the pseudo-IRI to the next component,
>         then a number
>         >         of things can happen:
>         >
>         >         The next component might reject it outright.
>         >
>         >         Or the next component accepts it and stores the
>         pseudo-IRI.
>         >         Then the user can do their thing. Then when the user
>         tries to
>         >         save their work, the serializer checks IRIs and
>         rejects it,
>         >         taking down the app with an error message. (This is
>         Jena's
>         >         default behaviour, or at least was the last time I
>         checked.)
>         >
>         >         Or maybe the entire system works, except that now we
>         have a
>         >         situation where certain RDF “graphs” can be loaded
>         and saved
>         >         in Turtle but not in other syntaxes. This will cause
>         major
>         >         headaches for users, who will end up messing around
>         with
>         >         format converters in order to get broken data into a
>         format
>         >         that doesn't complain about the data being broken.
>         >
>         >         Or maybe the system accepts the IRI and puts it into
>         its
>         >         store, but then you can't delete it from the store
>         any more
>         >         because the SPARQL Update part of the system is
>         stricter and
>         >         rejects DELETE DATA commands containing broken IRIs.
>         >
>         >         Given the complexity of RDF-bases systems, and the
>         many
>         >         interacting components and specifications involved,
>         this kind
>         >         of error handling cannot be introduced for a single
>         syntax. It
>         >         has to be done centrally so that all involved
>         components and
>         >         specifications can behave in a consistent way.
>         Defining
>         >         algorithms for error recovery for broken RDF data
>         may well be
>         >         a good idea, but I don't think this should be part
>         of an 1.1
>         >         update to RDF, and I don't think we are chartered to
>         do it.
>         >
>         >
>         >
>         > +1 to all these sentiments - in practice, letting an invalid
>         IRI into
>         > an RDF system will likely screw things up later down the
>         line when
>         > validation is eventually applied.
>         >
>         >
>         > However, the Turtle grammar already allows the creation of
>         invalid
>         > IRIs. The main purpose of the IRIREF rule is to disallow
>         characters
>         > that are illegal everywhere in an IRI, but you can still
>         construct an
>         > invalid IRI by either:
>         > 1. Using legal characters in an illegal order, e.g. <a#b#c>
>         > 2. Using Unicode escapes for illegal characters, e.g. <a
>         \u007Cb>
>         > (which is the escaped form of <a|b>)
>         >
>         >
>         > This was illustrated in a message to public-rdf-comments
>         >
>         (http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html), which pointed out a positive parser test in the test suite which contained an IRI with escaped control characters. An implementation that parsed the resulting, unescaped IRI using an IRI library was reporting an error for this case. The consensus on the list was that parsing with an IRI library is a perfectly appropriate thing to do, and that the test case should be changed or removed.
>         >
>         >
>         > In light of this, I think the Turtle document should give
>         guidance
>         > that the Turtle grammar alone is not sufficient to reject
>         invalid
>         > IRIs, and that conforming parsers MAY (or even SHOULD) do
>         additional
>         > validation against the grammar from RFC3987.
>         
>         
>         That makes sense.   So what's the point of the IRIREF pattern
>         being
>         something more complex than /<[^ \t\n\r>]*>/ ?    (Or
>         even /<[^>]*>/, or
>         -- if you have the nongreedy operator --  just /<.*?>/)
> 
> 
> I think the grammar as written is a happy compromise of rejecting
> input that is obviously not an IRI since it contains illegal
> characters, without introducing the full-blown complexity of RFC3987.

So, some bad-grammar IRIs get rejected by the Turtle grammar, and others
are accepted just fine?    What's the criterion we are using for
deciding which are allowed in Turtle and which are not?

> Keeping in mind that not all environments will have access to an IRI 
> library, I don't think it's appropriate to allow absolutely everything 
> within the <> brackets.

I'm suggesting that parsing Turtle will be a lot easier if these
restrictions are handled upstream of the lexer (and
justified/outsourced to the IRI spec).   The current spec allows
upstream handling, but we could make it a lot more obvious, and be more
helpful about it.  The grammar is much scarier than it needs to be. 

   -- Sandro
> 
> -Alex
> 
> 
>  
>         
>            -- Sandro
>         
>         >
>         > -Alex
>         >
>         >
>         >
>         >         Best,
>         >         Richard
>         >
>         >
>         >         On 17 May 2012, at 21:35, Sandro Hawke wrote:
>         >
>         >         > What should/may/must a Turtle parser do with a
>         turtle
>         >         document like
>         >         > this:
>         >         >
>         >         > <http://example.org/a> <http://example.org/a>
>         >         <http://example.org/a|b>.
>         >         >
>         >         > By the grammar, this is not a Turtle document,
>         because of
>         >         the '|'
>         >         > character in a URI.   I don't think, however, that
>         people
>         >         writing Turtle
>         >         > parsers will want to enforce this.  If they come
>         across some
>         >         Turtle
>         >         > document that's got a URI like this -- they can
>         still parse
>         >         it just
>         >         > fine, so they probably will.
>         >         >
>         >         > The language tokens like IRIREF and PNAME are
>         defined in the
>         >         grammar
>         >         > with these vast regexps (if you macro-expand
>         what's there,
>         >         now), but
>         >         > actually much simpler ones will produce the same
>         result in
>         >         practice --
>         >         > they'll just tolerate some files that are not,
>         >         strictly-speaking,
>         >         > Turtle.  (I'm pretty sure -- maybe there are some
>         corner
>         >         cases with
>         >         > missing whitespace where these regexps will give
>         you a
>         >         different result
>         >         > than something more like
>         any-character-up-until-a-delimiter.
>         >         >
>         >         > I'm not sure anything has to change, but I think
>         at very
>         >         least the
>         >         > conformance clause should be clear about whether
>         it's okay
>         >         to accept a
>         >         > turtle document like my example above.
>         >         >
>         >         > It might be nice to have "strict" and "loose"
>         parsers,
>         >         especially if we
>         >         > can define loose parsers in a way that makes them
>         simpler to
>         >         implement,
>         >         > run faster, and never parse anything differently
>         from a
>         >         strict parser.
>         >         >
>         >         > Of course, then I'm not quite sure the point of
>         the strict
>         >         parsers.
>         >         >
>         >         >  -- Sandro
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >
>         >
>         >
>         >
>         
>         
>         
>
Received on Friday, 18 May 2012 14:09:01 UTC