Re: turtle conformance clause / strict-vs-loose parsing from Andy Seaborne on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Fri, 18 May 2012 14:46:48 +0100
To: public-rdf-wg@w3.org
Message-ID: <4FB652C8.6080604@epimorphics.com>
On 18/05/12 14:30, Alex Hall wrote:
> On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak <richard@cyganiak.de
> <mailto:richard@cyganiak.de>> wrote:
>
>     Sandro,
>
>     -1 to a “loose Turtle”.
>
>     If a conforming Turtle parser were allowed to accept a document
>     containing <http://example.org/a|b>, then what next? This is not a
>     valid IRI. So it is not allowed in an RDF graph. A Turtle parser is
>     rarely a stand-alone system — it's a component in a larger system.
>     Once the Turtle parser tries passing on the pseudo-IRI to the next
>     component, then a number of things can happen:
>
>     The next component might reject it outright.
>
>     Or the next component accepts it and stores the pseudo-IRI. Then the
>     user can do their thing. Then when the user tries to save their
>     work, the serializer checks IRIs and rejects it, taking down the app
>     with an error message. (This is Jena's default behaviour, or at
>     least was the last time I checked.)
>
>     Or maybe the entire system works, except that now we have a
>     situation where certain RDF “graphs” can be loaded and saved in
>     Turtle but not in other syntaxes. This will cause major headaches
>     for users, who will end up messing around with format converters in
>     order to get broken data into a format that doesn't complain about
>     the data being broken.
>
>     Or maybe the system accepts the IRI and puts it into its store, but
>     then you can't delete it from the store any more because the SPARQL
>     Update part of the system is stricter and rejects DELETE DATA
>     commands containing broken IRIs.
>
>     Given the complexity of RDF-bases systems, and the many interacting
>     components and specifications involved, this kind of error handling
>     cannot be introduced for a single syntax. It has to be done
>     centrally so that all involved components and specifications can
>     behave in a consistent way. Defining algorithms for error recovery
>     for broken RDF data may well be a good idea, but I don't think this
>     should be part of an 1.1 update to RDF, and I don't think we are
>     chartered to do it.
>
>
> +1 to all these sentiments - in practice, letting an invalid IRI into an
> RDF system will likely screw things up later down the line when
> validation is eventually applied.
>
> However, the Turtle grammar already allows the creation of invalid IRIs.
> The main purpose of the IRIREF rule is to disallow characters that are
> illegal everywhere in an IRI, but you can still construct an invalid IRI
> by either:
> 1. Using legal characters in an illegal order, e.g. <a#b#c>
> 2. Using Unicode escapes for illegal characters, e.g. <a\u007Cb> (which
> is the escaped form of <a|b>)
>
> This was illustrated in a message to public-rdf-comments
> (http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html),
> which pointed out a positive parser test in the test suite which
> contained an IRI with escaped control characters. An implementation that
> parsed the resulting, unescaped IRI using an IRI library was reporting
> an error for this case. The consensus on the list was that parsing with
> an IRI library is a perfectly appropriate thing to do, and that the test
> case should be changed or removed.
>
> In light of this, I think the Turtle document should give guidance that
> the Turtle grammar alone is not sufficient to reject invalid IRIs, and
> that conforming parsers MAY (or even SHOULD) do additional validation
> against the grammar from RFC3987.
>
> -Alex
>

Agreed.

The document does sort of say that IRIs must be valid IRIs, as does 
rdf-concepts so it is a matter of how prominently to say it.

Turtle:
[[
6.2 RDF Term Constructors

production  type
IRIREF          IRI

The characters between "<" and ">" are unescaped¹ to form the unicode 
string of the IRI.
]]

so it says IRIREF produces an IRI and hence conformance checking is 
done.  It's prominent though.

Concepts ==>
[[
An IRI (Internationalized Resource Identifier) within an RDF graph is a 
Unicode string [UNICODE] that conforms to the syntax defined in RFC 3987 
[IRI].
]]

 Andy
Received on Friday, 18 May 2012 13:47:26 UTC