Re: turtle conformance clause / strict-vs-loose parsing from Alex Hall on 2012-05-18 (public-rdf-wg@w3.org from May 2012)

From: Alex Hall <alexhall@revelytix.com>
Date: Fri, 18 May 2012 09:52:50 -0400
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: public-rdf-wg@w3.org
Message-ID: <CAFq2biyM7nb9_bMjufdC93-stbiktcsQ_ovevbU4oFQgc1DMDQ@mail.gmail.com>
On Fri, May 18, 2012 at 9:46 AM, Andy Seaborne <
andy.seaborne@epimorphics.com> wrote:

>
>
> On 18/05/12 14:30, Alex Hall wrote:
>
>> On Fri, May 18, 2012 at 6:20 AM, Richard Cyganiak <richard@cyganiak.de
>> <mailto:richard@cyganiak.de>> wrote:
>>
>>    Sandro,
>>
>>    -1 to a “loose Turtle”.
>>
>>    If a conforming Turtle parser were allowed to accept a document
>>    containing <http://example.org/a|b>, then what next? This is not a
>>    valid IRI. So it is not allowed in an RDF graph. A Turtle parser is
>>    rarely a stand-alone system — it's a component in a larger system.
>>    Once the Turtle parser tries passing on the pseudo-IRI to the next
>>    component, then a number of things can happen:
>>
>>    The next component might reject it outright.
>>
>>    Or the next component accepts it and stores the pseudo-IRI. Then the
>>    user can do their thing. Then when the user tries to save their
>>    work, the serializer checks IRIs and rejects it, taking down the app
>>    with an error message. (This is Jena's default behaviour, or at
>>    least was the last time I checked.)
>>
>>    Or maybe the entire system works, except that now we have a
>>    situation where certain RDF “graphs” can be loaded and saved in
>>    Turtle but not in other syntaxes. This will cause major headaches
>>    for users, who will end up messing around with format converters in
>>    order to get broken data into a format that doesn't complain about
>>    the data being broken.
>>
>>    Or maybe the system accepts the IRI and puts it into its store, but
>>    then you can't delete it from the store any more because the SPARQL
>>    Update part of the system is stricter and rejects DELETE DATA
>>    commands containing broken IRIs.
>>
>>    Given the complexity of RDF-bases systems, and the many interacting
>>    components and specifications involved, this kind of error handling
>>    cannot be introduced for a single syntax. It has to be done
>>    centrally so that all involved components and specifications can
>>    behave in a consistent way. Defining algorithms for error recovery
>>    for broken RDF data may well be a good idea, but I don't think this
>>    should be part of an 1.1 update to RDF, and I don't think we are
>>    chartered to do it.
>>
>>
>> +1 to all these sentiments - in practice, letting an invalid IRI into an
>> RDF system will likely screw things up later down the line when
>> validation is eventually applied.
>>
>> However, the Turtle grammar already allows the creation of invalid IRIs.
>> The main purpose of the IRIREF rule is to disallow characters that are
>> illegal everywhere in an IRI, but you can still construct an invalid IRI
>> by either:
>> 1. Using legal characters in an illegal order, e.g. <a#b#c>
>> 2. Using Unicode escapes for illegal characters, e.g. <a\u007Cb> (which
>> is the escaped form of <a|b>)
>>
>> This was illustrated in a message to public-rdf-comments
>> (http://lists.w3.org/Archives/**Public/public-rdf-comments/**
>> 2012Mar/0000.html<http://lists.w3.org/Archives/Public/public-rdf-comments/2012Mar/0000.html>
>> ),
>> which pointed out a positive parser test in the test suite which
>> contained an IRI with escaped control characters. An implementation that
>> parsed the resulting, unescaped IRI using an IRI library was reporting
>> an error for this case. The consensus on the list was that parsing with
>> an IRI library is a perfectly appropriate thing to do, and that the test
>> case should be changed or removed.
>>
>> In light of this, I think the Turtle document should give guidance that
>> the Turtle grammar alone is not sufficient to reject invalid IRIs, and
>> that conforming parsers MAY (or even SHOULD) do additional validation
>> against the grammar from RFC3987.
>>
>> -Alex
>>
>>
> Agreed.
>
> The document does sort of say that IRIs must be valid IRIs, as does
> rdf-concepts so it is a matter of how prominently to say it.
>
> Turtle:
> [[
> 6.2 RDF Term Constructors
>
> production      type
> IRIREF          IRI
>
> The characters between "<" and ">" are unescaped¹ to form the unicode
> string of the IRI.
> ]]
>
> so it says IRIREF produces an IRI and hence conformance checking is done.
>  It's prominent though.
>
>
Did you mean to say it's NOT prominent? At any rate, I think it's
debatable. That passage could also be read to mean that the only thing
required to produce an IRI is unescaping the stuff between '<' and '>'.

-Alex



> Concepts ==>
> [[
> An IRI (Internationalized Resource Identifier) within an RDF graph is a
> Unicode string [UNICODE] that conforms to the syntax defined in RFC 3987
> [IRI].
> ]]
>
>        Andy
>
>
Received on Friday, 18 May 2012 13:53:46 UTC