Re: Turtle Bad IRI syntax tests from Andy Seaborne on 2012-11-05 (public-rdf-wg@w3.org from November 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Mon, 05 Nov 2012 11:35:36 +0000
To: public-rdf-wg@w3.org
Message-ID: <5097A488.9040502@epimorphics.com>
On 05/11/12 00:33, Gregg Kellogg wrote:
> On Nov 4, 2012, at 4:11 PM, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>
>> On Nov 4, 2012, at 3:52 AM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
>>

>>> 7 Parsing
>>> Sec 7.2 says:
>>> [[
>>> The characters between "<" and ">" are unescaped¹ to form the unicode
>>> string of the IRI. Relative IRI resolution is performed per section 6.3
>>> IRI References.
>>> ]]
>>> so currently, the process of checking IRIs is inside parsing.
>>>
>>> "IRI" links to rdf-concepts / sec 3.2 which says it must conform to RFC
>>> 3987.  i.e, as the doc is now, I think it says bad IRIs are a parsing error.

And section 6.3 (in the grammar section)

>>> Gregg - what do you do about IRI?   What checking happens?
>>
>> Yeah, as it happens, my implementation is missing the unescape
>> part,
which would clearly make illegal IRIs, and should cause my
implementation to fail when actually parsing the unescaped value as an
IRI. The fact that I can pass pretty much everything and have this bug,
indicates that we may need some evaluation tests for otherwise legal \u
escapes.

syn-uri-02

----
# x53 is capital S
<http://example/\u0053> <http://example/p> <http://example/o> .
----

(found when going to add a test for this :-)

Producing tests at this stage is cheap so if anything looks like it's 
work covering and isn't, let's do it.  It gets expensive when we go to 
the community for reports because adding tests then is asking people to 
resubmit results.

> Okay, I take that back, the IRIs _are_ unencoded, but the underlying IRI library does not consider IRIs with these characters (i.e., not RFC3987 ipchar) as being invalid. So, this requires an update to the underlying RDF::URI class in the core RDF library.

This is an illustration of a potential problem - lots of systems are not 
going to check IRIs.  Comprehensive and picky IRI libraries are rare 
(kudos to Jeremy for writing the Jena one, which is standalone, an 
education is what is legal in IRIs, but not much use to Ruby).  And 
sometimes they don't because they have to live with bad IRIs.

We may need to mark or split out IRI tests as being "rigorous" (or 
"pedantic") as well as a normal suite.

Another example I have - RIOT/N-Triples current accepts ''@lang strings 
because its the same tokenizer as Turtle and it currently looses the 
distinction of '' and "" early (to be fixed) "be generous with what you 
accept".

>> I think that it's arguable that these should be bad syntax tests, as they are clearly legal WRT the grammar, but they should be illegal when evaluated, so I'd vote to change them to 1) above.
>>
>>> Disclosure:
>>>
>>> RIOT unescapes \u as they are seen, then resolves relative IRIs in the
>>> parser.
>>
>> I'll do the same thing.
>>
>>>>
>>>
>>> The tests named "test-*" are the test from the turtle submission,
>>> cleaned up ... our first decision should be whether we want to carry
>>> them over as is (cleaned up - same issues about illegal characters in
>>> IRIs as [1]/[2]/[3]) or rewritten into the other suite as evaluation tests.
>>>
>>>> test-19.ttl [4] includes illegal characters in IRIs: ", {, |, and }
>>>>
>>>> tests 14-16 either take too long to run to be useful, or are just too stressful of my implementation. I would be happy if they were excluded.
>>>
>>> tests 14,15,16 are all 10K triples -- I don't care whether they are part
>>> of the suite or not.  They are cause a slight pause in RIOT while
>>> running but not unacceptable (to me).
>>
>> Well, they cause more than a slight pause for me, and I think that test-14 may lead to a stack overflow, as I pretty much just evaluate the grammar directly, rather than try to do tail-recursion folding.

I think that's why dajobe put them in.

I can well image that parsers for browsers and scripting environments 
parse to a collection of triples and hand the collection back to the caller.

Rather than tweak the submission tests cases, I prefer (+0.5) to put in 
eval and non-eval tests into the new suite leaving the submission 
(cleaned up) alone.

Maybe put any 10K tests in the "rigorous" category.

	Andy


>>
>> Gregg
>>
>>>>
>>>> Gregg Kellogg
>>>> gregg@greggkellogg.net
>>>>
>>>> [1] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-02.ttl
>>>> [2] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-05.ttl
>>>> [3] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-06.ttl
>>>> [4] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/TurtleSubm/test-29.ttl
>>>>
>>>>
>>>
>>
>>
>
>
Received on Monday, 5 November 2012 11:36:04 UTC