Re: Turtle Bad IRI syntax tests from Gregg Kellogg on 2012-11-05 (public-rdf-wg@w3.org from November 2012)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sun, 4 Nov 2012 19:33:05 -0500
To: Andy Seaborne <andy.seaborne@epimorphics.com>
CC: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
Message-ID: <B879806E-025D-489E-BBB6-399007E66256@greggkellogg.net>
On Nov 4, 2012, at 4:11 PM, Gregg Kellogg <gregg@greggkellogg.net> wrote:

> On Nov 4, 2012, at 3:52 AM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
> 
>> Hi Gregg,
>> 
>> On 03/11/12 23:33, Gregg Kellogg wrote:
>>> The following tests in the Turtle Syntax Tests look for a parser error, but I think they're actually correct syntax:
>>> 
>>> syn-bad-uri-02 [1]
>>> # Bad IRI : bad escape
>>> <http://example/\u0020> <http://example/p> <http://example/o> .
>>> 
>>> syn-bad-uri-05 [2]
>>> # Bad IRI : hex 3C is <
>>> <http://example/\u003C> <http://example/p> <http://example/o> .
>>> 
>>> syn-bad-uri-06 [3]
>>> # Bad IRI : hex 3E is >
>>> <http://example/\u003E> <http://example/p> <http://example/o> .
>>> 
>>> The Turtle Grammar allows any unicode escape to be part of the IRI, and is not restrictive of escapes that match what would be illegal if they are unescaped.
>>> 
>>> [19]	IRIREF	::=	'<' ([^#x00-#x20<>\"{}|^`\] | UCHAR)* '>'
>>> [27]	UCHAR	::=	'\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
>>> 
>>> I think these should be good syntax tests. If that is the case, my processor now passes all of the RIOT Turtle and TurtleSubm tests except the following:
>> 
>> These tests come down to how illegal IRIs are illegal.
>> 
>> I don't care what we decide - providing we decide something.  I'd be 
>> happy with any of:
>> 
>> 1/ Change to be being bad evaluation tests
>> 2/ Remove the tests -- avoid the problem by not pushing things too hard
>> 3/ Bad syntax tests
>> 
>> I do think discouraging illegal IRIs is good because they tend to just 
>> cause problems somewhere down the line if let into an application, e.g. 
>> when passing RDF onto to other systems which are strict(er).
>> 
>> It also seems reasonable to have lightweight systems that don't check, 
>> or check lightly.
>> 
>> Checking the text, I found:
>> 
>> 7 Parsing
>> Sec 7.2 says:
>> [[
>> The characters between "<" and ">" are unescaped¹ to form the unicode 
>> string of the IRI. Relative IRI resolution is performed per section 6.3 
>> IRI References.
>> ]]
>> so currently, the process of checking IRIs is inside parsing.
>> 
>> "IRI" links to rdf-concepts / sec 3.2 which says it must conform to RFC 
>> 3987.  i.e, as the doc is now, I think it says bad IRIs are a parsing error.
>> 
>> Gregg - what do you do about IRI?   What checking happens?
> 
> Yeah, as it happens, my implementation is missing the unescape part, which would clearly make illegal IRIs, and should cause my implementation to fail when actually parsing the unescaped value as an IRI. The fact that I can pass pretty much everything and have this bug, indicates that we may need some evaluation tests for otherwise legal \u escapes.

Okay, I take that back, the IRIs _are_ unencoded, but the underlying IRI library does not consider IRIs with these characters (i.e., not RFC3987 ipchar) as being invalid. So, this requires an update to the underlying RDF::URI class in the core RDF library.

Gregg

> I think that it's arguable that these should be bad syntax tests, as they are clearly legal WRT the grammar, but they should be illegal when evaluated, so I'd vote to change them to 1) above.
> 
>> Disclosure:
>> 
>> RIOT unescapes \u as they are seen, then resolves relative IRIs in the 
>> parser.
> 
> I'll do the same thing.
> 
>>> 
>> 
>> The tests named "test-*" are the test from the turtle submission, 
>> cleaned up ... our first decision should be whether we want to carry 
>> them over as is (cleaned up - same issues about illegal characters in 
>> IRIs as [1]/[2]/[3]) or rewritten into the other suite as evaluation tests.
>> 
>>> test-19.ttl [4] includes illegal characters in IRIs: ", {, |, and }
>>> 
>>> tests 14-16 either take too long to run to be useful, or are just too stressful of my implementation. I would be happy if they were excluded.
>> 
>> tests 14,15,16 are all 10K triples -- I don't care whether they are part 
>> of the suite or not.  They are cause a slight pause in RIOT while 
>> running but not unacceptable (to me).
> 
> Well, they cause more than a slight pause for me, and I think that test-14 may lead to a stack overflow, as I pretty much just evaluate the grammar directly, rather than try to do tail-recursion folding.
> 
> Gregg
> 
>>> 
>>> Gregg Kellogg
>>> gregg@greggkellogg.net
>>> 
>>> [1] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-02.ttl
>>> [2] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-05.ttl
>>> [3] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-06.ttl
>>> [4] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/TurtleSubm/test-29.ttl
>>> 
>>> 
>> 
> 
>
Received on Monday, 5 November 2012 00:33:42 UTC