Re: Turtle Bad IRI syntax tests from Andy Seaborne on 2012-11-04 (public-rdf-wg@w3.org from November 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Sun, 04 Nov 2012 11:52:28 +0000
To: public-rdf-wg@w3.org
Message-ID: <509656FC.5050900@epimorphics.com>
Hi Gregg,

On 03/11/12 23:33, Gregg Kellogg wrote:
> The following tests in the Turtle Syntax Tests look for a parser error, but I think they're actually correct syntax:
>
> syn-bad-uri-02 [1]
> # Bad IRI : bad escape
> <http://example/\u0020> <http://example/p> <http://example/o> .
>
> syn-bad-uri-05 [2]
> # Bad IRI : hex 3C is <
> <http://example/\u003C> <http://example/p> <http://example/o> .
>
> syn-bad-uri-06 [3]
> # Bad IRI : hex 3E is >
> <http://example/\u003E> <http://example/p> <http://example/o> .
>
> The Turtle Grammar allows any unicode escape to be part of the IRI, and is not restrictive of escapes that match what would be illegal if they are unescaped.
>
> [19]	IRIREF	::=	'<' ([^#x00-#x20<>\"{}|^`\] | UCHAR)* '>'
> [27]	UCHAR	::=	'\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
>
> I think these should be good syntax tests. If that is the case, my processor now passes all of the RIOT Turtle and TurtleSubm tests except the following:

These tests come down to how illegal IRIs are illegal.

I don't care what we decide - providing we decide something.  I'd be 
happy with any of:

1/ Change to be being bad evaluation tests
2/ Remove the tests -- avoid the problem by not pushing things too hard
3/ Bad syntax tests

I do think discouraging illegal IRIs is good because they tend to just 
cause problems somewhere down the line if let into an application, e.g. 
when passing RDF onto to other systems which are strict(er).

It also seems reasonable to have lightweight systems that don't check, 
or check lightly.

Checking the text, I found:

7 Parsing
Sec 7.2 says:
[[
The characters between "<" and ">" are unescaped¹ to form the unicode 
string of the IRI. Relative IRI resolution is performed per section 6.3 
IRI References.
]]
so currently, the process of checking IRIs is inside parsing.

"IRI" links to rdf-concepts / sec 3.2 which says it must conform to RFC 
3987.  i.e, as the doc is now, I think it says bad IRIs are a parsing error.

Gregg - what do you do about IRI?   What checking happens?

Disclosure:

RIOT unescapes \u as they are seen, then resolves relative IRIs in the 
parser.

>

The tests named "test-*" are the test from the turtle submission, 
cleaned up ... our first decision should be whether we want to carry 
them over as is (cleaned up - same issues about illegal characters in 
IRIs as [1]/[2]/[3]) or rewritten into the other suite as evaluation tests.

> test-19.ttl [4] includes illegal characters in IRIs: ", {, |, and }
>
> tests 14-16 either take too long to run to be useful, or are just too stressful of my implementation. I would be happy if they were excluded.

tests 14,15,16 are all 10K triples -- I don't care whether they are part 
of the suite or not.  They are cause a slight pause in RIOT while 
running but not unacceptable (to me).

>
> Gregg Kellogg
> gregg@greggkellogg.net
>
> [1] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-02.ttl
> [2] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-05.ttl
> [3] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-06.ttl
> [4] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/TurtleSubm/test-29.ttl
>
>
Received on Sunday, 4 November 2012 11:52:58 UTC