Re: Turtle Bad IRI syntax tests from Andy Seaborne on 2012-11-07 (public-rdf-wg@w3.org from November 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 07 Nov 2012 09:00:05 +0000
To: public-rdf-wg@w3.org
Message-ID: <509A2315.3020307@epimorphics.com>
On 06/11/12 01:50, Gregg Kellogg wrote:
> Update, after implementing the RFC3987 grammar to test URIs for conformance, I'm now passing all SubM and Turtle tests except for subm-test-29. Tests 14-16 run just fine in normal mode. It ends up that in my typical test running, I turn on processor logging to help debug after the fact. Turning this off allows me to run all these in a fairly short amount of time.

I've commented out test-subm-29.

	Andy

> My next task is to port over the RDFa EARL reporting environment, along with a generic test runner that can be used to create such reports. I'll add that to the mercurial repo.
>
> Gregg Kellogg
> gregg@greggkellogg.net
>

For those following along at home, test-subm-29 is:

--------------------------
## Unusual URIs

## <http://example.org/node> <http://example.org/prop> 
<scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F 
!"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F> 
.


## No \r \t \n space or bad % (%25 instead), no control chars
<http://example.org/node> <http://example.org/prop> 
<scheme:!"$%25&'()*+,-./0123456789:/@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{|}~?#> 
.

## Same - but with no unwise chars
<http://example.org/node> <http://example.org/prop> 
<scheme:!$%25&'()*+,-./0123456789:/@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~?#> 
.
--------------------------

The first, commented out, triple contains many illegal URI chars.

The second version is the characters reduced to those legal in RDF URI 
References but notnecessarilt IRIs.

It includes "unwise" in RFC 2396 and do not appear in RFC 3986 (except 
"[" and "]" which moved to "reserved" -- http URIs only use them for 
IP-v6 addresses

RFC 2396:
unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

RFC 3986:
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"

which means characters { | } are permitted in (RDF-2004) RDF URI 
References. XML system identifiers, and XML Schema anyURIs but not 
URIs/IRIs.

	Andy

> On Nov 5, 2012, at 3:35 AM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
>
>>
>>
>> On 05/11/12 00:33, Gregg Kellogg wrote:
>>> On Nov 4, 2012, at 4:11 PM, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>>>
>>>> On Nov 4, 2012, at 3:52 AM, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
>>>>
>>
>>>>> 7 Parsing
>>>>> Sec 7.2 says:
>>>>> [[
>>>>> The characters between "<" and ">" are unescaped¹ to form the unicode
>>>>> string of the IRI. Relative IRI resolution is performed per section 6.3
>>>>> IRI References.
>>>>> ]]
>>>>> so currently, the process of checking IRIs is inside parsing.
>>>>>
>>>>> "IRI" links to rdf-concepts / sec 3.2 which says it must conform to RFC
>>>>> 3987.  i.e, as the doc is now, I think it says bad IRIs are a parsing error.
>>
>> And section 6.3 (in the grammar section)
>>
>>>>> Gregg - what do you do about IRI?   What checking happens?
>>>>
>>>> Yeah, as it happens, my implementation is missing the unescape
>>>> part,
>> which would clearly make illegal IRIs, and should cause my
>> implementation to fail when actually parsing the unescaped value as an
>> IRI. The fact that I can pass pretty much everything and have this bug,
>> indicates that we may need some evaluation tests for otherwise legal \u
>> escapes.
>>
>> syn-uri-02
>>
>> ----
>> # x53 is capital S
>> <http://example/\u0053> <http://example/p> <http://example/o> .
>> ----
>>
>> (found when going to add a test for this :-)
>>
>> Producing tests at this stage is cheap so if anything looks like it's
>> work covering and isn't, let's do it.  It gets expensive when we go to
>> the community for reports because adding tests then is asking people to
>> resubmit results.
>>
>>> Okay, I take that back, the IRIs _are_ unencoded, but the underlying IRI library does not consider IRIs with these characters (i.e., not RFC3987 ipchar) as being invalid. So, this requires an update to the underlying RDF::URI class in the core RDF library.
>>
>> This is an illustration of a potential problem - lots of systems are not
>> going to check IRIs.  Comprehensive and picky IRI libraries are rare
>> (kudos to Jeremy for writing the Jena one, which is standalone, an
>> education is what is legal in IRIs, but not much use to Ruby).  And
>> sometimes they don't because they have to live with bad IRIs.
>>
>> We may need to mark or split out IRI tests as being "rigorous" (or
>> "pedantic") as well as a normal suite.
>>
>> Another example I have - RIOT/N-Triples current accepts ''@lang strings
>> because its the same tokenizer as Turtle and it currently looses the
>> distinction of '' and "" early (to be fixed) "be generous with what you
>> accept".
>>
>>>> I think that it's arguable that these should be bad syntax tests, as they are clearly legal WRT the grammar, but they should be illegal when evaluated, so I'd vote to change them to 1) above.
>>>>
>>>>> Disclosure:
>>>>>
>>>>> RIOT unescapes \u as they are seen, then resolves relative IRIs in the
>>>>> parser.
>>>>
>>>> I'll do the same thing.
>>>>
>>>>>>
>>>>>
>>>>> The tests named "test-*" are the test from the turtle submission,
>>>>> cleaned up ... our first decision should be whether we want to carry
>>>>> them over as is (cleaned up - same issues about illegal characters in
>>>>> IRIs as [1]/[2]/[3]) or rewritten into the other suite as evaluation tests.
>>>>>
>>>>>> test-19.ttl [4] includes illegal characters in IRIs: ", {, |, and }
>>>>>>
>>>>>> tests 14-16 either take too long to run to be useful, or are just too stressful of my implementation. I would be happy if they were excluded.
>>>>>
>>>>> tests 14,15,16 are all 10K triples -- I don't care whether they are part
>>>>> of the suite or not.  They are cause a slight pause in RIOT while
>>>>> running but not unacceptable (to me).
>>>>
>>>> Well, they cause more than a slight pause for me, and I think that test-14 may lead to a stack overflow, as I pretty much just evaluate the grammar directly, rather than try to do tail-recursion folding.
>>
>> I think that's why dajobe put them in.
>>
>> I can well image that parsers for browsers and scripting environments
>> parse to a collection of triples and hand the collection back to the caller.
>>
>> Rather than tweak the submission tests cases, I prefer (+0.5) to put in
>> eval and non-eval tests into the new suite leaving the submission
>> (cleaned up) alone.
>>
>> Maybe put any 10K tests in the "rigorous" category.
>>
>> 	Andy
>>
>>
>>>>
>>>> Gregg
>>>>
>>>>>>
>>>>>> Gregg Kellogg
>>>>>> gregg@greggkellogg.net
>>>>>>
>>>>>> [1] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-02.ttl
>>>>>> [2] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-05.ttl
>>>>>> [3] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/Turtle/syn-bad-uri-06.ttl
>>>>>> [4] http://svn.apache.org/repos/asf/jena/Experimental/riot-reader/testing/RIOT/Lang/TurtleSubm/test-29.ttl
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
Received on Wednesday, 7 November 2012 09:00:36 UTC