Re: test-29: special characters in Turtle IRIs

On Fri, Mar 2, 2012 at 2:19 AM, Henry Story <henry.story@bblfish.net> wrote:

> pretty much the only positive test that fails for me at present
> consistently across Jena, Sesame and my
> implementation is Test-29.ttl [1] which contains the following statement
>
> <http://example.org/node> <http://example.org/prop>
> <scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F
> !"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F>
> .
>
> This is causing the apache abdera IRI [2] library to barf . It looks like
> they put a lot of energy into this library, and so that's made me wonder
> where the error lies. This can be reproduced like this on the scala console
>
> scala> import org.apache.abdera.i18n.iri._
> scala> val iriStr =
> "scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019"
> [line elided for control chars: possibly a scala signature]
> scala> val iriStr2 = "\u001A\u001B\u001C\u001D\u001E\u001F
> !\"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F"
> [line elided for control chars: possibly a scala signature]
> scala> val iri = iriStr + iriStr2
> scala> val i = new IRI(iri)
> org.apache.abdera.i18n.iri.IRISyntaxException:
> org.apache.abdera.i18n.text.InvalidCharacterException: Invalid Character
> 0x1(?)
>        at org.apache.abdera.i18n.iri.IRI.parse(IRI.java:577)
>        at org.apache.abdera.i18n.iri.IRI.<init>(IRI.java:64)
>       ...
>
>
> I looked at http://tools.ietf.org/html/rfc3987 to see what the spec said
> there, but I don't think those characters are
> allowed. Can I remove this from the examples? What should I replace it
> with that would test the spec? Should we move this
> one to a bad-test?
>

This is another situation of syntactically valid Turtle that is not valid
RDF. The IRI in question has Unicode-escaped control characters. All
Unicode escape sequences are allowed in Turtle, but when the sequence is
unescaped as part of the parsing process it becomes a syntactically invalid
IRI (and therefore not valid for RDF). Any IRI parser will choke on this
particular IRI.

I think it's a perfectly reasonable thing to do to incorporate an IRI
parser into a Turtle parser, for validation as well as resolving relative
IRIs against @base. For this reason, I think it's good practice to keep the
positive parser tests in the realm of valid RDF, not just syntactically
valid Turtle.

Regards,
Alex



>
>        Henry
>
>
> [1]  http://www.w3.org/TR/turtle/tests/test-29.ttl
> [2]
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.abdera/abdera-i18n/1.1.2/org/apache/abdera/i18n/iri/IRI.java
>     http://abdera.apache.org/
>
>
>
> Social Web Architect
> http://bblfish.net/
>
>
>

Received on Friday, 2 March 2012 14:24:25 UTC