Re: test-29: special characters in Turtle IRIs from Alex Hall on 2012-03-05 (public-rdf-comments@w3.org from March 2012)

From: Alex Hall <alexhall@revelytix.com>
Date: Mon, 5 Mar 2012 10:37:39 -0500
To: Henry Story <henry.story@bblfish.net>
Cc: David Robillard <d@drobilla.net>, public-rdf-comments@w3.org
Message-ID: <CAFq2bizFrJ25VVZ624usDHqcX7ydPF3YtNgPWN2LYwPwStFM2A@mail.gmail.com>
On Sun, Mar 4, 2012 at 4:13 PM, Henry Story <henry.story@bblfish.net> wrote:

>
> On 3 Mar 2012, at 23:20, David Robillard wrote:
>
> > On Fri, 2012-03-02 at 08:19 +0100, Henry Story wrote:
> >> pretty much the only positive test that fails for me at present
> consistently across Jena, Sesame and my
> >> implementation is Test-29.ttl [1] which contains the following statement
> >>
> >> <http://example.org/node> <http://example.org/prop>
> <scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F
> !"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F>
> .
> >>
> >> This is causing the apache abdera IRI [2] library to barf . It looks
> like they put a lot of energy into this library, and so that's made me
> wonder where the error lies. This can be reproduced like this on the scala
> console
> >
> > This test always puzzled me a bit, since as far as I can tell \u escapes
> > like this in an IRI is not valid, but a Turtle/Sparql specific thing.
> >
> > This is a bit of a devil's advocate question, since I'd rather not
> > implement two escape mechanisms when one will do, but shouldn't percent
> > encoding be used to escape things in URIs/IRIs?  Can other software be
> > expected to actually understand URIs like this, or is it
> > intended/desirable that machine processing would have to happen before
> > they can be 'exported'?
>

Numeric Unicode escape sequences (\uxxxx) and percent-encoding serve two
different purposes.

Percent-encoding sequences (%xx) are part of the IRI/URI specs, and allow
you to encode characters, e.g. into the path section of an IRI, that would
otherwise be illegal in that position. For instance, if you have a file
pathname that contains a space -- "/tmp/foo bar.txt" -- then you must use
percent encoding to turn this into an IRI because spaces are not allowed
anywhere in an IRI. So the resulting IRI would be
"file:/tmp/foo%20bar.txt". Turtle allows percent-encoding sequences in IRIs
and the local part of prefixed names as part of the grammar, but these are
not processed as part of Turtle parsing. Converting percent-encoded
characters in an IRI turns it into a new IRI - <
http://example.com/foo%63ar.html> is NOT the same IRI as <
http://example.com/foobar.html>. It's discouraged to percent-encode
characters that are allowed at their position in an IRI, so use of the
first IRI would be considered bad practice.

Unicode escapes are allowed in IRIs and strings, primarily to allow Turtle
authors to write Unicode characters in other languages/alphabets where they
don't have good keyboard or font support. If I need to write a Japanese
character with my US keyboard, I can either (a) copy-and-paste from some
Unicode table that I've found online, or (b) use a \uxxxx escape sequence.
Unicode escapes are processed as part of Turtle parsing, so the resulting
IRI or string contains the escaped character, not the \uxxxx sequence. If
you use a Unicode escape inside an IRI, the escaped character must be legal
at that position (which is why this test was failing -- the escaped
character was illegal in an IRI). Strictly speaking, Unicode escapes aren't
entirely necessary since Turtle supports Unicode natively. You can't
express anything with Unicode escapes that you couldn't otherwise, it's
more a matter of convenience for authors.

We recognize that the description of character escapes in Turtle has been
confusing, and the editor has been working on new text clarify the various
types of escapes.


>
> AS I understand /u encoding is the turtle encoding of IRIs. The IRIs don't
> have those characters
> but the UTF8 equivalent. Depending on the type of the document you will
> encode IRIs in different
> ways.
>

Correct.


>
> So once the transformation from turtle to IRIs has been made %xx encoded
> numbers do not get
> interpreted again, but are just the string %xx.


Correct.


> If you transformed that IRI into an URI for
> consumption by some other format you would need to escapte the % character
> somehow.
>

Well, decoding of percent-encoded characters would not occur in IRI to URI
translation -- if a percent-encoded character is illegal at a given
position in an IRI, then it will also be illegal at that position in a URI.
But yes, an application that processes IRIs or URIs, e.g. to translate into
filesystem paths, would need to process the percent encodings. This is
obviously outside the scope of Turtle.

-Alex



>
> Henry
>
> >
> > -dr
> >
> >
>
> Social Web Architect
> http://bblfish.net/
>
>
>
Received on Monday, 5 March 2012 15:38:30 UTC