- From: Alex Hall <alexhall@revelytix.com>
- Date: Mon, 5 Mar 2012 10:37:39 -0500
- To: Henry Story <henry.story@bblfish.net>
- Cc: David Robillard <d@drobilla.net>, public-rdf-comments@w3.org
- Message-ID: <CAFq2bizFrJ25VVZ624usDHqcX7ydPF3YtNgPWN2LYwPwStFM2A@mail.gmail.com>
On Sun, Mar 4, 2012 at 4:13 PM, Henry Story <henry.story@bblfish.net> wrote: > > On 3 Mar 2012, at 23:20, David Robillard wrote: > > > On Fri, 2012-03-02 at 08:19 +0100, Henry Story wrote: > >> pretty much the only positive test that fails for me at present > consistently across Jena, Sesame and my > >> implementation is Test-29.ttl [1] which contains the following statement > >> > >> <http://example.org/node> <http://example.org/prop> > <scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F > !"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F> > . > >> > >> This is causing the apache abdera IRI [2] library to barf . It looks > like they put a lot of energy into this library, and so that's made me > wonder where the error lies. This can be reproduced like this on the scala > console > > > > This test always puzzled me a bit, since as far as I can tell \u escapes > > like this in an IRI is not valid, but a Turtle/Sparql specific thing. > > > > This is a bit of a devil's advocate question, since I'd rather not > > implement two escape mechanisms when one will do, but shouldn't percent > > encoding be used to escape things in URIs/IRIs? Can other software be > > expected to actually understand URIs like this, or is it > > intended/desirable that machine processing would have to happen before > > they can be 'exported'? > Numeric Unicode escape sequences (\uxxxx) and percent-encoding serve two different purposes. Percent-encoding sequences (%xx) are part of the IRI/URI specs, and allow you to encode characters, e.g. into the path section of an IRI, that would otherwise be illegal in that position. For instance, if you have a file pathname that contains a space -- "/tmp/foo bar.txt" -- then you must use percent encoding to turn this into an IRI because spaces are not allowed anywhere in an IRI. So the resulting IRI would be "file:/tmp/foo%20bar.txt". Turtle allows percent-encoding sequences in IRIs and the local part of prefixed names as part of the grammar, but these are not processed as part of Turtle parsing. Converting percent-encoded characters in an IRI turns it into a new IRI - < http://example.com/foo%63ar.html> is NOT the same IRI as < http://example.com/foobar.html>. It's discouraged to percent-encode characters that are allowed at their position in an IRI, so use of the first IRI would be considered bad practice. Unicode escapes are allowed in IRIs and strings, primarily to allow Turtle authors to write Unicode characters in other languages/alphabets where they don't have good keyboard or font support. If I need to write a Japanese character with my US keyboard, I can either (a) copy-and-paste from some Unicode table that I've found online, or (b) use a \uxxxx escape sequence. Unicode escapes are processed as part of Turtle parsing, so the resulting IRI or string contains the escaped character, not the \uxxxx sequence. If you use a Unicode escape inside an IRI, the escaped character must be legal at that position (which is why this test was failing -- the escaped character was illegal in an IRI). Strictly speaking, Unicode escapes aren't entirely necessary since Turtle supports Unicode natively. You can't express anything with Unicode escapes that you couldn't otherwise, it's more a matter of convenience for authors. We recognize that the description of character escapes in Turtle has been confusing, and the editor has been working on new text clarify the various types of escapes. > > AS I understand /u encoding is the turtle encoding of IRIs. The IRIs don't > have those characters > but the UTF8 equivalent. Depending on the type of the document you will > encode IRIs in different > ways. > Correct. > > So once the transformation from turtle to IRIs has been made %xx encoded > numbers do not get > interpreted again, but are just the string %xx. Correct. > If you transformed that IRI into an URI for > consumption by some other format you would need to escapte the % character > somehow. > Well, decoding of percent-encoded characters would not occur in IRI to URI translation -- if a percent-encoded character is illegal at a given position in an IRI, then it will also be illegal at that position in a URI. But yes, an application that processes IRIs or URIs, e.g. to translate into filesystem paths, would need to process the percent encodings. This is obviously outside the scope of Turtle. -Alex > > Henry > > > > > -dr > > > > > > Social Web Architect > http://bblfish.net/ > > >
Received on Monday, 5 March 2012 15:38:30 UTC