- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Sun, 24 Apr 2011 14:34:34 -0400
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- Cc: RDF-WG <public-rdf-wg@w3.org>
* Andy Seaborne <andy.seaborne@epimorphics.com> [2011-04-24 17:40+0100]
>
> On 23/04/11 20:27, Eric Prud'hommeaux wrote:
> >* Andy Seaborne<andy.seaborne@epimorphics.com> [2011-04-23 17:33+0100]
> >>(resent with note of ISSUE-1 for trackbot)
> >>
> >>RDF-WG ISSUE-1
> >>http://www.w3.org/2011/rdf-wg/track/issues/1
> >>
> >>
> >>I've gathered the differences together into a live document
> >>
> >>http://www.w3.org/2011/rdf-wg/wiki/Diff_SPARQL_Turtle#Relevant_RDF_WG_Decisions
> >>
> >>
> >>And added a new one: Turtle and SPARQL treat \u escape processing
> >>differently because they happen at different times in the parsing process.
> >
> >+1
> >
> >I've had a hard time defending the fact that one can't simply escape
> >characters in PNames (SPARQL's QNames). This comes up in DB dumps, e.g.
> >
> > PREFIX p:<http://foo.example/db/People#> .
> > SELECT ?who ?dept WHERE {
> > ?who p:deptName\u002CdeptCity ?dept
> > }
> >
> >SPARQL says \u002C is substituted with ',' *before* parsing (and ','
> >isn't valid in local names).
> >
> >
> >We could potentially simplify the story for Turtle users by adding
> >unicode escape sequences (I called them UCHARs) to qnames. I hacked
> >this up in a grammar called turtleEsc http://w3.org/brief/MjM0 . It
> >validates strings like:
> >
> > @prefix α:<http://foo.example/bar#> .
> > <ab\u00E9xy> \u03B1:p "ab\u0022cd" .
> >
> >and is, IMO, pretty easy to explain to users. The downside is that
> >we lose grammar control over folks adding chars like [<> ] to IRIs
> >(i.e. left to semantic validation) but I believe it's still better
> >than making PNames un-escapable.
>
> Turtle already has a mechanism for in-parsing quoting using \ as in
> "abc\"def\". That form of \u adds another mechanism.
Agreed, the \\[trn'"\] that exist in most programming (C, Java, Perl,
…) and data serialization (XML, JSON, YAML …) languages are redundant
against a general numeric escaping, but that's motivated by the fact
that there are vast swaths of Unicode for which we will never invent
abbreviations, e.g. [!#$%&´()*+,/@[]|¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],
and my favorite, '…'. Here's a list of the holes in the nth char in a
localname that are allowed in a SPARQL IRI:
!#$%&´()*+,/@\[\]|[#x7F-#xB6][#xB8-#xBF]#xD7#xF7[#x2000-200B]
[#x200E-#x203E][#x2041-#x2069][#x2190-#x2BFF][#x2FF0-#x3000]
[#xD800-#xF8FF][#xD800-#xF8FF][#xFDD0-#xFDEF][#xFFFE-#xFFFF]
(Hmm, we should eliminate surrogates from IRI (and thank UTF-16 for
imposing it's encoding liberties on the encoding-agnostic character
sequence). We should also eliminate the Byte Order Marker #xFFFE .)
> Surely it would be better to allow a style of \-escapes in prefixed
> names if we want to escape char in? Or change the prefix name rules
> to allow (internal) ","?
I see that as being a more radical approach (there are many more
characters than ',' which we want to include in IRIs).
> \u is a way to input characters that are not on the local keyboard,
> or the need to input a codepoint in the charset that does not have
> that codepoint available.
I agree with both uses, but I believe that users have come to expect
numeric escapes to get around language lexing contraints and would
find the fact that "ab\u0022 parses as a literal a bit of a shock.
> This does not apply to UTF-8, but it does apply to "text/turtle"
> because that's US-ASCII. (please use "text/turtle;charset=utf-8"!).
> reserving \u for that purposes seems prudent.
>
> The \u mechanism is very general.
>
> <ab\u0020xy>
> <ab xy>
>
> Making it easier to try to put spaces into IRIs seems to me to be a
> bad idea. There is already confusion in this area and the RDF URI
> reference to IRI change isn't going to make it any easier.
>
> You can't rely on the receiving parser to do and complete
> IRI-parsing which is complicated and expensive. How many systems do
> full IRI checking?
I share your pain here. I've run into data which has come out of
libraries which tolerated spaces. That's an issue in the current
Turtle spec (which allows \u0020 and \> in IRIs). That said, I think
it's better to tell implementers that they "MUST ensure the unescaped
IRI does not contain any of (#x00, #x20, #x3c #x3e)" than to push
numeric parsing onto users (i.e. get their head around "ab\u0022 and
a\u003Ab).
> Test your local parser with this N-Triples file:
> ---------
> <http://example/> <http://example/[]/g> "foo" .
> <http://example/> <http://example/ /g> "foo" .
> ---------
>
> Related:
>
> I do think its unfortunate that % is not allowed in the local part
> of prefix names.
>
> The correct fix is to allow it in % in PN_LOCAL (in Turtle and SPARQL).
>
> Andy
>
--
-ericP
Received on Sunday, 24 April 2011 18:35:04 UTC