Re: [TTL] Differences between SPARQL and Turtle. from Eric Prud'hommeaux on 2011-04-24 (public-rdf-wg@w3.org from April 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 24 Apr 2011 14:34:34 -0400
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <20110424183433.GF3342@w3.org>
* Andy Seaborne <andy.seaborne@epimorphics.com> [2011-04-24 17:40+0100]
> 
> On 23/04/11 20:27, Eric Prud'hommeaux wrote:
> >* Andy Seaborne<andy.seaborne@epimorphics.com>  [2011-04-23 17:33+0100]
> >>(resent with note of ISSUE-1 for trackbot)
> >>
> >>RDF-WG ISSUE-1
> >>http://www.w3.org/2011/rdf-wg/track/issues/1
> >>
> >>
> >>I've gathered the differences together into a live document
> >>
> >>http://www.w3.org/2011/rdf-wg/wiki/Diff_SPARQL_Turtle#Relevant_RDF_WG_Decisions
> >>
> >>
> >>And added a new one: Turtle and SPARQL treat \u escape processing
> >>differently because they happen at different times in the parsing process.
> >
> >+1
> >
> >I've had a hard time defending the fact that one can't simply escape
> >characters in PNames (SPARQL's QNames). This comes up in DB dumps, e.g.
> >
> >   PREFIX p:<http://foo.example/db/People#>  .
> >   SELECT ?who ?dept WHERE {
> >     ?who p:deptName\u002CdeptCity ?dept
> >   }
> >
> >SPARQL says \u002C is substituted with ',' *before* parsing (and ','
> >isn't valid in local names).
> >
> >
> >We could potentially simplify the story for Turtle users by adding
> >unicode escape sequences (I called them UCHARs) to qnames. I hacked
> >this up in a grammar called turtleEsc http://w3.org/brief/MjM0 . It
> >validates strings like:
> >
> >   @prefix α:<http://foo.example/bar#>  .
> >   <ab\u00E9xy>  \u03B1:p "ab\u0022cd" .
> >
> >and is, IMO, pretty easy to explain to users. The downside is that
> >we lose grammar control over folks adding chars like [<>  ] to IRIs
> >(i.e. left to semantic validation) but I believe it's still better
> >than making PNames un-escapable.
> 
> Turtle already has a mechanism for in-parsing quoting using \ as in
> "abc\"def\". That form of \u adds another mechanism.

Agreed, the \\[trn'"\] that exist in most programming (C, Java, Perl,
…) and data serialization (XML, JSON, YAML …) languages are redundant
against a general numeric escaping, but that's motivated by the fact
that there are vast swaths of Unicode for which we will never invent
abbreviations, e.g. [!#$%&´()*+,/@[]|¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],
and my favorite, '…'. Here's a list of the holes in the nth char in a
localname that are allowed in a SPARQL IRI:

  !#$%&´()*+,/@\[\]|[#x7F-#xB6][#xB8-#xBF]#xD7#xF7[#x2000-200B]
  [#x200E-#x203E][#x2041-#x2069][#x2190-#x2BFF][#x2FF0-#x3000]
  [#xD800-#xF8FF][#xD800-#xF8FF][#xFDD0-#xFDEF][#xFFFE-#xFFFF]

(Hmm, we should eliminate surrogates from IRI (and thank UTF-16 for
 imposing it's encoding liberties on the encoding-agnostic character
 sequence). We should also eliminate the Byte Order Marker #xFFFE .)


> Surely it would be better to allow a style of \-escapes in prefixed
> names if we want to escape char in? Or change the prefix name rules
> to allow (internal) ","?

I see that as being a more radical approach (there are many more
characters than ',' which we want to include in IRIs).


> \u is a way to input characters that are not on the local keyboard,
> or the need to input a codepoint in the charset that does not have
> that codepoint available.

I agree with both uses, but I believe that users have come to expect
numeric escapes to get around language lexing contraints and would
find the fact that "ab\u0022 parses as a literal a bit of a shock.


> This does not apply to UTF-8, but it does apply to "text/turtle"
> because that's US-ASCII. (please use "text/turtle;charset=utf-8"!).
> reserving \u for that purposes seems prudent.
> 
> The \u mechanism is very general.
> 
> <ab\u0020xy>
> <ab xy>
> 
> Making it easier to try to put spaces into IRIs seems to me to be a
> bad idea.  There is already confusion in this area and the RDF URI
> reference to IRI change isn't going to make it any easier.
> 
> You can't rely on the receiving parser to do and complete
> IRI-parsing which is complicated and expensive.  How many systems do
> full IRI checking?

I share your pain here. I've run into data which has come out of
libraries which tolerated spaces. That's an issue in the current
Turtle spec (which allows \u0020 and \> in IRIs). That said, I think
it's better to tell implementers that they "MUST ensure the unescaped
IRI does not contain any of (#x00, #x20, #x3c #x3e)" than to push
numeric parsing onto users (i.e. get their head around "ab\u0022 and
a\u003Ab).


> Test your local parser with this N-Triples file:
> ---------
>    <http://example/> <http://example/[]/g> "foo" .
>    <http://example/> <http://example/ /g> "foo" .
> ---------
> 
> Related:
> 
> I do think its unfortunate that % is not allowed in the local part
> of prefix names.
> 
> The correct fix is to allow it in % in PN_LOCAL (in Turtle and SPARQL).
> 
>  Andy
> 

-- 
-ericP
Received on Sunday, 24 April 2011 18:35:04 UTC