- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Sun, 24 Apr 2011 14:34:34 -0400
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- Cc: RDF-WG <public-rdf-wg@w3.org>
* Andy Seaborne <andy.seaborne@epimorphics.com> [2011-04-24 17:40+0100] > > On 23/04/11 20:27, Eric Prud'hommeaux wrote: > >* Andy Seaborne<andy.seaborne@epimorphics.com> [2011-04-23 17:33+0100] > >>(resent with note of ISSUE-1 for trackbot) > >> > >>RDF-WG ISSUE-1 > >>http://www.w3.org/2011/rdf-wg/track/issues/1 > >> > >> > >>I've gathered the differences together into a live document > >> > >>http://www.w3.org/2011/rdf-wg/wiki/Diff_SPARQL_Turtle#Relevant_RDF_WG_Decisions > >> > >> > >>And added a new one: Turtle and SPARQL treat \u escape processing > >>differently because they happen at different times in the parsing process. > > > >+1 > > > >I've had a hard time defending the fact that one can't simply escape > >characters in PNames (SPARQL's QNames). This comes up in DB dumps, e.g. > > > > PREFIX p:<http://foo.example/db/People#> . > > SELECT ?who ?dept WHERE { > > ?who p:deptName\u002CdeptCity ?dept > > } > > > >SPARQL says \u002C is substituted with ',' *before* parsing (and ',' > >isn't valid in local names). > > > > > >We could potentially simplify the story for Turtle users by adding > >unicode escape sequences (I called them UCHARs) to qnames. I hacked > >this up in a grammar called turtleEsc http://w3.org/brief/MjM0 . It > >validates strings like: > > > > @prefix α:<http://foo.example/bar#> . > > <ab\u00E9xy> \u03B1:p "ab\u0022cd" . > > > >and is, IMO, pretty easy to explain to users. The downside is that > >we lose grammar control over folks adding chars like [<> ] to IRIs > >(i.e. left to semantic validation) but I believe it's still better > >than making PNames un-escapable. > > Turtle already has a mechanism for in-parsing quoting using \ as in > "abc\"def\". That form of \u adds another mechanism. Agreed, the \\[trn'"\] that exist in most programming (C, Java, Perl, …) and data serialization (XML, JSON, YAML …) languages are redundant against a general numeric escaping, but that's motivated by the fact that there are vast swaths of Unicode for which we will never invent abbreviations, e.g. [!#$%&´()*+,/@[]|¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿], and my favorite, '…'. Here's a list of the holes in the nth char in a localname that are allowed in a SPARQL IRI: !#$%&´()*+,/@\[\]|[#x7F-#xB6][#xB8-#xBF]#xD7#xF7[#x2000-200B] [#x200E-#x203E][#x2041-#x2069][#x2190-#x2BFF][#x2FF0-#x3000] [#xD800-#xF8FF][#xD800-#xF8FF][#xFDD0-#xFDEF][#xFFFE-#xFFFF] (Hmm, we should eliminate surrogates from IRI (and thank UTF-16 for imposing it's encoding liberties on the encoding-agnostic character sequence). We should also eliminate the Byte Order Marker #xFFFE .) > Surely it would be better to allow a style of \-escapes in prefixed > names if we want to escape char in? Or change the prefix name rules > to allow (internal) ","? I see that as being a more radical approach (there are many more characters than ',' which we want to include in IRIs). > \u is a way to input characters that are not on the local keyboard, > or the need to input a codepoint in the charset that does not have > that codepoint available. I agree with both uses, but I believe that users have come to expect numeric escapes to get around language lexing contraints and would find the fact that "ab\u0022 parses as a literal a bit of a shock. > This does not apply to UTF-8, but it does apply to "text/turtle" > because that's US-ASCII. (please use "text/turtle;charset=utf-8"!). > reserving \u for that purposes seems prudent. > > The \u mechanism is very general. > > <ab\u0020xy> > <ab xy> > > Making it easier to try to put spaces into IRIs seems to me to be a > bad idea. There is already confusion in this area and the RDF URI > reference to IRI change isn't going to make it any easier. > > You can't rely on the receiving parser to do and complete > IRI-parsing which is complicated and expensive. How many systems do > full IRI checking? I share your pain here. I've run into data which has come out of libraries which tolerated spaces. That's an issue in the current Turtle spec (which allows \u0020 and \> in IRIs). That said, I think it's better to tell implementers that they "MUST ensure the unescaped IRI does not contain any of (#x00, #x20, #x3c #x3e)" than to push numeric parsing onto users (i.e. get their head around "ab\u0022 and a\u003Ab). > Test your local parser with this N-Triples file: > --------- > <http://example/> <http://example/[]/g> "foo" . > <http://example/> <http://example/ /g> "foo" . > --------- > > Related: > > I do think its unfortunate that % is not allowed in the local part > of prefix names. > > The correct fix is to allow it in % in PN_LOCAL (in Turtle and SPARQL). > > Andy > -- -ericP
Received on Sunday, 24 April 2011 18:35:04 UTC