Escape sequences (SPARQL and Turtle)

This is addressing the working group note in the query doc (bullets 2 
and 3).

http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammar


==== SPARQL Proposal

== tl;dr

1/ Change SPARQL so that \u escapes happen inside strings, IRIs and 
prefix names only.  Character escapes appear in string only.  This 
approach is the same design as Turtle.

2/ Suggest to RDF-WG that Turtle and SPARQL are same - that it, keep the 
Turtle approach with fixes for it's inconsistencies i.e. character 
escapes appear in strings only; no escape \>; add \b and \f.  \u and \U 
can appear in IRIs and prefixed names as well as strings.


The rest of this message is a quite detailed assessment of where we are 
and what the change would mean.  But it does feature the snowman.

 Andy


== Current situation

There are two kinds of escapes:

character escapes -- \t, \n \r \b \f \" \' \\

These present a single codepoint and also turns off any special meaning 
like string delimiter or newline.

unicode espaces :  \u1234 and \U12345678

Unicode escape allow systems to handle characters outside the range of 
the current input system and output font.  Like our friend the unicode 
snowman \u2603 ☃ (if your font has it) or accented characters \u00E9  é 
or Japanese (\u5E03\u77F3 布石 (fuseki)).

Snowman:
http://www.fileformat.info/info/unicode/char/2603/index.htm

The value is the unicode codepoint, not the hex code of UTF-8 bytes. 
That does not mean that UTF-8 to codepoint must be done because UTF-8 
encodes each codepoint separately.  A system can encode a \u or \U and 
then insert UTF-8 bytes into the input stream and it will just work.

It's also a way to write "\u5E03\u77F3" for "布石" and not risk 
corruption (binary/text messing around).

= SPARQL

Character escapes can occur in strings (" ", ' ', """ """, ''' ''')
They are converted to their real character after parsing, and any 
special meaning of the character is turned off.

Unicode escapes can occur anywhere.  They dealt with as part of the 
character input stream so it happens before any parsing takes place.  So 
a unicode escape can be anything anywhere

ASK \u007B\u007D
A\u0053K\u0020\u007BU\u007D

is seen by the parser as "ASK {}"

Unicode escapes can occur in IRIs and prefix names.

In SPARQL, the only escapes in IRIs are \u and \U.

http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#codepointEscape
http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammarEscapes

= Turtle

Turtle does not have \f or \b character escapes and it adds \>.

Both character escapes and unicode espaces are applied after parsing 
inside strings (short and long) and IRIs but not prefixed names.

There are special rules \" is only allowed in strings (odd - the " 
character is legal in IRIs), but \' is allowed in an IRI,

\> is only allowed in IRIs (where it's illegal by IRI rules)
But the grammar production does not allow you to type \> in!

"<" ( [^<>\"{}|^`\\] - [#0000-#0020] )* ">"


Two suggestions are pending for Turtle:

T1/ Allow unicode escapes in prefixed names.
T2/ Allow the unicode escapes in prefix names to pass in a wider 
character set than the prefix name production allows.

Two characters of note for T2 are "=" (U+003D) and ":" (U+003A)

The argument for "=" is that it is used in automatic generation of IRIs 
from SQL databases, then there is a case for allowing abbreviated input 
for <http://example/store/id=1234> as ex:id=1234 except "=" is illegal 
so ex:id\u003D1234.

The argument for ":" is that the Facebook Open Graph Protocol
http://developers.facebook.com/docs/opengraph

for example: og:audio:title

<html xmlns:og="http://ogp.me/ns#">
     <head>
         ...
         [REQUIRED TAGS]
         <meta property="og:audio" 
content="http://example.com/amazing.mp3" />
         <meta property="og:audio:title" content="Amazing Song" />
         <meta property="og:audio:artist" content="Amazing Band" />
         <meta property="og:audio:album" content="Amazing Album" />
         <meta property="og:audio:type" content="application/mp3" />
         ...
   </head>

then you can't write: (Turtle)

<http://example/page> og:audio:title "Amazing Song" .
<http://example/page> og:audio\u003Atitle "Amazing Song" .

You can't write og:audio\u003Atitle in SPARQL and have it parse.  The 
\u003A is converted to ":"  and the parser sees:

og:audio:title

which is not a single prefixed name.

Note that even if escaped in, "=" is still required to be a legal IRI 
after prefix name to IRI conversion.

Turtle editors working draft:
http://dvcs.w3.org/hg/rdf/raw-file/tip/rdf-turtle/index.html#sec-grammar

= The base name idiom

Another way to abbreviate IRIs is to use the base:

@base <http://ogp.me/ns#>

Downside: you can have one BASE in SPARQL, and only one active @base in 
Turtle (it can change between blocks triples).

... <audio:title> ...

A relative URI can not start with a segment containing a ":" (RFC 3986).

= Many prefixes

@prefix og-audio  <http://ogp.me/ns#audio:> .

...  og-audio:title ...

Downside is that you do have many prefixes.

= Opinion

Of

     og:audio\u003Atitle
and
     <http://ogp.me/ns#audio:title>

I find the <> form quite adequate because the NS is short.

The use of "id=" could equally have been "id_" -- the use of "=" was not 
forced.

== Proposal

There is a desire to make SPARQL and Turtle as much the same as is 
reasonable.

For SPARQL:

Change the Unicode escaping to only happen inside strings, IRIs and 
prefix names (prefix part and local part) and remove it from the input 
character processing.

The practical effect is small (use of \u in comments does not make a 
query illegal) because \u is used only in those places in the deployed 
world.

For Turtle:

Keep currently rule for in strings; add prefix names.  Only allow 
unicode escapes in IRIs. Fix the grammar rule for IRIs.

Use the same escapes as SPARQL (add \b and \f, remove \>).

Received on Saturday, 19 November 2011 18:02:14 UTC