- From: Andy Seaborne <andy.seaborne@epimorphics.com>
- Date: Sat, 19 Nov 2011 18:01:44 +0000
- To: SPARQL Working Group <public-rdf-dawg@w3.org>
This is addressing the working group note in the query doc (bullets 2
and 3).
http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammar
==== SPARQL Proposal
== tl;dr
1/ Change SPARQL so that \u escapes happen inside strings, IRIs and
prefix names only. Character escapes appear in string only. This
approach is the same design as Turtle.
2/ Suggest to RDF-WG that Turtle and SPARQL are same - that it, keep the
Turtle approach with fixes for it's inconsistencies i.e. character
escapes appear in strings only; no escape \>; add \b and \f. \u and \U
can appear in IRIs and prefixed names as well as strings.
The rest of this message is a quite detailed assessment of where we are
and what the change would mean. But it does feature the snowman.
Andy
== Current situation
There are two kinds of escapes:
character escapes -- \t, \n \r \b \f \" \' \\
These present a single codepoint and also turns off any special meaning
like string delimiter or newline.
unicode espaces : \u1234 and \U12345678
Unicode escape allow systems to handle characters outside the range of
the current input system and output font. Like our friend the unicode
snowman \u2603 ☃ (if your font has it) or accented characters \u00E9 é
or Japanese (\u5E03\u77F3 布石 (fuseki)).
Snowman:
http://www.fileformat.info/info/unicode/char/2603/index.htm
The value is the unicode codepoint, not the hex code of UTF-8 bytes.
That does not mean that UTF-8 to codepoint must be done because UTF-8
encodes each codepoint separately. A system can encode a \u or \U and
then insert UTF-8 bytes into the input stream and it will just work.
It's also a way to write "\u5E03\u77F3" for "布石" and not risk
corruption (binary/text messing around).
= SPARQL
Character escapes can occur in strings (" ", ' ', """ """, ''' ''')
They are converted to their real character after parsing, and any
special meaning of the character is turned off.
Unicode escapes can occur anywhere. They dealt with as part of the
character input stream so it happens before any parsing takes place. So
a unicode escape can be anything anywhere
ASK \u007B\u007D
A\u0053K\u0020\u007BU\u007D
is seen by the parser as "ASK {}"
Unicode escapes can occur in IRIs and prefix names.
In SPARQL, the only escapes in IRIs are \u and \U.
http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#codepointEscape
http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammarEscapes
= Turtle
Turtle does not have \f or \b character escapes and it adds \>.
Both character escapes and unicode espaces are applied after parsing
inside strings (short and long) and IRIs but not prefixed names.
There are special rules \" is only allowed in strings (odd - the "
character is legal in IRIs), but \' is allowed in an IRI,
\> is only allowed in IRIs (where it's illegal by IRI rules)
But the grammar production does not allow you to type \> in!
"<" ( [^<>\"{}|^`\\] - [#0000-#0020] )* ">"
Two suggestions are pending for Turtle:
T1/ Allow unicode escapes in prefixed names.
T2/ Allow the unicode escapes in prefix names to pass in a wider
character set than the prefix name production allows.
Two characters of note for T2 are "=" (U+003D) and ":" (U+003A)
The argument for "=" is that it is used in automatic generation of IRIs
from SQL databases, then there is a case for allowing abbreviated input
for <http://example/store/id=1234> as ex:id=1234 except "=" is illegal
so ex:id\u003D1234.
The argument for ":" is that the Facebook Open Graph Protocol
http://developers.facebook.com/docs/opengraph
for example: og:audio:title
<html xmlns:og="http://ogp.me/ns#">
<head>
...
[REQUIRED TAGS]
<meta property="og:audio"
content="http://example.com/amazing.mp3" />
<meta property="og:audio:title" content="Amazing Song" />
<meta property="og:audio:artist" content="Amazing Band" />
<meta property="og:audio:album" content="Amazing Album" />
<meta property="og:audio:type" content="application/mp3" />
...
</head>
then you can't write: (Turtle)
<http://example/page> og:audio:title "Amazing Song" .
<http://example/page> og:audio\u003Atitle "Amazing Song" .
You can't write og:audio\u003Atitle in SPARQL and have it parse. The
\u003A is converted to ":" and the parser sees:
og:audio:title
which is not a single prefixed name.
Note that even if escaped in, "=" is still required to be a legal IRI
after prefix name to IRI conversion.
Turtle editors working draft:
http://dvcs.w3.org/hg/rdf/raw-file/tip/rdf-turtle/index.html#sec-grammar
= The base name idiom
Another way to abbreviate IRIs is to use the base:
@base <http://ogp.me/ns#>
Downside: you can have one BASE in SPARQL, and only one active @base in
Turtle (it can change between blocks triples).
... <audio:title> ...
A relative URI can not start with a segment containing a ":" (RFC 3986).
= Many prefixes
@prefix og-audio <http://ogp.me/ns#audio:> .
... og-audio:title ...
Downside is that you do have many prefixes.
= Opinion
Of
og:audio\u003Atitle
and
<http://ogp.me/ns#audio:title>
I find the <> form quite adequate because the NS is short.
The use of "id=" could equally have been "id_" -- the use of "=" was not
forced.
== Proposal
There is a desire to make SPARQL and Turtle as much the same as is
reasonable.
For SPARQL:
Change the Unicode escaping to only happen inside strings, IRIs and
prefix names (prefix part and local part) and remove it from the input
character processing.
The practical effect is small (use of \u in comments does not make a
query illegal) because \u is used only in those places in the deployed
world.
For Turtle:
Keep currently rule for in strings; add prefix names. Only allow
unicode escapes in IRIs. Fix the grammar rule for IRIs.
Use the same escapes as SPARQL (add \b and \f, remove \>).
Received on Saturday, 19 November 2011 18:02:14 UTC