Re: Escape sequences (SPARQL and Turtle) from Andy Seaborne on 2011-11-22 (public-rdf-dawg@w3.org from October to December 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Tue, 22 Nov 2011 21:36:07 +0000
To: public-rdf-dawg@w3.org
Message-ID: <4ECC15C7.8060703@epimorphics.com>
My apologies - I was looking at an old Turtle document.

http://www.w3.org/TR/turtle/

allows unicode escapes in prefixes names and IRIs (by the grammar) but 
does not say what characters are allowed.  It may include spaces.

 Andy

On 19/11/11 18:01, Andy Seaborne wrote:
> This is addressing the working group note in the query doc (bullets 2
> and 3).
>
> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammar
>
>
> ==== SPARQL Proposal
>
> == tl;dr
>
> 1/ Change SPARQL so that \u escapes happen inside strings, IRIs and
> prefix names only. Character escapes appear in string only. This
> approach is the same design as Turtle.
>
> 2/ Suggest to RDF-WG that Turtle and SPARQL are same - that it, keep the
> Turtle approach with fixes for it's inconsistencies i.e. character
> escapes appear in strings only; no escape \>; add \b and \f. \u and \U
> can appear in IRIs and prefixed names as well as strings.
>
>
> The rest of this message is a quite detailed assessment of where we are
> and what the change would mean. But it does feature the snowman.
>
> Andy
>
>
> == Current situation
>
> There are two kinds of escapes:
>
> character escapes -- \t, \n \r \b \f \" \' \\
>
> These present a single codepoint and also turns off any special meaning
> like string delimiter or newline.
>
> unicode espaces : \u1234 and \U12345678
>
> Unicode escape allow systems to handle characters outside the range of
> the current input system and output font. Like our friend the unicode
> snowman \u2603 ☃ (if your font has it) or accented characters \u00E9 é
> or Japanese (\u5E03\u77F3 布石 (fuseki)).
>
> Snowman:
> http://www.fileformat.info/info/unicode/char/2603/index.htm
>
> The value is the unicode codepoint, not the hex code of UTF-8 bytes.
> That does not mean that UTF-8 to codepoint must be done because UTF-8
> encodes each codepoint separately. A system can encode a \u or \U and
> then insert UTF-8 bytes into the input stream and it will just work.
>
> It's also a way to write "\u5E03\u77F3" for "布石" and not risk
> corruption (binary/text messing around).
>
> = SPARQL
>
> Character escapes can occur in strings (" ", ' ', """ """, ''' ''')
> They are converted to their real character after parsing, and any
> special meaning of the character is turned off.
>
> Unicode escapes can occur anywhere. They dealt with as part of the
> character input stream so it happens before any parsing takes place. So
> a unicode escape can be anything anywhere
>
> ASK \u007B\u007D
> A\u0053K\u0020\u007BU\u007D
>
> is seen by the parser as "ASK {}"
>
> Unicode escapes can occur in IRIs and prefix names.
>
> In SPARQL, the only escapes in IRIs are \u and \U.
>
> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#codepointEscape
> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammarEscapes
>
> = Turtle
>
> Turtle does not have \f or \b character escapes and it adds \>.
>
> Both character escapes and unicode espaces are applied after parsing
> inside strings (short and long) and IRIs but not prefixed names.
>
> There are special rules \" is only allowed in strings (odd - the "
> character is legal in IRIs), but \' is allowed in an IRI,
>
> \> is only allowed in IRIs (where it's illegal by IRI rules)
> But the grammar production does not allow you to type \> in!
>
> "<" ( [^<>\"{}|^`\\] - [#0000-#0020] )* ">"
>
>
> Two suggestions are pending for Turtle:
>
> T1/ Allow unicode escapes in prefixed names.
> T2/ Allow the unicode escapes in prefix names to pass in a wider
> character set than the prefix name production allows.
>
> Two characters of note for T2 are "=" (U+003D) and ":" (U+003A)
>
> The argument for "=" is that it is used in automatic generation of IRIs
> from SQL databases, then there is a case for allowing abbreviated input
> for <http://example/store/id=1234> as ex:id=1234 except "=" is illegal
> so ex:id\u003D1234.
>
> The argument for ":" is that the Facebook Open Graph Protocol
> http://developers.facebook.com/docs/opengraph
>
> for example: og:audio:title
>
> <html xmlns:og="http://ogp.me/ns#">
> <head>
> ...
> [REQUIRED TAGS]
> <meta property="og:audio" content="http://example.com/amazing.mp3" />
> <meta property="og:audio:title" content="Amazing Song" />
> <meta property="og:audio:artist" content="Amazing Band" />
> <meta property="og:audio:album" content="Amazing Album" />
> <meta property="og:audio:type" content="application/mp3" />
> ...
> </head>
>
> then you can't write: (Turtle)
>
> <http://example/page> og:audio:title "Amazing Song" .
> <http://example/page> og:audio\u003Atitle "Amazing Song" .
>
> You can't write og:audio\u003Atitle in SPARQL and have it parse. The
> \u003A is converted to ":" and the parser sees:
>
> og:audio:title
>
> which is not a single prefixed name.
>
> Note that even if escaped in, "=" is still required to be a legal IRI
> after prefix name to IRI conversion.
>
> Turtle editors working draft:
> http://dvcs.w3.org/hg/rdf/raw-file/tip/rdf-turtle/index.html#sec-grammar
>
> = The base name idiom
>
> Another way to abbreviate IRIs is to use the base:
>
> @base <http://ogp.me/ns#>
>
> Downside: you can have one BASE in SPARQL, and only one active @base in
> Turtle (it can change between blocks triples).
>
> ... <audio:title> ...
>
> A relative URI can not start with a segment containing a ":" (RFC 3986).
>
> = Many prefixes
>
> @prefix og-audio <http://ogp.me/ns#audio:> .
>
> ... og-audio:title ...
>
> Downside is that you do have many prefixes.
>
> = Opinion
>
> Of
>
> og:audio\u003Atitle
> and
> <http://ogp.me/ns#audio:title>
>
> I find the <> form quite adequate because the NS is short.
>
> The use of "id=" could equally have been "id_" -- the use of "=" was not
> forced.
>
> == Proposal
>
> There is a desire to make SPARQL and Turtle as much the same as is
> reasonable.
>
> For SPARQL:
>
> Change the Unicode escaping to only happen inside strings, IRIs and
> prefix names (prefix part and local part) and remove it from the input
> character processing.
>
> The practical effect is small (use of \u in comments does not make a
> query illegal) because \u is used only in those places in the deployed
> world.
>
> For Turtle:
>
> Keep currently rule for in strings; add prefix names. Only allow unicode
> escapes in IRIs. Fix the grammar rule for IRIs.
>
> Use the same escapes as SPARQL (add \b and \f, remove \>).
>
Received on Tuesday, 22 November 2011 21:36:42 UTC