- From: Andy Seaborne <andy.seaborne@epimorphics.com>
- Date: Tue, 22 Nov 2011 21:36:07 +0000
- To: public-rdf-dawg@w3.org
My apologies - I was looking at an old Turtle document. http://www.w3.org/TR/turtle/ allows unicode escapes in prefixes names and IRIs (by the grammar) but does not say what characters are allowed. It may include spaces. Andy On 19/11/11 18:01, Andy Seaborne wrote: > This is addressing the working group note in the query doc (bullets 2 > and 3). > > http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammar > > > ==== SPARQL Proposal > > == tl;dr > > 1/ Change SPARQL so that \u escapes happen inside strings, IRIs and > prefix names only. Character escapes appear in string only. This > approach is the same design as Turtle. > > 2/ Suggest to RDF-WG that Turtle and SPARQL are same - that it, keep the > Turtle approach with fixes for it's inconsistencies i.e. character > escapes appear in strings only; no escape \>; add \b and \f. \u and \U > can appear in IRIs and prefixed names as well as strings. > > > The rest of this message is a quite detailed assessment of where we are > and what the change would mean. But it does feature the snowman. > > Andy > > > == Current situation > > There are two kinds of escapes: > > character escapes -- \t, \n \r \b \f \" \' \\ > > These present a single codepoint and also turns off any special meaning > like string delimiter or newline. > > unicode espaces : \u1234 and \U12345678 > > Unicode escape allow systems to handle characters outside the range of > the current input system and output font. Like our friend the unicode > snowman \u2603 ☃ (if your font has it) or accented characters \u00E9 é > or Japanese (\u5E03\u77F3 布石 (fuseki)). > > Snowman: > http://www.fileformat.info/info/unicode/char/2603/index.htm > > The value is the unicode codepoint, not the hex code of UTF-8 bytes. > That does not mean that UTF-8 to codepoint must be done because UTF-8 > encodes each codepoint separately. A system can encode a \u or \U and > then insert UTF-8 bytes into the input stream and it will just work. > > It's also a way to write "\u5E03\u77F3" for "布石" and not risk > corruption (binary/text messing around). > > = SPARQL > > Character escapes can occur in strings (" ", ' ', """ """, ''' ''') > They are converted to their real character after parsing, and any > special meaning of the character is turned off. > > Unicode escapes can occur anywhere. They dealt with as part of the > character input stream so it happens before any parsing takes place. So > a unicode escape can be anything anywhere > > ASK \u007B\u007D > A\u0053K\u0020\u007BU\u007D > > is seen by the parser as "ASK {}" > > Unicode escapes can occur in IRIs and prefix names. > > In SPARQL, the only escapes in IRIs are \u and \U. > > http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#codepointEscape > http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#grammarEscapes > > = Turtle > > Turtle does not have \f or \b character escapes and it adds \>. > > Both character escapes and unicode espaces are applied after parsing > inside strings (short and long) and IRIs but not prefixed names. > > There are special rules \" is only allowed in strings (odd - the " > character is legal in IRIs), but \' is allowed in an IRI, > > \> is only allowed in IRIs (where it's illegal by IRI rules) > But the grammar production does not allow you to type \> in! > > "<" ( [^<>\"{}|^`\\] - [#0000-#0020] )* ">" > > > Two suggestions are pending for Turtle: > > T1/ Allow unicode escapes in prefixed names. > T2/ Allow the unicode escapes in prefix names to pass in a wider > character set than the prefix name production allows. > > Two characters of note for T2 are "=" (U+003D) and ":" (U+003A) > > The argument for "=" is that it is used in automatic generation of IRIs > from SQL databases, then there is a case for allowing abbreviated input > for <http://example/store/id=1234> as ex:id=1234 except "=" is illegal > so ex:id\u003D1234. > > The argument for ":" is that the Facebook Open Graph Protocol > http://developers.facebook.com/docs/opengraph > > for example: og:audio:title > > <html xmlns:og="http://ogp.me/ns#"> > <head> > ... > [REQUIRED TAGS] > <meta property="og:audio" content="http://example.com/amazing.mp3" /> > <meta property="og:audio:title" content="Amazing Song" /> > <meta property="og:audio:artist" content="Amazing Band" /> > <meta property="og:audio:album" content="Amazing Album" /> > <meta property="og:audio:type" content="application/mp3" /> > ... > </head> > > then you can't write: (Turtle) > > <http://example/page> og:audio:title "Amazing Song" . > <http://example/page> og:audio\u003Atitle "Amazing Song" . > > You can't write og:audio\u003Atitle in SPARQL and have it parse. The > \u003A is converted to ":" and the parser sees: > > og:audio:title > > which is not a single prefixed name. > > Note that even if escaped in, "=" is still required to be a legal IRI > after prefix name to IRI conversion. > > Turtle editors working draft: > http://dvcs.w3.org/hg/rdf/raw-file/tip/rdf-turtle/index.html#sec-grammar > > = The base name idiom > > Another way to abbreviate IRIs is to use the base: > > @base <http://ogp.me/ns#> > > Downside: you can have one BASE in SPARQL, and only one active @base in > Turtle (it can change between blocks triples). > > ... <audio:title> ... > > A relative URI can not start with a segment containing a ":" (RFC 3986). > > = Many prefixes > > @prefix og-audio <http://ogp.me/ns#audio:> . > > ... og-audio:title ... > > Downside is that you do have many prefixes. > > = Opinion > > Of > > og:audio\u003Atitle > and > <http://ogp.me/ns#audio:title> > > I find the <> form quite adequate because the NS is short. > > The use of "id=" could equally have been "id_" -- the use of "=" was not > forced. > > == Proposal > > There is a desire to make SPARQL and Turtle as much the same as is > reasonable. > > For SPARQL: > > Change the Unicode escaping to only happen inside strings, IRIs and > prefix names (prefix part and local part) and remove it from the input > character processing. > > The practical effect is small (use of \u in comments does not make a > query illegal) because \u is used only in those places in the deployed > world. > > For Turtle: > > Keep currently rule for in strings; add prefix names. Only allow unicode > escapes in IRIs. Fix the grammar rule for IRIs. > > Use the same escapes as SPARQL (add \b and \f, remove \>). >
Received on Tuesday, 22 November 2011 21:36:42 UTC