Re: Aligning Turtle and SPARQL escape sequence processing. from Richard Cyganiak on 2011-11-22 (public-rdf-wg@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Tue, 22 Nov 2011 19:48:59 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: RDF-WG <public-rdf-wg@w3.org>
Message-Id: <A83D8A96-8850-4042-9EE1-644704D51C1C@cyganiak.de>

On 22 Nov 2011, at 17:43, Andy Seaborne wrote:
> T1/ Allow unicode escapes in prefixed names.
> 
> Unicode escape allow systems to handle characters outside the range of the current input system and output font and it avoids risk of corruption (binary/text messing around).  

I don't think fonts have anything to do with it.

The argument that people can use unicode escapes when they don't know how to type the character is a bit of a canard as well – they copy-paste it from somewhere or pull down the operating system's character palette app.

Corruption – yes, that's an issue, but Turtle and SPARQL are always UTF-8, making it easier to get this right. Those who can't get UTF-8 right won't get unicode escapes right either.

> Any unicode escape sequence is just a way of typing the character - it does not turn off special meanings unlike character escapes.
> 
> Accented characters \u00E9  é or Japanese (\u5E03\u77F3 布石).
> 
> If you have them at all anywhere, having them where the real characters are already legal seems consistent.

1. Turtle as existing and as implemented disagrees, and that outweighs the consistency argument in my mind.

2. The proposal allows escaping of *all* characters, but you don't seem to be arguing that unicode escapes should be allowed in the strings “@prefix” or “@base” or in the rdf:type shortcut “a”. So it's not even that consistent.

>  Why one place and not another?

The current situation around escaping in RDF is already a glorious mess. Let me illustrate this with an example, let's say querying DBpedia:

    // Special characters in literals…?

    "Éire"      – Works!
    "\u00C9ire" - Works!

    // Ok, easy enough. What about IRIs?

    <http://dbpedia.org/resource/Éire>      – Doesn't work :-(
    <http://dbpedia.org/resource/\u00C9ire> – Doesn't work :-(
    <http://dbpedia.org/resource/%C3%89ire> – Works!

    // Strange… So what about prefixed names?

    dbpedia:%C3%89ire       – Doesn't work :-(
    dbpedia:Éire            – Doesn't work :-(
    dbpedia:\u00C9ire       – Doesn't work :-(
    dbpedia:\u00C3\u0089ire – Doesn't work :-(

    // Oh well, back to IRIs I guess.

Now the proposal adds to that mess by adding *another* way of writing things differently with *no* increase in expressivity. (The results for all the cases above are unaffected by the proposal – the DBpedia IRI simply cannot be written as a prefixed name.)

As it stands, none of the following IRIs can be written as prefixed names – they all have to be written as full IRIs:

   1. <%C3%89ire>
   2. <search?q=eire>
   3. <Galway,_Ireland>
   4. <Éire> if you don't know how to type É but know that you can use \u00C9 instead
   5. <U.S.>
   6. <United%20Kingdom>

The proposal adds a whole bunch of complexity to the story that one needs to tell to explain how the hell prefixed names work, and what we get in return is a solution for the case that matters least – number 4 – while all the others still don't work and require falling back to full IRIs.

Escaping in IRIs and literals is necessary for backwards compatibility and for Oracle's ASCII-Triples. Adding escaping to prefixed names is *not* necessary as there is already a way of escaping them: expand to a full IRI and use unicode escapes there.

Best,
Richard

Received on Tuesday, 22 November 2011 19:49:30 UTC