Re: unicode escapes in prefix names from Richard Cyganiak on 2011-11-23 (public-rdf-wg@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Wed, 23 Nov 2011 23:38:28 +0000
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Gavin Carothers <gavin@carothers.name>, Andy Seaborne <andy.seaborne@epimorphics.com>, RDF-WG <public-rdf-wg@w3.org>
Message-Id: <D1680264-B27E-4C5C-B3FE-F915CAADFDDF@cyganiak.de>

On 23 Nov 2011, at 22:06, Eric Prud'hommeaux wrote:
>  FILTER (?kinase != kinase:Cyclin_D\u002FCdk4
…
>  FILTER (?kinase != <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4>

Presumably it is essential whether the thing being compared to is 002FCdk4, 2FCdk4, FCdk4, Cdk4, or dk4? Because that's no longer visible in the first form (except to the geeks of the geeks). There is quite some potential for confusion and errors in that.

I do prefer the second form.

>          && ?kinase != kinase:MECOM)

*That* is actually clear and pretty, but once you have unicode escapes in the local name it isn't.

> I believe it is a win, and that, as more relational data makes it onto the SemWeb, we'll see more specialized domain data with tokens which require either escaping or expansion into long IRIs.

Often these tokens will originally contain characters that are not allowed in IRIs, requiring %-encoding. Now you would end up with prefixed names that contain a mix of %-encoding and \u-escaping. At that point, users are better served by just copy&pasting the entire IRI en bloc.

> Specifically disabling escaping for prefixed names,

I don't propose specifically disabling it. I object to adding it to the list of places where it's specifically enabled.

> which is the only place it's really useful,

You're presuming that “Cyclin_D\u002FCdk4” is a really useful form for reading or writing the string “Cyclin_D/Cdk4”.

> will introduce needless confusion and annoyance. No one has to use the escaped form while they're dinking with the query, but once it's done, layout and readability will count for a lot.

I think you overstate the layout issue. Consistently using full IRIs for instances, e.g., using the full IRI for kinase:MECOM in the example above, will give you consistent layout alright, if consistent layout is what you're after.

And the claim of increased readability is a dubious one. It saves 30 characters, but it obfuscates the identifier, and makes the queries and data unreadable if you don't know your unicode syntax and code points.

I am curious whom you see as writing those queries that involve unicode escapes in prefixed names? Do you expect the average SPARQL query author (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries? Or do you see some automated tool doing the job? Or something/someone else?

Finally, another example – can you discuss the relative usefulness of these two guys?
<http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html>
rdfwg-hg:rdf-turtle\u002Findex.html

Best,
Richard

Received on Wednesday, 23 November 2011 23:39:01 UTC