Re: unicode escapes in prefix names

* Richard Cyganiak <> [2011-11-23 23:38+0000]
> On 23 Nov 2011, at 22:06, Eric Prud'hommeaux wrote:
> >  FILTER (?kinase != kinase:Cyclin_D\u002FCdk4
> …
> >  FILTER (?kinase != <>
> Presumably it is essential whether the thing being compared to is 002FCdk4, 2FCdk4, FCdk4, Cdk4, or dk4? Because that's no longer visible in the first form (except to the geeks of the geeks). There is quite some potential for confusion and errors in that.
> I do prefer the second form.

seen alone, i as well, but in the context of a larger graph or graph pattern, i prefer seeing a consistent representation conveying types or roles.

> >          && ?kinase != kinase:MECOM)
> *That* is actually clear and pretty, but once you have unicode escapes in the local name it isn't.
> > I believe it is a win, and that, as more relational data makes it onto the SemWeb, we'll see more specialized domain data with tokens which require either escaping or expansion into long IRIs.
> Often these tokens will originally contain characters that are not allowed in IRIs, requiring %-encoding. Now you would end up with prefixed names that contain a mix of %-encoding and \u-escaping. At that point, users are better served by just copy&pasting the entire IRI en bloc.

Turning that around a bit, '%'s aren't allowed in PNames. So if you *do* want to use PNames (and we've established that you don't and I do), process:package-\u003Earmor is the only way to write process:package->armor .

I guess a dogmatic answer is that those %'d characters are part of an opaque identifier, e.g. <http://生物活性.cn/使用者/史密斯/周知/细胞主动/%7C细胞凋>, and that cell-process:\u003E7C细胞凋亡, in addition to allowing me to logically group things by type, gives me a chance to express terms in my local language.

> > Specifically disabling escaping for prefixed names,
> I don't propose specifically disabling it. I object to adding it to the list of places where it's specifically enabled.
> > which is the only place it's really useful,
> You're presuming that “Cyclin_D\u002FCdk4” is a really useful form for reading or writing the string “Cyclin_D/Cdk4”.

Escaping there allows us to use prefixed names where we could not otherwise. Other than that, the *only* value of escaping is for editing unicode queries in ASCII editors (or for folks who want to obscure their text).
Escapes made sense in Turtle because it was specifically ASCII. I'm not convinced they offer value in a UTF-8 language in the modern world, but if we do complicate the language with them, let's use them where users would expect them, specifically, for working within parsing constraints.

> > will introduce needless confusion and annoyance. No one has to use the escaped form while they're dinking with the query, but once it's done, layout and readability will count for a lot.
> I think you overstate the layout issue. Consistently using full IRIs for instances, e.g., using the full IRI for kinase:MECOM in the example above, will give you consistent layout alright, if consistent layout is what you're after.
> And the claim of increased readability is a dubious one. It saves 30 characters, but it obfuscates the identifier, and makes the queries and data unreadable if you don't know your unicode syntax and code points.

Reading the query, I'm not as concearned with back-calculating the actual spelling of the identifiers as I am with the roles of the terms which the query author has communicated to me in the prefixes. I don't expect anyone to memorize the codepoints; "Cyclin_D mumble Cdk4" is fine as long as people know it's a protein complex.

> I am curious whom you see as writing those queries that involve unicode escapes in prefixed names? Do you expect the average SPARQL query author (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries? Or do you see some automated tool doing the job? Or something/someone else?

I expect that for now it will be people who dink around with a query to get it to work and then take a few extra minutes to format it and comment it, perhaps for collaboration, justification or for later maintenance.

It's pretty nice from a serializer perspective to be able to work within a template (user configuration says that this namespace gets this prefix, perhaps via prefixes used in parsing earlier in the pipeline) and serialize the local name, escaping anything that's illegal in a local name.

Some day, tools requiring varying levels of expertise may hide users from some to all of this via various semaphores, ranging from e.g. an emacs mode which displays the escaped characters (as do the HTML emacs modes) to a fancy draggy-droppy fancy query interface. The latter is only relevent to this use case to the degree it tries to serialize the query with consistent prefixes.

> Finally, another example – can you discuss the relative usefulness of these two guys?
> <>
> rdfwg-hg:rdf-turtle\u002Findex.html

Yeah, that's an interesting CURIE use case. Alone, the former is more readable, but in the context of a bunch of othe rdfwg-hg:* assertions, the "\u002F" could be less disruptive than the full URI.

> Best,
> Richard


Received on Thursday, 24 November 2011 15:30:37 UTC