Re: unicode escapes in prefix names

On 24 Nov 2011, at 15:30, Eric Prud'hommeaux wrote:
>>> FILTER (?kinase != kinase:Cyclin_D\u002FCdk4
>> …
>>> FILTER (?kinase != <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4>
>> 
>> Presumably it is essential whether the thing being compared to is 002FCdk4, 2FCdk4, FCdk4, Cdk4, or dk4? Because that's no longer visible in the first form (except to the geeks of the geeks). There is quite some potential for confusion and errors in that.
>> 
>> I do prefer the second form.
> 
> seen alone, i as well, but in the context of a larger graph or graph pattern, i prefer seeing a consistent representation conveying types or roles.

There are more appropriate mechanisms – variable names, comments, and the full IRIs themselves – for conveying types and roles.

>> Often these tokens will originally contain characters that are not allowed in IRIs, requiring %-encoding. Now you would end up with prefixed names that contain a mix of %-encoding and \u-escaping. At that point, users are better served by just copy&pasting the entire IRI en bloc.
> 
> Turning that around a bit, '%'s aren't allowed in PNames. So if you *do* want to use PNames (and we've established that you don't and I do), process:package-\u003Earmor is the only way to write process:package->armor .

No, because “->” isn't allowed in IRIs, so it would be “package-%3Earmor”, and if you want to write that as a prefixed name it becomes process:package-\u00253Earmor. Read, debug and maintain that… It gets even more fun if it was package->12345, which turns into package-\u00253E12345.

> I guess a dogmatic answer is that those %'d characters are part of an opaque identifier, e.g. <http://生物活性.cn/使用者/史密斯/周知/细胞主动/%7C细胞凋>, and that cell-process:\u003E7C细胞凋亡, in addition to allowing me to logically group things by type, gives me a chance to express terms in my local language.

Prefixed names are for shortening appropriately designed IRIs. You want to (ab)use them for something else – as a means of inserting documentation into your query, and then find that it doesn't work very well. SPARQL has comments!

>> You're presuming that “Cyclin_D\u002FCdk4” is a really useful form for reading or writing the string “Cyclin_D/Cdk4”.
> 
> Escaping there allows us to use prefixed names where we could not otherwise. Other than that, the *only* value of escaping is for editing unicode queries in ASCII editors (or for folks who want to obscure their text). Escapes made sense in Turtle because it was specifically ASCII.

Turtle was never ASCII, it was always UTF-8.

> I'm not convinced they offer value in a UTF-8 language in the modern world, 

Me neither, but I don't see a case for removing them, and it's reasonable to have them for N-Triples compatibility, so we can just accept their existence as legacy.

> but if we do complicate the language with them, let's use them where users would expect them, specifically, for working within parsing constraints.

You allege that users expect to be able to get around syntax constraints using unicode escapes. I don't think that's well-founded. Most languages don't work that way – you can't get around the syntax constraints imposed on identifiers using unicode escapes in any of XML, SQL, Java, Javascript, SPARQL 1.0, CSV, ASN.1 or just about any other language I can think of. What makes you believe that users expect to be able to avoid constraints on identifier tokens using unicode escapes in Turtle, when this isn't possible in other languages?

> Reading the query, I'm not as concearned with back-calculating the actual spelling of the identifiers as I am with the roles of the terms which the query author has communicated to me in the prefixes.

I think that being able to read the original identifiers is often more important than having an extra layer of annotation on terms used in the query.

Also, you're presuming that people actually use well-thought-out namespace prefixes that serve as good documentation. That's not always the case – often it's just a: b: c:, or something that made sense to Bob when he wrote the query but doesn't make sense to Alice. Often it's acronyms that are used inconsistently – think of DBpedia, where db:, dbp: dbp-prop:, dbo:, db-ont:, dbp-ont:, dbpedia:, dbr: can all be found on a regular basis and you have to check the @prefix declaration anyways to figure out what the query means. Prefixes that are written without a lot of care add yet another layer of indirection that the query reader has to decipher to make sense of the query.

>> I am curious whom you see as writing those queries that involve unicode escapes in prefixed names? Do you expect the average SPARQL query author (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries? Or do you see some automated tool doing the job? Or something/someone else?
> 
> I expect that for now it will be people who dink around with a query to get it to work and then take a few extra minutes to format it and comment it, perhaps for collaboration, justification or for later maintenance.

You're avoiding the question. Do you expect average SPARQL query authors (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries with unicode escapes? If not, then who is writing them?

> Some day, tools requiring varying levels of expertise may hide users from some to all of this via various semaphores

Yes – if we were at that stage already then this wouldn't be a big issue.

I still don't understand your reasoning at all. If you want to write “Cyclin_D/Cdk4” in a prefixed name, then why are you pushing for a half-assed non-solution like kinease:Cyclin_D\u002FCdk4 instead of an actually useful and readable approach that has precedent, like regex-style kinease:Cyclin_D\/Cdk4 ?

Best,
Richard

Received on Thursday, 24 November 2011 16:45:57 UTC