Re: unicode escapes in prefix names

* Richard Cyganiak <richard@cyganiak.de> [2011-11-24 16:45+0000]
> On 24 Nov 2011, at 15:30, Eric Prud'hommeaux wrote:
> >>> FILTER (?kinase != kinase:Cyclin_D\u002FCdk4
> >> …
> >>> FILTER (?kinase != <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4>
> >> 
> >> Presumably it is essential whether the thing being compared to is 002FCdk4, 2FCdk4, FCdk4, Cdk4, or dk4? Because that's no longer visible in the first form (except to the geeks of the geeks). There is quite some potential for confusion and errors in that.
> >> 
> >> I do prefer the second form.
> > 
> > seen alone, i as well, but in the context of a larger graph or graph pattern, i prefer seeing a consistent representation conveying types or roles.
> 
> There are more appropriate mechanisms – variable names, comments, and the full IRIs themselves – for conveying types and roles.
> 
> >> Often these tokens will originally contain characters that are not allowed in IRIs, requiring %-encoding. Now you would end up with prefixed names that contain a mix of %-encoding and \u-escaping. At that point, users are better served by just copy&pasting the entire IRI en bloc.
> > 
> > Turning that around a bit, '%'s aren't allowed in PNames. So if you *do* want to use PNames (and we've established that you don't and I do), process:package-\u003Earmor is the only way to write process:package->armor .
> 
> No, because “->” isn't allowed in IRIs, so it would be “package-%3Earmor”, and if you want to write that as a prefixed name it becomes process:package-\u00253Earmor. Read, debug and maintain that… It gets even more fun if it was package->12345, which turns into package-\u00253E12345.
> 
> > I guess a dogmatic answer is that those %'d characters are part of an opaque identifier, e.g. <http://生物活性.cn/使用者/史密斯/周知/细胞主动/%7C细胞凋>, and that cell-process:\u003E7C细胞凋亡, in addition to allowing me to logically group things by type, gives me a chance to express terms in my local language.
> 
> Prefixed names are for shortening appropriately designed IRIs. You want to (ab)use them for something else – as a means of inserting documentation into your query, and then find that it doesn't work very well. SPARQL has comments!

I've not seen anyone rely on comments when they can rely on namespace prefixes. For example
  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  
  SELECT DISTINCT ?name
  WHERE { 
      ?x rdf:type foaf:Person . 
      ?x foaf:name ?name
  }
needs no documentation and
  PREFIX foaf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  PREFIX rdf: <http://xmlns.com/foaf/0.1/>
  
  SELECT DISTINCT ?name
  WHERE { 
      ?x foaf:type rdf:Person . # for everyone of RDF type FOAF Person
      ?x rdf:name ?name         #     get their FOAF name
  }
is downright antisocial.


> >> You're presuming that “Cyclin_D\u002FCdk4” is a really useful form for reading or writing the string “Cyclin_D/Cdk4”.
> > 
> > Escaping there allows us to use prefixed names where we could not otherwise. Other than that, the *only* value of escaping is for editing unicode queries in ASCII editors (or for folks who want to obscure their text). Escapes made sense in Turtle because it was specifically ASCII.
> 
> Turtle was never ASCII, it was always UTF-8.

sorry, s/Turtle/NTriples/

> > I'm not convinced they offer value in a UTF-8 language in the modern world, 
> 
> Me neither, but I don't see a case for removing them, and it's reasonable to have them for N-Triples compatibility, so we can just accept their existence as legacy.
> 
> > but if we do complicate the language with them, let's use them where users would expect them, specifically, for working within parsing constraints.
> 
> You allege that users expect to be able to get around syntax constraints using unicode escapes. I don't think that's well-founded. Most languages don't work that way – you can't get around the syntax constraints imposed on identifiers using unicode escapes in any of XML, SQL, Java, Javascript, SPARQL 1.0, CSV, ASN.1 or just about any other language I can think of. What makes you believe that users expect to be able to avoid constraints on identifier tokens using unicode escapes in Turtle, when this isn't possible in other languages?

Most of these languages have pretty conventional escaping for the parts where someone is dealing with arbitrary text:
  XML: <ex>1 &lt;&#x3C; 3</ex>  <ex title="call me &quot;Bob&#x22;"/>
  SQL: SELECT "call me \"Bob\"" AS title;  SELECT * FROM toy WHERE name LIKE '%\\%%'; -- matches ab%cd
  Java: counter-example (iirc), pre-processes escapes (like SPARQL)
  Javascript (and C, PHP, etc): "call me \"Bob\u0022"
  SPARQL 1.0: like java
  CSV: "call me ""Bob"""
  ASN.1: isn't this more at the level of BNF or DTD? I expect that specific ASN.1 languages define their own escapes. For instance, X.509 uses ASN.1's DER so e.g. issuerName?XOU=Our CN=Prud\27hommeaux',C=US%
In all of these, you can generate literals to e.g match a given input or generate a particular output. In SPARQL, the range of things we must match includes IRIs.


> > Reading the query, I'm not as concearned with back-calculating the actual spelling of the identifiers as I am with the roles of the terms which the query author has communicated to me in the prefixes.
> 
> I think that being able to read the original identifiers is often more important than having an extra layer of annotation on terms used in the query.
> 
> Also, you're presuming that people actually use well-thought-out namespace prefixes that serve as good documentation. That's not always the case – often it's just a: b: c:, or something that made sense to Bob when he wrote the query but doesn't make sense to Alice. Often it's acronyms that are used inconsistently – think of DBpedia, where db:, dbp: dbp-prop:, dbo:, db-ont:, dbp-ont:, dbpedia:, dbr: can all be found on a regular basis and you have to check the @prefix declaration anyways to figure out what the query means. Prefixes that are written without a lot of care add yet another layer of indirection that the query reader has to decipher to make sense of the query.

I'm presuming that *some* people use well-thought-out namespace prefixes.


> >> I am curious whom you see as writing those queries that involve unicode escapes in prefixed names? Do you expect the average SPARQL query author (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries? Or do you see some automated tool doing the job? Or something/someone else?
> > 
> > I expect that for now it will be people who dink around with a query to get it to work and then take a few extra minutes to format it and comment it, perhaps for collaboration, justification or for later maintenance.
> 
> You're avoiding the question. Do you expect average SPARQL query authors (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries with unicode escapes? If not, then who is writing them?

Yes, mean that some SPARQL authors will choose to use escaped prefix names instead of full IRIs. (I find it trivial in emacs because I can write the character and use a macro to expand it to a \u code.)


> > Some day, tools requiring varying levels of expertise may hide users from some to all of this via various semaphores
> 
> Yes – if we were at that stage already then this wouldn't be a big issue.
> 
> I still don't understand your reasoning at all. If you want to write “Cyclin_D/Cdk4” in a prefixed name, then why are you pushing for a half-assed non-solution like kinease:Cyclin_D\u002FCdk4 instead of an actually useful and readable approach that has precedent, like regex-style kinease:Cyclin_D\/Cdk4 ?

Two reasons:
  I pushed a bit for CURIES. That was killed because we couldn't get 100% coverage of what's escaped and what's not. I still want to be able to use prefixes.
    http://www.w3.org/2005/01/yacker/uploads/SPARQL_CURIE?lang=perl&markup=html#productions

  I think that current SPARQL and Turtle are less intuitive to programmers who are used to writing escapes when they need them. Either get rid of them or make them logical.


> Best,
> Richard

-- 
-ericP

Received on Thursday, 24 November 2011 18:40:15 UTC