Re: unicode escapes in prefix names from Eric Prud'hommeaux on 2011-11-23 (public-rdf-wg@w3.org from November 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 23 Nov 2011 17:06:39 -0500
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Gavin Carothers <gavin@carothers.name>, Andy Seaborne <andy.seaborne@epimorphics.com>, RDF-WG <public-rdf-wg@w3.org>
Message-ID: <20111123220637.GD9496@w3.org>
* Richard Cyganiak <richard@cyganiak.de> [2011-11-23 17:01+0000]
> On 23 Nov 2011, at 15:49, Eric Prud'hommeaux wrote:
> >> In many ways, expanding the prefix and wrapping everything into <…> is a friendlier escaping mechanism than looking up unicode code points.
> > 
> > I see it as momentarily easier to author, but much harder to read
> 
> So you find “\u00C9” easier to read than “É”?
> 
> You find “United\u002520Kingdom” easier to mentally parse than “United%20Kingdom” (ugly as it is)?
> 
> You find “ts16\u003A44\u003A28Z” easier to mentally parse than “ts16:44:28Z”?
> 
> You're arguing that prefixed names with the former forms are easier to use than full IRIs with the latter forms. I don't believe that one second. At best you're moving a turd from one pocket to another.

These are all fragments of examples. Let's peer at Cyclin_D/Cdk4, which is implicated in cell proliferation in human malignancies:

Data :
  @prefix gro: <http://www.bootstrep.eu/ontology/GRO/> .
  @prefix kinase: <http://www.bootstrep.eu/instances/cyclin-dependent-kinase/> .
  
  kinase:Cyclin_D\u002FCdk4 a kinase:Compund ;
     rdfs:label "Cyclin_D1/CDK4 complex" .
  protein:Cyclin_D ro:proper_part_of kinase:Cyclin_D\u002FCdk4 .
  gene:Cdk4        ro:proper_part_of kinase:Cyclin_D\u002FCdk4 .
  [ a gro:RegulationOfTranscription ;
    gro:hasParticipant protein:GATA1 , kinase:Cyclin_D\u002FCdk4 ] .

vs.

  @prefix gro: <http://www.bootstrep.eu/ontology/GRO/> .
  @prefix kinase: <http://www.bootstrep.eu/instances/cyclin-dependent-kinase/> .
  
  <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> a kinase:Compund ;
     rdfs:label "Cyclin_D1/CDK4 complex" .
  protein:Cyclin_D ro:proper_part_of <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> .
  gene:Cdk4        ro:proper_part_of <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> .
  [ a gro:RegulationOfTranscription ;
    gro:hasParticipant protein:GATA1 , <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> ] .


Writing <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> everywhere in the data obscures it quite a lot, but what's more likely to need maintenance are queries which e.g. monitor pubmed articles for interactions between this complex and another:

SELECT ?kinase (COUNT(*) AS ?rank) {
  kinase:Cyclin_D\u002FCdk4 sc:mentioned-in ?article .
  kinase:MECOM sc:mentioned-in ?article .
  ?kinase sc:mentioned-in ?article
  FILTER (?kinase != kinase:Cyclin_D\u002FCdk4
          && ?kinase != kinase:MECOM)
} GROUP BY ?kinase
HAVING (?rank > 2)
ORDER BY DESC(?rank)

vs.

SELECT ?kinase (COUNT(*) AS ?rank) {
  <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4> sc:mentioned-in ?article .
  kinase:MECOM sc:mentioned-in ?article .
  ?kinase sc:mentioned-in ?article
  FILTER (?kinase != <http://www.bootstrep.eu/instances/cyclin-dependent/Cyclin_D/Cdk4>
          && ?kinase != kinase:MECOM)
} GROUP BY ?kinase
HAVING (?rank > 2)
ORDER BY DESC(?rank)

In the data and query, we can see how the prefixes are very helpful in identifying the roles and types of terms in the graph. The latter data and query are needlessly more difficult to parse.


> > , Debug or maintain;
> 
> I'm not sure that a query littered with unicode escapes is easy to debug or maintain. Surely, if a query doesn't work, one of the things you need to check is whether all the colons, full stops, commas, percent signs and plus signs that the query authors elected to unicode-escape in order to be able to squeeze IRIs into prefixed names are indeed correct, or if they mixed up a \u0025 for a \u002d somewhere. You gain neat layout, shorter tokens and hopefully less duplication, but introduce a new source of potential errors that cannot be found by eyeballing the query but require unicode lookup tables. This is not a debug/maintenance win.

I believe it is a win, and that, as more relational data makes it onto the SemWeb, we'll see more specialized domain data with tokens which require either escaping or expansion into long IRIs. Specifically disabling escaping for prefixed names, which is the only place it's really useful, will introduce needless confusion and annoyance. No one has to use the escaped form while they're dinking with the query, but once it's done, layout and readability will count for a lot.


> > a large community would exploit escapes in prefixed names.
> 
> My crystal ball disagrees with your crystal ball here.
> 
> For perspective, let's keep in mind that an *actually* large community is currently adopting an RDF-derived syntax that had prefixed names abolished altogether (microdata).
> 
> > I'm not sure I see the reasoning against including escapes in the grammar for prefixed names. It's a minimal grammar delta from allowing them in IRIs and literals (I added <U_CHAR> to <PN_CHARS_BASE> in <http://www.w3.org/2005/01/yacker/uploads/turtleEsc?lang=perl&markup=html#term-turtleEsc-UCHAR>). It doesn't allow any more invalid IRI forms than does <IRI_REF> (and we can always demand implementors validate against [^<>\"{}|^`\\] - [#0000-#20]), it's trivial for and implementor to call the same un-escaping code for prefixed name components as they call for literals and IRIs.
> 
> I don't dispute that it's an easy enough change for the spec editors and for implementers. I say that it would be a bad change because it doesn't result in benefits for users, authors, or implementers.
> 
> (You claim that some author benefits would result; I say that they would materialize only for a small subset of authors – those who have memorized Unicode tables – while making life more difficult for the rest.)
> 
> > it's closer to syntactic compatibility with with SPARQL 1.0 escapes
> 
> Which I argue is a broken design.

I think that keeping this form of backward-compatibility does not propogate the design flaws you are trying to fix (given that there will be escaping in IRIs and literals anyways).


> >> Not everyone is a Unicode geek with an obsession for orderly query layout ;-)
> > 
> > I may agree with your second point, but I'm pretty sure that the 7 billionth happy unicode geek was born at the end of October.
> >  http://a57.foxnews.com/static/managed/img/Scitech/396/223/Peru%207%20Billionth%20Person.jpg
> 
> Not sure what you're tying to get at here. I don't think she can tell an É from a \u00C9 yet. Most of the 7 billion should never have to.

agreed, meant as a joke.


> Best,
> Richard

-- 
-ericP
Received on Wednesday, 23 November 2011 22:07:10 UTC