Re: unicode escapes in prefix names

* Richard Cyganiak <richard@cyganiak.de> [2011-11-24 20:43+0000]
> On 24 Nov 2011, at 18:39, Eric Prud'hommeaux wrote:
> >> Prefixed names are for shortening appropriately designed IRIs. You want to (ab)use them for something else – as a means of inserting documentation into your query, and then find that it doesn't work very well. SPARQL has comments!
> > 
> > I've not seen anyone rely on comments when they can rely on namespace prefixes. For example
> >  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> > 
> >  SELECT DISTINCT ?name
> >  WHERE { 
> >      ?x rdf:type foaf:Person . 
> >      ?x foaf:name ?name
> >  }
> > needs no documentation and
> >  PREFIX foaf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >  PREFIX rdf: <http://xmlns.com/foaf/0.1/>
> > 
> >  SELECT DISTINCT ?name
> >  WHERE { 
> >      ?x foaf:type rdf:Person . # for everyone of RDF type FOAF Person
> >      ?x rdf:name ?name         #     get their FOAF name
> >  }
> > is downright antisocial.
> 
> That's a different case – the rdf: and foaf: prefixes are fixed by convention and practically everybody knows them. That's not the kind of namespace we are talking about here – all the terms in these namespaces can already be prefix-abbreviated because they were designed for this.
> 
> We are talking about instance-level namespaces that you currently can't prefix-abbreviate.
> 
> People commonly just write out the full IRIs. They manage. See DBpedia. Comments are available in SPARQL for documentation in cases where it's needed.

In the queries I've seen, people use prefixes whenever possible. In the interest of documentation, here are some non-types, non-predicate prefixed names mined from databases and expressed in literature:
  http://people.csail.mit.edu/pcm/tempISWC/workshops/SWPM2010/InvitedPaper_6.pdf "neurolex:Entorhinal_cortex", "doid:DOID_10652"
  http://www.biomedcentral.com/1471-2105/10/S10/S10 "db:Alzheimer_disease_pathway"
People do use prefixed names for subjects and objects.


> >> You allege that users expect to be able to get around syntax constraints using unicode escapes. I don't think that's well-founded. Most languages don't work that way – you can't get around the syntax constraints imposed on identifiers using unicode escapes in any of XML, SQL, Java, Javascript, SPARQL 1.0, CSV, ASN.1 or just about any other language I can think of. What makes you believe that users expect to be able to avoid constraints on identifier tokens using unicode escapes in Turtle, when this isn't possible in other languages?
> > 
> > Most of these languages have pretty conventional escaping for the parts where someone is dealing with arbitrary text:
> 
> We are not talking about unicode escaping in strings. We are talking about unicode escaping in restricted-syntax tokens. Some languages allow unicode escapes in identifiers, some don't. None, as far as I can see, allows expanding the range of identifier characters using unicode escapes.

If you are talking about language identifiers, I mostly agree. Most conventional languages (apart from lisp) don't allow the programmar to create a variable identifiers with escapes. But that's not the point. In Turtle and SPARQL BGPs, we're talking about representing *data*, which is where most languages (C-like and XML-like) provide escapes to allow the programmar to match arbitrary data without having to abandon convenient syntaxes.


> > In all of these, you can generate literals to e.g match a given input or generate a particular output. In SPARQL, the range of things we must match includes IRIs.
> 
> Yes, and SPARQL allows writing any character in any IRI using as many unicode escapes as you like. What it doesn't do is allow the range of legal characters in restricted-syntax tokens using unicode escapes.

Arbitrary escapes are allowed in non-abbreviated form of IRIs; they are simply not allowed in prefixed names.


> I repeat my question: What makes you believe that users expect to be able to avoid syntax constraints on these tokens using unicode escapes in Turtle, when this isn't possible in other languages?

Here's an example: when I was working with Alan Ruttenberg and Jonathan Rees a long time ago, we had to get around '.'s in prefixed names (not then allowed in SPARQL). Alan's first question was, can't we use the escaping system (presumably because this is how expects escapes to work due to his experiences with escapes in other languages). I had to say that it wouldn't work because the escapes were already substituted at lexing time.


> > I'm presuming that *some* people use well-thought-out namespace prefixes.
> 
> What's wrong with expecting these same people to comment their queries?

Is this an argument for getting rid of prefixes altogether?
What's wrong with allowing people to use the obvious tools?


> >> Do you expect average SPARQL query authors (perhaps a domain expert or DBA-type person with some RDF background) to hand-write those queries with unicode escapes? If not, then who is writing them?
> > 
> > Yes, mean that some SPARQL authors will choose to use escaped prefix names instead of full IRIs. (I find it trivial in emacs because I can write the character and use a macro to expand it to a \u code.)
> 
> Yeah but you're the 1%.

You have to look pretty hard to find an experienced programmer who doens't understand escapes.


> The average SPARQL author doesn't use emacs macro. The average SPARQL author is a second-year student in India who can't set up their classpath in Eclipse. If we're lucky, in the future the average SPARQL author will be more like the average SQL author – who still doesn't use emacs, and doesn't have a clue what Unicode is.
> 
> >>> Some day, tools requiring varying levels of expertise may hide users from some to all of this via various semaphores
> >> 
> >> Yes – if we were at that stage already then this wouldn't be a big issue.
> >> 
> >> I still don't understand your reasoning at all. If you want to write “Cyclin_D/Cdk4” in a prefixed name, then why are you pushing for a half-assed non-solution like kinease:Cyclin_D\u002FCdk4 instead of an actually useful and readable approach that has precedent, like regex-style kinease:Cyclin_D\/Cdk4 ?
> > 
> > Two reasons:
> >  I pushed a bit for CURIES. That was killed because we couldn't get 100% coverage of what's escaped and what's not. I still want to be able to use prefixes.
> 
> I wasn't asking about CURIEs. CURIEs can't work in SPARQL. I asked about regex-style backslash escaping. That's readable (compared to unicode escapes), useful and has precedent. Why are you not pushing for that?
> 
> >  I think that current SPARQL and Turtle are less intuitive to programmers who are used to writing escapes when they need them.
> 
> You mean yourself?
> 
> > Either get rid of them or make them logical.
> 
> That's what I want too. What you propose isn't logical. Unicode escapes sequences are for transmitting a larger set of characters using a smaller set of characters, not for dealing with limited character ranges in a grammar and delimiter collisions.

Entity encoding is used in XML to represent both codepoints which are not intended to be interpreted as markup (and to represent arbitrary codepoints which may exceed the range of the document's character encoding).

I am not opposed to offring common escape sequences to make encoding simpler and easier. That strikes me as much less conservative than simply permitting escapes in prefixed names (where they were allowed in SPARQL 1.0, albeit without benefit of extending the expressivity of prefixed names).

-- 
-ericP

Received on Monday, 28 November 2011 16:54:54 UTC