Re: unicode escapes in prefix names from Gavin Carothers on 2011-11-23 (public-rdf-wg@w3.org from November 2011)

From: Gavin Carothers <gavin@carothers.name>
Date: Tue, 22 Nov 2011 17:20:12 -0800
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, RDF-WG <public-rdf-wg@w3.org>
Message-ID: <CAPqY83w4zTu8p3DC3geVyn=hN5idE=Lpg=LPieQxrb5GSfn9oQ@mail.gmail.com>
On Tue, Nov 22, 2011 at 3:41 PM, Richard Cyganiak <richard@cyganiak.de> wrote:
> Hi Andy,
>
> On 22 Nov 2011, at 21:04, Andy Seaborne wrote:
>> With a goal of maximising compatibility between Turtle and SPARQL, maximising compatibility from both heritiages is important.
>>
>> SPARQL 1.0 allows \u in prefix names (and in fact uniformly)
>
> Allowing escapes everywhere was a design mistake in SPARQL 1.0 IMHO. I was looking forward to seeing this fixed in 1.1.
>
>> SPARQL is already changing to accommodate Turtle in a major way for implementers
>
> I would argue that SPARQL is changing to avoid a security risk in SPARQL Update:
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Aug/0010.html

Obfuscated comments are not really a security risk. They are perhaps a
typo risk, but if the user has permissions to delete every triple in
the database the issue is not about how they wrote a comment. I do
agree that obfuscated comments are a poor idea, just not that they are
a security issue.

>
>> Turtle can make a smaller change to accommodate SPARQL.
>> (smaller because it does not change the design of a Turtle parser as it does to a SPARQL one)
>
> Turtle would thereby acquire the same security risk.
>
> Transmitting potentially sensitive information in a format that supports obfuscation of element boundaries is not a good idea. I'm not aware of a single other format besides SPARQL 1.0 that has this “feature”.

> Turtle should not support obfuscation of element boundaries.

Okay, so we should limit the use of escape sequences to specific
tokens. SPARQL 1.0 allows for escaping sequences in all tokens
effectively. The previous decision was to allow escaping in as many
tokens as seemed reasonable. However, the UCHAR production is not
specific enough to limit what characters can be used in a given token.
The Team Submission isn't much help here either

<\u003E> is still perfectly fine by it's grammar as is <\u0000>

>
>> More inline - some of your examples are about %-encoding in prefixed names and not about unicode escapes.
>
> They are real-world examples taken from DBpedia, the poster child of RDF datasets. Yes of course they are about %-encoding and the fact that DBpedia uses %-encoded characters in IRIs where it doesn't have to. Such is the state of deployed reality. The examples illustrate the complexity that users already have to deal with regarding a rather simple question – how to deal with the “É” in “Éire” when figuring out how to query DBpedia. This is the reason why any additional complexity should be motivated by benefits to users and document authors, not by dependencies between specs or modest implementation issues.
>
>>> As it stands, none of the following IRIs can be written as prefixed
>>> names – they all have to be written as full IRIs:
>>>
>>> 1.<%C3%89ire>
>>
>> This isn't about encoding.
>
> Right – it's about the complexity that authors already face in this area.

Yeah, it's about IRI normalisation. Depending on which IRI
normalisation one is expecting
<Éire> and <%C3%89ire> could be the same. I believe DBpedia is "wrong"
in storing the % escaped form.

>
>>> 2.<search?q=eire>
>>> 3.<Galway,_Ireland>
>>> 4.<Éire>  if you don't know how to type É but know that you can use \u00C9 instead
>>
>> Aside from the fact it's relative, why not?
>
> Because xxx:\u00C9ire is not a valid prefixed name (in Turtle – it is in SPARQL 1.0).

xxx:Éire is valid in RDFa 1.0, RDFa 1.1, SPARQL 1.0, XML 1.0, XML 1.1,
and Turtle (TS, WD).
xxx:\u00C9ire is valid in RDFa 1.0, RDFa 1.1, SPARQL 1.0, and Turtle (WD)

Comes down mostly to do we follow XML in not allowing escaping in
names or not? But a bit more complicated by the fact we of course DO
already allow escaping in names some names (<\u00C9ire>) just not all
names.

>
>>> 5.<U.S.>
>>
>> What have trailing dots got to do with unicode escapes?
>
> They are both examples of stuff that prevents prefixed names from being an *all-purpose* IRI abbreviation mechanism.
>
>>> 6.<United%20Kingdom>
>>
>> use of % - not about unicode escapes.
>
> Ditto. The point is that only a very limited range of IRIs can be abbreviated using prefixed names (I gave six examples where a reasonable person might expect it to work but it doesn't), and users don't actually benefit from a change that makes *one* of those many cases work.

And here we get away from the escaping issue again. Will follow up in
another thread.

>
> Without a benefit to users, I don't see the case for a backwards compatibility breaking change to Turtle.
>
>> My suggestion is not expanding the range of characters that are, or are not, allowed in a prefix name but I'm open to adding %xx.
>
> This would make a second example work, while the four others still don't.
>
> As long as most IRIs can't be usefully abbreviated with prefixed names, it's a fundamental mistake to think of prefixed names as an all-purpose IRI abbreviation mechanism. It just isn't. It's a feature for abbreviating IRIs that have been designed with the feature in mind. (I may be refuting a point here that you didn't make but others did when asking for the same feature.)

Prefixed names are a all-purpose IRI abbreviation mechanism in RDFa.
Which thanks to FaceBook Open Graph has far more deployed data then
SPARQL does.

>
>>> The proposal adds a whole bunch of complexity to the story that one
>>> needs to tell to explain how the hell prefixed names work, and what
>>> we get in return is a solution for the case that matters least –
>>> number 4 – while all the others still don't work and require falling
>>> back to full IRIs.
>>
>> What about compatibility?
>
> Compatibility? Between what and what? SPARQL and Turtle? That can be achieved by SPARQL 1.1 matching Turtle's (Team Submission) behaviour.

The Team Submission of course has issues as well. Which behaviour? Not
allowing escapes in prefix names? Using QNames not SPARQL PNames? Not
allowing numbers to start prefix names? Our early decisions seem to be
coming unstuck. SPARQL 1.1 isn't really this group's job. Not to
mention it seeming rather obvious that the thing SPARQL 1.1 needs to
be most compatible with is SPARQL 1.0

Cheers,
Gavin

> Best,
> Richard
>
Received on Wednesday, 23 November 2011 01:20:42 UTC