Re: unicode escapes in prefix names

* Richard Cyganiak <richard@cyganiak.de> [2011-11-23 13:36+0000]
> On 23 Nov 2011, at 01:20, Gavin Carothers wrote:
> >> I would argue that SPARQL is changing to avoid a security risk in SPARQL Update:
> >> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Aug/0010.html
> > 
> > Obfuscated comments are not really a security risk.
> 
> The problem is obfuscated DELETE statements, not obfuscated comments.

I believe this whitepaper describes the security risk http://xkcd.com/327/


> > SPARQL 1.0 allows for escaping sequences in all tokens
> > effectively. The previous decision was to allow escaping in as many
> > tokens as seemed reasonable.
> 
> I'd say it should be as few tokens as is reasonable.
> 
> >>> This isn't about encoding.
> >> 
> >> Right – it's about the complexity that authors already face in this area.
> > 
> > Yeah, it's about IRI normalisation. Depending on which IRI
> > normalisation one is expecting
> > <Éire> and <%C3%89ire> could be the same.
> 
> Not in RDF.
> 
> [[
> IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of [IRI]. Further normalization must not be performed when comparing IRIs for equality.
> ]]
> http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#section-IRIs
> 
> > I believe DBpedia is "wrong" in storing the % escaped form.
> 
> It's bad practice, but it's what they do and it's what users of our specs have to deal with in the real world.
> 
> [[
> Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of [IRI]. Non-normalized forms that should be avoided include:
> 
>  • Percent-encoding of characters where it is not required by IRI syntax
> ]]
> http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#section-IRIs

I think this is a slight simplificaiton, but a worthy one. Of course, URL minters can mint whatever they want, but the mapping to URI (for that popular GET protocol) *loses* '%'s. So a reason to avoid excessive %-ification is that, when you push it through the standard processing at the far end, say, Apache's mapping to a filename, those lost '%'s don't come back. As an example, <http://example.com/R&D> and <http://example.com/R%26D> map to the same URL (Apache will look for <server root>/R&D).


> >>>> 4.<Éire>  if you don't know how to type É but know that you can use \u00C9 instead
> >>> 
> >>> Aside from the fact it's relative, why not?
> >> 
> >> Because xxx:\u00C9ire is not a valid prefixed name (in Turtle – it is in SPARQL 1.0).
> > 
> > xxx:Éire is valid in RDFa 1.0, RDFa 1.1, SPARQL 1.0, XML 1.0, XML 1.1,
> > and Turtle (TS, WD).
> 
> Sure. Andy made the case that some people can't type “É” and hence have to use \u00C9.

The point is that in SPARQL 1.0, the grammar never "sees" xxx:Éire. You can sprinkle them where you like, but they are only useful for folks who are editing unicode in ascii, which is a small and shrinking use case.


> > xxx:\u00C9ire is valid in RDFa 1.0, RDFa 1.1, SPARQL 1.0, and Turtle (WD)
> 
> \u escapes in RDFa? I sincerely hope not! And in Turtle it says so without WG consensus.
> 
> > Comes down mostly to do we follow XML
> 
> (and Java and SQL)
> 
> > in not allowing escaping in
> > names or not? But a bit more complicated by the fact we of course DO
> > already allow escaping in names some names (<\u00C9ire>) just not all
> > names.
> 
> Yes, so anyone who needs unicode escapes can already use them in <…>.

I don't find the you-don't-need-escaping-anyways argument relevent to defending the limitations of an escaping mechansim. I've seen short exemplars bandied about, but the ones I deal with reallistically are IRIs mapped from protein identifiers which have ':'s in them. I have a nice syntax for writing most of my queries and most of my data, nicely categorized by namespace prefixes which helps me visually distinguish proteins from mechanisms from drugs. But if I'm unlucky enough to need to reference one with a ':' in it, I'm not allowed to use the obvious escaping syntax? Instead I have to throw all that away and have a big opaque IRI in the middle of some otherwise organized data or query?

This has also come up when teaching SPARQL to the Linked Data class at MIT. It's very hard to defend allowing escapes for an antiquated use case and not for more critical ones.


> > Prefixed names are a all-purpose IRI abbreviation mechanism in RDFa.
> 
> RDFa uses CURIEs which are a different story. CURIEs work where there are delimiters around them. Prefixed names in SPARQL and Turtle don't have delimiters around them. When you have delimiters, it's easy to go wild with what you allow to happen between them.
> 
> > Which thanks to FaceBook Open Graph has far more deployed data then
> > SPARQL does.
> 
> Hardly a good example. Facebook requires that the “ogp” prefix be used and ignores prefix declarations. Also, RDFa's design managed to piss off some other group so much that they made their own competing standard, which is now winning the adoption war. The thing they hated most about RDFa was its use of prefixed names. Their standard has no prefixed names at all.
> 
> >> Compatibility? Between what and what? SPARQL and Turtle? That can be achieved by SPARQL 1.1 matching Turtle's (Team Submission) behaviour.
> > 
> > The Team Submission of course has issues as well. Which behaviour? Not
> > allowing escapes in prefix names?
> 
> Yes.
> 
> > Using QNames not SPARQL PNames? Not allowing numbers to start prefix names? 
> 
> No.
> 
> > Our early decisions seem to be coming unstuck.
> 
> I don't recall this group making a decision to allow unicode escapes in prefixed names.
> 
> > SPARQL 1.1 isn't really this group's job. Not to mention it seeming rather obvious that the thing SPARQL 1.1 needs to be most compatible with is SPARQL 1.0
> 
> Unicode escape handling in SPARQL 1.0 is broken. The old SPARQL WG broke it – they copied the general syntax from Turtle, but in the process decided to change the sane escape handling they could have inherited from Turtle to something bizarre that's pretty much unique in the computing world. (They made other departures from Turtle too and some of those were major improvements.) Now the SPARQL 1.1 WG has to deal with that and yes that's their job. Unicode escape handling in Turtle is *not* broken – it's been sane ever since January 2004, so why mess with it.
> 
> Best,
> Richard

-- 
-ericP

Received on Wednesday, 23 November 2011 14:51:06 UTC