Re: PROPOSAL for %-encoding (was: Re: IMPORTANT: remaining issues for closing CR)

* Richard Cyganiak <richard@cyganiak.de> [2012-04-29 21:02+0100]
> Hi Eric,
> 
> On 27 Apr 2012, at 18:01, Eric Prud'hommeaux wrote:
> > Recent changes to the specs in SPARQL and Turtle mean that (once these changes are deployed) this URL can be written in SPARQL/Turtle as <Department/name\=accounting\;city\=Cambridge>.
> 
> No, it can be written as <Department/name=accounting;city=Cambridge>, and that was always the case.
> 
> It can also be abbreviated, assuming an appropriate prefix declaration, as something like department:name\=accounting\;city\=Cambridge. Is this what you meant?

yep, apologies

> > This doesn't make the eyes bleed but it is a minor usability impediment 'cause you can't cut and paste those particular URLs from e.g query results to a another query.
> 
> I don't think there's a copy-paste problem. No matter if query results are displayed as relative IRIs, absolute IRIs or prefixed names, one can always just copy-paste them into a query and they will be valid (assuming appropriate enclosing <> and prefix declarations).
> 
> Prefixed names with backslashes in them are not very pretty, that's true, but that's a minor concern. Lots of languages, including JSON, regular expressions, and most programming languages, require backslash-escaping of certain characters.
> 
> > Another nearby issue is that R2RML users are limited in the separator characters that they can safely use in templates. A user creating a template like "Department/{NAME}-{CITY}" may not first inspect his data to make sure there's no '-' in the NAME column.
> 
> Well, the frequent case is numeric columns, and finding a safe separator for them is trivial.
> 
> All of the RFC 3987 sub-delims are *always* safe:
> 
>    sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
>                   / "*" / "+" / "," / ";" / "="

Agreed, but there is some usability pressure to use separators which don't require escaping in prefixed names, which, as we see below, is impossible:

[166]  PN_LOCAL_ESC  ::=  '\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' | '%' )
sub-delims             =                                "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="


> I'd expect that a good data validator would spot templates that are unsafe for the given database. (The spec for data validators doesn't require this. It probably should.)
> 
> > As it stands, users will probably want to use separators which need not be escaped in SPARQL/Turtle.
> 
> First, separators don't *need* to be escaped in SPARQL or Turtle. They only need to be escaped if one wants to abbreviate those IRIs in prefixed names.
> 
> Second, experience indicates that users don't care whether IRIs *in data* can be abbreviated. Slashes are ubiquitous as separators in D2RQ mappings. Conversations about whether these IRIs are easily abbreviated, or how they can be made more easily abbreviated, are not happening at all. Based on this, I think your prediction above is wrong.
> 
> > As it turns out, that leaves no safe characters, i.e. chars which are escaped in {}s but not in SPARQL.
> 
> If your data contains characters that are not allowed in prefixed names, then you'll have to use escapes anyways. So why would you care about a separator character that doesn't need escaping?
> 
> Also, R2RML mapping authors are not restricted to a single character, and finding a multi-character sequence that doesn't need escaping and doesn't occur in the data is actually easy if you care about this. The template "Department/{NAME}---{CITY}" should be safe unless something very strange is going on in your data.
> 
> > One solution which may help users is the parameterized escape, e.g. "Department/{-|NAME}-{CITY}", which would also make DM URLs templatable.
> 
> D2RQ has a similar mechanism that allows a choice between different escaping modes, where one optimizes for usability/readability while another one sticks to the usual RFCs. (It doesn't support selection of individual characters for escaping.) I brought this up when we talked about template escaping, and others thought it sounds like a possible R2RML 1.1 feature. I found that compelling. A likely direction for R2RML would be to implement more of RFC 6570 on which the template syntax is based, and which allows for several different %-encoding regimes.
> 
> OTOH, given that R2RML implementations may want to do the encoding and decoding in the database, there's something to be said for keeping the mechanism as simple as possible. The cost of a change in the DM, to align it with RFC 3986+3987, as proposed below, seems smaller to me.

Spec-wise, sure, but we're also trying to guess what will be most appealing to users. You argue that having simple rules is good for them. I argue that slightly more complex rules (escaping '.'s and '-'s) could produce a smoother experience for users. I apparently wasn't paying attention when we decided on the rule for PN_LOCAL_ESC because right now it seems crazy to me to require escaping really common word separators like '_' | '~' | '.' | '-'. I'm not exactly psyched to bring this up in SPARQL and RDF, but I suppose I should. Barring relaxing those rules, the custom escaping rules in DM don't have the desired payoff for e.g. dates in a primary key logns:time-2012-29-04T01:23:45. I guess I could also just blow it all off and accept your proposal unchallenged. I wish I could spin off a parallel universe to see which one ends up curing cancer and paving the roads.


> Best,
> Richard
> 
> 
> 
> > 
> > 
> >> == CHANGE PROPOSAL ==
> >> 
> >> SUMMARY: “Direct Mapping: Change %-escaping rules to be compatible with R2RML. Change two delimiter characters to ones that are safe under these rules.”
> >> 
> >> 
> >> PROPOSAL: In the Direct Mapping spec, do the following changes:
> >> 
> >> 
> >> REMOVE: [[
> >> These identifiers are separated by the punctuation characters '#', '.', '/' and '-'. All SQL identifiers are escaped following URL-encoding HTML form data except that only the above punctuation and the characters not permitted in RDF IRIs are escaped.
> >> ]]
> >> ADD: [[
> >> These identifiers are separated by the punctuation characters '#', ';', '/' and '='. All SQL identifiers are escaped following R2RML's escaping rules.
> >> ]]
> >> 
> >> 
> >> In “Definition percent-encode”, REMOVE the following bullet point:
> >> [[
> >>  • For attribute names, replace each HYPHEN-MINUS character ('-', U+003d) with the string "%3D".
> >>  • For attribute values, replace each FULL STOP character ('.', U+002e) with the string "%2E".
> >> ]]
> >> 
> >> 
> >> In “Definition row node”, replace two bullet points:
> >> REMOVE: [[
> >>  • a HYPHEN-MINUS character '-',
> >>  • if it is not the last column in the foreign key, a FULL STOP character '.'
> >> ]]
> >> ADD: [[
> >>  • an EQUALS SIGN character '=',
> >>  • if it is not the last column in the foreign key, a SEMICOLON character ';'
> >> ]]
> >> 
> >> 
> >> In “Definition reference property IRI”:
> >> REMOVE: [[
> >>  • if it is not the last column in the foreign key, a FULL STOP character '.'
> >> ]]
> >> ADD: [[
> >>  • if it is not the last column in the foreign key, a SEMICOLON character '.'
> >> ]]
> >> 
> >> 
> >> Change all examples in Section 2 accordingly.
> >> 
> >> Change rules [37] and [40] in Appendix A.4 accordingly.
> >> 
> >> Change all DM test cases accordingly.
> >> 
> >> 
> >> 
> >> Best,
> >> Richard
> > 
> > -- 
> > -ericP
> > 
> 

-- 
-ericP

Received on Sunday, 29 April 2012 20:50:10 UTC