working through some details on "Just Works" escapes

* Richard Cyganiak <richard@cyganiak.de> [2011-11-25 14:59+0000]
> On 25 Nov 2011, at 14:24, Andy Seaborne wrote:
> >> - Is an escaped character not in the list its normal value, e.g. \a == a?  I think so.
> > 
> > Some languages do indeed have \X be X for undefined X but (wild claim!) it can get a bit mysterious.  Your example shows this (:-) In C, \a is "audible bell" unicode codepoint U+0007 = BELL, not 'a'
> > 
> > We already have in a string, \t is a tab not a "t"
> > 
> > So I prefer to identify the characters that are allowed without defaulting to pass-through.
> 
> +1, for consistency. It would be weird to have \t be U+0009 in "strings", forbidden in <IRIs>, and “t” in prefixed:names.
> 
> Then there's the question whether "\-" should equal "-" in prefixed:names, and "\_" == "_" and "\." == "." and so forth. Authors are likely to be unsure whether a particular punctuation character needs backslash-escaping or not, so they might be tempted to escape them just in case, and it would be good if it Just Worked anyways. This uncertainty is unlikely to occur for alphanumeric characters.

We can write Just Works into a grammar in a way which communicates the *required* escape chars:

  <identifier_char>: [a-zA-Z0-9-] | \\[~.-!$&'()*+,;=:/?#@%] || \\. # added to permit e.g. "\x", which will be transformed to "x"

(That presumes first longest lexing. Unordered longest lexing would require \\[~.-…@%] || \\[~.-…@%] .)
We can't, however, use the grammar to validate the unescaped form if the escaping is written into the grammar.

I think Just Works works as long as we never add language features for which the escaped version of a character never acquires a special meaning. An example of where that rule wasn't observered is in some regex dialects in which e.g. "\(\)" and "[]" are meta characters (presumably because capture was added to the language and they wanted backward compatibility with old patterns which didn't have a special meaning for "()"s.) Anyone who used the Just Works feature may have escaped "()"s just in case, which would break when "\(\)" became meta-characters.

We're largely safe from that anitpattern as we already have our reserved set of characters, so even if we add a special meaning for '@' in path expressions, anything with an (unescaped) '@' would be an error anyways.


> And what to do about characters outside of US-ASCII? Surely they would be treated all equally, so either "\É" == "É" for all non-ASCII chars, or "\É" is an error for all non-ASCII chars. But non-ASCII chars include punctuation as well as alphanumerics, so it's hard to draw a consistent boundary between punctuation and alphanumerics.
> 
> > bash : it's different : "\F" is two characters \ and F and so  "\F" == "\\F"
> > 
> > C : unclear : MSDN says it is as a MS extension to std C.
> > C# : no but wider range of characters defined.
> > Java : no
> > Ruby : yes (and a wider range of escapes)
> > Python :   (also has \xXX)
> 
> PHP: "\F" is two characters "\F"
> Regex: "\F" matches one character "F"
> Javascript: "\F" is one character "F"
> JSON: "\F" is not allowed – "\" is only allowed as part of an escape sequence
> MySQL: "\F" is one character "F" (Backslash is not an escape character in standard SQL.)
> 
> It's all over the place.
> 
> Best,
> Richard
> 
> 
> 
> > 
> >> 
> >> Regards,
> >> Dave
> > 
> >  Andy
> > 
> >> 
> >> 
> >> On Nov 25, 2011, at 5:55 AM, Andy Seaborne<andy.seaborne@epimorphics.com>  wrote:
> >> 
> >>> 1/ If we want to have extra characters in prefixed names
> >>> (extra characters means ones not allowed by the current syntax for PN_LOCAL)
> >>> then it seems better to use the character escape mechanism.
> >>> 
> >>> Character escapes turn off the meaning of character in that context (e.g. turning " into a char in the string, not the delimiter).  The current meaning of these characters is to end the prefixed name.
> >>> 
> >>> Using character escapes is also (vaguely) readable.
> >>> 
> >>>    og:audio\:title
> >>>    dbpedia:\%C3\%89ire
> >>>    db:employee.id\=123
> >>>    kinase:Cyclin_D\/Cdk4
> >>> 
> >>> A possible set is:
> >>> 
> >>>   ~.-!$&'()*+,;=:/?#@%
> >>> 
> >>> From RFC 3986
> >>> 
> >>> A/ unreserved extras which have positional restrictions (leading "-" and trailing ".")  ~.-
> >>> 
> >>> B/ sub-delims   !$&'()*+,;=
> >>> 
> >>> C/ gen-delims without []   :/?#@
> >>> 
> >>> D/ %
> >>> 
> >>> The prefixed name is still required to be a valid IRI.
> >>> 
> >>> (I haven't gone though all these chars in detail but they are legal IRI chars and not ones marked "unwise", I think)
> >>> 
> >>> 2/ Variant: Adding %XX as a token rule (so the parser will check it's two hex digits), otherwise have \% in the character escapes as above.
> >>> 
> >>> 
> >>> 3/ Variant: One that we haven't discussed much is #, which is sometimes mentioned as a nuisance.  Unescaped # is also possible without major risk of breaking things.  You'd have to write a comment, with no immediately proceeding whitespace in the middle of a "triples" block.  I don't recall ever seeing such a thing.
> >>> 
> >>>    Andy
> >>> 
> > 
> 
> 

-- 
-ericP

Received on Wednesday, 30 November 2011 14:25:07 UTC