Re: Character escapes in prefix names from Richard Cyganiak on 2011-11-25 (public-rdf-wg@w3.org from November 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 25 Nov 2011 14:59:19 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: David Wood <david@3roundstones.com>, RDF-WG <public-rdf-wg@w3.org>
Message-Id: <A63C8DA0-B079-4890-B490-805349677D7C@cyganiak.de>

On 25 Nov 2011, at 14:24, Andy Seaborne wrote:
>> - Is an escaped character not in the list its normal value, e.g. \a == a?  I think so.
> 
> Some languages do indeed have \X be X for undefined X but (wild claim!) it can get a bit mysterious.  Your example shows this (:-) In C, \a is "audible bell" unicode codepoint U+0007 = BELL, not 'a'
> 
> We already have in a string, \t is a tab not a "t"
> 
> So I prefer to identify the characters that are allowed without defaulting to pass-through.

+1, for consistency. It would be weird to have \t be U+0009 in "strings", forbidden in <IRIs>, and “t” in prefixed:names.

Then there's the question whether "\-" should equal "-" in prefixed:names, and "\_" == "_" and "\." == "." and so forth. Authors are likely to be unsure whether a particular punctuation character needs backslash-escaping or not, so they might be tempted to escape them just in case, and it would be good if it Just Worked anyways. This uncertainty is unlikely to occur for alphanumeric characters.

And what to do about characters outside of US-ASCII? Surely they would be treated all equally, so either "\É" == "É" for all non-ASCII chars, or "\É" is an error for all non-ASCII chars. But non-ASCII chars include punctuation as well as alphanumerics, so it's hard to draw a consistent boundary between punctuation and alphanumerics.

> bash : it's different : "\F" is two characters \ and F and so  "\F" == "\\F"
> 
> C : unclear : MSDN says it is as a MS extension to std C.
> C# : no but wider range of characters defined.
> Java : no
> Ruby : yes (and a wider range of escapes)
> Python :   (also has \xXX)

PHP: "\F" is two characters "\F"
Regex: "\F" matches one character "F"
Javascript: "\F" is one character "F"
JSON: "\F" is not allowed – "\" is only allowed as part of an escape sequence
MySQL: "\F" is one character "F" (Backslash is not an escape character in standard SQL.)

It's all over the place.

Best,
Richard



> 
>> 
>> Regards,
>> Dave
> 
> 	Andy
> 
>> 
>> 
>> On Nov 25, 2011, at 5:55 AM, Andy Seaborne<andy.seaborne@epimorphics.com>  wrote:
>> 
>>> 1/ If we want to have extra characters in prefixed names
>>> (extra characters means ones not allowed by the current syntax for PN_LOCAL)
>>> then it seems better to use the character escape mechanism.
>>> 
>>> Character escapes turn off the meaning of character in that context (e.g. turning " into a char in the string, not the delimiter).  The current meaning of these characters is to end the prefixed name.
>>> 
>>> Using character escapes is also (vaguely) readable.
>>> 
>>>    og:audio\:title
>>>    dbpedia:\%C3\%89ire
>>>    db:employee.id\=123
>>>    kinase:Cyclin_D\/Cdk4
>>> 
>>> A possible set is:
>>> 
>>>   ~.-!$&'()*+,;=:/?#@%
>>> 
>>> From RFC 3986
>>> 
>>> A/ unreserved extras which have positional restrictions (leading "-" and trailing ".")  ~.-
>>> 
>>> B/ sub-delims   !$&'()*+,;=
>>> 
>>> C/ gen-delims without []   :/?#@
>>> 
>>> D/ %
>>> 
>>> The prefixed name is still required to be a valid IRI.
>>> 
>>> (I haven't gone though all these chars in detail but they are legal IRI chars and not ones marked "unwise", I think)
>>> 
>>> 2/ Variant: Adding %XX as a token rule (so the parser will check it's two hex digits), otherwise have \% in the character escapes as above.
>>> 
>>> 
>>> 3/ Variant: One that we haven't discussed much is #, which is sometimes mentioned as a nuisance.  Unescaped # is also possible without major risk of breaking things.  You'd have to write a comment, with no immediately proceeding whitespace in the middle of a "triples" block.  I don't recall ever seeing such a thing.
>>> 
>>>    Andy
>>> 
>

Received on Friday, 25 November 2011 15:00:01 UTC