Re: Character escapes in prefix names

On 25/11/11 13:58, David Wood wrote:
> Good idea.  The normal sort of issues that come up include:
>
> - Is an escaped character not in the list its normal value, e.g. \a == a?  I think so.

Some languages do indeed have \X be X for undefined X but (wild claim!) 
it can get a bit mysterious.  Your example shows this (:-) In C, \a is 
"audible bell" unicode codepoint U+0007 = BELL, not 'a'

We already have in a string, \t is a tab not a "t"

So I prefer to identify the characters that are allowed without 
defaulting to pass-through.  Both are workable though.


> - The above infers that a backslash should always be escaped: \\

'\' isn't legal in IRIs so this not needed.  Or to put it another way, 
it'll cause an error on IRI checking.

Spaces in particular have always been a difficult area for RDF so I do 
have a preference for a design that stops spaces (and other illegal 
characters) getting into data as early as possible.

Some system do not check IRI strings (and some don't when parsing at 
speed).  Keeping illegal characters out seems to me to be more helpful 
than being liberal with \X.  YMMV.  The opposite argument of "rubbish 
data exists, have to work with it" also holds.

> String escapes work this way in several programming languages.

We could quickly survey.


bash : it's different : "\F" is two characters \ and F and so  "\F" == "\\F"

C : unclear : MSDN says it is as a MS extension to std C.

C# : no but wider range of characters defined.

Java : no

Ruby : yes (and a wider range of escapes)

Python :   (also has \xXX)

>
> Regards,
> Dave

	Andy

>
>
> On Nov 25, 2011, at 5:55 AM, Andy Seaborne<andy.seaborne@epimorphics.com>  wrote:
>
>> 1/ If we want to have extra characters in prefixed names
>> (extra characters means ones not allowed by the current syntax for PN_LOCAL)
>> then it seems better to use the character escape mechanism.
>>
>> Character escapes turn off the meaning of character in that context (e.g. turning " into a char in the string, not the delimiter).  The current meaning of these characters is to end the prefixed name.
>>
>> Using character escapes is also (vaguely) readable.
>>
>>     og:audio\:title
>>     dbpedia:\%C3\%89ire
>>     db:employee.id\=123
>>     kinase:Cyclin_D\/Cdk4
>>
>> A possible set is:
>>
>>    ~.-!$&'()*+,;=:/?#@%
>>
>>  From RFC 3986
>>
>> A/ unreserved extras which have positional restrictions (leading "-" and trailing ".")  ~.-
>>
>> B/ sub-delims   !$&'()*+,;=
>>
>> C/ gen-delims without []   :/?#@
>>
>> D/ %
>>
>> The prefixed name is still required to be a valid IRI.
>>
>> (I haven't gone though all these chars in detail but they are legal IRI chars and not ones marked "unwise", I think)
>>
>> 2/ Variant: Adding %XX as a token rule (so the parser will check it's two hex digits), otherwise have \% in the character escapes as above.
>>
>>
>> 3/ Variant: One that we haven't discussed much is #, which is sometimes mentioned as a nuisance.  Unescaped # is also possible without major risk of breaking things.  You'd have to write a comment, with no immediately proceeding whitespace in the middle of a "triples" block.  I don't recall ever seeing such a thing.
>>
>>     Andy
>>

Received on Friday, 25 November 2011 14:25:34 UTC