[Fwd: SPARQL: QuotedIRIref too lax]

Bjoern Hoehrmann wrote:
 > Dear RDF Data Access Working Group,
 >
 >   http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050721/ section 10.1
 > notes "IRIs are ordered by comparing the character strings making up
 > each IRI" it's however not clear how character strings are compared,
 > I would have expected that a `string < string` operator is defined, but
 > section 11.1 only defines such an operator for numeric and dateTime
 > types. Please change the draft such that ordering of IRIs is clear.
 >
 > regards,


The current grammar does have a rather open production for QuotedIRIref 
(anything except space and >).  An IRI reference can be relative.  There is a 
comment referring to RFC 3987 in the grammar.  An implementation is going to 
have to additionally process IRI references anyway to make them absolute. 
Without including the whol of teh IRI/URI grammar, we just parse IRIs.

RFC 2396 defined "excluded charcaters" as:
   control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
   space = <US-ASCII coded characters 00-1F and 7F hexadecimal>
   delims      = "<" | ">" | "#" | "%" | <">

RFC 3986 defines:
    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
    pct-encoded   = "%" HEXDIG HEXDIG
    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

RFC 3987 adds the characters of the UCS beyond U+007F to unreserved

ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                   / %xD0000-DFFFD / %xE1000-EFFFD
iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

Private can appear in a query string but not in the rest of the IRI.

So the characters in an IRI reference are:

	ALPHA / DIGIT / "-" / "." / "_" / "~"
         ":" / "/" / "?" / "#" / "[" / "]" / "@"
         "!" / "$" / "&" / "'" / "(" / ")"
         "*" / "+" / "," / ";" / "="
	ucschar
	iprivate
         "%"

and the rq23 grammar becomes:

QuotedIRIref  	  ::= '<' IRICHAR* '>'     /* An IRI reference : RFC 3987 */

IRICHAR ::=
         [A-Z] | [a-z] | '=' | '.' | '_' | '~' |
         ':' | '/' | '?' | '#' | '[' | ']' | '@' |
         '!' | '$' | '&' | ''' | '(' | ')' |
         '*' | '+' | ',' | ';' | '=' |
         '%' |
         [#xA0-D7FF]       | [#xF900-FDCF]     | [#xFDF0-FFEF] |
         [#x10000-#x1FFFD] | [#x20000-#x2FFFD] | [#x30000-#x3FFFD] |
         [#x40000-#x4FFFD] | [#x50000-#x5FFFD] | [#x60000-#x6FFFD] |
         [#x70000-#x7FFFD] | [#x80000-#x8FFFD] | [#x90000-#x9FFFD] |
         [#xA0000-#xAFFFD] | [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] |
         [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD] |
         [#xE000-F8FF]     | [#xF0000-FFFFD]   | [#x100000-#x10FFFD]


[I would be very grateful if someone checked this]



An alternative is to exclude the illegal characters:

That is (RFC3986):
0x00-0x20, 0xFF, '<' '>' "`"

but with RFC3987 it isn't that short:

FDD0-FDEF
FFF0-FFFF
1FFFE, 1FFFF
2FFFE, 2FFFF
etc for 3,4,5,6,7,8,9,A,B,C,D,E,F
10FFFE, 10FFFF,
200000 onwards

	Andy

Received on Monday, 1 August 2005 12:39:30 UTC