Re: [Fwd: SPARQL: QuotedIRIref too lax] from Dave Beckett on 2005-08-04 (public-rdf-dawg@w3.org from July to September 2005)

From: Dave Beckett <dave.beckett@bristol.ac.uk>
Date: Thu, 04 Aug 2005 14:58:45 +0100
To: andy.seaborne@hp.com
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <1123163925.20354.22.camel@hoth.ilrt.bris.ac.uk>
On Mon, 2005-08-01 at 13:38 +0100, Seaborne, Andy wrote:
> The current grammar does have a rather open production for QuotedIRIref 
> (anything except space and >).  An IRI reference can be relative.  There is a 
> comment referring to RFC 3987 in the grammar.  An implementation is going to 
> have to additionally process IRI references anyway to make them absolute. 
> Without including the whol of teh IRI/URI grammar, we just parse IRIs.
> 
> RFC 2396 defined "excluded charcaters" as:
>    control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>    space = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>    delims      = "<" | ">" | "#" | "%" | <">

OK

How does these related to IRIs?  "#" for exammple is excluded above, but
included in 'reserved' token below.

> 
> RFC 3986 defines:
RFC 3986: Uniform Resource Identifier (URI): Generic Syntax

>     pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
>     pct-encoded   = "%" HEXDIG HEXDIG
>     unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>     reserved      = gen-delims / sub-delims
>     gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>     sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>                   / "*" / "+" / "," / ";" / "="

OK

from RFC2234 on ABNF defines:
        ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z
        DIGIT          =  %x30-39 ; 0-9
        HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"


> RFC 3987 adds the characters of the UCS beyond U+007F to unreserved

RFC 3987 Internationalized Resource Identifiers (IRIs)

> ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
> iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
> ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                    / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                    / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                    / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                    / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                    / %xD0000-DFFFD / %xE1000-EFFFD
> iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

ipchar and iprivate are mentioned only in:
   iquery         = *( ipchar / iprivate / "/" / "?" )

which is where "/" comes from in your summary below:

> Private can appear in a query string but not in the rest of the IRI.
> 
> So the characters in an IRI reference are:
> 
> 	ALPHA / DIGIT / "-" / "." / "_" / "~"
>          ":" / "/" / "?" / "#" / "[" / "]" / "@"
>          "!" / "$" / "&" / "'" / "(" / ")"
>          "*" / "+" / "," / ";" / "="
> 	ucschar
> 	iprivate
>          "%"

+ ":" (from ipchar)

> and the rq23 grammar becomes:
> 
> QuotedIRIref  	  ::= '<' IRICHAR* '>'     /* An IRI reference : RFC 3987 */
> 
> IRICHAR ::=
>          [A-Z] | [a-z] | '=' | '.' | '_' | '~' |
>          ':' | '/' | '?' | '#' | '[' | ']' | '@' |

ok, you added ':'

>          '!' | '$' | '&' | ''' | '(' | ')' |
>          '*' | '+' | ',' | ';' | '=' |
>          '%' |
>          [#xA0-D7FF]       | [#xF900-FDCF]     | [#xFDF0-FFEF] |
>          [#x10000-#x1FFFD] | [#x20000-#x2FFFD] | [#x30000-#x3FFFD] |
>          [#x40000-#x4FFFD] | [#x50000-#x5FFFD] | [#x60000-#x6FFFD] |
>          [#x70000-#x7FFFD] | [#x80000-#x8FFFD] | [#x90000-#x9FFFD] |
>          [#xA0000-#xAFFFD] | [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] |
>          [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD] |
>          [#xE000-F8FF]     | [#xF0000-FFFFD]   | [#x100000-#x10FFFD]
> 
> 
> [I would be very grateful if someone checked this]

looks right.

> An alternative is to exclude the illegal characters:
> 
> That is (RFC3986):
> 0x00-0x20, 0xFF, '<' '>' "`"
> 
> but with RFC3987 it isn't that short:
> 
> FDD0-FDEF
> FFF0-FFFF
> 1FFFE, 1FFFF
> 2FFFE, 2FFFF
> etc for 3,4,5,6,7,8,9,A,B,C,D,E,F
> 10FFFE, 10FFFF,
> 200000 onwards

I'd hope there is some advice about this in some document nearby the
Character Model for the World Wide Web 1.0: Fundamentals
http://www.w3.org/TR/2005/REC-charmod-20050215/

"Publicly interchanged content SHOULD NOT use codepoints in the private
use area."
http://www.w3.org/TR/charmod/#sec-PrivateUse

which includes "the Private Use Area (PUA) (U+E000-F8FF) and planes 15
and 16 (U+F0000-FFFFD and U+100000-10FFFD)."

There are probably more...

Dave
Received on Thursday, 4 August 2005 13:58:53 UTC