- From: Dave Beckett <dave.beckett@bristol.ac.uk>
- Date: Thu, 04 Aug 2005 14:58:45 +0100
- To: andy.seaborne@hp.com
- Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
On Mon, 2005-08-01 at 13:38 +0100, Seaborne, Andy wrote: > The current grammar does have a rather open production for QuotedIRIref > (anything except space and >). An IRI reference can be relative. There is a > comment referring to RFC 3987 in the grammar. An implementation is going to > have to additionally process IRI references anyway to make them absolute. > Without including the whol of teh IRI/URI grammar, we just parse IRIs. > > RFC 2396 defined "excluded charcaters" as: > control = <US-ASCII coded characters 00-1F and 7F hexadecimal> > space = <US-ASCII coded characters 00-1F and 7F hexadecimal> > delims = "<" | ">" | "#" | "%" | <"> OK How does these related to IRIs? "#" for exammple is excluded above, but included in 'reserved' token below. > > RFC 3986 defines: RFC 3986: Uniform Resource Identifier (URI): Generic Syntax > pchar = unreserved / pct-encoded / sub-delims / ":" / "@" > pct-encoded = "%" HEXDIG HEXDIG > unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" > reserved = gen-delims / sub-delims > gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" OK from RFC2234 on ABNF defines: ALPHA = %x41-5A / %x61-7A ; A-Z / a-z DIGIT = %x30-39 ; 0-9 HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" > RFC 3987 adds the characters of the UCS beyond U+007F to unreserved RFC 3987 Internationalized Resource Identifiers (IRIs) > ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@" > iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar > ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF > / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD > / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD > / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD > / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD > / %xD0000-DFFFD / %xE1000-EFFFD > iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD ipchar and iprivate are mentioned only in: iquery = *( ipchar / iprivate / "/" / "?" ) which is where "/" comes from in your summary below: > Private can appear in a query string but not in the rest of the IRI. > > So the characters in an IRI reference are: > > ALPHA / DIGIT / "-" / "." / "_" / "~" > ":" / "/" / "?" / "#" / "[" / "]" / "@" > "!" / "$" / "&" / "'" / "(" / ")" > "*" / "+" / "," / ";" / "=" > ucschar > iprivate > "%" + ":" (from ipchar) > and the rq23 grammar becomes: > > QuotedIRIref ::= '<' IRICHAR* '>' /* An IRI reference : RFC 3987 */ > > IRICHAR ::= > [A-Z] | [a-z] | '=' | '.' | '_' | '~' | > ':' | '/' | '?' | '#' | '[' | ']' | '@' | ok, you added ':' > '!' | '$' | '&' | ''' | '(' | ')' | > '*' | '+' | ',' | ';' | '=' | > '%' | > [#xA0-D7FF] | [#xF900-FDCF] | [#xFDF0-FFEF] | > [#x10000-#x1FFFD] | [#x20000-#x2FFFD] | [#x30000-#x3FFFD] | > [#x40000-#x4FFFD] | [#x50000-#x5FFFD] | [#x60000-#x6FFFD] | > [#x70000-#x7FFFD] | [#x80000-#x8FFFD] | [#x90000-#x9FFFD] | > [#xA0000-#xAFFFD] | [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] | > [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD] | > [#xE000-F8FF] | [#xF0000-FFFFD] | [#x100000-#x10FFFD] > > > [I would be very grateful if someone checked this] looks right. > An alternative is to exclude the illegal characters: > > That is (RFC3986): > 0x00-0x20, 0xFF, '<' '>' "`" > > but with RFC3987 it isn't that short: > > FDD0-FDEF > FFF0-FFFF > 1FFFE, 1FFFF > 2FFFE, 2FFFF > etc for 3,4,5,6,7,8,9,A,B,C,D,E,F > 10FFFE, 10FFFF, > 200000 onwards I'd hope there is some advice about this in some document nearby the Character Model for the World Wide Web 1.0: Fundamentals http://www.w3.org/TR/2005/REC-charmod-20050215/ "Publicly interchanged content SHOULD NOT use codepoints in the private use area." http://www.w3.org/TR/charmod/#sec-PrivateUse which includes "the Private Use Area (PUA) (U+E000-F8FF) and planes 15 and 16 (U+F0000-FFFFD and U+100000-10FFFD)." There are probably more... Dave
Received on Thursday, 4 August 2005 13:58:53 UTC