- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Sun, 24 Mar 2013 01:45:06 -0400
- To: Gregg Kellogg <gregg@greggkellogg.net>
- Cc: "public-rdf-comments@w3.org Comments" <public-rdf-comments@w3.org>
* Gregg Kellogg <gregg@greggkellogg.net> [2013-03-23 15:35-0700] > I've been struggling with the localName_with_PN_CHARS_BASE_character_boundaries.ttl test, which tests the range of characters allowed by PN_CHARS_BASE. Within the grammar, this is defined as follows: > > [163s] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] |[#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] > > This explicitly includes characters beyond what is allows in RFC-3987 [2] uschar production: > > ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF > / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD > / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD > / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD > / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD > / %xD0000-DFFFD / %xE1000-EFFFD > > As a result, even though my Turtle processor parses the test, it fails when I try to validate the output, where I ensure that IRIs are also valid. My read of the ucschar production is that a valid IRI does not include %xEFFFE or %xEFFFF, which _are_ included in Turtle (and SPARQL I believe). > > (Interestingly, it also excludes some ranges that are included in ucschar, but that is the subject of issue-190 [3]). > > Since the horse has probably left the barn, I don't expect PN_CHARS_BASE to change at this point, but tests, such as localName_with_PN_CHARS_BASE_character_boundaries.ttl should probably be limited to be valid IRIs according to RFC-3987, as that spec is normatively referenced. Thank you for resolving a long-standing mystery. The Turtle productions come from the SPARQL productions which come from [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] — http://www.w3.org/TR/REC-xml/#NT-NameStartChar Per <http://www.unicode.org/charts/PDF/UEFF80.pdf>, U+EFFFE and U+EFFFF are intended for process-internal use and will thus never be useful in IRIs. (Relevent only to literals, U+10FFFE and U+10FFFF are also process-internal per <http://www.unicode.org/charts/PDF/U10FF80.pdf>.) For reasons I couldn't recover, I had U+EFFFD in my own implementation, but changed it to U+EFFFF 'cause the coverage test seemed justified by the the production. Given your observation, I've updated the tests (and reverted my code). https://dvcs.w3.org/hg/rdf/rev/dad56881f954 I think I could argue that this is an error in the spec and could be fixed without another Last Call. I've raised ISSUE-123 to poke at "PN_CHARS_BASE permits up to U+EFFFF but RFC-3987 stops at U+EFFFD". https://www.w3.org/2011/rdf-wg/track/issues/123 I hope this complete agreement with your position is a satisfactory resolution of your comment. If so, please respond with "[RESOLVED]" at the beginning of your subject. > Gregg Kellogg > gregg@greggkellogg.net > > [1] http://www.w3.org/TR/turtle/#sec-grammar-grammar > [2] http://www.ietf.org/rfc/rfc3987.txt > [3] http://www.w3.org/International/track/issues/190 -- -ericP
Received on Sunday, 24 March 2013 05:45:36 UTC