Re: PN_CHARS_BASE outside of IRI range

* Gregg Kellogg <gregg@greggkellogg.net> [2013-03-23 15:35-0700]
> I've been struggling with the localName_with_PN_CHARS_BASE_character_boundaries.ttl test, which tests the range of characters allowed by PN_CHARS_BASE. Within the grammar, this is defined as follows:
> 
> [163s] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] |[#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> 
> This explicitly includes characters beyond what is allows in RFC-3987 [2] uschar production:
> 
> ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD
> 
> As a result, even though my Turtle processor parses the test, it fails when I try to validate the output, where I ensure that IRIs are also valid. My read of the ucschar production is that a valid IRI does not include %xEFFFE or %xEFFFF, which _are_ included in Turtle (and SPARQL I believe).
> 
> (Interestingly, it also excludes some ranges that are included in ucschar, but that is the subject of issue-190 [3]).
> 
> Since the horse has probably left the barn, I don't expect PN_CHARS_BASE to change at this point, but tests, such as localName_with_PN_CHARS_BASE_character_boundaries.ttl should probably be limited to be valid IRIs according to RFC-3987, as that spec is normatively referenced.

Thank you for resolving a long-standing mystery. The Turtle productions come from the SPARQL productions which come from

  [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
— http://www.w3.org/TR/REC-xml/#NT-NameStartChar

Per <http://www.unicode.org/charts/PDF/UEFF80.pdf>, U+EFFFE and U+EFFFF are intended for process-internal use and will thus never be useful in IRIs.
(Relevent only to literals, U+10FFFE and U+10FFFF are also process-internal per <http://www.unicode.org/charts/PDF/U10FF80.pdf>.)

For reasons I couldn't recover, I had U+EFFFD in my own implementation, but changed it to U+EFFFF 'cause the coverage test seemed justified by the the production. Given your observation, I've updated the tests (and reverted my code).
  https://dvcs.w3.org/hg/rdf/rev/dad56881f954

I think I could argue that this is an error in the spec and could be fixed without another Last Call. I've raised ISSUE-123 to poke at "PN_CHARS_BASE permits up to U+EFFFF but RFC-3987 stops at U+EFFFD".
  https://www.w3.org/2011/rdf-wg/track/issues/123

I hope this complete agreement with your position is a satisfactory resolution of your comment. If so, please respond with "[RESOLVED]" at the beginning of your subject.


> Gregg Kellogg
> gregg@greggkellogg.net
> 
> [1] http://www.w3.org/TR/turtle/#sec-grammar-grammar
> [2] http://www.ietf.org/rfc/rfc3987.txt
> [3] http://www.w3.org/International/track/issues/190
-- 
-ericP

Received on Sunday, 24 March 2013 05:45:36 UTC