Re: tightenting up the Turtle grammar

* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-03-27 08:42+0000]
> Eric,
> 
> Could you spell out exactly what the changes would be?  (Not sure
> everyone is following every link, every time....)

Trim out codepoints which are never legal in IRIs:

HIGHUCSd ::= [#x10000-1FFFD] | [#x20000-2FFFD] | [#x30000-3FFFD]
           | [#x40000-4FFFD] | [#x50000-5FFFD] | [#x60000-6FFFD]
           | [#x70000-7FFFD] | [#x80000-8FFFD] | [#x90000-9FFFD]
           | [#xA0000-AFFFD] | [#xB0000-BFFFD] | [#xC0000-CFFFD]
           | [#xD0000-DFFFD] | [#xE1000-EFFFD]

UCSCHAR ::= [#xA0-D7FF]     | [#xF900-FDCF]   | [#xFDF0-FFEF] | HIGHUCS

IRIREF ::= '<' ([\x21\x23-\x3b\x3d\x3f-\x5b\x5d\x5fa-z\x7e-\x7F] | UCSCHAR)* '>'

PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | HIGHUCS

and for strings, the included region, referencing UCSCHAR a la
       "'" ([\x00-\x09\x0B-\x0C\x0E-\x26\x28-\x5B\x5D-\x7F] | UCSCHAR | ECHAR )* "'"

Add text at the bottom of the grammar, just before section 7.
[[
Note that RDF literals are Unicode strings, they must be composed of
valid Unicode characters. The code points in the Unicode surrogate
code range, U+D800-U+DFFF, are not Unicode characters.

Note that IRIs produced by matching [135s] iri in a Turtle document
are RDF IRIs as defined in RDF Concepts. This effectively constrains
Turtle documents to those which, when treated according to the parsing
rules below, produce valid IRI refrences per rfc3987.
]]



> 	Thanks
> 	Andy
> 
> On 26/03/13 21:01, Eric Prud'hommeaux wrote:
> >The Turtle spec says that parsing the PNAME_NS and PNAME_LN terminals
> >produces an IRI as defined in RDF Concepts.
> >   http://www.w3.org/TR/turtle/#handle-IRI
> >   http://www.w3.org/TR/turtle/#handle-PNAME_LN
> >   http://www.w3.org/TR/2013/WD-rdf11-concepts-20130115/#dfn-iri
> >RDF Concepts says that IRI is "a Unicode string [UNICODE] that
> >conforms to the syntax defined in RFC 3987 [RFC3987]." In sum, we
> >provide a pretty liberal grammar and then point to a hilariously
> >complex grammar, but don't expect anyone to enforce it.
> >
> >Comments c23 "IRIREF production less restrictive than RFC3987" and c26
> >"PN_CHARS_BASE outside of IRI range" indicate some frustration with our
> >grammar which permits characters which aren't allowed anywhere in IRIs.
> >
> >   <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#c23>
> >   <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#c26>
> >
> >One approach would be to trim the bogus chars off of PN_CHARS_BASE and
> >include a note below the grammer which points directly at 3987 and
> >states that the IRIs constructed by either IRIREF or PNAME_LN are 3987
> >IRIs. This would would supplement the note about valid literal ranges
> >proposed to address c27.
> >
> >   <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#c27>
> >   <http://www.w3.org/mid/20130324145153.GN14139@w3.org>
> >
> >I have spoken to those acting as W3C director. They consider this to
> >be a clarification and nothing that would require another LC.
> >
> 

-- 
-ericP

Received on Wednesday, 27 March 2013 13:16:51 UTC