- From: Wouter Beek <w.g.j.beek@vu.nl>
- Date: Sat, 3 Mar 2018 13:55:09 +0100
- To: SW-forum Web <semantic-web@w3.org>
Hi Semantic Web community, To what extent are Turtle parsers allowed to parse IRIs that do not conform to the official IRI grammar, as defined in RFC 3987, and how to deal with issues that arise from the under-specification of IRI resolution for such non-RFC IRIs? Firstly, the Turtle standard does not explicitly state that the IRI terms it uses should follow the RFC 3987 grammar. The RFC 3987 specification is mentioned several times, but only in relation to making relative IRIs absolute w.r.t. a given base URI, and in relation to security issues (Appendix B). As a side note, the SPARQL standard does explicitly mention that its IRI terms should be in line with RFC 3987: [1] "The `iri' production designates the set of IRIs [RFC3987];" Secondly, the BNF grammar of the Turtle standard does not follow the RFC grammar at all: IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' In practice, many Turtle parsers seem to implement this grammar rule, which allows non-RFC IRIs to be parsed. This implies that Turtle parsers allow non-RDF data to be parsed, since "An IRI within an RDF graph is a Unicode string that conforms to the syntax defined in RFC 3987" (RDF 1.1 Abstract Syntax). Thirdly, according to the Turtle specification, a Turtle parser must resolve relative IRIs w.r.t. a base IRI, according to RFC 3986. This seems to require at least a partial implementation of the URI grammar defined in RFC 3986 and/or the IRI grammar defined in RFC 3987. Otherwise, a Turtle parser would be unable to make the distinction between relative and absolute IRIs in the first place. Furthermore, the difference between relative and absolute IRIs is rather subtle. It is not immediately clear whether a simple heuristic exists that can be used to make this distinction reliably (except for implementing the RFC grammars, that is). To give a concrete example: the first segment of an IRI path is not allowed to contain a colon (but subsequent path segments are allowed to contain colons). This subtlety requires a Turtle parser to understand (i) what the IRI path component of a given IRI string is, (ii) how it is different from the IRI scheme and IRI authority components that precede it, and (iii) how the IRI path component is itself subdivided into different segments. None of this is entirely trivial, and all of this is needed in order to determine whether an IRI like [2] is relative or absolute. [2] x:: Finally, the Turtle standard does not describe how/whether Turtle parsers should distinguish between absolute and relative non-RFC/non-RDF IRIs. As a consequence, different Turtle parsers implement different ways of resolving such Turtle-specific IRIs. As an example, input file [3] is parsed as triple [4] in N3.js, but as triple [5] in Serd and Rapper. There is no clear right or wrong here, because the Turtle standard does not define whether/how non-RFC/non-RDF IRIs like `_:s` should be resolved. [3] base <b:b> <_:s> <p> <o> . [4] <b:b_:s> <b:p> <b:o> . [5] <b:_:s> <b:p> <b:o> . So, my questions are as follows: 1. Is the distinction between RFC IRIs and non-RFC IRIs in the Turtle standard intentional? 2. If so, what is the main use case for the Turtle standard allowing documents to contain non-RFC IRIs? (I was under the impression that Turtle was only intended for serializing RDF data.) 3. In the absence of a clear definition of absolute and relative non-RFC IRIs, should non-RFC IRIs be resolved at all? Since Turtle parsers must implement the RFC grammars anyway (in order to distinguish relative RFC IRIs from absolute RFC IRIs), they should also be able to distinguish between RFC IRIs (which must be resolved according to the RFC standard) and non-RFC IRIs (which must not be resolved). --- Cheers!, Wouter Beek.
Received on Saturday, 3 March 2018 12:56:28 UTC