How to deal with non-RFC IRIs in Turtle? from Wouter Beek on 2018-03-03 (semantic-web@w3.org from March 2018)

From: Wouter Beek <w.g.j.beek@vu.nl>
Date: Sat, 3 Mar 2018 13:55:09 +0100
To: SW-forum Web <semantic-web@w3.org>
Message-ID: <CAEh2WcMpZxxQxAMTOOAuO6sg8T=Uf=cj7sacYqp1wNHj7=mXwA@mail.gmail.com>
Hi Semantic Web community,

To what extent are Turtle parsers allowed to parse IRIs that do not
conform to the official IRI grammar, as defined in RFC 3987, and how
to deal with issues that arise from the under-specification of IRI
resolution for such non-RFC IRIs?

Firstly, the Turtle standard does not explicitly state that the IRI
terms it uses should follow the RFC 3987 grammar.  The RFC 3987
specification is mentioned several times, but only in relation to
making relative IRIs absolute w.r.t. a given base URI, and in relation
to security issues (Appendix B).

As a side note, the SPARQL standard does explicitly mention that its
IRI terms should be in line with RFC 3987:

    [1] "The `iri' production designates the set of IRIs [RFC3987];"

Secondly, the BNF grammar of the Turtle standard does not follow the
RFC grammar at all:

    IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

In practice, many Turtle parsers seem to implement this grammar rule,
which allows non-RFC IRIs to be parsed.  This implies that Turtle
parsers allow non-RDF data to be parsed, since "An IRI within an RDF
graph is a Unicode string that conforms to the syntax defined in RFC
3987" (RDF 1.1 Abstract Syntax).

Thirdly, according to the Turtle specification, a Turtle parser must
resolve relative IRIs w.r.t. a base IRI, according to RFC 3986.  This
seems to require at least a partial implementation of the URI grammar
defined in RFC 3986 and/or the IRI grammar defined in RFC 3987.
Otherwise, a Turtle parser would be unable to make the distinction
between relative and absolute IRIs in the first place.

Furthermore, the difference between relative and absolute IRIs is
rather subtle.  It is not immediately clear whether a simple heuristic
exists that can be used to make this distinction reliably (except for
implementing the RFC grammars, that is).  To give a concrete example:
the first segment of an IRI path is not allowed to contain a colon
(but subsequent path segments are allowed to contain colons).  This
subtlety requires a Turtle parser to understand (i) what the IRI path
component of a given IRI string is, (ii) how it is different from the
IRI scheme and IRI authority components that precede it, and (iii) how
the IRI path component is itself subdivided into different segments.
None of this is entirely trivial, and all of this is needed in order
to determine whether an IRI like [2] is relative or absolute.

    [2] x::

Finally, the Turtle standard does not describe how/whether Turtle
parsers should distinguish between absolute and relative
non-RFC/non-RDF IRIs.  As a consequence, different Turtle parsers
implement different ways of resolving such Turtle-specific IRIs.  As an
example, input file [3] is parsed as triple [4] in N3.js, but as triple [5] in
Serd and Rapper.  There is no clear right or wrong here, because the
Turtle standard does not define whether/how non-RFC/non-RDF IRIs
like `_:s` should be resolved.

    [3] base <b:b>
        <_:s> <p> <o> .
    [4] <b:b_:s> <b:p> <b:o> .
    [5] <b:_:s> <b:p> <b:o> .

So, my questions are as follows:

  1. Is the distinction between RFC IRIs and non-RFC IRIs in the
     Turtle standard intentional?

  2. If so, what is the main use case for the Turtle standard allowing
     documents to contain non-RFC IRIs?  (I was under the impression
     that Turtle was only intended for serializing RDF data.)

  3. In the absence of a clear definition of absolute and relative
     non-RFC IRIs, should non-RFC IRIs be resolved at all?  Since
     Turtle parsers must implement the RFC grammars anyway (in order
     to distinguish relative RFC IRIs from absolute RFC IRIs), they
     should also be able to distinguish between RFC IRIs (which must
     be resolved according to the RFC standard) and non-RFC IRIs
     (which must not be resolved).

---
Cheers!,
Wouter Beek.
Received on Saturday, 3 March 2018 12:56:28 UTC