Re: How to deal with non-RFC IRIs in Turtle? from Andy Seaborne on 2018-03-04 (semantic-web@w3.org from March 2018)

From: Andy Seaborne <andy@seaborne.org>
Date: Sun, 4 Mar 2018 10:43:23 +0000
To: semantic-web@w3.org
Message-ID: <b00e53d4-860c-05ee-74a1-641b022582fa@seaborne.org>
Hi Wouter,

The idea of having the rule

      IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

in the grammar is not that is an IRI but that it generates an unicode 
string that is processed further.  It does exclude many cases of not 
being a IRI. Other validation would also be applied, possibly by a 
different software.

Having the full IRI grammar was thought (by the SPARQL 1.0 WG, IIRC) to 
make the grammar large, and also having a language-in-a-language makes 
the implementation barrier higher. It would make using parser generators 
tools harder as the token set changes.

IRI validation might also be used elsewhere as well so there is the 
issue of having the IRI process implemented twice.

Both Turtle and SPARQL say it must be an IRI. This includes ones formed 
using prefixed names and any scheme specific conditions.

Resolving IRIs is not part of the grammar, the grammar emits a unicode 
string which must be further processed

The grammar is one step in producing RDF data. In the Turtle spec, 
section 7 is preformed on the output of the grammar parsing including:

7.2 RDF Term Constructors
6.3 IRI References

     Andy

On 03/03/18 12:55, Wouter Beek wrote:
> Hi Semantic Web community,
> 
> To what extent are Turtle parsers allowed to parse IRIs that do not
> conform to the official IRI grammar, as defined in RFC 3987, and how
> to deal with issues that arise from the under-specification of IRI
> resolution for such non-RFC IRIs?
> 
> Firstly, the Turtle standard does not explicitly state that the IRI
> terms it uses should follow the RFC 3987 grammar.  The RFC 3987
> specification is mentioned several times, but only in relation to
> making relative IRIs absolute w.r.t. a given base URI, and in relation
> to security issues (Appendix B).
> 
> As a side note, the SPARQL standard does explicitly mention that its
> IRI terms should be in line with RFC 3987:
> 
>      [1] "The `iri' production designates the set of IRIs [RFC3987];"
> 
> Secondly, the BNF grammar of the Turtle standard does not follow the
> RFC grammar at all:
> 
>      IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
> 
> In practice, many Turtle parsers seem to implement this grammar rule,
> which allows non-RFC IRIs to be parsed.  This implies that Turtle
> parsers allow non-RDF data to be parsed, since "An IRI within an RDF
> graph is a Unicode string that conforms to the syntax defined in RFC
> 3987" (RDF 1.1 Abstract Syntax).
> 
> Thirdly, according to the Turtle specification, a Turtle parser must
> resolve relative IRIs w.r.t. a base IRI, according to RFC 3986.  This
> seems to require at least a partial implementation of the URI grammar
> defined in RFC 3986 and/or the IRI grammar defined in RFC 3987.
> Otherwise, a Turtle parser would be unable to make the distinction
> between relative and absolute IRIs in the first place.
> 
> Furthermore, the difference between relative and absolute IRIs is
> rather subtle.  It is not immediately clear whether a simple heuristic
> exists that can be used to make this distinction reliably (except for
> implementing the RFC grammars, that is).  To give a concrete example:
> the first segment of an IRI path is not allowed to contain a colon
> (but subsequent path segments are allowed to contain colons).  This
> subtlety requires a Turtle parser to understand (i) what the IRI path
> component of a given IRI string is, (ii) how it is different from the
> IRI scheme and IRI authority components that precede it, and (iii) how
> the IRI path component is itself subdivided into different segments.
> None of this is entirely trivial, and all of this is needed in order
> to determine whether an IRI like [2] is relative or absolute.
> 
>      [2] x::
> 
> Finally, the Turtle standard does not describe how/whether Turtle
> parsers should distinguish between absolute and relative
> non-RFC/non-RDF IRIs.  As a consequence, different Turtle parsers
> implement different ways of resolving such Turtle-specific IRIs.  As an
> example, input file [3] is parsed as triple [4] in N3.js, but as triple [5] in
> Serd and Rapper.  There is no clear right or wrong here, because the
> Turtle standard does not define whether/how non-RFC/non-RDF IRIs
> like `_:s` should be resolved.
> 
>      [3] base <b:b>
>          <_:s> <p> <o> .
>      [4] <b:b_:s> <b:p> <b:o> .
>      [5] <b:_:s> <b:p> <b:o> .
> 
> So, my questions are as follows:
> 
>    1. Is the distinction between RFC IRIs and non-RFC IRIs in the
>       Turtle standard intentional?
> 
>    2. If so, what is the main use case for the Turtle standard allowing
>       documents to contain non-RFC IRIs?  (I was under the impression
>       that Turtle was only intended for serializing RDF data.)
> 
>    3. In the absence of a clear definition of absolute and relative
>       non-RFC IRIs, should non-RFC IRIs be resolved at all?  Since
>       Turtle parsers must implement the RFC grammars anyway (in order
>       to distinguish relative RFC IRIs from absolute RFC IRIs), they
>       should also be able to distinguish between RFC IRIs (which must
>       be resolved according to the RFC standard) and non-RFC IRIs
>       (which must not be resolved).
> 
> ---
> Cheers!,
> Wouter Beek.
>
Received on Sunday, 4 March 2018 10:43:49 UTC