Re: How to deal with non-RFC IRIs in Turtle? from Eric Prud'hommeaux on 2018-03-04 (semantic-web@w3.org from March 2018)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 4 Mar 2018 16:07:56 -0500
To: Wouter Beek <w.g.j.beek@vu.nl>
Cc: Andy Seaborne <andy@seaborne.org>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <20180304210752.GJ11678@w3.org>
* Wouter Beek <w.g.j.beek@vu.nl> [2018-03-04 13:50+0100]
> Hi Andy,
> 
> Thank you for your helpful response.  You explain that a Turtle parser
> is not required to implement the full RFC URI/IRI grammars, and that
> full IRI parsing may take place later in the RDF ingestion pipeline.  I
> have follow-up questions about both of these points.
> 
> Regarding your first point, it is not entirely clear to me how a
> Turtle parser is able to properly resolve relative IRIs without also
> implementing (a non-trivial part of) the RFC grammars.  Since the
> `IRIREF` rule is clearly insufficient in order to make this
> distinction, I would expect the Turtle standard to make explicit the
> minimal criteria a Turtle parser should implement in order to
> determine whether an IRI is absolute and relative.

I'm sympathetic with your goal of helping Turtle implementers and
users have a minimum bar beyond the over-liberal IRIREF terminal. I'm
interested in suggestions which could help interoperability. The
current spec defines IRIs with a normative reference to RFC3987, which
provides a grammar starting with:

   IRI            = scheme ":" ihier-part [ "?" iquery ]
                         [ "#" ifragment ]

The Turtle grammar designed to be LL(1) and LALR(1) while the IRI spec
is not.

The other challenge to deal with is that RDF is defined in terms of
IRIs and it may be awkward to specify a less stringent grammar. We'd
essentially be saying "it MUST resolve to be an absolute IRI but it
MUST MUST MUST conform to this production". We could use conformance
levels for that but I don't think the community would be too happy to
have multiple classes of RDF.

While Turtle (and other specs) answer the question of what is legal,
they rarely (apart from HTML, which probably impacts RDFa parsing) go
into how to handle illegal text.

I can't say exactly how it would get incorporated into the Turtle spec
(there's no current charter to update it), but I'm certainly
interested in ideas for how to set conformance expectations. This
could possibly be entered in the errata. Another approach is to supply
additional tests that more fully test the boundaries of RFC3987.


> As to your second point, I have not often seen an RDF ingestion
> pipeline in which the output of a Turtle parser is handed over to
> another component that performs IRI validation.  From an architectural
> viewpoint, such a setup also does not seem to make that much sense,
> since the Turtle parser may resolve invalid IRIs that it deems relative to
> valid absolute IRIs.  For example, Rapper parses [1], which contains an
> invalid IRI as a subject term, into [2], which contains only valid
> IRIs.  An IRI validator that takes the output from the Rapper parser
> will say that [2] is valid, but the original input [1] is not valid.
> 
>   [1] base <https://example.org/a/>
>       <_:s> <p:p> <o:o> .
>   [2] <https://example.org/a/_:s> <p:p> <o:o> .
> 
> Of course, Rapper could be updated to not resolve the subject term in
> [1], but that would violate the sequential approach too, since it would
> require (partially) implementing the RFC grammars twice: once for the
> Turtle parser, and once for the IRI validator.

I believe the architecture that Andy was alluding to is not using
pipelines but instead using libraries. For instance, many RDF parsers
us some library for creating and manipulating IRIs. In the process,
they have to trap exceptions that arrise when IRIs are not well-formed.


> ---
> Cheers,
> Wouter.
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Sunday, 4 March 2018 21:08:11 UTC