summary of rdf-turtle#37

Hi all,

we decided last week to discuss rdf-turtle#37 
<https://github.com/w3c/rdf-turtle/issues/37> in our next meeting, and I 
committed to send a summary of the discussion to the mailing list.

In rdf-turtle#37 <https://github.com/w3c/rdf-turtle/issues/37>, I 
pointed the following text complies with the Turtle grammar (as well as 
N-Triples, TriG and N-Quads), but does not represent a valid RDF triple:

     <x:s> <x:p> "foo"^^rdf:langString .

More specifically, object of this triple does not match the definition, 
in RDF-Concepts (which requires a language tag when the datatype is 
rdf:langString).

The scope of the discussion was then broaden to include a number of 
ill-formed terms that are technically allowed in the Turtle grammar, but 
do not correspond to RDF terms as defined by RDF-Concepts.

     "foo"@abcdefghi  # the language tag does not comply with BCP47
     "foo"@en--xyz   # the base direction is not one of 'ltr' or 'rtl'
     <%>             # the text between pointy brackets is not a valid IRI

(NB the first two were also pointed out in rdf-n-triples#33 
<https://github.com/w3c/rdf-n-triples/issues/33>).

There are good reasons for keeping the grammar 
<https://www.w3.org/TR/rdf12-turtle/#sec-grammar> of Turtle & co. simple 
enough (see here 
<https://github.com/w3c/rdf-n-triples/issues/33#issuecomment-1610353424> 
and here 
<https://github.com/w3c/rdf-n-triples/issues/33#issuecomment-1610354444> 
for more details),
and defer further validation to the description of the parsing process 
<https://www.w3.org/TR/rdf12-turtle/#sec-grammar>.
This is the spirit of PR n-triples#68 
<https://github.com/w3c/rdf-n-triples/pull/68> adds some text in the 
"Parsing" section to this effect.

This leaves the question open of how parsers should behave when they 
encounter such "grammatically valid" documents that result to invalid 
RDF terms...

1. stop parsing a raise an error
2. refrain from emitting invalid triples, raise a warning, but continue 
parsing
3. emit triples containing the invalid terms (with a warning)

Option 1 is probably not a good idea: such invalid data exists in the 
wild <https://github.com/apache/jena/issues/2555>, and the fact that the 
document matches the grammar justifies that parsers should not just 
stop. Note however that that's how some parsers currently behave (e.g. 
Oxigraph, in some of the examples above).

Option 2 is what n-triples#68 
<https://github.com/w3c/rdf-n-triples/pull/68> currently proposes.

Option 3 has the advantage of not losing any information compared to the 
source format, and let the use deal with the possibly invalid data. The 
drawback is that what it produce is then not guaranteed to be compliant 
with the abstract syntax. This is how Jena works -- and Oxigraph, for 
the "foo"^^rdf:langString case.

    best

Received on Monday, 7 July 2025 13:13:01 UTC