Re: trying to enforce IRI syntax

On 18 May 2012, at 16:40, Sandro Hawke wrote:
> On Fri, 2012-05-18 at 14:03 +0100, Richard Cyganiak wrote:
>> 1) I will formally object to any notion of fixing-up-data-by-trying-to-guess-their-intent that is applied only to a single RDF syntax.
>> 2) A general fixing-up-data-by-trying-to-guess-their-intent framework for all of RDF is out of scope for this WG.
>> 3) Turtle validator. Think about it.
> 
> Hmmm.  What do other RDF syntaxes do with bad-syntax IRIs?
> 
> Quick check of the W3C RDF/XML validator show it doesn't even warn about
> an IRI with "|" in it.  

Because it checks for RDF/XML 2004 and this was valid in 2004.

> Nor does rapper (v 2.0.0), so it's not just
> older parsers.  The 2010 RDFa distiller is also fine with this IRI.

Same thing.

> The 2012 one is down right now.
> 
> The RDF/XML spec seems silent on it, deferring to RDF Concepts, which I
> read as allowing it [1].

Yup.

> Specifically, it says the string must conform
> to the URI character syntax AFTER disallowed characters (like "|") are
> percent-encoded.   Possibly I'm reading that wrong.

“|” is not a disallowed character in RDF 2004.

> Oh, I see in your new RDF Concepts draft you have a note about how this
> changed.   So "|" was allowed in 2004 in RDF URIReference terms, but
> wont be allowed in RDF 1.1 IRI terms?

Exactly.

>     Ouch.  

Not really. The character was never allowed in URIs or IRIs anyway.

> - Will RDF/XML and RDFa parsers be required to reject documents
> containing bad-syntax IRIs, eg containing a "|"?

Assuming these syntaxes are updated for RDF 1.1, then those are not conforming RDF/XML or RDFa documents.

Neither spec defines what a parser has to do when faced with non-conforming documents, so it's left to the implementer's wisdom.

A conformance checker (=validator) would have to reject such documents.

>    (That does not seem viable.  I doubt anyone would support that.)
> 
> - so why should Turtle parsers have to do that?

Who said that they should have to do that?

> Actually -- thinking about it some more, Turtle doesn't even reject IRIs
> like this, either.  It just makes you type them in using the \u syntax!
> The spec right now allows an IRI containing control characters, etc.   <
> \u00> is a valid Turtle term.

No, it doesn't. Andy pointed to the bit that says that the thing has to be an IRI elsewhere in this thread. While saying that this isn't as explicit as it probably should be, as it's currently too easy to miss.

> So, given what I'm seeing so far, my proposal is:
> 
> 1.  Simplify the Turtle grammar, making the IRIREF production be just:
> "<" [^>]* ">".  Also simplify PNAME, etc, in corresponding ways.
> 
> 2.  Somewhere, probably in 6.2, say IRIs (from both IRIREF and PNAME
> productions, after prefix and base expansion and resolution) SHOULD be
> checked against the IRI syntax rules and SHOULD NOT be emitted by the
> parser or generated by the serializer.  

This mixes format definition and implementation strategy.

The spec already says (not explicitly enough) that these things MUST be IRIs. Whether this is said in the grammar or with some other text is an editorial detail that I'm not interested it.

-1 to making this a SHOULD. This means that conforming Turtle parsers could generate output that's not a valid RDF graph. As multiple implementers have told you in this thread, this would be a Bad Idea.

In my opinion it is sufficient for the spec to define what is and what isn't valid Turtle, and how to get from valid Turtle to a valid RDF graph. Defining error handling is not the spec's business. It would be especially bizarre to single out one class of errors (IRI syntax problems) and define some sort of error recovery or partial conformance or whatever for them, while saying nothing about much more common error classes, like mixed up punctuation.

> This text should be coordinated
> with rdf-concepts, factoring out stuff that applies to other RDF
> syntaxes as well.  (What's there now [2], especially the last Note, is
> good; the question is who is responsible for enforcing this stuff, and
> how does it affect whether something is a conformant turtle serializer
> or parser.)

Who is responsible for enforcing this is an implementation detail. There are many different workable strategies.

A Turtle parser is a system that turns conforming Turtle files into conforming RDF graphs. I don't think we need to say anything what it does with non-conforming files. We certainly shouldn't do it normatively.

I'm not sure that we need to define what a Turtle serializer is. It follows trivially from the definitions of Turtle document and Turtle parser (regardless of whether these definitions are explicit or not). We certainly don't need to say what it does with broken RDF graphs.

Best,
Richard



> 
>     -- Sandro
> 
> [1]
> http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference
> [2]
> http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#section-IRIs
> 
>> Best,
>> Richard
>> 
>> 
>> On 18 May 2012, at 13:08, Sandro Hawke wrote:
>> 
>>> On Fri, 2012-05-18 at 12:22 +0100, Steve Harris wrote:
>>>> Yes, we've actually had this issue in practice (lots of web URLs are not legal URIs), you want to find that out as early as possible that you have a problem. For us that's when PUTing Turtle to a RDF store.
>>>> 
>>>> - Steve
>>>> 
>>>> On 2012-05-18, at 11:20, Richard Cyganiak wrote:
>>>> 
>>>>> Sandro,
>>>>> 
>>>>> -1 to a “loose Turtle”.
>>>>> 
>>>>> If a conforming Turtle parser were allowed to accept a document containing <http://example.org/a|b>, then what next? This is not a valid IRI. So it is not allowed in an RDF graph. A Turtle parser is rarely a stand-alone system — it's a component in a larger system. Once the Turtle parser tries passing on the pseudo-IRI to the next component, then a number of things can happen:
>>>>> 
>>>>> The next component might reject it outright.
>>>>> 
>>>>> Or the next component accepts it and stores the pseudo-IRI. Then the user can do their thing. Then when the user tries to save their work, the serializer checks IRIs and rejects it, taking down the app with an error message. (This is Jena's default behaviour, or at least was the last time I checked.)
>>>>> 
>>>>> Or maybe the entire system works, except that now we have a situation where certain RDF “graphs” can be loaded and saved in Turtle but not in other syntaxes. This will cause major headaches for users, who will end up messing around with format converters in order to get broken data into a format that doesn't complain about the data being broken.
>>>>> 
>>>>> Or maybe the system accepts the IRI and puts it into its store, but then you can't delete it from the store any more because the SPARQL Update part of the system is stricter and rejects DELETE DATA commands containing broken IRIs.
>>>>> 
>>>>> Given the complexity of RDF-bases systems, and the many interacting components and specifications involved, this kind of error handling cannot be introduced for a single syntax. It has to be done centrally so that all involved components and specifications can behave in a consistent way. Defining algorithms for error recovery for broken RDF data may well be a good idea, but I don't think this should be part of an 1.1 update to RDF, and I don't think we are chartered to do it.
>>> 
>>> Yeah, speaking as a a standards person, I absolutely agree with you.
>>> 
>>> But I put on my implementer's hat this week and had two problems:
>>> 
>>> 1.  The regexps seemed to be breaking various tools.  But it's very hard
>>> to tell if it's the regexps or the tools, because of how big they are
>>> (and I'm using Javascript which doesn't allow spaces or comments in
>>> regexps).   For example, here's the regexp generated from our grammar
>>> for IRIREF, with all the chars that I was particularly worried about
>>> done as hex escapes:
>>> 
>>> /^(<([^\x00-\x20<>\x5c\x22\x7b\x7d\x7c^`\x5c]|((\\x5cu([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f]))|(\\x5cU([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f])([0-9]|[A-F]|[a-f]))))*>)
>>> 
>>> Actually, looking at it now, I see two errors in it, caused by how jison
>>> composes named regexps containing backslashes.
>>> 
>>> So, obviously I could do this more carefully, writing my own tool to do
>>> macro-composition of regexps, trying to find something better, or fixing
>>> jison.    But still, ouch.
>>> 
>>> 2.  If/when I ship my software, it's pretty clear I'm going to get bug
>>> reports from users about these syntax errors in files that look fine to
>>> them, and/or they don't control but want to be able to read anyway.  I
>>> can tell them, "sorry, check the spec", but, ... another pain point.
>>> 
>>>> On Fri, 2012-05-18 at 12:22 +0100, Steve Harris wrote:
>>>> Yes, we've actually had this issue in practice (lots of web URLs are  
>>>> not legal URIs), you want to find that out as early as possible that  
>>>> you have a problem. For us that's when PUTing Turtle to a RDF store.
>>> 
>>> So, how about this:
>>> 
>>> 1.  We make the Turtle grammar much simpler on this.  For example:
>>>       IRIREF ::= /^<[^ \t\n\r>]*>/ 
>>> 
>>> 2.  We say that Turtle parsers MUST check IRIREFs for conforming to the
>>> IRI spec -- maybe we give the regexp for that, or maybe we just refer
>>> them to the right RFCs.  If an IRI fails the check, the parser MAY
>>> transform the IRI into one which would pass the check, using percent
>>> encoding.  The parser MUST not emit RDF triples containing IRIs which do
>>> not syntactically conform to IRIand URI RFCs.    (I'd also be fine with
>>> SHOULD NOT; people might, for instance, know the next stage upstream is
>>> going to do this anyway.)
>>> 
>>> 3.  We say the same sort of thing about Turtle generators; they MAY do
>>> percent-encoding if handed bad stuff; they MUST NOT emit bad IRIs.
>>> 
>>> So, we'd be permitting error recovery -- but very well-defined error
>>> recovery.    (And something the browsers do all the time; you can give
>>> them anything for a URL and they'll percent-encode it as needed.)
>>> 
>>> We'd also be using a standard bit of code -- the IRI checker isn't in
>>> any way Turtle-specific -- instead of making the Turtle lexer super
>>> complicated.
>>> 
>>>  -- Sandro
>>> 
>>>>> Best,
>>>>> Richard
>>>>> 
>>>>> 
>>>>> On 17 May 2012, at 21:35, Sandro Hawke wrote:
>>>>> 
>>>>>> What should/may/must a Turtle parser do with a turtle document like
>>>>>> this:
>>>>>> 
>>>>>> <http://example.org/a> <http://example.org/a> <http://example.org/a|b>.
>>>>>> 
>>>>>> By the grammar, this is not a Turtle document, because of the '|'
>>>>>> character in a URI.   I don't think, however, that people writing Turtle
>>>>>> parsers will want to enforce this.  If they come across some Turtle
>>>>>> document that's got a URI like this -- they can still parse it just
>>>>>> fine, so they probably will.
>>>>>> 
>>>>>> The language tokens like IRIREF and PNAME are defined in the grammar
>>>>>> with these vast regexps (if you macro-expand what's there, now), but
>>>>>> actually much simpler ones will produce the same result in practice --
>>>>>> they'll just tolerate some files that are not, strictly-speaking,
>>>>>> Turtle.  (I'm pretty sure -- maybe there are some corner cases with
>>>>>> missing whitespace where these regexps will give you a different result
>>>>>> than something more like any-character-up-until-a-delimiter.
>>>>>> 
>>>>>> I'm not sure anything has to change, but I think at very least the
>>>>>> conformance clause should be clear about whether it's okay to accept a
>>>>>> turtle document like my example above.
>>>>>> 
>>>>>> It might be nice to have "strict" and "loose" parsers, especially if we
>>>>>> can define loose parsers in a way that makes them simpler to implement,
>>>>>> run faster, and never parse anything differently from a strict parser.
>>>>>> 
>>>>>> Of course, then I'm not quite sure the point of the strict parsers.
>>>>>> 
>>>>>> -- Sandro
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Received on Friday, 18 May 2012 16:27:11 UTC