Re: Turtle parsing from Andy Seaborne on 2012-07-19 (public-rdf-comments@w3.org from July 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 19 Jul 2012 16:15:41 +0100
To: public-rdf-comments@w3.org
Message-ID: <5008249D.90908@epimorphics.com>
(personal reply)

On 19/07/12 15:47, Paul Gearon wrote:
> Hi,
>
> I have some questions and comments about the Turtle parsing grammar
> and current tests. I'm looking at the Working Draft found at:
>    http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html
> so please let me know if I have made a mistake with the appropriate document.
>
>
> - The document makes no statement as to whether numbers literals
> should be represented canonically. Given that these can be represented
> as a raw number (e.g. 2.4 instead of
> "2.4"^^<http://www.w3.org/2001/XMLSchema#decimal>), then I would
> expect the canonical form to be appropriate. I suggest that whether or
> not canonicalization is required be documented.

A parser generates RDF terms, and a literal is a lexical form and a 
datatype (and maybe a language tag).  There is nothing about values and 
a parser may not be aware of all datatypes.

While I think we ought to encourage a value-centric view of the world, 
and canonicalization is good, sometimes it is necessary to preserve 
non-canonical forms - so the spec should not force it.

> - The test case test-28 (decimal data type - serializing test) appears
> to support the canonicalization of decimals. However,
> "2.3"^^<http://www.w3.org/2001/XMLSchema#decimal> which is in the
> canonical form is being expanded to
> "2.30"^^<http://www.w3.org/2001/XMLSchema#decimal>, which is not
> canonical.

The test are the old Turtle tests and haven't been updated.

((If any one has a comprehensive set of tests for Turtle, I'm sure the 
WG will be delighted to incorporate it.))

> - The documentation for xsd:decimal requires a minimum of 18 digits.
> There is also the option of setting a maximum number of digits (this
> must be documented). However, test-28 is making a presumption of only
> 18 digits. This seems inappropriate, though testing up to the 18 digit
> minimum is correct.

Agreed.

The test is wrong - if the lexical form is X chars long, then that is 
what it is.

(this has been mentioned before)

> - Test case test-30 contains the following IRI:
>
> <scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F
> !"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F>
>
> This contains all of the characters that IRIREF explicitly disallows
> (except the > character), thereby leading the test to fail:
>    ([^#x00-#x20<>\"{}|^`\] | UCHAR)*

Agreed. This is legacy though.

RIOT fails this test, and test-28.  They were reflecting assumptions 
about the parser setup; they are unpassable now.

> It also appears that UCHAR is allowing a back door for the characters
> #x00-#x20. I expect that this cannot be avoided at the level of the
> grammar, but perhaps it should be documented.

Yes and no :-) -- an IRI still had to be an IRI so even if it passes the 
weak syntax restrictions, all the IRI (inc scheme specific) rules apply, 
which can't be captured by a regex.  And many systems choose not to do 
full IRI checking.

In SPARQL (1.0), there is a simple regex to allow parsing (no spaces!) 
but it is not intended to guarantee valid IRIs.

	Andy

>
> - Production 160s (NIL) is not used. Is this still needed?
>
> Regards,
> Paul Gearon
>
Received on Thursday, 19 July 2012 15:16:31 UTC