Turtle parsing from Paul Gearon on 2012-07-19 (public-rdf-comments@w3.org from July 2012)

From: Paul Gearon <pgearon@revelytix.com>
Date: Thu, 19 Jul 2012 10:47:06 -0400
To: public-rdf-comments@w3.org
Message-ID: <CAOQ8B2FSkiNJDry9E8-7YCaX69MU-BeT+O5iwysCF=SvtWCfkA@mail.gmail.com>

Hi,

I have some questions and comments about the Turtle parsing grammar
and current tests. I'm looking at the Working Draft found at:
  http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html
so please let me know if I have made a mistake with the appropriate document.


- The document makes no statement as to whether numbers literals
should be represented canonically. Given that these can be represented
as a raw number (e.g. 2.4 instead of
"2.4"^^<http://www.w3.org/2001/XMLSchema#decimal>), then I would
expect the canonical form to be appropriate. I suggest that whether or
not canonicalization is required be documented.

- The test case test-28 (decimal data type - serializing test) appears
to support the canonicalization of decimals. However,
"2.3"^^<http://www.w3.org/2001/XMLSchema#decimal> which is in the
canonical form is being expanded to
"2.30"^^<http://www.w3.org/2001/XMLSchema#decimal>, which is not
canonical.

- The documentation for xsd:decimal requires a minimum of 18 digits.
There is also the option of setting a maximum number of digits (this
must be documented). However, test-28 is making a presumption of only
18 digits. This seems inappropriate, though testing up to the 18 digit
minimum is correct.

- Test case test-30 contains the following IRI:

<scheme:\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000B\u000C\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F
!"#$%&'()*+,-./0123456789:/<=\u003E?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007F>

This contains all of the characters that IRIREF explicitly disallows
(except the > character), thereby leading the test to fail:
  ([^#x00-#x20<>\"{}|^`\] | UCHAR)*

It also appears that UCHAR is allowing a back door for the characters
#x00-#x20. I expect that this cannot be avoided at the level of the
grammar, but perhaps it should be documented.

- Production 160s (NIL) is not used. Is this still needed?

Regards,
Paul Gearon

Received on Thursday, 19 July 2012 14:47:39 UTC