normalization issues with Turtle spec tests from Ruben Verborgh on 2013-04-06 (public-rdf-comments@w3.org from April 2013)

From: Ruben Verborgh <ruben.verborgh@ugent.be>
Date: Sat, 6 Apr 2013 22:15:43 +0200
To: public-rdf-comments@w3.org
Cc: Gregg Kellogg <gregg@greggkellogg.net>, Eric Prud'hommeaux <eric@w3.org>, gavin@carothers.name
Message-Id: <88859F9C-F113-40C0-BB52-72E377411154@ugent.be>

Dear all,

I have been working on making my JavaScript streaming Turtle parser node-n3 [1] compatible with the CR spec tests [2].
I’ve come across some issues with normalization that I’d like to have your feedback on.

My current test setup is:
1. parse action file, write as N-triples, send to cwm
2. download correct N-triples result, send to cwm
3. compare both cwm outputs string-wise

With this setup, I’m experiencing the following normalization issues:
- The result of bareword_double is a bit inconvenient because is includes an uppercase E to indicate the double’s exponent,
instead of a lowercase e found in other tests such as turtle-subm-19 and turtle-subm-20.
While this is of course not wrong, it is inconvenient with parsers that normalize the exponent (to either lowercase or uppercase).
If I choose to normalize to uppercase, bareword_double fails. If I choose to normalize to lowercase, turtle-subm-19 and turtle-subm-20 fail.

- The result of positive_numeric includes "+1"^^<http://www.w3.org/2001/XMLSchema#integer>.
Although correct, it is more convenient when normalized to "1"^^<http://www.w3.org/2001/XMLSchema#integer>.

- The result of numeric_with_leading_0 includes "01"^^<http://www.w3.org/2001/XMLSchema#integer>
Although correct, it is more convenient when normalized to "1"^^<http://www.w3.org/2001/XMLSchema#integer>.
(In that case, the result could be shared with positive_numeric.)

- The result of turtle-subm-11 includes leading zeros on two lines, although the test is called “decimal integer canonicalization”.
I’d expect canonicalization to be applied indeed and the leading zeros removed.

- The Turtle draft spec part about quoted literals [3] points to the RDF 1.1 Concepts and Abstract Syntax [4],
which says that the language tag must be normalized to lowercase.
However, this normalization does not happen in the result of “langtagged_LONG_with_subtag”, which uses @en-UK.

Therefore, I wonder:
- Would it be meaningful to change the test results to make them use normalization?
- If not, are there any suggestions to change my test setup?

Right now, the tests are difficult for parsers that apply normalization,
i.e., you are forced to remember the initial serialization to get correct results.
This is probably not desirable.

Best regards,

Ruben

PS I expect to have passing EARL reports soon.

[1] https://github.com/RubenVerborgh/node-n3/tree/cr-spec
[2] http://lists.w3.org/Archives/Public/public-rdf-comments/2013Feb/0037.html
[3] http://www.w3.org/TR/turtle/#turtle-literals
[4] http://www.w3.org/TR/2012/WD-rdf11-concepts-20120605/#dfn-literal

Received on Saturday, 6 April 2013 20:16:24 UTC