Re: PROPOSED to RESOLVE ISSUE-127 with Canonical N-Triples

On Oct 18, 2013, at 7:50 AM, Gavin Carothers <gavin@carothers.name> wrote:

> Gregory,
> 
> Thank you again for your comments on N-Triples. 
> 
> This is the second formal response to issue http://www.w3.org/2011/rdf-wg/track/issues/127 
> 
> N-Triples was originally created as part of the RDF Test Cases. As such it included:
> 
> N-Triples is an RDF syntax for expressing RDF test cases and defining the correspondence between RDF/XML and the RDF abstract syntax. RDF/XML [RDF-SYNTAX] is the recommended syntax for applications to exchange RDF information.
> 
> It also did not have a distinct media type, and was recommended only for test cases. As such it did not have any internationalization requirements placed on it. Also the world has changed since 2001 when it was decided that N-Triples should be ASCII and not UTF-8.

I'm sympathetic to arguments based on existing REC language, but the fact is that the ship sailed long ago on the "only for test cases" issue.
 
> RDF Test Cases N-Triples requires the following:
> 
> <http://example.org/> <http://example.org/property> "I\u00F1t\u00EBrn\u00E2ti\u00F4n\u00E0liz\u00E6ti\u00F8n" .
> 
> N-Triples REC track allows and recommends:
> 
> <http://example.org/> <http://example.org/property> "Iñtërnâtiônàlizætiøn". 
> 
> While the first was totally acceptable for a test case format, it is not acceptable for use as a wide spread data exchange format.

I don't accept this as a blanket statement. N-Triples has seen widespread use since it was defined in the test cases document. It is implemented in just about every semantic web system I can think of. The lack of UTF-8 support does not seem to have prevented it from becoming *the* serialization of choice for bulk RDF access.

> In order to address internationalization concerns and adopt the practice of existing implementations in the wild N-Triples is now allowed and recommended to be UTF-8 while continuing to support data using \u \U escapes. In modern systems "Iñtërnâtiônàlizætiøn" is greatly preferred by users for interoperability and ease of use over "I\u00F1t\u00EBrn\u00E2ti\u00F4n\u00E0liz\u00E6ti\u00F8n”.

I would tend to agree that users may prefer that, but think the logical answer would be that Turtle is the obvious serialization for that use. One of the major benefits I see in the original N-Triples format is its great simplicity stemming from the fact that there are essentially no choices to be made when serializing an RDF graph (modulo whitespace concerns which I've discussed previously).

Beyond these issues, however, the new N-Triples syntax seems to differ from the test-cases version in matters that are not based on internationalization concerns. For example, allowing mixed case of hex escape codes instead of purely uppercase hex codes.

For what it's worth, I think the best path forward is/was Sandro's suggestion of renaming the new (2013) syntaxes (including N-Triples) with new names that would indicate they were all profiles/subsets of the same language:

http://lists.w3.org/Archives/Public/public-rdf-wg/2013Jul/0171.html

> Your comment also touches on requirements for serializes. The N-Triples REC track document places no conformance constants on a serializer, instead it defines two classes of documents a "canonical N-Triples document" and a "N-Triple document". Canonical was added specifically to address your comment regrading the need for a recommended way to write down a given triple while also meeting the new requirements around internationalization. At the same time a seralizer that produces Test Cases N-Triples will produce a conforming N-Triple document. 

I wasn't suggesting that a test-cases N-Triples serializer would ever produce non-conformat 2013 N-Triples. I think it's obvious, though, that old serializers will be producing non-canonical 2013 N-Triples.

One canonicalization issue I came across is in the different handling of character escapes between test-cases and 2013 N-Triples. STRING_LITERAL_QUOTE seems to allow direct encoding of values that must be escaped in the test-cases version of N-Triples. This includes characters like 0x08 and 0x0C. To be honest, I'm not entirely sure what the rules for producing canonical N-Triples say about characters like 0x08. The rules include:

- Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022, U+005C, U+000A, U+000D) must use ECHAR not UCHAR.
- Characters must be represented directly and not by UCHAR.

0x08 is not "not allowed in STRING_LITERAL_QUOTE", so the first rule shouldn't apply. However, the second rule says nothing about the choice between direct representation and the use of ECHAR (which can encode 0x08 in 2013 N-Triples, but not test-cases N-Triples). Either way, I believe test-cases N-Triples and canonical 2013 N-Triples will differ on the encoding of these characters, which I would think is a bad choice for ASCII characters in general (the domain of test-cases N-Triples) and these characters specifically. Is 2013 N-Triples really meant to allow the direct encoding of the backspace character (or other control characters) in string literals?

> Please reply to public-rdf-comments@w3.org indicating whether this relational explains the Working Groups decision to allow and recommend the use of UTF-8 for N-Triples.

It explains it, yes, but I continue to strongly disagree with the motivation and object to the specific changes that have been made to N-Triples in the 2013 draft version. I believe Turtle (or perhaps a line-based profile of it) is a better fit for the user-based use cases you cite, and believe the changes to the 2013 draft version remove many of the benefits of the (widely deployed) test-cases version of N-Triples.

thanks,
.greg

Received on Wednesday, 6 November 2013 01:49:59 UTC