Comments regarding "Turtle and N-Triples Synaxes for RDF" from Gregory Williams on 2012-05-19 (public-rdf-comments@w3.org from May 2012)

From: Gregory Williams <greg@evilfunhouse.com>
Date: Sat, 19 May 2012 14:22:14 -0400
To: public-rdf-comments@w3.org
Message-Id: <638F9B10-D520-4D14-B6A8-B4FF4CB5FE85@evilfunhouse.com>

Gavin mentioned on #swig the other day that the Turtle/N-Triples document is heading for LC, and was soliciting feedback. I read through the document and think it improves on the previous turtle and n-triples documents by providing a lot of nice detail and examples. I've included comments per-section below.

=== 1 Introduction

"N-Triples is a sub-language of Turtle intended for machines."
Isn't Turtle "intended for machines," too? The introduction should provide a description of the relative benefits of each format.

"The Turtle grammar for triples is a subset of the SPARQL Query Language for RDF [RDF-SPARQL-QUERY] grammar for TriplesBlock."
The link to SPARQL is to the (1.0) REC version, but the grammar link is to the (1.1) LC version. These should be consistent.

"Comments in either language may be given after a # that is not part of another lexical token and continue to the end of the line."
The octothorp is bare, but colored orange (in my browser). In similar descriptions later in the document, turtle characters/tokens are not always colored, and sometimes quoted (with both single and double quotes). Such cases should be made consistent where possible (my preference would be both colored and double quoted, except in situations where the thing being quoted contains double quotes).

=== 2.2 Predicate Lists

"This expresses a series of RDF Triples with that subject ***and a*** each predicate and object allocated to one triple."
Typo.

=== 2.3 Object Lists

"This expresses a series of RDF Triples with that subject and predicate ***and a each*** object allocated to one triple."
Typo.

=== 3.1.1 Prefixed Names in Turtle

"A prefixed name is a prefix label and a local part, separated by a colon ":"."
I would find this a lot easier to read if the first sentence of this section instead explained that a prefixed name is a shortcut syntax for expressing an IRI.

"* reserved character escape sequences, e.g. wgs:lat\-long"
Can't dashes be used unescaped in the local part of a prefix name? I think this example would be better if it used a character that required escaping.

=== 3.1.2 Relative IRIs

"The "Retrieval URI" identified in 5.1.3, Base "URI from the Retrieval URI", is the URL from which a particular SPARQL query was retrieved."
Is the reference to SPARQL here just a copy-paste error from the SPARQL Query document?

=== 3.2 RDF Literals

Given that the new turtle allows language tags and unicode escapes in mixed case, is there a suggested canonical form? If not, please define one, and consider making the use of the canonical form a 'SHOULD' for serializers.

"If there is no language tag, there may be a datatype IRI, preceeded by ^^."
The link anchor for "datatype IRI" doesn't exist in the linked-to document.

=== 3.2.1 Other Lexical Representations in Turtle

"* Literals delimited by """, which permit up to two "s, as well as \r and \n."
"* Literals delimited by ''', which permit up to two 's, as well as \r and \n."
While it's implied by context, it would be helpful this text was more explicit about the permission of the quoting characters (e.g. it's about permitting up to two *consecutive* quote characters in the lexical form).

=== 3.2.3 Representing Booleans in Turtle

"Boolean values may be written as either true or false (case-sensitive) and represent RDF literals with the datatype xsd:boolean."
Since xsd:boolean has four valid lexical forms, it would be helpful to clarify that the lexical value of the resulting literal is the same as the boolean keyword used.

=== 3.3 RDF Blank Nodes

"RDF blank nodes in Turtle are expressed as _: followed by a blank node label which is a series of name characters."
This isn't completely true, as the very next (sub-)section explains the use of [] for blank nodes. This section would be clearer if 3.3 introduced the two blank-node forms, and two sub-sections provided the details.

=== 3.3.1 Nesting Unlabeled Blank Nodes in Turtle

"In Turtle, fresh RDF blank nodes are also allocated when matching the production blankNodePropertyList and the terminal ANON."
I don't find this text and link into the grammar to be particularly helpful. It isn't until the second paragraph, and after an example, that this section even mentions that it is discussing a syntactic form for blank nodes using square brackets.

=== 4 Collections in Turtle

I think the example in this section would benefit greatly from a side-by-side comparison with the equivalent triples, which style is used in the preceeding section.

=== 5.4 Grammar

The following productions are used in the grammar, but are never defined (and seem irrelevant, because the "unsigned" production rules match the signs):
INTEGER_POSITIVE
INTEGER_NEGATIVE
DECIMAL_POSITIVE
DOUBLE_POSITIVE
DECIMAL_NEGATIVE
DOUBLE_NEGATIVE

=== 6 Parsing

"Some productions change the parser state (base or prefix declarations)."
Since other productions change the parser state beyond base and prefix declarations, the parenthetical should indicate that the list isn't inclusive (perhaps with an "e.g.").

=== 6.1 Parser State

"Parsing Turtle requires a state of four items:"
This is followed by a list of *five* state items.

"RDF_Term curSubject"
"RDF_Term curPredicate"
Section 6.3 uses language such as "[record] the curSubject and curPredicate" and "[restore] curSubject and curPredicate". This sounds to me like the parser state for curSubject and curPredicate actually involve two stacks of RDF terms, not just two scalar RDF terms. I think the description of parsing would be clearer if this were made explicit, instead of hiding parsing complexity behind words like "record" and "restore".

=== 11 Turtle in HTML

I'm not entirely clear on the value of this section, and believe that it probably doesn't give enough information to safely embed turtle in HTML5. The W3C HTML5 validator, for example, shows that the described technique produces invalid HTML5 when the Turtle includes "</script>" in a literal string.

=== 11.1 XHTML

"Like JavaScript, Turtle authored for HTML (text/html) can break when used in an XHTML (application/xhtml+xml)."
Should this sentence end with "XHTML ***document***"?

=== 11.3 Parsing Turtle in HTML

"THe HTML lang attribute or XHTML xml:lang attribute have no effect on the parsing of the data blocks."
Case typo in "THe".

=== 12 N-Triples

"These may be seperated by white space (spaces #x20 or tabs #x9)."
I assume "these" here refer to the RDF terms, not the triples?

=== 12.3 Grammar

I'm not happy with the change to make N-Triples a unicode format. This change means that tools interacting with N-Triples will have to be unicode aware, and support the \u style of unicode escapes used in N-Triples. This is a big change from the old N-Triples format, where command line tools such as sort/uniq/cut/join could be used to easily parse and perform simple processing of N-Triples data. With the unicode change, this strategy is now much more likely to not work, as a single value now has many equivalent syntactic forms (e.g. "Spïdermann" vs. "Sp\u00EFdermann"). Moreover, even the unicode escapes now have many equivalent forms, as the HEX production in the grammar has been made case insensitive, accepting [0-9A-Fa-f] instead of the old [0-9A-F] (e.g. "Sp\u00EFdermann" vs. "Sp\u00efdermann"). As mentioned above, this is also an issue with case insensitive language tags. Can you provide a pointer to any discussion that occurred in the WG about the reasoning behind this change?

No mention is made of comments in the N-Triples grammar section. They are mentioned in the introduction (section 1), used in the N-Triples example in section 12, and as a change from the test cases format (in section 12.2), but there are no specifics given. If N-Triples comment handling is intended to be identical to that of Turtle, this should be stated explicitly.

"[1] ntriplesDoc ::= (triple)? (EOL triple)* (EOL)?"
This rule seems oddly restrictive. For example, it seems to forbid an N-Triples document with consecutive newline characters. The turtle grammar has a sub-section describing white space handling, but no such section exists for the N-Triples grammar. This makes it tough to know exactly how to interpret this rule.

=== 13.3 Turtle compared to SPARQL (Informative)

"SPARQL permits variables (?name or $name) in any part of the triple of the form"
This sentence trails off. Was there more to it?

thanks,
gregory williams

Received on Saturday, 19 May 2012 18:22:40 UTC