Re: Comments regarding "Turtle and N-Triples Synaxes for RDF"

On Tue, Jul 10, 2012 at 7:50 PM, Gavin Carothers <gavin@carothers.name> wrote:
> On Sat, May 19, 2012 at 11:22 AM, Gregory Williams
> <greg@evilfunhouse.com> wrote:
>> Gavin mentioned on #swig the other day that the Turtle/N-Triples document is heading for LC, and was soliciting feedback. I read through the document and think it improves on the previous turtle and n-triples documents by providing a lot of nice detail and examples. I've included comments per-section below.
>
> My apologies on not replying to this email sooner. Seems to have
> gotten lost in the shuffle around N-Triples vs. Turtle and threw this
> in the N-Triples feedback block to work on after getting Turtle to LC.
> Very sorry!

And I had this on my guilt list; many thanks to Gavin removing it from there.


>> === 1 Introduction
>>
>> "N-Triples is a sub-language of Turtle intended for machines."
>> Isn't Turtle "intended for machines," too? The introduction should provide a description of the relative benefits of each format.
>
> N-Triples is no longer part of the same document as Turtle.
>
>>
>>
>> "The Turtle grammar for triples is a subset of the SPARQL Query Language for RDF [RDF-SPARQL-QUERY] grammar for TriplesBlock."
>> The link to SPARQL is to the (1.0) REC version, but the grammar link is to the (1.1) LC version. These should be consistent.
>
> These are now consistent in the LC Draft. ... I hope.
>
>>
>>
>> "Comments in either language may be given after a # that is not part of another lexical token and continue to the end of the line."
>> The octothorp is bare, but colored orange (in my browser). In similar descriptions later in the document, turtle characters/tokens are not always colored, and sometimes quoted (with both single and double quotes). Such cases should be made consistent where possible (my preference would be both colored and double quoted, except in situations where the thing being quoted contains double quotes).
>
> The majority of tokens were already single quoted, the remaining
> tokens have been wrapped in single quotes except in areas where it
> would not be clear what was going on. For example in Quoted Literals
> additional quotes of either the single our double variety were not
> added as it did not increase readability it just made everything more
> confusing. http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#turtle-literals
> Changed in new Editors Draft
>
>>
>>
>> === 2.2 Predicate Lists
>>
>> "This expresses a series of RDF Triples with that subject ***and a*** each predicate and object allocated to one triple."
>> Typo.
>
> Fixed in Editors Draft
>
>>
>>
>> === 2.3 Object Lists
>>
>> "This expresses a series of RDF Triples with that subject and predicate ***and a each*** object allocated to one triple."
>> Typo.
>
> Fixed in Editors Draft
>
>>
>>
>> === 3.1.1 Prefixed Names in Turtle
>>
>> "A prefixed name is a prefix label and a local part, separated by a colon ":"."
>> I would find this a lot easier to read if the first sentence of this section instead explained that a prefixed name is a shortcut syntax for expressing an IRI.
>
> Prefix names are now simply part of the IRI section in general, other
> language around this was changed as well. I believe there should be
> less confusion about what prefix names are for now.
> http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-iri
>
>>
>>
>> "* reserved character escape sequences, e.g. wgs:lat\-long"
>> Can't dashes be used unescaped in the local part of a prefix name? I think this example would be better if it used a character that required escaping.
>
> Yes :( This is a poor example. A new example is needed, with the
> addition of : to allowed characters in local parts I no longer have
> any real world examples easily at hand for using \ escaping. Will try
> and find something.

[[
  PREFIX user: <http://service.example/user?>
  user:email\=gavin\@carothers.name foaf:mbox <mailto:gavin@carothers.name> .
]]

Here's a silly one-triple document:
[[
  _:g foaf:mbox <mailto:gavin@carothers.name> .
  PREFIX mailto: <mailto:>
  _:g foaf:mbox mailto:gavin\@carothers.name .
]]

I tried to fish a dbpedia identifier for the famous gaba(b) receptor
but this timed out:
  select distinct ?o where {_:s ?p ?o FILTER regex(?o, "gaba")} LIMIT 100


>> === 3.1.2 Relative IRIs
>>
>> "The "Retrieval URI" identified in 5.1.3, Base "URI from the Retrieval URI", is the URL from which a particular SPARQL query was retrieved."
>> Is the reference to SPARQL here just a copy-paste error from the SPARQL Query document?
>
> This a few other bits lifted directly from SPARQL have been removed or
> rewored in the LC Draft.
>
>>
>>
>> === 3.2 RDF Literals
>>
>> Given that the new turtle allows language tags and unicode escapes in mixed case, is there a suggested canonical form? If not, please define one, and consider making the use of the canonical form a 'SHOULD' for serializers.
>
> In theory there is no need for Turtle to specify this as all language
> tags have a canonical form according to RFC 5646 (In fact they don't
> have any other forms). In reality folks do what they want. It is
> likely this will come up as part of N-Triples and the attempt to
> create a canonical form for it, I think Turtle can leave this alone.
>
>>
>>
>> "If there is no language tag, there may be a datatype IRI, preceeded by ^^."
>> The link anchor for "datatype IRI" doesn't exist in the linked-to document.
>
> Ugh, this bug is still in the LC Draft, and the link checkers didn't
> catch it :( Thanks.

I see that fixed in LC
<http://www.w3.org/TR/2012/WD-turtle-20120710/#turtle-literals>.
It points to <http://www.w3.org/TR/2012/WD-rdf11-concepts-20120605/#dfn-datatype>,
but may have a better link in the editor's draft of Concepts.


>> === 3.2.1 Other Lexical Representations in Turtle
>>
>> "* Literals delimited by """, which permit up to two "s, as well as \r and \n."
>> "* Literals delimited by ''', which permit up to two 's, as well as \r and \n."
>> While it's implied by context, it would be helpful this text was more explicit about the permission of the quoting characters (e.g. it's about permitting up to two *consecutive* quote characters in the lexical form).
>
> This whole section was rewritten for clarity.
>
>>
>>
>> === 3.2.3 Representing Booleans in Turtle
>>
>> "Boolean values may be written as either true or false (case-sensitive) and represent RDF literals with the datatype xsd:boolean."
>> Since xsd:boolean has four valid lexical forms, it would be helpful to clarify that the lexical value of the resulting literal is the same as the boolean keyword used.
>
> "The literal has a lexical form of the "true" or "false", depending on
> which matched the input, and a datatype of xsd:boolean."
>
> Specifics of the lexical form are specified in the Normative Text.
>
>>
>>
>> === 3.3 RDF Blank Nodes
>>
>> "RDF blank nodes in Turtle are expressed as _: followed by a blank node label which is a series of name characters."
>> This isn't completely true, as the very next (sub-)section explains the use of [] for blank nodes. This section would be clearer if 3.3 introduced the two blank-node forms, and two sub-sections provided the details.
>>
>>
>> === 3.3.1 Nesting Unlabeled Blank Nodes in Turtle
>>
>> "In Turtle, fresh RDF blank nodes are also allocated when matching the production blankNodePropertyList and the terminal ANON."
>> I don't find this text and link into the grammar to be particularly helpful. It isn't until the second paragraph, and after an example, that this section even mentions that it is discussing a syntactic form for blank nodes using square brackets.
>
> The sections on blank nodes have been rewritten a number of times. I
> don't think anyone has ever really been all that happy with any of the
> versions. I think this is a complex subject, the current text is at
> least not incorrect (earlier versions were). At this point I think
> we'd really need to see proposed text that was better before trying
> again.
>
>>
>>
>> === 4 Collections in Turtle
>>
>> I think the example in this section would benefit greatly from a side-by-side comparison with the equivalent triples, which style is used in the preceeding section.
>>
>
> Yes! I in fact have an example that does exactly that ... and for some
> reason known only to my previous self I didn't include it here. Will
> add to Editors Draft.
>
>
>>
>
>> === 5.4 Grammar
>>
>> The following productions are used in the grammar, but are never defined (and seem irrelevant, because the "unsigned" production rules match the signs):
>> INTEGER_POSITIVE
>> INTEGER_NEGATIVE
>> DECIMAL_POSITIVE
>> DOUBLE_POSITIVE
>> DECIMAL_NEGATIVE
>> DOUBLE_NEGATIVE
>
> Long gone from LC Draft.
>
>>
>>
>> === 6 Parsing
>>
>> "Some productions change the parser state (base or prefix declarations)."
>> Since other productions change the parser state beyond base and prefix declarations, the parenthetical should indicate that the list isn't inclusive (perhaps with an "e.g.").

How about "Grammar productions change the parser state and emit triples"?


>> === 6.1 Parser State
>>
>> "Parsing Turtle requires a state of four items:"
>> This is followed by a list of *five* state items.
>>
>>
>> "RDF_Term curSubject"
>> "RDF_Term curPredicate"
>> Section 6.3 uses language such as "[record] the curSubject and curPredicate" and "[restore] curSubject and curPredicate". This sounds to me like the parser state for curSubject and curPredicate actually involve two stacks of RDF terms, not just two scalar RDF terms. I think the description of parsing would be clearer if this were made explicit, instead of hiding parsing complexity behind words like "record" and "restore".


I believe that's required.
<http://www.w3.org/TR/2012/WD-turtle-20120710/#sec-parsing-example>
shows an example where parsing "[ :mbox <mailto:timbl@w3.org> ]"
requires a stack to save and restore the current subject and
predicate.

-- 
-ericP


> On this section I defer to my handsome co-editor Eric. Though I did
> update the four to five in the LC Draft ;)
>
>>
>> === 11 Turtle in HTML
>>
>> I'm not entirely clear on the value of this section, and believe that it probably doesn't give enough information to safely embed turtle in HTML5. The W3C HTML5 validator, for example, shows that the described technique produces invalid HTML5 when the Turtle includes "</script>" in a literal string.
>
> Now in an appendix, does not claim to be perfect, and in general has
> the issue of trying to standardize before there is real practice in
> this area.
>
>>
>>
>> === 11.1 XHTML
>>
>> "Like JavaScript, Turtle authored for HTML (text/html) can break when used in an XHTML (application/xhtml+xml)."
>> Should this sentence end with "XHTML ***document***"?
>
> Typo, unneeded an. Might or might not be a document, could also be
> magic DOM cloning etc.
>
>>
>>
>> === 11.3 Parsing Turtle in HTML
>>
>> "THe HTML lang attribute or XHTML xml:lang attribute have no effect on the parsing of the data blocks."
>> Case typo in "THe".
>
> Fixed in LC Draft.
>
>>
>>
>> === 12 N-Triples
>>
>> "These may be seperated by white space (spaces #x20 or tabs #x9)."
>> I assume "these" here refer to the RDF terms, not the triples?
>
> N-Triples all gone from Turtle document. Will attempt to address
> issues before FPWD of N-Triples document.
>
>>
>>
>> === 12.3 Grammar
>>
>> I'm not happy with the change to make N-Triples a unicode format. This change means that tools interacting with N-Triples will have to be unicode aware, and support the \u style of unicode escapes used in N-Triples. This is a big change from the old N-Triples format, where command line tools such as sort/uniq/cut/join could be used to easily parse and perform simple processing of N-Triples data. With the unicode change, this strategy is now much more likely to not work, as a single value now has many equivalent syntactic forms (e.g. "Spïdermann" vs. "Sp\u00EFdermann"). Moreover, even the unicode escapes now have many equivalent forms, as the HEX production in the grammar has been made case insensitive, accepting [0-9A-Fa-f] instead of the old [0-9A-F] (e.g. "Sp\u00EFdermann" vs. "Sp\u00efdermann"). As mentioned above, this is also an issue with case insensitive language tags. Can you provide a pointer to any discussion that occurred in the WG about the reasoning behind this change?
>>
>>
>> No mention is made of comments in the N-Triples grammar section. They are mentioned in the introduction (section 1), used in the N-Triples example in section 12, and as a change from the test cases format (in section 12.2), but there are no specifics given. If N-Triples comment handling is intended to be identical to that of Turtle, this should be stated explicitly.
>>
>>
>> "[1]            ntriplesDoc             ::=     (triple)? (EOL triple)* (EOL)?"
>> This rule seems oddly restrictive. For example, it seems to forbid an N-Triples document with consecutive newline characters. The turtle grammar has a sub-section describing white space handling, but no such section exists for the N-Triples grammar. This makes it tough to know exactly how to interpret this rule.
>
> Lots of stuff, here. In general N-Triples was NOT ready for Last Call.
> Will use as input into the FPWD of N-Triples.
>
>>
>>
>> === 13.3 Turtle compared to SPARQL (Informative)
>>
>> "SPARQL permits variables (?name or $name) in any part of the triple of the form"
>> This sentence trails off. Was there more to it?
>
> No, list some sometimes ending in a period and sometimes not :( Fixed
> in Editors Draft.
>
>
> Thanks for your feedback!
>
> Cheers,
> Gavin
>
>>
>>
>>
>>
>> thanks,
>> gregory williams
>>
>>

Received on Wednesday, 11 July 2012 08:41:45 UTC