review of CSV/TSV (ACTION-594)

Below is my review of the CSV/TSV document. I think there are a few issues that need clearing up before publication. The only big issue I have is that the document specifically talks about the default encoding for these formats being US-ASCII, but then doesn't discuss the possible need to escape unicode characters in the serialization. This is especially important for the CSV format where we are relying directly (and only) on the CSV escaping mechanism which really only covers the escaping of quotes and newlines. The rest of the points are minor/editorial.

thanks,
.greg



Abstract still has an @@. I think we agreed in the last telecon to drop it.

The set of SPARQL 1.1 docs doesn't include the CSV/TSV document.

Some of the example data used in section 1.1 is confusing. Since it's not clear what formatting is being used in the example table, it's not immediately clear what literal value this represents: "String-with-dquote"". By context I assume it's a literal that starts with the character 'S' and ends with a sole double quote. If this table is meant to be using a turtle-like encoding (and not a CSV-like encoding), then perhaps that double quote should be backslash-escaped? Or perhaps there should be some text that explains the possibly ambiguous values in the example table.

Regarding "Applications reading these formats are advised to cope with both CRLF and LF as end of line markers," should this be using "SHOULD" normative language?

=== Section 3 ===

"the results table is serialized as ... one line for each query solution." I'm don't think this is true. The CSV spec document does say "Each record is located on a separate line," but also indicates that a CRLF can appear in a double quoted field value:

  "aaa","b CRLF
  bb","ccc" CRLF
  zzz,yyy,xxx

Section 3.2 actually notes this case ("Within quote strings, all characters except ", including new line characters have their exact meaning - newlines do not end a CSV record.")

"Values in the results are strings, for URIs and literals, together with numbers when the literals are of numeric XSD datatype." No mention is made of blank nodes.

=== Section 3.1 ===

"Each row has the same number of fields..." Is this meant to say that each row 'MUST' have the same number of fields?

=== Section 3.2 ===

"The entry in each field is the string corresponding to the RDF term value. (c.f. SPARQL STR()) without syntax to denote what kind of term it is. The encoding quoting rules of CSV format must be used." As it's earlier mentioned that the encoding of the CSV file may be US-ASCII, we probably need to mention that simply taking the STR() value and applying CSV escaping may not always be enough to produce a valid CSV file.

"((COMMA, code point 44, 0x2C)" has an extra open paren.

=== Section 4.1 ===

"Variables are serialized in SPARQL syntax, using question mark ? character followed by the variable name." Is there a reason we chose to use the '?' in TSV, but not in CSV?

"Each row has the same number of fields..." Again, I think this should probably be using "MUST".

=== Section 4.2 ===

"The SPARQL Results TSV Results Format serializes RDF terms in the results table by using the syntax that SPARQL [RDF-SPARQL-QUERY] [SPARQL11-QUERY] and Turtle [TURTLE] use." Do we need references to both 1.0 and 1.1 versions of SPARQL Query?

"""
literals are enclosed with single quotes "..." or ' ...'
"""
The use of 'single quotes' here immediately followed by double quotes is confusing. I assume 'single' is meant to mean either of the quoting forms used, but not the triple-quote form available in turtle?

As with CSV, I'm concerened about the use of unicode in terms when section 2 specifically talks about these formats defaulting to US-ASCII. The TSV encoding at least supports unicode escaping by default as it deals with turtle/sparql syntax for terms.

Received on Monday, 5 March 2012 16:42:38 UTC