Comments on "URI Fragment Identifiers for the text/csv Media Type" from Richard Cyganiak on 2011-06-03 (public-iri@w3.org from June 2011)

From: Richard Cyganiak <richard.cyganiak@deri.org>
Date: Fri, 03 Jun 2011 15:01:48 -0400
To: Michael Hausenblas <michael.hausenblas@deri.org>, Erik Wilde <dret@berkeley.edu>
Cc: uri@w3.org
Message-Id: <3419F8C1-CA47-442A-AEC6-F20EA93FB330@deri.org>

Hi Erik, hi Michael,

This is a comment on the first draft of “URI Fragment Identifiers for the text/csv Media Type” [1], announced here [2].

Best,
Richard

[1] http://www.ietf.org/id/draft-hausenblas-csv-fragment-00.txt
[2] http://lists.w3.org/Archives/Public/uri/2011Apr/0003.html

Section 2

The draft does not appear to provide a way of addressing the most fundamental part of a CSV file: a cell. I find this confusing, as it seems like a really obvious and surprising use case to me. In fact, you say that one use case is “making assertions about a certain value”. How is this possible given the current design?

I guess I'm asking for something like this: #cell:temperature,4 to address the value in the temperature column, row 4.

A less critical but perhaps also interesting feature would be Excel-style cell ranges, such as #cells:temperature,4:temperature,6.

Section 2.1

This is quite fuzzy on the question of header detection. As the draft is currently designed, an implementation has to detect whether a header is present or not, otherwise it cannot determine what part of the table exactly is being addressed. So is the header=present thing in the media type the only and canonical way of determining presence of headers?

If that is the case, then what with non-HTTP protocols, e.g., file:///Users/richard/test.csv#row:0?

What does #head address if the media type does not indicate the presence of a header?

(A possible solution might be to make the addressed part independent of the presence of a header. #head would simply address the first row, regardless of whether it's actually a header. Same for #row:0. #col:2 would be place,Galway,Galway,Galway,Berkeley,Berkeley,Berkeley. If the example table had no header, then #col:2 would be Galway,Galway,Galway,Berkeley,Berkeley,Berkeley. And #col:Galway would be the same. And so on.)

The first paragraph of 2.1 is poorly written.

Section 2.2

How does the row:n format interact with presence/absence of header? If a header is present, does #row:0 address the same as #head?

A handy feature would be to allow addressing of the last row using #row:-1 (and similar for the second-to-last row etc).

What is addressed by #row:1000 if the table has only 10 rows?

What is the use case for the #row:* format? It seems a bit obscure to me and perhaps might better be dropped.

Section 2.3

It appears that the header row, if present, is excluded from #col:xxx addressing. Maybe this can be clarified in the text.

What is addressed by #col:xxx if xxx is neither a number nor a column in the table?

What is addressed by #col:2 if there is a column named "2"?

What is addressed by #col:xxx if no header is present, or if a header is present but not indicated in the media type?

What is addressed by #col:foo if the header contains a duplicate column, like foo,bar,baz,foo?

Section 2.4

I am unconvinced that the slice-based selection is useful as it is described right now. I'd like to understand better what the use case is. Personally, I can see more use cases for selecting entire rows based on a value match, such as this:

#row:name=Alice

I would expect the addressed part to be the entire row, including the value that was used for the match. Excluding the matched column seems a bit strange to me and I just have trouble understanding what the motivation is.

Independently from that: The name “slice-based” isn't very appropriate for the current mechanism. “Slice” implies a complete “thin” cut along one dimension. That's how it's used in data warehouse speak, anyways. In that sense, both row-based and column-based selection are slices, but this “slice-based” selection actually is not. More accurate would be “table reduction” or “select+project”, but admittedly these are not very snappy. Perhaps “value-based selection”?

Section 3

URI syntax only allows certain characters. Other characters have to be escaped. CSV cells also allow only certain characters, but a different set, with different escaping rules. I would expect some language here that addresses this. For example, if I have a cell row:

2011-01-01,1,"Galway, Ireland"

then what exactly would a #where:place=xxx fragment that selects this row look like?

Received on Friday, 3 June 2011 19:01:56 UTC