Re: Comments on "URI Fragment Identifiers for the text/csv Media Type" from Michael Hausenblas on 2011-04-28 (uri@w3.org from April 2011)

From: Michael Hausenblas <michael.hausenblas@deri.org>
Date: Thu, 28 Apr 2011 17:35:41 +0100
To: URI IG <uri@w3.org>
Cc: Erik Wilde <dret@berkeley.edu>, Richard Cyganiak <richard.cyganiak@deri.org>
Message-Id: <093520BC-1D3B-4E15-9853-22AA8144C3DB@deri.org>
(forwarding this to list as it seems Richard is not subscribed and  
hence this message didn't show up in the archive).

Thanks a lot for the comments, Richard - I'll follow up in a separate  
mail, soon.

Cheers,
	Michael

On 27 Apr 2011, at 12:05, Richard Cyganiak wrote:

> Hi Erik, hi Michael,
>
> This is a comment on the first draft of “URI Fragment Identifiers  
> for the text/csv Media Type” [1], announced here [2].
>
> Best,
> Richard
>
> [1] http://www.ietf.org/id/draft-hausenblas-csv-fragment-00.txt
> [2] http://lists.w3.org/Archives/Public/uri/2011Apr/0003.html
>
>
> Section 2
>
> The draft does not appear to provide a way of addressing the most  
> fundamental part of a CSV file: a cell. I find this confusing, as it  
> seems like a really obvious and surprising use case to me. In fact,  
> you say that one use case is “making assertions about a certain  
> value”. How is this possible given the current design?
>
> I guess I'm asking for something like this: #cell:temperature,4 to  
> address the value in the temperature column, row 4.
>
> A less critical but perhaps also interesting feature would be Excel- 
> style cell ranges, such as #cells:temperature,4:temperature,6.
>
>
> Section 2.1
>
> This is quite fuzzy on the question of header detection. As the  
> draft is currently designed, an implementation has to detect whether  
> a header is present or not, otherwise it cannot determine what part  
> of the table exactly is being addressed. So is the header=present  
> thing in the media type the only and canonical way of determining  
> presence of headers?
>
> If that is the case, then what with non-HTTP protocols, e.g., file:///Users/richard/test.csv 
> #row:0?
>
> What does #head address if the media type does not indicate the  
> presence of a header?
>
> (A possible solution might be to make the addressed part independent  
> of the presence of a header. #head would simply address the first  
> row, regardless of whether it's actually a header. Same for #row:0.  
> #col:2 would be  
> place,Galway,Galway,Galway,Berkeley,Berkeley,Berkeley. If the  
> example table had no header, then #col:2 would be  
> Galway,Galway,Galway,Berkeley,Berkeley,Berkeley. And #col:Galway  
> would be the same. And so on.)
>
> The first paragraph of 2.1 is poorly written.
>
>
> Section 2.2
>
> How does the row:n format interact with presence/absence of header?  
> If a header is present, does #row:0 address the same as #head?
>
> A handy feature would be to allow addressing of the last row using  
> #row:-1 (and similar for the second-to-last row etc).
>
> What is addressed by #row:1000 if the table has only 10 rows?
>
> What is the use case for the #row:* format? It seems a bit obscure  
> to me and perhaps might better be dropped.
>
>
> Section 2.3
>
> It appears that the header row, if present, is excluded from  
> #col:xxx addressing. Maybe this can be clarified in the text.
>
> What is addressed by #col:xxx if xxx is neither a number nor a  
> column in the table?
>
> What is addressed by #col:2 if there is a column named "2"?
>
> What is addressed by #col:xxx if no header is present, or if a  
> header is present but not indicated in the media type?
>
> What is addressed by #col:foo if the header contains a duplicate  
> column, like foo,bar,baz,foo?
>
>
> Section 2.4
>
> I am unconvinced that the slice-based selection is useful as it is  
> described right now. I'd like to understand better what the use case  
> is. Personally, I can see more use cases for selecting entire rows  
> based on a value match, such as this:
>
>   #row:name=Alice
>
> I would expect the addressed part to be the entire row, including  
> the value that was used for the match. Excluding the matched column  
> seems a bit strange to me and I just have trouble understanding what  
> the motivation is.
>
> Independently from that: The name “slice-based” isn't very  
> appropriate for the current mechanism. “Slice” implies a complete  
> “thin” cut along one dimension. That's how it's used in data  
> warehouse speak, anyways. In that sense, both row-based and column- 
> based selection are slices, but this “slice-based” selection  
> actually is not. More accurate would be “table reduction” or “select 
> +project”, but admittedly these are not very snappy. Perhaps “value- 
> based selection”?
>
>
> Section 3
>
> URI syntax only allows certain characters. Other characters have to  
> be escaped. CSV cells also allow only certain characters, but a  
> different set, with different escaping rules. I would expect some  
> language here that addresses this. For example, if I have a cell row:
>
>   2011-01-01,1,"Galway, Ireland"
>
> then what exactly would a #where:place=xxx fragment that selects  
> this row look like?
Received on Thursday, 28 April 2011 16:36:15 UTC