Re: Finding Metadata for CSV Files from Ivan Herman on 2014-03-09 (public-csv-wg@w3.org from March 2014)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 9 Mar 2014 12:09:36 +0100
To: Jeni Tennison <jeni@jenitennison.com>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <3FAB4008-DB15-4830-847A-AA4DCE428027@w3.org>
On 08 Mar 2014, at 10:25 , Jeni Tennison <jeni@jenitennison.com> wrote:

> Hi,
> 
> It feels to me like the ‘Model for Tabular Data and Metadata on the Web’ is getting close to something publishable. The gaps that I’d like to fill are around indicating how an application might discover annotations to create an annotated data model, or might discover groups of tables to create a grouped data model.
> 
> In other words:
> 
>   How does an application find annotations on tables, columns, rows and fields?
>   How does an application find groups of tables and common metadata about them?
> 
> I can think of four possible answers:
> 
>   1. Publish a CSV file with a Link rel=describedby header pointing to a file
>      that provides the annotations, which might also describe other CSV files.
> 
>   2. Publish a package of CSV file(s) and a file that provides the annotations
>      (the Simple Data Format / DSPL model).
> 
>   3. Include a comment line (or something) in a CSV file that points to a file
>      that provides the annotations, which might also describe other CSV files.
> 
>   4. Embed annotations within a CSV file, including pointers to other descriptive
>      documents and CSV files (the Linked CSV / CSV-LD model).
> 
> My current thinking is that we should specify all of the above because:
> 
>   1. is good because the CSV file can remain untouched, but bad because it
>      relies on publisher access to and control of HTTP headers which is 
>      hard in practice
> 
>   2. is good because you get everything in one bundle, but bad because it
>      means duplicating CSV files that belong to multiple packages, making
>      them hard to keep up to date, and limits linking to individual CSV 
>      files (given we lack a good fragment identifier scheme for packages)
> 
>   3. is good because it’s a simple addition to a CSV file, but bad because
>      it means changing existing CSV files and might cause parsing problems
>      for legacy parsers (depending on how the commenting is done)
> 
>   4. is good because it enables embedding of metadata within a file (which
>      means it’s less likely to get out of date) but bad because it means
>      changing CSV files and might cause parsing/processing problems for
>      legacy parsers (depending on how the embedding is done)

I agree that, at least for now but probably in the final version, too, we will have to allow for all of these (plus the one Andy just posted on the naming convention). Of course, we will have to establish a priority order, ie, how to manage if the same metadata term appears though the HTTP header and, say, embedded in the file...

_Personally_ I am a little bit wary on an approach that requires a modification of the file itself. If we think (do use cases say that?) that the data is often produced by other tools (excel or any other data dump) than an ulterior modification of a possibly big CSV file seems to be problematic. The HTTP header and the naming convention approaches have the merit of leaving the file intact...

Ivan

> 
> (3 could be considered a subset of or related to 4.)
> 
> What do you all think? Any other methods that I’ve missed?
> 
> Jeni
> --  
> Jeni Tennison
> http://www.jenitennison.com/
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Sunday, 9 March 2014 11:10:09 UTC