Re: Finding Metadata for CSV Files

Good list.  At least at the moment, I think describing possibilities is 
the way forward.  Eventually, we may wish to highlight one mechanism but 
given the different roles involved, saying there is only one way is 
creating friction in publishing data.

On 08/03/14 09:25, Jeni Tennison wrote:
> Hi,
>
> It feels to me like the ‘Model for Tabular Data and Metadata on the
 > Web’ is getting close to something publishable. The gaps that I’d like
> to fill are around indicating how an application might discover
>annotations to create an annotated data model, or might discover groups
>of tables to create a grouped data model.
 >
> In other words:
>
>    How does an application find annotations on tables, columns, rows and fields?
>    How does an application find groups of tables and common metadata about them?
>
> I can think of four possible answers:
>
>    1. Publish a CSV file with a Link rel=describedby header pointing to a file
>       that provides the annotations, which might also describe other CSV files.
>
>    2. Publish a package of CSV file(s) and a file that provides the annotations
>       (the Simple Data Format / DSPL model).
>
>    3. Include a comment line (or something) in a CSV file that points to a file
>       that provides the annotations, which might also describe other CSV files.
>
>    4. Embed annotations within a CSV file, including pointers to other descriptive
>       documents and CSV files (the Linked CSV / CSV-LD model).

5. (no advocacy) Naming convention : if there is a "data.csv" then the 
metadata is adjacent under "data.csv.json" or somesuch.

Similar to a Simple Data Format bundle but in an unpacked form (DSF has 
"datapackage.json" for a whole bundle).

The downside with this is that files become separated and the 
association is lost.

> My current thinking is that we should specify all of the above because:
>
>    1. is good because the CSV file can remain untouched, but bad because it
>       relies on publisher access to and control of HTTP headers which is
>       hard in practice

I agree that relying on control of the HTTP headers is problematic. 
Often, control of the server is different from the source of CSV files.

>    2. is good because you get everything in one bundle, but bad because it
>       means duplicating CSV files that belong to multiple packages, making
>       them hard to keep up to date, and limits linking to individual CSV
>       files (given we lack a good fragment identifier scheme for packages)
 >
>    3. is good because it’s a simple addition to a CSV file, but bad because
>       it means changing existing CSV files and might cause parsing problems
>       for legacy parsers (depending on how the commenting is done)
>
>    4. is good because it enables embedding of metadata within a file (which
>       means it’s less likely to get out of date) but bad because it means
>       changing CSV files and might cause parsing/processing problems for
>       legacy parsers (depending on how the embedding is done)
>
> (3 could be considered a subset of or related to 4.)

For 3 and 4, modifying the CSV file, at least keeps the link with the 
data but may be unacceptable.

The person/system producing the file may not be the person publishing.
The file format may be already prescribed.

> What do you all think? Any other methods that I’ve missed?


 Andy

>
> Jeni
> --
> Jeni Tennison
> http://www.jenitennison.com/
>

Received on Sunday, 9 March 2014 10:33:32 UTC