Finding Metadata for CSV Files

Hi,

It feels to me like the ‘Model for Tabular Data and Metadata on the Web’ is getting close to something publishable. The gaps that I’d like to fill are around indicating how an application might discover annotations to create an annotated data model, or might discover groups of tables to create a grouped data model.

In other words:

  How does an application find annotations on tables, columns, rows and fields?
  How does an application find groups of tables and common metadata about them?

I can think of four possible answers:

  1. Publish a CSV file with a Link rel=describedby header pointing to a file
     that provides the annotations, which might also describe other CSV files.

  2. Publish a package of CSV file(s) and a file that provides the annotations
     (the Simple Data Format / DSPL model).

  3. Include a comment line (or something) in a CSV file that points to a file
     that provides the annotations, which might also describe other CSV files.

  4. Embed annotations within a CSV file, including pointers to other descriptive
     documents and CSV files (the Linked CSV / CSV-LD model).

My current thinking is that we should specify all of the above because:

  1. is good because the CSV file can remain untouched, but bad because it
     relies on publisher access to and control of HTTP headers which is 
     hard in practice

  2. is good because you get everything in one bundle, but bad because it
     means duplicating CSV files that belong to multiple packages, making
     them hard to keep up to date, and limits linking to individual CSV 
     files (given we lack a good fragment identifier scheme for packages)

  3. is good because it’s a simple addition to a CSV file, but bad because
     it means changing existing CSV files and might cause parsing problems
     for legacy parsers (depending on how the commenting is done)

  4. is good because it enables embedding of metadata within a file (which
     means it’s less likely to get out of date) but bad because it means
     changing CSV files and might cause parsing/processing problems for
     legacy parsers (depending on how the embedding is done)

(3 could be considered a subset of or related to 4.)

What do you all think? Any other methods that I’ve missed?

Jeni
--  
Jeni Tennison
http://www.jenitennison.com/

Received on Saturday, 8 March 2014 09:26:08 UTC