Re: Finding Metadata for CSV Files

On Mar 8, 2014, at 2:25 AM, Jeni Tennison <jeni@jenitennison.com> wrote:
> 
> Hi,
> 
> It feels to me like the ‘Model for Tabular Data and Metadata on the Web’ is getting close to something publishable. The gaps that I’d like to fill are around indicating how an application might discover annotations to create an annotated data model, or might discover groups of tables to create a grouped data model.
> 
> In other words:
> 
>   How does an application find annotations on tables, columns, rows and fields?
>   How does an application find groups of tables and common metadata about them?
> 
> I can think of four possible answers:
> 
>   1. Publish a CSV file with a Link rel=describedby header pointing to a file
>      that provides the annotations, which might also describe other CSV files.

+1 This is essentially what JSON-LD does to annotate normal JSON. I also suggested something similar in my CSC-LD proposal.

>   2. Publish a package of CSV file(s) and a file that provides the annotations
>      (the Simple Data Format / DSPL model).

+0 I can see why this might be common practice, but it doesn't provide the encapsulation I think is best.

>   3. Include a comment line (or something) in a CSV file that points to a file
>      that provides the annotations, which might also describe other CSV files.

+1 Also a practice from JSON-LD, and I suggested identifying a CSV- LD frame/context using a similar mechanism.

>   4. Embed annotations within a CSV file, including pointers to other descriptive
>      documents and CSV files (the Linked CSV / CSV-LD model).

+0.5, it could be provided as a prep end to the CSV, but not ideal.

(Also +0.5 for Andy's #5)

Gregg

> My current thinking is that we should specify all of the above because:
> 
>   1. is good because the CSV file can remain untouched, but bad because it
>      relies on publisher access to and control of HTTP headers which is 
>      hard in practice
> 
>   2. is good because you get everything in one bundle, but bad because it
>      means duplicating CSV files that belong to multiple packages, making
>      them hard to keep up to date, and limits linking to individual CSV 
>      files (given we lack a good fragment identifier scheme for packages)
> 
>   3. is good because it’s a simple addition to a CSV file, but bad because
>      it means changing existing CSV files and might cause parsing problems
>      for legacy parsers (depending on how the commenting is done)
> 
>   4. is good because it enables embedding of metadata within a file (which
>      means it’s less likely to get out of date) but bad because it means
>      changing CSV files and might cause parsing/processing problems for
>      legacy parsers (depending on how the embedding is done)
> 
> (3 could be considered a subset of or related to 4.)
> 
> What do you all think? Any other methods that I’ve missed?
> 
> Jeni
> --  
> Jeni Tennison
> http://www.jenitennison.com/
> 

Received on Monday, 10 March 2014 16:39:01 UTC