Re: Finding Metadata for CSV Files

Good summary. Comments inline.

On 8 March 2014 09:25, Jeni Tennison <jeni@jenitennison.com> wrote:

> Hi,
>
> It feels to me like the 'Model for Tabular Data and Metadata on the Web'
> is getting close to something publishable. The gaps that I'd like to fill
> are around indicating how an application might discover annotations to
> create an annotated data model, or might discover groups of tables to
> create a grouped data model.
>
> In other words:
>
>   How does an application find annotations on tables, columns, rows and
> fields?
>   How does an application find groups of tables and common metadata about
> them?
>
> I can think of four possible answers:
>
>   1. Publish a CSV file with a Link rel=describedby header pointing to a
> file
>      that provides the annotations, which might also describe other CSV
> files.
>

Nice idea though as you point out it is quite fancy for most people and
involves messing with your http headers.


>   2. Publish a package of CSV file(s) and a file that provides the
> annotations
>      (the Simple Data Format / DSPL model).
>
>   3. Include a comment line (or something) in a CSV file that points to a
> file
>      that provides the annotations, which might also describe other CSV
> files.
>

I'm somewhat -1 on both 3 + 4 because I think you don't want to mess with
existing structure and we'd also like to get people out of the habit of
inlining metadata into CSVs (which break them for automated consumption).

I also think we want to aim to "degrade" nicely (if you can) - i.e. the CSV
should still remain usable even if my tool doesn't support the spec.

That said I do appreciate the all in one attraction (but if we were doing
that why not go the whole hog and go json ;-) ...)


>   4. Embed annotations within a CSV file, including pointers to other
> descriptive
>      documents and CSV files (the Linked CSV / CSV-LD model).
>
> My current thinking is that we should specify all of the above because:
>
>   1. is good because the CSV file can remain untouched, but bad because it
>      relies on publisher access to and control of HTTP headers which is
>      hard in practice
>
>   2. is good because you get everything in one bundle, but bad because it
>      means duplicating CSV files that belong to multiple packages, making
>      them hard to keep up to date, and limits linking to individual CSV
>      files (given we lack a good fragment identifier scheme for packages)
>

I'm not quite clear why you have to duplicate - can't you refer to CSVs via
urls? Also what about sharing schemas across packages (I don't esp like it
but it could be possible) - see
https://github.com/dataprotocols/dataprotocols/issues/71

Rufus

Received on Monday, 10 March 2014 16:13:10 UTC