Locating file- and directory-specific metadata (Was: Re: Spec review request: CSV on the Web)

Hi Mark,

On 19 May 2015 at 07:52:01, Mark Nottingham (mnot@mnot.net) wrote:
> > On 18 Apr 2015, at 8:24 pm, Jeni Tennison wrote:
> > The CSV on the Web Working Group would like to request that the TAG review the following  
> > Working Drafts:
> >
> > Model for Tabular Data and Metadata on the Web -
> > http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/
> > Metadata Vocabulary for Tabular Data -
> > http://www.w3.org/TR/2015/WD-tabular-metadata-20150416/
> > Generating JSON from Tabular Data on the Web -
> > http://www.w3.org/TR/2015/WD-csv2json-20150416/
> > Generating RDF from Tabular Data on the Web -
> > http://www.w3.org/TR/2015/WD-csv2rdf-20150416/
>  
> […]
>  
> > 3. The model of access that we’re assuming for CSV and other tabular data files is that  
> > someone will link directly to the CSV file (as currently) and that processors will need  
> > to retrieve a metadata file about that CSV based on the location of the CSV file. Note that  
> > metadata files are file-specific; we wouldn’t expect a single metadata file that includes  
> > information about every CSV file on a particular site.
> >
> > We think that the “correct” way of getting this pointer to a metadata file (given that  
> > there is no scope for embedding information within the CSV file itself) is to use a Link  
> > header that points to the metadata file, and we have specified that here [5].
> >
> > However, we recognise that there are many publishing environments in which it is impossible  
> > for users to set HTTP headers, particularly on an individual file basis. We have therefore  
> > specified two other mechanisms to retrieve metadata files, used only if the URL of the  
> > original CSV file doesn’t include a query string:
> >
> > * appending ‘-metadata.json’ to the end of the URL to get file-specific metadata [6]  
> > * resolving the URL ‘../metadata.json’ against the URL to get directory-level metadata  
> [7]
> >
> > Neither of these feels great: they require users who can’t use Link headers to structure  
> > their URL space in particular ways, and they use string concatenation on URLs which is  
> > horrible. However, we can’t see any better alternative to meet our requirement for what  
> > is in effect a file-specific well known URI.
>  
> More than not "feeling great", they're defined as bad practice in BCP190/RFC7320 . 

Yes, that was a Britishism. I am aware.

> Having the W3C define a static URI pattern for a metadata file would be a horrible precedent,  
> IMO — one that would likely be used as an excuse for yet more such "conveniences."
>  
> In , I suggested the least-worst option  
> of using .well-known (RFC5785), so that the metadata for e.g., "/foo/bar.json" could  
> be found at "/.well-known/whatever-metadata/foo/bar.json". That's been dismissed  
> by the folks participating in the discussion as insufficient, so I'm getting concerned.  

The reason that I thought this worthy of TAG discussion is because it illustrates a situation where file- and directory-specific well known locations would be very useful. This might be a need that arises in other cases, so it would be helpful for the community to have the best-practice way of addressing the requirement spelled out. This is particularly the case if the right solution is .well-known, as .well-known is currently described as being for “site-wide metadata”, not metadata for individual files. Plus it’s always useful to test best practice guidance against real use cases.

To be more explicit about the requirement, publishers need to be able to publish CSV files and their associated metadata such that the metadata can be found by reusers. If it is hard to publish they simply won’t bother, and we won’t get a good impact from the CSV on the Web work. In many cases the publishers of CSV files are not particularly tech-literate, and they are fairly likely to be using shared publishing infrastructure, which is usually not oriented specifically to publishing data.

We have been employing a “Github test” to assess the difficulty of publication in a shared hosting environment. We could equally use a “GOV.UK test” or a “Wordpress test”; the issues are similar.

There is a significant impact on ease of use for publishers if they have to put metadata files into /.well-known as opposed to the same directory as the CSV file(s):

  * they have to negotiate access to the /.well-known directory (eg for CSV files in w3c.github.com/csvw such as the test suite we would have to ask the W3C staff to create a new repo that we could have access to)
  * they have to mirror a potentially changing directory structure within that space
  * having the files so separate means they’re likely to go out of sync (eg in Github, /.well-known would be a completely different repo)

I’m not dismissing this as an approach, just spelling out the concerns about the usability impact which I think is behind the pushback on the suggestion from the Working Group.

I wondered if there might be another option, perhaps using a .well-known subdirectory within the directory holding the CSV files. However, that’s not supported by RFC 5785, which says:

   4. Why aren't per-directory well-known locations defined?

      Allowing every URI path segment to have a well-known location
      (e.g., "/images/.well-known/") would increase the risks of
      colliding with a pre-existing URI on a site, and generally these
      solutions are found not to scale well, because they're too
      "chatty".

I don’t really understand the scalability or chattiness arguments here; perhaps you can expand on them?

Look forward to talking this through on the call next week.

Cheers,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com/

Received on Tuesday, 19 May 2015 17:16:52 UTC