Re: Comment on /TR/tabular-data-model concerning standard file/directory metadata from Jeni Tennison on 2015-03-11 (public-csv-wg@w3.org from March 2015)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 11 Mar 2015 14:56:17 +0000
To: Richard Cyganiak <richard@cyganiak.de>
Cc: public-csv-wg@w3.org
Message-ID: <etPan.55005791.5577f8e1.2fae@jenit.local>
Hi Richard,

Thanks for taking time to look through the CSV on the Web specifications and for your comments.

I’m sympathetic to the objection around the axiom of URI opacity. We’ve tried to come to a pragmatic solution to what we see as major requirements for the discovery of metadata for CSV files.

First, it is unrealistic to expect, per your final suggestion, that people will link to metadata files rather than CSV files. We need a solution that layers on existing practice (which is that publishers link directly to CSV files) rather than one that either requires browsers to change how they work (such that they redirect to the CSV file that people actually want to provide a link to) or that users (who might not care at all about the additional metadata that is being provided) will have to understand. That said, we do specify what happens when a tool is pointed to a metadata file rather than directly at a CSV file, so if linking behaviour does change in that direction then that will work.

Second, the use case that you state is unrealistic, of people being able to create JSON metadata files but unable to publish on a system that supports Link headers, is in fact very common. JSON metadata files could be generated by tooling such as http://data.okfn.org/tools/create, so people can be non-technical and still create JSON files. And even technical people are publishing data using eg GitHub pages or common content management systems, which don’t provide any/easy mechanisms for setting Link headers.

There is no *requirement* that people use the specified locations for metadata: they can place them anywhere else, even on completely separate systems, if they can use the Link header or rely on people navigating first to a metadata document. This does not restrict anyone’s ability to manage their URI space.

Regarding unnecessary and 404ing requests, we know that this isn’t at all ideal. There are several protocols that do require on optimistic request of files which might or might not exist, such as robots.txt, favicons, sitemaps, and all of the .well-known locations (see RFC5785). I’d be happy to re-open the issue about stopping at the first discovery of a metadata file (such that those who used Link headers wouldn’t have to deal with unnecessary 404ing requests) and open an issue to give the metadata.json directory file a more unique name, to avoid clashes.

More generally, if you have an alternative workable solution, given the assumptions and requirements above, we’d be glad to hear it.

Cheers,

Jeni
--  
Jeni Tennison
http://www.jenitennison.com/

On 9 March 2015 at 22:29:20, Richard Cyganiak (richard@cyganiak.de) wrote:
> Dear CSV WG,
>  
> This is a comment on your draft, “Model for Tabular Data and Metadata on the Web”.
> http://www.w3.org/TR/2015/WD-tabular-data-model-20150108/
>  
> Let me first say that the document is great, and I expect it to serve as a solid foundation  
> for future work around CSV.
>  
> There is however an issue that I think should be reconsidered.
>  
> It concerns sections 3.4 and 3.5, “Standard File Metadata” and “Standard Directory  
> Metadata”.
> http://www.w3.org/TR/2015/WD-tabular-data-model-20150108/#standard-file-metadata  
> http://www.w3.org/TR/2015/WD-tabular-data-model-20150108/#standard-directory-metadata  
>  
> The mechanism described there lacks a realistic use case, and is bad for all sorts of reasons,  
> including:
>  
> - It means that conforming processors must make three requests to retrieve a single CSV  
> file, two of which will almost always fail.
>  
> - It does not include any protocol by which client and server can work out in advance that  
> the metadata requests would be futile and hence should be avoided.
>  
> - It violates the axiom of URI opacity.
>  
> - It hobbles the ability of publishers who would like to deploy a different URI design,  
> restricting their ability to manage their URI space the way they like, or to evolve it  
> in the future.
>  
> - It makes setups where data and metadata are published from separate systems (e.g.,  
> data on FTP server, metadata on a CKAN-style data catalogue) unnecessarily complicated  
> and awkward.
>  
> - It gets even worse if a format different from JSON becomes somewhat popular in the future,  
> as now processors will have to do even more requests in search of a file that probably isn’t  
> there.
>  
> - It only addresses an unrealistic use case where the publisher is so untechnical that  
> they can only publish static files from the file system, but is also so technical that  
> they can write JSON by hand.
>  
> - It is equivalent to a proposal to discover the landing page for some-image.gif by going  
> to some-image.gif-landing-page.html. No one should implement such a ridiculous thing.  
>  
> If everyone was designing protocols like this, the web would be a firework of 404s where  
> clients poke blindly at servers…
>  
> The solution, I think, is simple: When metadata is published in a separate file, instead  
> of sending around the URL of the CSV file, one should send around the URL of the metadata  
> file, which contains a pointer to the CSV file.
>  
> Best,
> Richard
>  
>  
> (This is my personal opinion and I do not speak for my employer.)
>
Received on Wednesday, 11 March 2015 14:56:43 UTC