Use of .well-known for CSV metadata: More harm than good

The CSVW working group recently sought the TAG's advice on locating 
metadata associated with a tabular data document (typically CSV) 
retrieved from a given URI:
https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md
Among other mechanisms, the CSVW WG proposed that metadata could be 
retrieved from two standard locations (one per file and one per 
directory) relative to the original tabular data document URI:
http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#standard-file-metadata

   {+url}-metadata.json
   metadata.json

where {+url} is the URL of the CSV document.  For example, given a 
tabular data URL http://example/foo.csv , a CSVW processor would 
automatically look for its associated metadata at the following URLs:

   http://example/foo.csv-metadata.json
   http://example/metadata.json

Presumably out of a concern that this would be URI squatting and violate 
RFC7320
http://tools.ietf.org/html/rfc7320#section-3
the TAG's guidance was to use the RFC5785 .well-known mechanism to 
enable sites to specify custom metadata URIs based on templates, rather 
than relying on those standard relative locations.

Although URI squatting is an important issue to guard against, I do not 
believe it actually applies in this case, and use of .well-known would 
cause more harm than good.

What distinguishes this case is that a tabular metadata file must 
*explicitly* reference the associated data document in order for it to 
be used as a CSVW metadata document.  This is a critical point, which 
IMO changes the balance of the situation.  It means that: (a) the URI 
owner has clearly indicated the intent to use that metadata URI for that 
purpose; and (b) it does *not* prevent that URI from instead being used 
for other purposes.   It *does* prevent that URI from simultaneously 
being used for the tabular metadata and for some other purpose, and 
hence it does force the URI owner to choose between using it for tabular 
metadata or for something else.  But even in that case, if the URI owner 
really wants to use that URI for another purpose while *still* providing 
tabular metadata, then the URI owner still has the option of publishing 
the metadata at an arbitrary custom URI, and publicizing that location, 
because the metadata file will explicitly reference the data file 
anyway.  (In other words, although the most common case may be that a 
user would first know the URL of the tabular *data* file, and from that 
seek the associated metadata, it is perfectly acceptable -- and in some 
ways better -- for the user to start with the URL of the metadata file, 
and use that to find the desired data file URL.)  For example, the URI 
owner could publish the metadata at http://example/my-foo-metadata.json 
(which in turn would point to http://example/foo.csv ) and then 
advertise that URL.

Harms that would be caused by requiring the use of .well-known in this 
case include:

  - A *required* extra web access, nearly *every* time a conforming CSVW 
processor is given a tabular data URL and wishes to find the associated 
metadata -- because surely http://example/.well-known/csvm will be 404 
(and not cachable) in the vast majority of cases.

  - Greater complexity in all conforming CSVW implementations.

  - Reduced security, because a change to .well-known/csvm could 
completely change the interpretation of a given tabular data file, and 
that change would be far afield from the directory containing the data 
file, and thus may go completely unnoticed by the owner of the data file.

In short, I think the benefits of .well-known in this case are dubious, 
and far outweighed by the harms.   I think the TAG's guidance to the 
CSVW group should be amended.

Thanks,
David Booth

-------- Forwarded Message --------
Subject: Re: .well-known
Resent-Date: Thu, 18 Jun 2015 16:56:48 +0000
Resent-From: public-csv-wg@w3.org
Date: Thu, 18 Jun 2015 09:56:15 -0700
From: Gregg Kellogg <gregg@greggkellogg.net>
To: David Booth <david@dbooth.org>
CC: Ivan Herman <ivan@w3.org>, W3C CSV on the Web Working Group 
<public-csv-wg@w3.org>

> On Jun 17, 2015, at 7:43 PM, David Booth <david@dbooth.org> wrote:
>
> On 06/17/2015 02:29 AM, Ivan Herman wrote:
>> David,
>>
>> the .well-known mechanism is the result of a long discussion with the
>> TAG that had difficulties with the principle of baking in URI-schemes
>> like "-metadata.json".
>
> Is there a pointer to that discussion?   It sounds like the TAG concern is URI squatting.  URI squatting is an important concern, but I don't think it applies in this case, because -- if I've understood correctly -- a metadata file *explicitly* references the relevant data file, which in effect means that the URI owner has clearly indicated an intent to use that URI for that purpose.

Hi David, I found a link to the minutes here: 
https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md 
(already added to the issue).

The minutes aren’t particularly illuminating, but the issue raised by 
mnot was definitely concern over squatting. At this point, it seems to 
be settled. I’ve implemented it in my implementation, and it was quite 
straight-forward, although it requires an extra GET, the result of this 
can be cached for some time (subject to policies, of course).

> HOWEVER, I no longer see any mention of .well-known in the current editor's draft, so maybe my concern is moot:
> http://w3c.github.io/csvw/syntax/#locating-metadata

It’s still in a PR that hasn’t yet been pulled: 
https://github.com/w3c/csvw/pull/605. You likely say a page based on 
that branch, rather than the gh-pages branch where the ED is available.

It’s awaiting resolution of some minor wording on what “no such file is 
located” means, precisely.

Gregg

> Has the .well-known mechanism now been removed from the algorithm for finding metadata?
>
> Thanks,
> David Booth
>
>> Note that the agreement is to have a default
>> fall-back, ie, if the .well-known file does not exist then the client
>> can fall back to a default value which, actually, reproduces the
>> previous patterns. I think we should go ahead with this approach to
>> cover all points of views.
>>
>> Ivan
>>
>>
>>
>>> On 17 Jun 2015, at 05:20 , David Booth <david@dbooth.org> wrote:
>>>
>>> I'm sorry to ask this question at this point, but is .well-known
>>> *really* needed for this?
>>>
>>> I am concerned that it is just adding complexity and network
>>> accesses for dubious benefit.  AFAICT -- but please correct me if
>>> I've overlooked something -- the only "benefit" that .well-known
>>> adds here is to allow users to use non-standard names for their
>>> metadata files.  And what *real* benefit is that?  It seems to me
>>> to be adding pointless variability.  Are there really cases where
>>> users *cannot* name their metadata files to end with
>>> "-metadata.json"?  If so what are they?
>>>
>>> David Booth
>>>
>>> On 06/16/2015 09:20 PM, Yakov Shafranovich wrote:
>>>> Hmm. I am wondering if we can use the host-meta file instead,
>>>> skipping the registration, as per this:
>>>>
>>>> https://tools.ietf.org/html/rfc6415#section-4.2
>>>>
>>>> On Tue, Jun 16, 2015 at 4:01 PM, Gregg Kellogg
>>>> <gregg@greggkellogg.net> wrote:
>>>>> On Jun 16, 2015, at 12:55 PM, Yakov Shafranovich
>>>>> <yakov-ietf@shaftek.org> wrote:
>>>>>
>>>>> What's the proposed format?
>>>>>
>>>>> It's simply a file with one URI pattern per line. You can see
>>>>> the proposed text here:
>>>>> https://rawgit.com/w3c/csvw/98e728bcfef8d30e68c10f9cd798da0d39c7d172/syntax/index.html#site-wide-location-configuration
>>>>>
>>>>>
>>>>>
> Gregg
>>>>>
>>>>>
>>>>> On Jun 16, 2015 3:38 PM, "Ivan Herman" <ivan@w3.org> wrote:
>>>>>>
>>>>>> Jeni, Gregg,
>>>>>>
>>>>>> I have just received the green light from our system people
>>>>>> to set up the .well-known csw file. Can you ping me when the
>>>>>> changes are added to the documents and the issue is closed? I
>>>>>> would also need to know if it should contain anything else
>>>>>> than the default.
>>>>>>
>>>>>> I will also take care of the registration when the document
>>>>>> is available.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> ---- Ivan Herman +31 641044153
>>>>>>
>>>>>> (Written on my mobile. Excuses for brevity and frequent
>>>>>> misspellings...)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> ---- Ivan Herman, W3C Digital Publishing Activity Lead Home:
>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID:
>> http://orcid.org/0000-0003-0782-2704
>>
>>
>>
>>
>

Received on Thursday, 18 June 2015 19:15:37 UTC