Re: Use of .well-known for CSV metadata: More harm than good from David Booth on 2015-06-18 (www-tag@w3.org from June 2015)

From: David Booth <david@dbooth.org>
Date: Thu, 18 Jun 2015 15:56:54 -0400
To: Henry Story <henry.story@co-operating.systems>
CC: "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <55832286.7050200@dbooth.org>
On 06/18/2015 03:36 PM, Henry Story wrote:
> The solution that would fit the Linked Data Platform way of doing things
>
>
> 	http://www.w3.org/TR/ldp/
>
> would be to have a Link header with a relation pointing to the right file.

Yes, that is one of the existing mechanisms that CSVW allows, which I 
did not mention.  But the assumption is that many users will not be able 
to set Link headers.

David Booth

>
> Henry
>
>> On 18 Jun 2015, at 21:15, David Booth <david@dbooth.org> wrote:
>>
>> The CSVW working group recently sought the TAG's advice on locating metadata associated with a tabular data document (typically CSV) retrieved from a given URI:
>> https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md
>> Among other mechanisms, the CSVW WG proposed that metadata could be retrieved from two standard locations (one per file and one per directory) relative to the original tabular data document URI:
>> http://www.w3.org/TR/2015/WD-tabular-data-model-20150416/#standard-file-metadata
>>
>>   {+url}-metadata.json
>>   metadata.json
>>
>> where {+url} is the URL of the CSV document.  For example, given a tabular data URL http://example/foo.csv , a CSVW processor would automatically look for its associated metadata at the following URLs:
>>
>>   http://example/foo.csv-metadata.json
>>   http://example/metadata.json
>>
>> Presumably out of a concern that this would be URI squatting and violate RFC7320
>> http://tools.ietf.org/html/rfc7320#section-3
>> the TAG's guidance was to use the RFC5785 .well-known mechanism to enable sites to specify custom metadata URIs based on templates, rather than relying on those standard relative locations.
>>
>> Although URI squatting is an important issue to guard against, I do not believe it actually applies in this case, and use of .well-known would cause more harm than good.
>>
>> What distinguishes this case is that a tabular metadata file must *explicitly* reference the associated data document in order for it to be used as a CSVW metadata document.  This is a critical point, which IMO changes the balance of the situation.  It means that: (a) the URI owner has clearly indicated the intent to use that metadata URI for that purpose; and (b) it does *not* prevent that URI from instead being used for other purposes.   It *does* prevent that URI from simultaneously being used for the tabular metadata and for some other purpose, and hence it does force the URI owner to choose between using it for tabular metadata or for something else.  But even in that case, if the URI owner really wants to use that URI for another purpose while *still* providing tabular metadata, then the URI owner still has the option of publishing the metadata at an arbitrary custom URI, and publicizing that location, because the metadata file will explicitly reference the data file anyway. 
 (In other words, although the most common case may be that a user would first know the URL of the tabular *data* file, and from that seek the associated metadata, it is perfectly acceptable -- and in some ways better -- for the user to start with the URL of the metadata file, and use that to find the desired data file URL.)  For example, the URI owner could publish the metadata at http://example/my-foo-metadata.json (which in turn would point to http://example/foo.csv ) and then advertise that URL.
>>
>> Harms that would be caused by requiring the use of .well-known in this case include:
>>
>> - A *required* extra web access, nearly *every* time a conforming CSVW processor is given a tabular data URL and wishes to find the associated metadata -- because surely http://example/.well-known/csvm will be 404 (and not cachable) in the vast majority of cases.
>>
>> - Greater complexity in all conforming CSVW implementations.
>>
>> - Reduced security, because a change to .well-known/csvm could completely change the interpretation of a given tabular data file, and that change would be far afield from the directory containing the data file, and thus may go completely unnoticed by the owner of the data file.
>>
>> In short, I think the benefits of .well-known in this case are dubious, and far outweighed by the harms.   I think the TAG's guidance to the CSVW group should be amended.
>>
>> Thanks,
>> David Booth
>>
>> -------- Forwarded Message --------
>> Subject: Re: .well-known
>> Resent-Date: Thu, 18 Jun 2015 16:56:48 +0000
>> Resent-From: public-csv-wg@w3.org
>> Date: Thu, 18 Jun 2015 09:56:15 -0700
>> From: Gregg Kellogg <gregg@greggkellogg.net>
>> To: David Booth <david@dbooth.org>
>> CC: Ivan Herman <ivan@w3.org>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
>>
>>> On Jun 17, 2015, at 7:43 PM, David Booth <david@dbooth.org> wrote:
>>>
>>> On 06/17/2015 02:29 AM, Ivan Herman wrote:
>>>> David,
>>>>
>>>> the .well-known mechanism is the result of a long discussion with the
>>>> TAG that had difficulties with the principle of baking in URI-schemes
>>>> like "-metadata.json".
>>>
>>> Is there a pointer to that discussion?   It sounds like the TAG concern is URI squatting.  URI squatting is an important concern, but I don't think it applies in this case, because -- if I've understood correctly -- a metadata file *explicitly* references the relevant data file, which in effect means that the URI owner has clearly indicated an intent to use that URI for that purpose.
>>
>> Hi David, I found a link to the minutes here: https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md (already added to the issue).
>>
>> The minutes aren’t particularly illuminating, but the issue raised by mnot was definitely concern over squatting. At this point, it seems to be settled. I’ve implemented it in my implementation, and it was quite straight-forward, although it requires an extra GET, the result of this can be cached for some time (subject to policies, of course).
>>
>>> HOWEVER, I no longer see any mention of .well-known in the current editor's draft, so maybe my concern is moot:
>>> http://w3c.github.io/csvw/syntax/#locating-metadata
>>
>> It’s still in a PR that hasn’t yet been pulled: https://github.com/w3c/csvw/pull/605. You likely say a page based on that branch, rather than the gh-pages branch where the ED is available.
>>
>> It’s awaiting resolution of some minor wording on what “no such file is located” means, precisely.
>>
>> Gregg
>>
>>> Has the .well-known mechanism now been removed from the algorithm for finding metadata?
>>>
>>> Thanks,
>>> David Booth
>>>
>>>> Note that the agreement is to have a default
>>>> fall-back, ie, if the .well-known file does not exist then the client
>>>> can fall back to a default value which, actually, reproduces the
>>>> previous patterns. I think we should go ahead with this approach to
>>>> cover all points of views.
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>>> On 17 Jun 2015, at 05:20 , David Booth <david@dbooth.org> wrote:
>>>>>
>>>>> I'm sorry to ask this question at this point, but is .well-known
>>>>> *really* needed for this?
>>>>>
>>>>> I am concerned that it is just adding complexity and network
>>>>> accesses for dubious benefit.  AFAICT -- but please correct me if
>>>>> I've overlooked something -- the only "benefit" that .well-known
>>>>> adds here is to allow users to use non-standard names for their
>>>>> metadata files.  And what *real* benefit is that?  It seems to me
>>>>> to be adding pointless variability.  Are there really cases where
>>>>> users *cannot* name their metadata files to end with
>>>>> "-metadata.json"?  If so what are they?
>>>>>
>>>>> David Booth
>>>>>
>>>>> On 06/16/2015 09:20 PM, Yakov Shafranovich wrote:
>>>>>> Hmm. I am wondering if we can use the host-meta file instead,
>>>>>> skipping the registration, as per this:
>>>>>>
>>>>>> https://tools.ietf.org/html/rfc6415#section-4.2
>>>>>>
>>>>>> On Tue, Jun 16, 2015 at 4:01 PM, Gregg Kellogg
>>>>>> <gregg@greggkellogg.net> wrote:
>>>>>>> On Jun 16, 2015, at 12:55 PM, Yakov Shafranovich
>>>>>>> <yakov-ietf@shaftek.org> wrote:
>>>>>>>
>>>>>>> What's the proposed format?
>>>>>>>
>>>>>>> It's simply a file with one URI pattern per line. You can see
>>>>>>> the proposed text here:
>>>>>>> https://rawgit.com/w3c/csvw/98e728bcfef8d30e68c10f9cd798da0d39c7d172/syntax/index.html#site-wide-location-configuration
>>>>>>>
>>>>>>>
>>>>>>>
>>> Gregg
>>>>>>>
>>>>>>>
>>>>>>> On Jun 16, 2015 3:38 PM, "Ivan Herman" <ivan@w3.org> wrote:
>>>>>>>>
>>>>>>>> Jeni, Gregg,
>>>>>>>>
>>>>>>>> I have just received the green light from our system people
>>>>>>>> to set up the .well-known csw file. Can you ping me when the
>>>>>>>> changes are added to the documents and the issue is closed? I
>>>>>>>> would also need to know if it should contain anything else
>>>>>>>> than the default.
>>>>>>>>
>>>>>>>> I will also take care of the registration when the document
>>>>>>>> is available.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Ivan
>>>>>>>>
>>>>>>>> ---- Ivan Herman +31 641044153
>>>>>>>>
>>>>>>>> (Written on my mobile. Excuses for brevity and frequent
>>>>>>>> misspellings...)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---- Ivan Herman, W3C Digital Publishing Activity Lead Home:
>>>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID:
>>>> http://orcid.org/0000-0003-0782-2704
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
Received on Thursday, 18 June 2015 19:57:22 UTC