Re: Use of .well-known for CSV metadata: More harm than good from David Booth on 2015-06-19 (www-tag@w3.org from June 2015)

From: David Booth <david@dbooth.org>
Date: Fri, 19 Jun 2015 03:32:49 -0400
To: Mark Nottingham <mnot@mnot.net>
CC: "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <5583C5A1.8080106@dbooth.org>
Hi Mark,

Thanks for your comments.   Replies below in line . . .

On 06/19/2015 12:29 AM, Mark Nottingham wrote:
>
>> On 19 Jun 2015, at 1:28 pm, David Booth <david@dbooth.org> wrote:
>>
>> On 06/18/2015 10:08 PM, Mark Nottingham wrote:
>>> Hi David,
>>>
>>>> On 19 Jun 2015, at 5:15 am, David Booth <david@dbooth.org>
>>>> wrote:
>>>
>>> […]
>>>
>>>> What distinguishes this case is that a tabular metadata file
>>>> must *explicitly* reference the associated data document in
>>>> order for it to be used as a CSVW metadata document.  This is a
>>>> critical point, which IMO changes the balance of the
>>>> situation.
>>>
>>> Where is this specified?
>>
>> Section 5.3:
>> http://w3c.github.io/csvw/syntax/#h-standard-file-metadata "If the
>> metadata file found at this location does not explicitly include a
>> reference to the relevant tabular data file then it MUST be
>> ignored."
>
> Ah, I see - thanks.
>
> I don't think this changes much. Effectively, I see this as saying
> that because we're not willing to make assumptions about the
> structure of a CSV file, we're going to make assumptions about the
> URI space *and* the structure of the JSON that we might retrieve from
> it (unless we're willing to enforce a specific media type for it).

I see it more like saying "If you *choose* to use this mechanism for 
associating your JSON metadata with your CSV file then you may do so by 
structuring your URI space this way *and* writing your metadata file 
this way.  But maybe we're saying the same thing.

>
> It also doesn't address much of the issue at hand — e.g., if a server
> already has a resource at that location, it's cold comfort that it
> won't be accidentally misinterpreted as metadata for the CSV; they
> still either have to move that resource (not "cool"), or use a Link
> header to locate metadata (which we're told are too onerous).

No they don't.  They also have the option of storing the metadata file 
at whatever URI they want, and advertising the metadata URI instead of 
(or in addition to) the CSV URI.  A CSVW processor can start with the 
metadata URI, and use that to locate the CSV URI, instead of starting 
with the CSV URI

> And,
> since the Web is so big, I certainly wouldn't rule out a collisions
> where it *is* misinterpreted as metadata.

It certainly is possible in theory that someone with a CSV resource at a 
particular URI could completely coincidentally and unintentionally 
create a JSON file with the exact name and exact contents -- including 
the URI of the CSV resource -- required to cause that JSON to be 
misinterpreted as metadata for the CSV file.  But it seems so unlikely 
that virtually any non-zero cost to prevent it would be a waste.

Furthermore, this is *exactly* the same risk that would *already* be 
present if the CSVW processor started with the JSON URI instead of the 
CSV URI: If the JSON *accidentally* looks like CSVW metadata and 
*accidentally* contains the URI of an existing CSV resource, then that 
CSV resource will be misinterpreted, regardless of the content of 
.well-known/csvm , because a CSVW processor must ignore .well-known/csvm 
if it is given CSVW metadata to start with, as described in section 6.1:
http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables

>
> Earlier, you talked about the downsides:
>
>> - A *required* extra web access, nearly *every* time a conforming
>> CSVW processor is given a tabular data URL and wishes to find the
>> associated metadata -- because surely
>> http://example/.well-known/csvm will be 404 (and not cachable) in
>> the vast majority of cases.
>
> Why is that bad? HTTP requests can be parallelised, so it's not
> latency. Is the extra request processing *really* that much of an
> overhead (considering we're talking about a comma- or tab- delimited
> file)?

It's not a big cost, but it is an actual cost, and it's being weighed 
against a benefit that IMO is largely theoretical.

>
> As I pointed out earlier, you can specify a default heuristic for 404
> on that resource so that you avoid it being uncacheable.

I doubt many server owners will bother to make that 404 cachable, given 
that they didn't bother to install a .well-known/csvm file.

>
>> - Greater complexity in all conforming CSVW implementations.
>
> I don't find this convincing; if we were talking about some involved
> scheme that involved lots of processing and tricky syntax, sure, but
> this is extremely simple, and all of the code to support it
> (libraries for HTTP, Link header parsing and URI Templates) is
> already at hand in most cases.

I agree that it's not a lot of additional complexity -- in fact it's 
quite simple -- but it *is* additional code.

>
>> - Reduced security, because a change to .well-known/csvm could
>> completely change the interpretation of a given tabular data file,
>> and that change would be far afield from the directory containing
>> the data file, and thus may go completely unnoticed by the owner of
>> the data file.
>
> That's true of many aspects of the Web architecture; if someone has
> the ability to modify the origin, the origin's security is
> compromised.
>
> Overall - it's a pretty bad tradeoff IMO. Other TAG members may feel
> differently.

Thanks for your comments!

David Booth

>
> Cheers,
>
> -- Mark Nottingham   https://www.mnot.net/
>
>
>
>
>
Received on Friday, 19 June 2015 07:33:27 UTC