Re: Use of .well-known for CSV metadata: More harm than good from Mark Nottingham on 2015-06-19 (www-tag@w3.org from June 2015)

From: Mark Nottingham <mnot@mnot.net>
Date: Fri, 19 Jun 2015 14:29:02 +1000
To: David Booth <david@dbooth.org>
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <84319074-D12D-4E7C-9AAA-5CBBAFBA0A2F@mnot.net>

> On 19 Jun 2015, at 1:28 pm, David Booth <david@dbooth.org> wrote:
> 
> On 06/18/2015 10:08 PM, Mark Nottingham wrote:
>> Hi David,
>> 
>>> On 19 Jun 2015, at 5:15 am, David Booth <david@dbooth.org> wrote:
>> 
>> […]
>> 
>>> What distinguishes this case is that a tabular metadata file must
>>> *explicitly* reference the associated data document in order for it
>>> to be used as a CSVW metadata document.  This is a critical point,
>>> which IMO changes the balance of the situation.
>> 
>> Where is this specified?
> 
> Section 5.3:
> http://w3c.github.io/csvw/syntax/#h-standard-file-metadata
> "If the metadata file found at this location does not explicitly include
> a reference to the relevant tabular data file then it MUST be ignored."

Ah, I see - thanks.

I don't think this changes much. Effectively, I see this as saying that because we're not willing to make assumptions about the structure of a CSV file, we're going to make assumptions about the URI space *and* the structure of the JSON that we might retrieve from it (unless we're willing to enforce a specific media type for it). 

It also doesn't address much of the issue at hand — e.g., if a server already has a resource at that location, it's cold comfort that it won't be accidentally misinterpreted as metadata for the CSV; they still either have to move that resource (not "cool"), or use a Link header to locate metadata (which we're told are too onerous). And, since the Web is so big, I certainly wouldn't rule out a collisions where it *is* misinterpreted as metadata.

Earlier, you talked about the downsides:

> - A *required* extra web access, nearly *every* time a conforming CSVW processor is given a tabular data URL and wishes to find the associated metadata -- because surely http://example/.well-known/csvm will be 404 (and not cachable) in the vast majority of cases.

Why is that bad? HTTP requests can be parallelised, so it's not latency. Is the extra request processing *really* that much of an overhead (considering we're talking about a comma- or tab- delimited file)? 

As I pointed out earlier, you can specify a default heuristic for 404 on that resource so that you avoid it being uncacheable.

> - Greater complexity in all conforming CSVW implementations.

I don't find this convincing; if we were talking about some involved scheme that involved lots of processing and tricky syntax, sure, but this is extremely simple, and all of the code to support it (libraries for HTTP, Link header parsing and URI Templates) is already at hand in most cases.

> - Reduced security, because a change to .well-known/csvm could completely change the interpretation of a given tabular data file, and that change would be far afield from the directory containing the data file, and thus may go completely unnoticed by the owner of the data file.

That's true of many aspects of the Web architecture; if someone has the ability to modify the origin, the origin's security is compromised. 

Overall - it's a pretty bad tradeoff IMO. Other TAG members may feel differently.

Cheers,

--
Mark Nottingham   https://www.mnot.net/

Received on Friday, 19 June 2015 04:29:32 UTC