Re: Use of .well-known for CSV metadata: More harm than good from Mark Nottingham on 2015-06-22 (www-tag@w3.org from June 2015)

From: Mark Nottingham <mnot@mnot.net>
Date: Mon, 22 Jun 2015 13:08:44 +1000
To: David Booth <david@dbooth.org>
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <3CF0BA63-FF81-48FA-AC02-FE2C3D7DEA35@mnot.net>
> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote:
>> And,
>> since the Web is so big, I certainly wouldn't rule out a collisions
>> where it *is* misinterpreted as metadata.
> 
> It certainly is possible in theory that someone with a CSV resource at a particular URI could completely coincidentally and unintentionally create a JSON file with the exact name and exact contents -- including the URI of the CSV resource -- required to cause that JSON to be misinterpreted as metadata for the CSV file.  But it seems so unlikely that virtually any non-zero cost to prevent it would be a waste.
> 
> Furthermore, this is *exactly* the same risk that would *already* be present if the CSVW processor started with the JSON URI instead of the CSV URI: If the JSON *accidentally* looks like CSVW metadata and *accidentally* contains the URI of an existing CSV resource, then that CSV resource will be misinterpreted, regardless of the content of .well-known/csvm , because a CSVW processor must ignore .well-known/csvm if it is given CSVW metadata to start with, as described in section 6.1:
> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables

Right, and the way we prevent that on the Web is by giving something a distinctive media type.

AFAICT the audience you're designing this for is "CSV downloads that don't have any context (e.g., a direct link, rather than one from HTML) where the author has no ability to set Link headers." Is that correct?


>> Earlier, you talked about the downsides:
>> 
>>> - A *required* extra web access, nearly *every* time a conforming
>>> CSVW processor is given a tabular data URL and wishes to find the
>>> associated metadata -- because surely
>>> http://example/.well-known/csvm will be 404 (and not cachable) in
>>> the vast majority of cases.
>> 
>> Why is that bad? HTTP requests can be parallelised, so it's not
>> latency. Is the extra request processing *really* that much of an
>> overhead (considering we're talking about a comma- or tab- delimited
>> file)?
> 
> It's not a big cost, but it is an actual cost, and it's being weighed against a benefit that IMO is largely theoretical.

In isolation, I agree that's the right technical determination. 

This isn't an isolated problem, however; there are lots of applications trying to stake a claim on various parts of URI space. The main reason that I wrote the BCP was because writing protocols on top of HTTP has become popular, and a lot of folks wanted to define "standard" URI paths. 

As such, this is really a problem of the commons; your small encroachment might not make a big impact on its own, but in concert with others — especially when the W3C as steward of the Web is seen doing this — it starts to have impact. 

In ten years, I really don't want to have a list of "filenames I can't use on my Web site" because you wanted to save the overhead of a single request in 2015 — especially when HTTP/2 makes requests really, really cheap.

Is that "theoretical"? I don't know, but I do think it's important. 


>> As I pointed out earlier, you can specify a default heuristic for 404
>> on that resource so that you avoid it being uncacheable.
> 
> I doubt many server owners will bother to make that 404 cachable, given that they didn't bother to install a .well-known/csvm file.

You misunderstand. You can specify a heuristic for the 404 to be interpreted on the *client* side; it tells consumers that if there's a 404 without freshness information, they can assume a specified default.


>>> - Greater complexity in all conforming CSVW implementations.
>> 
>> I don't find this convincing; if we were talking about some involved
>> scheme that involved lots of processing and tricky syntax, sure, but
>> this is extremely simple, and all of the code to support it
>> (libraries for HTTP, Link header parsing and URI Templates) is
>> already at hand in most cases.
> 
> I agree that it's not a lot of additional complexity -- in fact it's quite simple -- but it *is* additional code.

And I find that really unconvincing. If the bar for doing the right thing is so small and still can't be overcome, we're in a really bad place.

Cheers,

--
Mark Nottingham   https://www.mnot.net/
Received on Monday, 22 June 2015 03:09:13 UTC