W3C home > Mailing lists > Public > www-tag@w3.org > June 2015

Re: Use of .well-known for CSV metadata: More harm than good

From: Mark Nottingham <mnot@mnot.net>
Date: Mon, 22 Jun 2015 13:08:44 +1000
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <3CF0BA63-FF81-48FA-AC02-FE2C3D7DEA35@mnot.net>
To: David Booth <david@dbooth.org>

> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote:
>> And,
>> since the Web is so big, I certainly wouldn't rule out a collisions
>> where it *is* misinterpreted as metadata.
> 
> It certainly is possible in theory that someone with a CSV resource at a particular URI could completely coincidentally and unintentionally create a JSON file with the exact name and exact contents -- including the URI of the CSV resource -- required to cause that JSON to be misinterpreted as metadata for the CSV file.  But it seems so unlikely that virtually any non-zero cost to prevent it would be a waste.
> 
> Furthermore, this is *exactly* the same risk that would *already* be present if the CSVW processor started with the JSON URI instead of the CSV URI: If the JSON *accidentally* looks like CSVW metadata and *accidentally* contains the URI of an existing CSV resource, then that CSV resource will be misinterpreted, regardless of the content of .well-known/csvm , because a CSVW processor must ignore .well-known/csvm if it is given CSVW metadata to start with, as described in section 6.1:
> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables

Right, and the way we prevent that on the Web is by giving something a distinctive media type.

AFAICT the audience you're designing this for is "CSV downloads that don't have any context (e.g., a direct link, rather than one from HTML) where the author has no ability to set Link headers." Is that correct?


>> Earlier, you talked about the downsides:
>> 
>>> - A *required* extra web access, nearly *every* time a conforming
>>> CSVW processor is given a tabular data URL and wishes to find the
>>> associated metadata -- because surely
>>> http://example/.well-known/csvm will be 404 (and not cachable) in
>>> the vast majority of cases.
>> 
>> Why is that bad? HTTP requests can be parallelised, so it's not
>> latency. Is the extra request processing *really* that much of an
>> overhead (considering we're talking about a comma- or tab- delimited
>> file)?
> 
> It's not a big cost, but it is an actual cost, and it's being weighed against a benefit that IMO is largely theoretical.

In isolation, I agree that's the right technical determination. 

This isn't an isolated problem, however; there are lots of applications trying to stake a claim on various parts of URI space. The main reason that I wrote the BCP was because writing protocols on top of HTTP has become popular, and a lot of folks wanted to define "standard" URI paths. 

As such, this is really a problem of the commons; your small encroachment might not make a big impact on its own, but in concert with others  especially when the W3C as steward of the Web is seen doing this  it starts to have impact. 

In ten years, I really don't want to have a list of "filenames I can't use on my Web site" because you wanted to save the overhead of a single request in 2015  especially when HTTP/2 makes requests really, really cheap.

Is that "theoretical"? I don't know, but I do think it's important. 


>> As I pointed out earlier, you can specify a default heuristic for 404
>> on that resource so that you avoid it being uncacheable.
> 
> I doubt many server owners will bother to make that 404 cachable, given that they didn't bother to install a .well-known/csvm file.

You misunderstand. You can specify a heuristic for the 404 to be interpreted on the *client* side; it tells consumers that if there's a 404 without freshness information, they can assume a specified default.


>>> - Greater complexity in all conforming CSVW implementations.
>> 
>> I don't find this convincing; if we were talking about some involved
>> scheme that involved lots of processing and tricky syntax, sure, but
>> this is extremely simple, and all of the code to support it
>> (libraries for HTTP, Link header parsing and URI Templates) is
>> already at hand in most cases.
> 
> I agree that it's not a lot of additional complexity -- in fact it's quite simple -- but it *is* additional code.

And I find that really unconvincing. If the bar for doing the right thing is so small and still can't be overcome, we're in a really bad place.

Cheers,

--
Mark Nottingham   https://www.mnot.net/
Received on Monday, 22 June 2015 03:09:13 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:57:12 UTC