Re: Use of .well-known for CSV metadata: More harm than good

On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>
>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote:
>>> And, since the Web is so big, I certainly wouldn't rule out a
>>> collisions where it *is* misinterpreted as metadata.
>>
>> It certainly is possible in theory that someone with a CSV resource
>> at a particular URI could completely coincidentally and
>> unintentionally create a JSON file with the exact name and exact
>> contents -- including the URI of the CSV resource -- required to
>> cause that JSON to be misinterpreted as metadata for the CSV file.
>> But it seems so unlikely that virtually any non-zero cost to
>> prevent it would be a waste.

Actually, we really can rule out the possibility that a non-CSVW file 
would accidentally be misinterpreted as a CSVW metadata file.  For a 
non-CSVW file to be accidentally misinterpreted as a CSVW metadata for a 
corresponding CSV data file, *all* of the following would have to be 
true of the non-CSVW file:

  - it would have to be in the same directory as the CSV data file;

  - it would have to have the name {+url}-metadata.json or 
metadata.json , where {+url} is the name of the CSV data file;

  - it would have to parse as JSON;

  - it would have to contain a top level JSON property called 
"@context", with a value of either the string 
"http://www.w3.org/ns/csvw" or an array containing that string;

  - it would have to explicitly reference the CSV data file; and

  - when interpreted as CSVW metadata, the schema described must be 
compatible with the actual schema of the CSV data file.  Schema 
compatibility is defined as one would expect, such as the same number of 
columns, the same column names (where present), etc:
http://w3c.github.io/csvw/metadata/#schema-compatibility

Short of having an infinite number of monkeys typing, that just isn't 
going to happen accidentally.

>>
>> Furthermore, this is *exactly* the same risk that would *already*
>> be present if the CSVW processor started with the JSON URI instead
>> of the CSV URI: If the JSON *accidentally* looks like CSVW metadata
>> and *accidentally* contains the URI of an existing CSV resource,
>> then that CSV resource will be misinterpreted, regardless of the
>> content of .well-known/csvm , because a CSVW processor must ignore
>> .well-known/csvm if it is given CSVW metadata to start with, as
>> described in section 6.1:
>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>
> Right, and the way we prevent that on the Web is by giving something
> a distinctive media type.
>
> AFAICT the audience you're designing this for is "CSV downloads that
> don't have any context (e.g., a direct link, rather than one from
> HTML) where the author has no ability to set Link headers." Is that
> correct?

Yes.

>
>
>>> Earlier, you talked about the downsides:
>>>
>>>> - A *required* extra web access, nearly *every* time a
>>>> conforming CSVW processor is given a tabular data URL and
>>>> wishes to find the associated metadata -- because surely
>>>> http://example/.well-known/csvm will be 404 (and not cachable)
>>>> in the vast majority of cases.
>>>
>>> Why is that bad? HTTP requests can be parallelised, so it's not
>>> latency. Is the extra request processing *really* that much of
>>> an overhead (considering we're talking about a comma- or tab-
>>> delimited file)?
>>
>> It's not a big cost, but it is an actual cost, and it's being
>> weighed against a benefit that IMO is largely theoretical.
>
> In isolation, I agree that's the right technical determination.
>
> This isn't an isolated problem, however; there are lots of
> applications trying to stake a claim on various parts of URI space.
> The main reason that I wrote the BCP was because writing protocols on
> top of HTTP has become popular, and a lot of folks wanted to define
> "standard" URI paths.
>
> As such, this is really a problem of the commons; your small
> encroachment might not make a big impact on its own, but in concert
> with others — especially when the W3C as steward of the Web is seen
> doing this — it starts to have impact.
>
> In ten years, I really don't want to have a list of "filenames I
> can't use on my Web site" because you wanted to save the overhead of
> a single request in 2015 — especially when HTTP/2 makes requests
> really, really cheap.
>
> Is that "theoretical"? I don't know, but I do think it's important.

I share your concern.  I think we should be vigilant against URI 
squatting.  But although this case may look like URI squatting on the 
surface, I don't think it actually is when you dig into it.  The fact 
that the content of the CSV metadata file must *explicitly* indicate its 
intent to be used as a CSV metadata file changes the situation in a 
critical way, because it means that you are *not* prevented from using 
that filename for a different purpose.  That file will *only* be 
interpreted as a CSV metadata file if the owner explicitly indicates 
that it *should* be interpreted that way.  That's not squatting, that's 
the URI owner rightly exercising his/her choice.

The only case where there's any name conflict at all is if the URI owner 
wishes to use that URI for some other purpose *and* for serving CSV 
metadata, simultaneously.  In that case the URI owner would have to make 
a choice about how he/she chooses to use that particular path.  But 
that's like trying to install two different software packages in the 
same directory: nobody expects to be able to do that, because both 
packages might have a Make file called 'makefile', or some other 
conflict.  Plus it makes a mess of the directory having files of 
different packages intermingled.   If someone really wants to use both 
software packages simultaneously, they install them in *different* 
directories.  The same is true of CSV metadata: if you want to publish 
CSV data and metadata, using the standard metadata filename, *and* you 
want to use that same filename for some other purpose, then you will 
have to put one of them in a different directory.  No big deal.  That 
doesn't cause you to have to consult a list of "filenames you can't use 
on your website".

>
>>> As I pointed out earlier, you can specify a default heuristic for
>>> 404 on that resource so that you avoid it being uncacheable.
>>
>> I doubt many server owners will bother to make that 404 cachable,
>> given that they didn't bother to install a .well-known/csvm file.
>
> You misunderstand. You can specify a heuristic for the 404 to be
> interpreted on the *client* side; it tells consumers that if there's
> a 404 without freshness information, they can assume a specified
> default.

Oh, I see.  Yes, I guess they could.

>
>>>> - Greater complexity in all conforming CSVW implementations.
>>>
>>> I don't find this convincing; if we were talking about some
>>> involved scheme that involved lots of processing and tricky
>>> syntax, sure, but this is extremely simple, and all of the code
>>> to support it (libraries for HTTP, Link header parsing and URI
>>> Templates) is already at hand in most cases.
>>
>> I agree that it's not a lot of additional complexity -- in fact
>> it's quite simple -- but it *is* additional code.
>
> And I find that really unconvincing. If the bar for doing the right
> thing is so small and still can't be overcome, we're in a really bad
> place.

If it really were a matter of "doing the right thing" then I'd agree. 
But as explained above, in this case I don't think it is.  Please 
consider the above points, and see what you think.

Thanks,
David Booth

Received on Monday, 22 June 2015 06:40:26 UTC