Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE from David Booth on 2015-06-24 (www-tag@w3.org from June 2015)

From: David Booth <david@dbooth.org>
Date: Wed, 24 Jun 2015 00:52:41 -0400
To: Mark Nottingham <mnot@mnot.net>
CC: "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <558A3799.4030102@dbooth.org>
Hi Mark,

On 06/24/2015 12:30 AM, Mark Nottingham wrote:
> David,
>
>> On 24 Jun 2015, at 2:07 pm, David Booth <david@dbooth.org> wrote:
>>
>> The CSVW working group is anxious to close this issue, so I'd like
>> to ask members of the TAG: In light of the discussion below and
>> elsewhere in this thread, does anyone still think that .well-known
>> is necessary in this case to prevent harmful URI squatting?   If
>> so, why?
>>
>> I maintain that this case does not represent harmful URI squatting,
>> because URI owners are not prevented from using the standard CSVW
>> metadata URIs ({+url}-metadata.json or metadata.json) for other
>> purposes, because a file of that name will *only* be interpreted as
>> a CSVW metadata file if it explicitly indicates that it *should* be
>> interpreted that way.  So far only Mark Nottingham has expressed
>> concerns.  (I hope that the explanations below have since allayed
>> Mark's concerns, but I do not yet know if they have.)
>
> We discussed it on a TAG call and came to agreement;furthermore, I'd
> thought that .well-known was acceptable to CSVWG as well. As such, I
> think the relevant question is if TAG members think the issues you
> raise are sufficient to reopen the discussion.

Right.  As far as I could tell from the minutes, it seems that the 
assumption at the time was that harmful URI squatting would result if 
.well-known were not used.  I have now brought new information that 
explains more fully how the CSVW metadata mechanism works, which  rather 
conclusively shows that this mechanism would *not* cause harmful URI 
squatting without the use of .well-known.  Hence the previous decision 
should be changed.

>
> Again, I'm happy to have that discussion.

What more needs to be discussed?   Are there questions that still need 
to be answered or aspects of the mechanism that are still unclear?  If 
so, I'd like to get them on the table.

Does it need to be discussed on another call?   What information or 
process would be helpful in getting this resolved?

Thanks,
David Booth

>
> Cheers,
>
>
>> Thanks, David Booth
>>
>> On 06/22/2015 02:39 AM, David Booth wrote:
>>> On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>>>>
>>>>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org>
>>>>> wrote:
>>>>>> And, since the Web is so big, I certainly wouldn't rule out
>>>>>> a collisions where it *is* misinterpreted as metadata.
>>>>>
>>>>> It certainly is possible in theory that someone with a CSV
>>>>> resource at a particular URI could completely coincidentally
>>>>> and unintentionally create a JSON file with the exact name
>>>>> and exact contents -- including the URI of the CSV resource
>>>>> -- required to cause that JSON to be misinterpreted as
>>>>> metadata for the CSV file. But it seems so unlikely that
>>>>> virtually any non-zero cost to prevent it would be a waste.
>>>
>>> Actually, we really can rule out the possibility that a non-CSVW
>>> file would accidentally be misinterpreted as a CSVW metadata
>>> file.  For a non-CSVW file to be accidentally misinterpreted as a
>>> CSVW metadata for a corresponding CSV data file, *all* of the
>>> following would have to be true of the non-CSVW file:
>>>
>>> - it would have to be in the same directory as the CSV data
>>> file;
>>>
>>> - it would have to have the name {+url}-metadata.json or
>>> metadata.json , where {+url} is the name of the CSV data file;
>>>
>>> - it would have to parse as JSON;
>>>
>>> - it would have to contain a top level JSON property called
>>> "@context", with a value of either the string
>>> "http://www.w3.org/ns/csvw" or an array containing that string;
>>>
>>> - it would have to explicitly reference the CSV data file; and
>>>
>>> - when interpreted as CSVW metadata, the schema described must
>>> be compatible with the actual schema of the CSV data file.
>>> Schema compatibility is defined as one would expect, such as the
>>> same number of columns, the same column names (where present),
>>> etc: http://w3c.github.io/csvw/metadata/#schema-compatibility
>>>
>>> Short of having an infinite number of monkeys typing, that just
>>> isn't going to happen accidentally.
>>>
>>>>>
>>>>> Furthermore, this is *exactly* the same risk that would
>>>>> *already* be present if the CSVW processor started with the
>>>>> JSON URI instead of the CSV URI: If the JSON *accidentally*
>>>>> looks like CSVW metadata and *accidentally* contains the URI
>>>>> of an existing CSV resource, then that CSV resource will be
>>>>> misinterpreted, regardless of the content of .well-known/csvm
>>>>> , because a CSVW processor must ignore .well-known/csvm if it
>>>>> is given CSVW metadata to start with, as described in section
>>>>> 6.1:
>>>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>>>>
>>>>
>>>>>
Right, and the way we prevent that on the Web is by giving something
>>>> a distinctive media type.
>>>>
>>>> AFAICT the audience you're designing this for is "CSV downloads
>>>> that don't have any context (e.g., a direct link, rather than
>>>> one from HTML) where the author has no ability to set Link
>>>> headers." Is that correct?
>>>
>>> Yes.
>>>
>>>>
>>>>
>>>>>> Earlier, you talked about the downsides:
>>>>>>
>>>>>>> - A *required* extra web access, nearly *every* time a
>>>>>>> conforming CSVW processor is given a tabular data URL
>>>>>>> and wishes to find the associated metadata -- because
>>>>>>> surely http://example/.well-known/csvm will be 404 (and
>>>>>>> not cachable) in the vast majority of cases.
>>>>>>
>>>>>> Why is that bad? HTTP requests can be parallelised, so it's
>>>>>> not latency. Is the extra request processing *really* that
>>>>>> much of an overhead (considering we're talking about a
>>>>>> comma- or tab- delimited file)?
>>>>>
>>>>> It's not a big cost, but it is an actual cost, and it's
>>>>> being weighed against a benefit that IMO is largely
>>>>> theoretical.
>>>>
>>>> In isolation, I agree that's the right technical
>>>> determination.
>>>>
>>>> This isn't an isolated problem, however; there are lots of
>>>> applications trying to stake a claim on various parts of URI
>>>> space. The main reason that I wrote the BCP was because writing
>>>> protocols on top of HTTP has become popular, and a lot of folks
>>>> wanted to define "standard" URI paths.
>>>>
>>>> As such, this is really a problem of the commons; your small
>>>> encroachment might not make a big impact on its own, but in
>>>> concert with others — especially when the W3C as steward of the
>>>> Web is seen doing this — it starts to have impact.
>>>>
>>>> In ten years, I really don't want to have a list of "filenames
>>>> I can't use on my Web site" because you wanted to save the
>>>> overhead of a single request in 2015 — especially when HTTP/2
>>>> makes requests really, really cheap.
>>>>
>>>> Is that "theoretical"? I don't know, but I do think it's
>>>> important.
>>>
>>> I share your concern.  I think we should be vigilant against URI
>>> squatting.  But although this case may look like URI squatting on
>>> the surface, I don't think it actually is when you dig into it.
>>> The fact that the content of the CSV metadata file must
>>> *explicitly* indicate its intent to be used as a CSV metadata
>>> file changes the situation in a critical way, because it means
>>> that you are *not* prevented from using that filename for a
>>> different purpose.  That file will *only* be interpreted as a CSV
>>> metadata file if the owner explicitly indicates that it *should*
>>> be interpreted that way.  That's not squatting, that's the URI
>>> owner rightly exercising his/her choice.
>>>
>>> The only case where there's any name conflict at all is if the
>>> URI owner wishes to use that URI for some other purpose *and* for
>>> serving CSV metadata, simultaneously.  In that case the URI owner
>>> would have to make a choice about how he/she chooses to use that
>>> particular path.  But that's like trying to install two different
>>> software packages in the same directory: nobody expects to be
>>> able to do that, because both packages might have a Make file
>>> called 'makefile', or some other conflict.  Plus it makes a mess
>>> of the directory having files of different packages intermingled.
>>> If someone really wants to use both software packages
>>> simultaneously, they install them in *different* directories.
>>> The same is true of CSV metadata: if you want to publish CSV data
>>> and metadata, using the standard metadata filename, *and* you
>>> want to use that same filename for some other purpose, then you
>>> will have to put one of them in a different directory.  No big
>>> deal.  That doesn't cause you to have to consult a list of
>>> "filenames you can't use on your website".
>>>
>>>>
>>>>>> As I pointed out earlier, you can specify a default
>>>>>> heuristic for 404 on that resource so that you avoid it
>>>>>> being uncacheable.
>>>>>
>>>>> I doubt many server owners will bother to make that 404
>>>>> cachable, given that they didn't bother to install a
>>>>> .well-known/csvm file.
>>>>
>>>> You misunderstand. You can specify a heuristic for the 404 to
>>>> be interpreted on the *client* side; it tells consumers that if
>>>> there's a 404 without freshness information, they can assume a
>>>> specified default.
>>>
>>> Oh, I see.  Yes, I guess they could.
>>>
>>>>
>>>>>>> - Greater complexity in all conforming CSVW
>>>>>>> implementations.
>>>>>>
>>>>>> I don't find this convincing; if we were talking about
>>>>>> some involved scheme that involved lots of processing and
>>>>>> tricky syntax, sure, but this is extremely simple, and all
>>>>>> of the code to support it (libraries for HTTP, Link header
>>>>>> parsing and URI Templates) is already at hand in most
>>>>>> cases.
>>>>>
>>>>> I agree that it's not a lot of additional complexity -- in
>>>>> fact it's quite simple -- but it *is* additional code.
>>>>
>>>> And I find that really unconvincing. If the bar for doing the
>>>> right thing is so small and still can't be overcome, we're in a
>>>> really bad place.
>>>
>>> If it really were a matter of "doing the right thing" then I'd
>>> agree. But as explained above, in this case I don't think it is.
>>> Please consider the above points, and see what you think.
>>>
>>> Thanks, David Booth
>
> -- Mark Nottingham   https://www.mnot.net/
>
>
>
>
Received on Wednesday, 24 June 2015 04:53:27 UTC