Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE from Mark Nottingham on 2015-06-24 (www-tag@w3.org from June 2015)

From: Mark Nottingham <mnot@mnot.net>
Date: Wed, 24 Jun 2015 14:30:36 +1000
To: David Booth <david@dbooth.org>
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <3FD3F2C6-DD53-4A31-8839-1E12FFE038B5@mnot.net>
David,

> On 24 Jun 2015, at 2:07 pm, David Booth <david@dbooth.org> wrote:
> 
> The CSVW working group is anxious to close this issue, so I'd like to ask members of the TAG: In light of the discussion below and elsewhere in this thread, does anyone still think that .well-known is necessary in this case to prevent harmful URI squatting?   If so, why?
> 
> I maintain that this case does not represent harmful URI squatting, because URI owners are not prevented from using the standard CSVW metadata URIs ({+url}-metadata.json or metadata.json) for other purposes, because a file of that name will *only* be interpreted as a CSVW metadata file if it explicitly indicates that it *should* be interpreted that way.  So far only Mark Nottingham has expressed concerns.  (I hope that the explanations below have since allayed Mark's concerns, but I do not yet know if they have.)

We discussed it on a TAG call and came to agreement; furthermore, I'd thought that .well-known was acceptable to CSVWG as well. As such, I think the relevant question is if TAG members think the issues you raise are sufficient to reopen the discussion.

Again, I'm happy to have that discussion.

Cheers,


> Thanks,
> David Booth
> 
> On 06/22/2015 02:39 AM, David Booth wrote:
>> On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>>> 
>>>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote:
>>>>> And, since the Web is so big, I certainly wouldn't rule out a
>>>>> collisions where it *is* misinterpreted as metadata.
>>>> 
>>>> It certainly is possible in theory that someone with a CSV resource
>>>> at a particular URI could completely coincidentally and
>>>> unintentionally create a JSON file with the exact name and exact
>>>> contents -- including the URI of the CSV resource -- required to
>>>> cause that JSON to be misinterpreted as metadata for the CSV file.
>>>> But it seems so unlikely that virtually any non-zero cost to
>>>> prevent it would be a waste.
>> 
>> Actually, we really can rule out the possibility that a non-CSVW file
>> would accidentally be misinterpreted as a CSVW metadata file.  For a
>> non-CSVW file to be accidentally misinterpreted as a CSVW metadata for a
>> corresponding CSV data file, *all* of the following would have to be
>> true of the non-CSVW file:
>> 
>>  - it would have to be in the same directory as the CSV data file;
>> 
>>  - it would have to have the name {+url}-metadata.json or metadata.json
>> , where {+url} is the name of the CSV data file;
>> 
>>  - it would have to parse as JSON;
>> 
>>  - it would have to contain a top level JSON property called
>> "@context", with a value of either the string
>> "http://www.w3.org/ns/csvw" or an array containing that string;
>> 
>>  - it would have to explicitly reference the CSV data file; and
>> 
>>  - when interpreted as CSVW metadata, the schema described must be
>> compatible with the actual schema of the CSV data file.  Schema
>> compatibility is defined as one would expect, such as the same number of
>> columns, the same column names (where present), etc:
>> http://w3c.github.io/csvw/metadata/#schema-compatibility
>> 
>> Short of having an infinite number of monkeys typing, that just isn't
>> going to happen accidentally.
>> 
>>>> 
>>>> Furthermore, this is *exactly* the same risk that would *already*
>>>> be present if the CSVW processor started with the JSON URI instead
>>>> of the CSV URI: If the JSON *accidentally* looks like CSVW metadata
>>>> and *accidentally* contains the URI of an existing CSV resource,
>>>> then that CSV resource will be misinterpreted, regardless of the
>>>> content of .well-known/csvm , because a CSVW processor must ignore
>>>> .well-known/csvm if it is given CSVW metadata to start with, as
>>>> described in section 6.1:
>>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>>> 
>>> Right, and the way we prevent that on the Web is by giving something
>>> a distinctive media type.
>>> 
>>> AFAICT the audience you're designing this for is "CSV downloads that
>>> don't have any context (e.g., a direct link, rather than one from
>>> HTML) where the author has no ability to set Link headers." Is that
>>> correct?
>> 
>> Yes.
>> 
>>> 
>>> 
>>>>> Earlier, you talked about the downsides:
>>>>> 
>>>>>> - A *required* extra web access, nearly *every* time a
>>>>>> conforming CSVW processor is given a tabular data URL and
>>>>>> wishes to find the associated metadata -- because surely
>>>>>> http://example/.well-known/csvm will be 404 (and not cachable)
>>>>>> in the vast majority of cases.
>>>>> 
>>>>> Why is that bad? HTTP requests can be parallelised, so it's not
>>>>> latency. Is the extra request processing *really* that much of
>>>>> an overhead (considering we're talking about a comma- or tab-
>>>>> delimited file)?
>>>> 
>>>> It's not a big cost, but it is an actual cost, and it's being
>>>> weighed against a benefit that IMO is largely theoretical.
>>> 
>>> In isolation, I agree that's the right technical determination.
>>> 
>>> This isn't an isolated problem, however; there are lots of
>>> applications trying to stake a claim on various parts of URI space.
>>> The main reason that I wrote the BCP was because writing protocols on
>>> top of HTTP has become popular, and a lot of folks wanted to define
>>> "standard" URI paths.
>>> 
>>> As such, this is really a problem of the commons; your small
>>> encroachment might not make a big impact on its own, but in concert
>>> with others — especially when the W3C as steward of the Web is seen
>>> doing this — it starts to have impact.
>>> 
>>> In ten years, I really don't want to have a list of "filenames I
>>> can't use on my Web site" because you wanted to save the overhead of
>>> a single request in 2015 — especially when HTTP/2 makes requests
>>> really, really cheap.
>>> 
>>> Is that "theoretical"? I don't know, but I do think it's important.
>> 
>> I share your concern.  I think we should be vigilant against URI
>> squatting.  But although this case may look like URI squatting on the
>> surface, I don't think it actually is when you dig into it.  The fact
>> that the content of the CSV metadata file must *explicitly* indicate its
>> intent to be used as a CSV metadata file changes the situation in a
>> critical way, because it means that you are *not* prevented from using
>> that filename for a different purpose.  That file will *only* be
>> interpreted as a CSV metadata file if the owner explicitly indicates
>> that it *should* be interpreted that way.  That's not squatting, that's
>> the URI owner rightly exercising his/her choice.
>> 
>> The only case where there's any name conflict at all is if the URI owner
>> wishes to use that URI for some other purpose *and* for serving CSV
>> metadata, simultaneously.  In that case the URI owner would have to make
>> a choice about how he/she chooses to use that particular path.  But
>> that's like trying to install two different software packages in the
>> same directory: nobody expects to be able to do that, because both
>> packages might have a Make file called 'makefile', or some other
>> conflict.  Plus it makes a mess of the directory having files of
>> different packages intermingled.   If someone really wants to use both
>> software packages simultaneously, they install them in *different*
>> directories.  The same is true of CSV metadata: if you want to publish
>> CSV data and metadata, using the standard metadata filename, *and* you
>> want to use that same filename for some other purpose, then you will
>> have to put one of them in a different directory.  No big deal.  That
>> doesn't cause you to have to consult a list of "filenames you can't use
>> on your website".
>> 
>>> 
>>>>> As I pointed out earlier, you can specify a default heuristic for
>>>>> 404 on that resource so that you avoid it being uncacheable.
>>>> 
>>>> I doubt many server owners will bother to make that 404 cachable,
>>>> given that they didn't bother to install a .well-known/csvm file.
>>> 
>>> You misunderstand. You can specify a heuristic for the 404 to be
>>> interpreted on the *client* side; it tells consumers that if there's
>>> a 404 without freshness information, they can assume a specified
>>> default.
>> 
>> Oh, I see.  Yes, I guess they could.
>> 
>>> 
>>>>>> - Greater complexity in all conforming CSVW implementations.
>>>>> 
>>>>> I don't find this convincing; if we were talking about some
>>>>> involved scheme that involved lots of processing and tricky
>>>>> syntax, sure, but this is extremely simple, and all of the code
>>>>> to support it (libraries for HTTP, Link header parsing and URI
>>>>> Templates) is already at hand in most cases.
>>>> 
>>>> I agree that it's not a lot of additional complexity -- in fact
>>>> it's quite simple -- but it *is* additional code.
>>> 
>>> And I find that really unconvincing. If the bar for doing the right
>>> thing is so small and still can't be overcome, we're in a really bad
>>> place.
>> 
>> If it really were a matter of "doing the right thing" then I'd agree.
>> But as explained above, in this case I don't think it is.  Please
>> consider the above points, and see what you think.
>> 
>> Thanks,
>> David Booth

--
Mark Nottingham   https://www.mnot.net/
Received on Wednesday, 24 June 2015 04:31:13 UTC