Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE from David Booth on 2015-06-24 (www-tag@w3.org from June 2015)

From: David Booth <david@dbooth.org>
Date: Wed, 24 Jun 2015 00:07:28 -0400
To: "www-tag@w3.org List" <www-tag@w3.org>
CC: Mark Nottingham <mnot@mnot.net>
Message-ID: <558A2D00.2040903@dbooth.org>
The CSVW working group is anxious to close this issue, so I'd like to 
ask members of the TAG: In light of the discussion below and elsewhere 
in this thread, does anyone still think that .well-known is necessary in 
this case to prevent harmful URI squatting?   If so, why?

I maintain that this case does not represent harmful URI squatting, 
because URI owners are not prevented from using the standard CSVW 
metadata URIs ({+url}-metadata.json or metadata.json) for other 
purposes, because a file of that name will *only* be interpreted as a 
CSVW metadata file if it explicitly indicates that it *should* be 
interpreted that way.  So far only Mark Nottingham has expressed 
concerns.  (I hope that the explanations below have since allayed Mark's 
concerns, but I do not yet know if they have.)

Thanks,
David Booth

On 06/22/2015 02:39 AM, David Booth wrote:
> On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>>
>>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote:
>>>> And, since the Web is so big, I certainly wouldn't rule out a
>>>> collisions where it *is* misinterpreted as metadata.
>>>
>>> It certainly is possible in theory that someone with a CSV resource
>>> at a particular URI could completely coincidentally and
>>> unintentionally create a JSON file with the exact name and exact
>>> contents -- including the URI of the CSV resource -- required to
>>> cause that JSON to be misinterpreted as metadata for the CSV file.
>>> But it seems so unlikely that virtually any non-zero cost to
>>> prevent it would be a waste.
>
> Actually, we really can rule out the possibility that a non-CSVW file
> would accidentally be misinterpreted as a CSVW metadata file.  For a
> non-CSVW file to be accidentally misinterpreted as a CSVW metadata for a
> corresponding CSV data file, *all* of the following would have to be
> true of the non-CSVW file:
>
>   - it would have to be in the same directory as the CSV data file;
>
>   - it would have to have the name {+url}-metadata.json or metadata.json
> , where {+url} is the name of the CSV data file;
>
>   - it would have to parse as JSON;
>
>   - it would have to contain a top level JSON property called
> "@context", with a value of either the string
> "http://www.w3.org/ns/csvw" or an array containing that string;
>
>   - it would have to explicitly reference the CSV data file; and
>
>   - when interpreted as CSVW metadata, the schema described must be
> compatible with the actual schema of the CSV data file.  Schema
> compatibility is defined as one would expect, such as the same number of
> columns, the same column names (where present), etc:
> http://w3c.github.io/csvw/metadata/#schema-compatibility
>
> Short of having an infinite number of monkeys typing, that just isn't
> going to happen accidentally.
>
>>>
>>> Furthermore, this is *exactly* the same risk that would *already*
>>> be present if the CSVW processor started with the JSON URI instead
>>> of the CSV URI: If the JSON *accidentally* looks like CSVW metadata
>>> and *accidentally* contains the URI of an existing CSV resource,
>>> then that CSV resource will be misinterpreted, regardless of the
>>> content of .well-known/csvm , because a CSVW processor must ignore
>>> .well-known/csvm if it is given CSVW metadata to start with, as
>>> described in section 6.1:
>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>>
>> Right, and the way we prevent that on the Web is by giving something
>> a distinctive media type.
>>
>> AFAICT the audience you're designing this for is "CSV downloads that
>> don't have any context (e.g., a direct link, rather than one from
>> HTML) where the author has no ability to set Link headers." Is that
>> correct?
>
> Yes.
>
>>
>>
>>>> Earlier, you talked about the downsides:
>>>>
>>>>> - A *required* extra web access, nearly *every* time a
>>>>> conforming CSVW processor is given a tabular data URL and
>>>>> wishes to find the associated metadata -- because surely
>>>>> http://example/.well-known/csvm will be 404 (and not cachable)
>>>>> in the vast majority of cases.
>>>>
>>>> Why is that bad? HTTP requests can be parallelised, so it's not
>>>> latency. Is the extra request processing *really* that much of
>>>> an overhead (considering we're talking about a comma- or tab-
>>>> delimited file)?
>>>
>>> It's not a big cost, but it is an actual cost, and it's being
>>> weighed against a benefit that IMO is largely theoretical.
>>
>> In isolation, I agree that's the right technical determination.
>>
>> This isn't an isolated problem, however; there are lots of
>> applications trying to stake a claim on various parts of URI space.
>> The main reason that I wrote the BCP was because writing protocols on
>> top of HTTP has become popular, and a lot of folks wanted to define
>> "standard" URI paths.
>>
>> As such, this is really a problem of the commons; your small
>> encroachment might not make a big impact on its own, but in concert
>> with others — especially when the W3C as steward of the Web is seen
>> doing this — it starts to have impact.
>>
>> In ten years, I really don't want to have a list of "filenames I
>> can't use on my Web site" because you wanted to save the overhead of
>> a single request in 2015 — especially when HTTP/2 makes requests
>> really, really cheap.
>>
>> Is that "theoretical"? I don't know, but I do think it's important.
>
> I share your concern.  I think we should be vigilant against URI
> squatting.  But although this case may look like URI squatting on the
> surface, I don't think it actually is when you dig into it.  The fact
> that the content of the CSV metadata file must *explicitly* indicate its
> intent to be used as a CSV metadata file changes the situation in a
> critical way, because it means that you are *not* prevented from using
> that filename for a different purpose.  That file will *only* be
> interpreted as a CSV metadata file if the owner explicitly indicates
> that it *should* be interpreted that way.  That's not squatting, that's
> the URI owner rightly exercising his/her choice.
>
> The only case where there's any name conflict at all is if the URI owner
> wishes to use that URI for some other purpose *and* for serving CSV
> metadata, simultaneously.  In that case the URI owner would have to make
> a choice about how he/she chooses to use that particular path.  But
> that's like trying to install two different software packages in the
> same directory: nobody expects to be able to do that, because both
> packages might have a Make file called 'makefile', or some other
> conflict.  Plus it makes a mess of the directory having files of
> different packages intermingled.   If someone really wants to use both
> software packages simultaneously, they install them in *different*
> directories.  The same is true of CSV metadata: if you want to publish
> CSV data and metadata, using the standard metadata filename, *and* you
> want to use that same filename for some other purpose, then you will
> have to put one of them in a different directory.  No big deal.  That
> doesn't cause you to have to consult a list of "filenames you can't use
> on your website".
>
>>
>>>> As I pointed out earlier, you can specify a default heuristic for
>>>> 404 on that resource so that you avoid it being uncacheable.
>>>
>>> I doubt many server owners will bother to make that 404 cachable,
>>> given that they didn't bother to install a .well-known/csvm file.
>>
>> You misunderstand. You can specify a heuristic for the 404 to be
>> interpreted on the *client* side; it tells consumers that if there's
>> a 404 without freshness information, they can assume a specified
>> default.
>
> Oh, I see.  Yes, I guess they could.
>
>>
>>>>> - Greater complexity in all conforming CSVW implementations.
>>>>
>>>> I don't find this convincing; if we were talking about some
>>>> involved scheme that involved lots of processing and tricky
>>>> syntax, sure, but this is extremely simple, and all of the code
>>>> to support it (libraries for HTTP, Link header parsing and URI
>>>> Templates) is already at hand in most cases.
>>>
>>> I agree that it's not a lot of additional complexity -- in fact
>>> it's quite simple -- but it *is* additional code.
>>
>> And I find that really unconvincing. If the bar for doing the right
>> thing is so small and still can't be overcome, we're in a really bad
>> place.
>
> If it really were a matter of "doing the right thing" then I'd agree.
> But as explained above, in this case I don't think it is.  Please
> consider the above points, and see what you think.
>
> Thanks,
> David Booth
Received on Wednesday, 24 June 2015 04:08:27 UTC