- From: David Booth <david@dbooth.org>
- Date: Wed, 24 Jun 2015 00:07:28 -0400
- To: "www-tag@w3.org List" <www-tag@w3.org>
- CC: Mark Nottingham <mnot@mnot.net>
The CSVW working group is anxious to close this issue, so I'd like to ask members of the TAG: In light of the discussion below and elsewhere in this thread, does anyone still think that .well-known is necessary in this case to prevent harmful URI squatting? If so, why? I maintain that this case does not represent harmful URI squatting, because URI owners are not prevented from using the standard CSVW metadata URIs ({+url}-metadata.json or metadata.json) for other purposes, because a file of that name will *only* be interpreted as a CSVW metadata file if it explicitly indicates that it *should* be interpreted that way. So far only Mark Nottingham has expressed concerns. (I hope that the explanations below have since allayed Mark's concerns, but I do not yet know if they have.) Thanks, David Booth On 06/22/2015 02:39 AM, David Booth wrote: > On 06/21/2015 11:08 PM, Mark Nottingham wrote: >> >>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote: >>>> And, since the Web is so big, I certainly wouldn't rule out a >>>> collisions where it *is* misinterpreted as metadata. >>> >>> It certainly is possible in theory that someone with a CSV resource >>> at a particular URI could completely coincidentally and >>> unintentionally create a JSON file with the exact name and exact >>> contents -- including the URI of the CSV resource -- required to >>> cause that JSON to be misinterpreted as metadata for the CSV file. >>> But it seems so unlikely that virtually any non-zero cost to >>> prevent it would be a waste. > > Actually, we really can rule out the possibility that a non-CSVW file > would accidentally be misinterpreted as a CSVW metadata file. For a > non-CSVW file to be accidentally misinterpreted as a CSVW metadata for a > corresponding CSV data file, *all* of the following would have to be > true of the non-CSVW file: > > - it would have to be in the same directory as the CSV data file; > > - it would have to have the name {+url}-metadata.json or metadata.json > , where {+url} is the name of the CSV data file; > > - it would have to parse as JSON; > > - it would have to contain a top level JSON property called > "@context", with a value of either the string > "http://www.w3.org/ns/csvw" or an array containing that string; > > - it would have to explicitly reference the CSV data file; and > > - when interpreted as CSVW metadata, the schema described must be > compatible with the actual schema of the CSV data file. Schema > compatibility is defined as one would expect, such as the same number of > columns, the same column names (where present), etc: > http://w3c.github.io/csvw/metadata/#schema-compatibility > > Short of having an infinite number of monkeys typing, that just isn't > going to happen accidentally. > >>> >>> Furthermore, this is *exactly* the same risk that would *already* >>> be present if the CSVW processor started with the JSON URI instead >>> of the CSV URI: If the JSON *accidentally* looks like CSVW metadata >>> and *accidentally* contains the URI of an existing CSV resource, >>> then that CSV resource will be misinterpreted, regardless of the >>> content of .well-known/csvm , because a CSVW processor must ignore >>> .well-known/csvm if it is given CSVW metadata to start with, as >>> described in section 6.1: >>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables >> >> Right, and the way we prevent that on the Web is by giving something >> a distinctive media type. >> >> AFAICT the audience you're designing this for is "CSV downloads that >> don't have any context (e.g., a direct link, rather than one from >> HTML) where the author has no ability to set Link headers." Is that >> correct? > > Yes. > >> >> >>>> Earlier, you talked about the downsides: >>>> >>>>> - A *required* extra web access, nearly *every* time a >>>>> conforming CSVW processor is given a tabular data URL and >>>>> wishes to find the associated metadata -- because surely >>>>> http://example/.well-known/csvm will be 404 (and not cachable) >>>>> in the vast majority of cases. >>>> >>>> Why is that bad? HTTP requests can be parallelised, so it's not >>>> latency. Is the extra request processing *really* that much of >>>> an overhead (considering we're talking about a comma- or tab- >>>> delimited file)? >>> >>> It's not a big cost, but it is an actual cost, and it's being >>> weighed against a benefit that IMO is largely theoretical. >> >> In isolation, I agree that's the right technical determination. >> >> This isn't an isolated problem, however; there are lots of >> applications trying to stake a claim on various parts of URI space. >> The main reason that I wrote the BCP was because writing protocols on >> top of HTTP has become popular, and a lot of folks wanted to define >> "standard" URI paths. >> >> As such, this is really a problem of the commons; your small >> encroachment might not make a big impact on its own, but in concert >> with others — especially when the W3C as steward of the Web is seen >> doing this — it starts to have impact. >> >> In ten years, I really don't want to have a list of "filenames I >> can't use on my Web site" because you wanted to save the overhead of >> a single request in 2015 — especially when HTTP/2 makes requests >> really, really cheap. >> >> Is that "theoretical"? I don't know, but I do think it's important. > > I share your concern. I think we should be vigilant against URI > squatting. But although this case may look like URI squatting on the > surface, I don't think it actually is when you dig into it. The fact > that the content of the CSV metadata file must *explicitly* indicate its > intent to be used as a CSV metadata file changes the situation in a > critical way, because it means that you are *not* prevented from using > that filename for a different purpose. That file will *only* be > interpreted as a CSV metadata file if the owner explicitly indicates > that it *should* be interpreted that way. That's not squatting, that's > the URI owner rightly exercising his/her choice. > > The only case where there's any name conflict at all is if the URI owner > wishes to use that URI for some other purpose *and* for serving CSV > metadata, simultaneously. In that case the URI owner would have to make > a choice about how he/she chooses to use that particular path. But > that's like trying to install two different software packages in the > same directory: nobody expects to be able to do that, because both > packages might have a Make file called 'makefile', or some other > conflict. Plus it makes a mess of the directory having files of > different packages intermingled. If someone really wants to use both > software packages simultaneously, they install them in *different* > directories. The same is true of CSV metadata: if you want to publish > CSV data and metadata, using the standard metadata filename, *and* you > want to use that same filename for some other purpose, then you will > have to put one of them in a different directory. No big deal. That > doesn't cause you to have to consult a list of "filenames you can't use > on your website". > >> >>>> As I pointed out earlier, you can specify a default heuristic for >>>> 404 on that resource so that you avoid it being uncacheable. >>> >>> I doubt many server owners will bother to make that 404 cachable, >>> given that they didn't bother to install a .well-known/csvm file. >> >> You misunderstand. You can specify a heuristic for the 404 to be >> interpreted on the *client* side; it tells consumers that if there's >> a 404 without freshness information, they can assume a specified >> default. > > Oh, I see. Yes, I guess they could. > >> >>>>> - Greater complexity in all conforming CSVW implementations. >>>> >>>> I don't find this convincing; if we were talking about some >>>> involved scheme that involved lots of processing and tricky >>>> syntax, sure, but this is extremely simple, and all of the code >>>> to support it (libraries for HTTP, Link header parsing and URI >>>> Templates) is already at hand in most cases. >>> >>> I agree that it's not a lot of additional complexity -- in fact >>> it's quite simple -- but it *is* additional code. >> >> And I find that really unconvincing. If the bar for doing the right >> thing is so small and still can't be overcome, we're in a really bad >> place. > > If it really were a matter of "doing the right thing" then I'd agree. > But as explained above, in this case I don't think it is. Please > consider the above points, and see what you think. > > Thanks, > David Booth
Received on Wednesday, 24 June 2015 04:08:27 UTC