- From: Mark Nottingham <mnot@mnot.net>
- Date: Wed, 24 Jun 2015 14:30:36 +1000
- To: David Booth <david@dbooth.org>
- Cc: "www-tag@w3.org List" <www-tag@w3.org>
David, > On 24 Jun 2015, at 2:07 pm, David Booth <david@dbooth.org> wrote: > > The CSVW working group is anxious to close this issue, so I'd like to ask members of the TAG: In light of the discussion below and elsewhere in this thread, does anyone still think that .well-known is necessary in this case to prevent harmful URI squatting? If so, why? > > I maintain that this case does not represent harmful URI squatting, because URI owners are not prevented from using the standard CSVW metadata URIs ({+url}-metadata.json or metadata.json) for other purposes, because a file of that name will *only* be interpreted as a CSVW metadata file if it explicitly indicates that it *should* be interpreted that way. So far only Mark Nottingham has expressed concerns. (I hope that the explanations below have since allayed Mark's concerns, but I do not yet know if they have.) We discussed it on a TAG call and came to agreement; furthermore, I'd thought that .well-known was acceptable to CSVWG as well. As such, I think the relevant question is if TAG members think the issues you raise are sufficient to reopen the discussion. Again, I'm happy to have that discussion. Cheers, > Thanks, > David Booth > > On 06/22/2015 02:39 AM, David Booth wrote: >> On 06/21/2015 11:08 PM, Mark Nottingham wrote: >>> >>>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org> wrote: >>>>> And, since the Web is so big, I certainly wouldn't rule out a >>>>> collisions where it *is* misinterpreted as metadata. >>>> >>>> It certainly is possible in theory that someone with a CSV resource >>>> at a particular URI could completely coincidentally and >>>> unintentionally create a JSON file with the exact name and exact >>>> contents -- including the URI of the CSV resource -- required to >>>> cause that JSON to be misinterpreted as metadata for the CSV file. >>>> But it seems so unlikely that virtually any non-zero cost to >>>> prevent it would be a waste. >> >> Actually, we really can rule out the possibility that a non-CSVW file >> would accidentally be misinterpreted as a CSVW metadata file. For a >> non-CSVW file to be accidentally misinterpreted as a CSVW metadata for a >> corresponding CSV data file, *all* of the following would have to be >> true of the non-CSVW file: >> >> - it would have to be in the same directory as the CSV data file; >> >> - it would have to have the name {+url}-metadata.json or metadata.json >> , where {+url} is the name of the CSV data file; >> >> - it would have to parse as JSON; >> >> - it would have to contain a top level JSON property called >> "@context", with a value of either the string >> "http://www.w3.org/ns/csvw" or an array containing that string; >> >> - it would have to explicitly reference the CSV data file; and >> >> - when interpreted as CSVW metadata, the schema described must be >> compatible with the actual schema of the CSV data file. Schema >> compatibility is defined as one would expect, such as the same number of >> columns, the same column names (where present), etc: >> http://w3c.github.io/csvw/metadata/#schema-compatibility >> >> Short of having an infinite number of monkeys typing, that just isn't >> going to happen accidentally. >> >>>> >>>> Furthermore, this is *exactly* the same risk that would *already* >>>> be present if the CSVW processor started with the JSON URI instead >>>> of the CSV URI: If the JSON *accidentally* looks like CSVW metadata >>>> and *accidentally* contains the URI of an existing CSV resource, >>>> then that CSV resource will be misinterpreted, regardless of the >>>> content of .well-known/csvm , because a CSVW processor must ignore >>>> .well-known/csvm if it is given CSVW metadata to start with, as >>>> described in section 6.1: >>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables >>> >>> Right, and the way we prevent that on the Web is by giving something >>> a distinctive media type. >>> >>> AFAICT the audience you're designing this for is "CSV downloads that >>> don't have any context (e.g., a direct link, rather than one from >>> HTML) where the author has no ability to set Link headers." Is that >>> correct? >> >> Yes. >> >>> >>> >>>>> Earlier, you talked about the downsides: >>>>> >>>>>> - A *required* extra web access, nearly *every* time a >>>>>> conforming CSVW processor is given a tabular data URL and >>>>>> wishes to find the associated metadata -- because surely >>>>>> http://example/.well-known/csvm will be 404 (and not cachable) >>>>>> in the vast majority of cases. >>>>> >>>>> Why is that bad? HTTP requests can be parallelised, so it's not >>>>> latency. Is the extra request processing *really* that much of >>>>> an overhead (considering we're talking about a comma- or tab- >>>>> delimited file)? >>>> >>>> It's not a big cost, but it is an actual cost, and it's being >>>> weighed against a benefit that IMO is largely theoretical. >>> >>> In isolation, I agree that's the right technical determination. >>> >>> This isn't an isolated problem, however; there are lots of >>> applications trying to stake a claim on various parts of URI space. >>> The main reason that I wrote the BCP was because writing protocols on >>> top of HTTP has become popular, and a lot of folks wanted to define >>> "standard" URI paths. >>> >>> As such, this is really a problem of the commons; your small >>> encroachment might not make a big impact on its own, but in concert >>> with others — especially when the W3C as steward of the Web is seen >>> doing this — it starts to have impact. >>> >>> In ten years, I really don't want to have a list of "filenames I >>> can't use on my Web site" because you wanted to save the overhead of >>> a single request in 2015 — especially when HTTP/2 makes requests >>> really, really cheap. >>> >>> Is that "theoretical"? I don't know, but I do think it's important. >> >> I share your concern. I think we should be vigilant against URI >> squatting. But although this case may look like URI squatting on the >> surface, I don't think it actually is when you dig into it. The fact >> that the content of the CSV metadata file must *explicitly* indicate its >> intent to be used as a CSV metadata file changes the situation in a >> critical way, because it means that you are *not* prevented from using >> that filename for a different purpose. That file will *only* be >> interpreted as a CSV metadata file if the owner explicitly indicates >> that it *should* be interpreted that way. That's not squatting, that's >> the URI owner rightly exercising his/her choice. >> >> The only case where there's any name conflict at all is if the URI owner >> wishes to use that URI for some other purpose *and* for serving CSV >> metadata, simultaneously. In that case the URI owner would have to make >> a choice about how he/she chooses to use that particular path. But >> that's like trying to install two different software packages in the >> same directory: nobody expects to be able to do that, because both >> packages might have a Make file called 'makefile', or some other >> conflict. Plus it makes a mess of the directory having files of >> different packages intermingled. If someone really wants to use both >> software packages simultaneously, they install them in *different* >> directories. The same is true of CSV metadata: if you want to publish >> CSV data and metadata, using the standard metadata filename, *and* you >> want to use that same filename for some other purpose, then you will >> have to put one of them in a different directory. No big deal. That >> doesn't cause you to have to consult a list of "filenames you can't use >> on your website". >> >>> >>>>> As I pointed out earlier, you can specify a default heuristic for >>>>> 404 on that resource so that you avoid it being uncacheable. >>>> >>>> I doubt many server owners will bother to make that 404 cachable, >>>> given that they didn't bother to install a .well-known/csvm file. >>> >>> You misunderstand. You can specify a heuristic for the 404 to be >>> interpreted on the *client* side; it tells consumers that if there's >>> a 404 without freshness information, they can assume a specified >>> default. >> >> Oh, I see. Yes, I guess they could. >> >>> >>>>>> - Greater complexity in all conforming CSVW implementations. >>>>> >>>>> I don't find this convincing; if we were talking about some >>>>> involved scheme that involved lots of processing and tricky >>>>> syntax, sure, but this is extremely simple, and all of the code >>>>> to support it (libraries for HTTP, Link header parsing and URI >>>>> Templates) is already at hand in most cases. >>>> >>>> I agree that it's not a lot of additional complexity -- in fact >>>> it's quite simple -- but it *is* additional code. >>> >>> And I find that really unconvincing. If the bar for doing the right >>> thing is so small and still can't be overcome, we're in a really bad >>> place. >> >> If it really were a matter of "doing the right thing" then I'd agree. >> But as explained above, in this case I don't think it is. Please >> consider the above points, and see what you think. >> >> Thanks, >> David Booth -- Mark Nottingham https://www.mnot.net/
Received on Wednesday, 24 June 2015 04:31:13 UTC