Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE from David Booth on 2015-07-01 (www-tag@w3.org from July 2015)

From: David Booth <david@dbooth.org>
Date: Tue, 30 Jun 2015 23:17:32 -0400
To: Mark Nottingham <mnot@mnot.net>, Tim Berners-Lee <timbl@w3.org>, Daniel Appelquist <appelquist@gmail.com>, Yan Zhu <yzhu@yahoo-inc.com>, Hadley Beeman <hadley@linkedgov.org>, Peter Linss <peter.linss@hp.com>, Yves Lafon <ylafon@w3.org>, Alex Russell <slightlyoff@google.com>, Travis Leithead <travis.leithead@microsoft.com>
CC: "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <55935BCC.9010000@dbooth.org>
Hi Mark (and Tim, Daniel, Yan, Hadley, Peter, Yves, Alex and Travis),

The CSVW working group appears to still be deferring to the TAG's 
27-May-2015 suggestion[1] to use .well-known for specifying non-standard 
CSV metadata URIs[2].  This means that, unless the WG decides to throw 
me a bone to appease me, they will likely go ahead with a decision that 
was based on the incorrect assumption that *not* using .well-known would 
cause harmful URI squatting -- simply because no member of the TAG has 
yet spoken up to acknowledge this error.  Three other readers of the TAG 
list have acknowledged the error, but thus far no TAG members have.[4][5][6]

I have previously explained[9] in some detail how the CSVW spec's 
standard CSV metadata URI mechanism avoids harmful URI squatting, in 
spite of first appearances.  Harmful URI squatting is caused when URI 
owners are prevented from using their own URIs how they choose. 
Although CSV metadata documents may use standard URI patterns, they 
avoid harmful URI squatting by following the approach of the Self 
Describing Web[7].  This is very similar to the way XML namespaces 
enable XML documents to be self describing.  Where an XML document would 
use an attribute like xmlns="http://example/foo" to further indicate the 
document's type (beyond just being XML), a CSV metadata file uses a JSON 
property "@context": "http://www.w3.org/ns/csvw" to explicitly indicate 
its type.  But that's not all: a CSV metadata file must *also* meet 
several other requirements[9] that prevent a non-CSV-metadata file from 
being accidentally interpreted as a CSV metadata file.  To drive this 
point home: this means that a URI owner is *not* prevented from using a 
standard CSV metadata URI for a completely different purpose of his/her 
choosing.

I have also previously explained the actual harms[6] (complexity, extra 
HTTP requests, and security) that would result if the CSVW spec includes 
such an obscure feature that so few sites are likely to use.

I also put out a poll, asking who would actually use the .well-known 
feature if it were adopted.  So far there have been exactly zero responses.

Furthermore, in reviewing RFC5785, I notice that this use of .well-known 
actually *violates* RFC5785!  Section 1.1 (Appropriate Use of Well-Known 
URIs) explicitly states:

   "well-known URIs are not intended
    for general information retrieval or establishment of large URI
    namespaces on the Web.  Rather, they are designed to facilitate
    discovery of information on a site **when it isn't practical to use
    other mechanisms**"   [my emphasis]

But in this case, it clearly *is* practical to use other mechanisms. 
The CSVW spec already provides at least three alternate mechanisms for 
associating CSV metadata with CSV data: (a) a Link header, which can 
point from a CSV data URI to its corresponding CSV metadata URI; (b) 
standard CSV metadata URI patterns; and (c) the ability to publicize 
non-standard URIs of CSV metadata documents that link to their 
corresponding CSV data files.  (To clarify that last mechanism, one mode 
of use is for a user to start with a CSV data URI, and from that seek 
the corresponding CSV metadata.  But another mode of use is for a user 
to start with the CSV metadata URI, and from that locate the 
corresponding CSV data.  To facilitate that mode of use, data publishers 
have the option of publicizing their CSV metadata URIs along with their 
CSV data, so that users can easily find the CSV metadata.  Given the CSV 
metadata, CSVW processor can then automatically locate the corresponding 
CSV data, because the CSV metadata file explicitly links to its 
corresponding data file.)

If any TAG members could please take the time to diligently follow 
through this logic and speak up to right this wrong, please do so *now*, 
before the CSVW working group irrevocably bakes this misguided feature 
into the spec.  I will be happy to assist in any way that I can, such as 
by answering questions or discussing it on a teleconference.

Thanks very much,
David Booth

References
1. 
https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md

2. https://github.com/w3c/csvw/issues/555#issuecomment-117019654

3. https://tools.ietf.org/html/rfc5785

4. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0023.html

5. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0019.html

6. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0011.html

7. http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html

8. http://w3c.github.io/csvw/metadata/

9. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0026.html

10. https://lists.w3.org/Archives/Public/public-csv-wg/2015Jun/0085.html

On 06/24/2015 12:52 AM, David Booth wrote:
> Hi Mark,
>
> On 06/24/2015 12:30 AM, Mark Nottingham wrote:
>> David,
>>
>>> On 24 Jun 2015, at 2:07 pm, David Booth <david@dbooth.org> wrote:
>>>
>>> The CSVW working group is anxious to close this issue, so I'd like
>>> to ask members of the TAG: In light of the discussion below and
>>> elsewhere in this thread, does anyone still think that .well-known
>>> is necessary in this case to prevent harmful URI squatting?   If
>>> so, why?
>>>
>>> I maintain that this case does not represent harmful URI squatting,
>>> because URI owners are not prevented from using the standard CSVW
>>> metadata URIs ({+url}-metadata.json or metadata.json) for other
>>> purposes, because a file of that name will *only* be interpreted as
>>> a CSVW metadata file if it explicitly indicates that it *should* be
>>> interpreted that way.  So far only Mark Nottingham has expressed
>>> concerns.  (I hope that the explanations below have since allayed
>>> Mark's concerns, but I do not yet know if they have.)
>>
>> We discussed it on a TAG call and came to agreement;furthermore, I'd
>> thought that .well-known was acceptable to CSVWG as well. As such, I
>> think the relevant question is if TAG members think the issues you
>> raise are sufficient to reopen the discussion.
>
> Right.  As far as I could tell from the minutes, it seems that the
> assumption at the time was that harmful URI squatting would result if
> .well-known were not used.  I have now brought new information that
> explains more fully how the CSVW metadata mechanism works, which  rather
> conclusively shows that this mechanism would *not* cause harmful URI
> squatting without the use of .well-known.  Hence the previous decision
> should be changed.
>
>>
>> Again, I'm happy to have that discussion.
>
> What more needs to be discussed?   Are there questions that still need
> to be answered or aspects of the mechanism that are still unclear?  If
> so, I'd like to get them on the table.
>
> Does it need to be discussed on another call?   What information or
> process would be helpful in getting this resolved?
>
> Thanks,
> David Booth
>
>>
>> Cheers,
>>
>>
>>> Thanks, David Booth
>>>
>>> On 06/22/2015 02:39 AM, David Booth wrote:
>>>> On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>>>>>
>>>>>> On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org>
>>>>>> wrote:
>>>>>>> And, since the Web is so big, I certainly wouldn't rule out
>>>>>>> a collisions where it *is* misinterpreted as metadata.
>>>>>>
>>>>>> It certainly is possible in theory that someone with a CSV
>>>>>> resource at a particular URI could completely coincidentally
>>>>>> and unintentionally create a JSON file with the exact name
>>>>>> and exact contents -- including the URI of the CSV resource
>>>>>> -- required to cause that JSON to be misinterpreted as
>>>>>> metadata for the CSV file. But it seems so unlikely that
>>>>>> virtually any non-zero cost to prevent it would be a waste.
>>>>
>>>> Actually, we really can rule out the possibility that a non-CSVW
>>>> file would accidentally be misinterpreted as a CSVW metadata
>>>> file.  For a non-CSVW file to be accidentally misinterpreted as a
>>>> CSVW metadata for a corresponding CSV data file, *all* of the
>>>> following would have to be true of the non-CSVW file:
>>>>
>>>> - it would have to be in the same directory as the CSV data
>>>> file;
>>>>
>>>> - it would have to have the name {+url}-metadata.json or
>>>> metadata.json , where {+url} is the name of the CSV data file;
>>>>
>>>> - it would have to parse as JSON;
>>>>
>>>> - it would have to contain a top level JSON property called
>>>> "@context", with a value of either the string
>>>> "http://www.w3.org/ns/csvw" or an array containing that string;
>>>>
>>>> - it would have to explicitly reference the CSV data file; and
>>>>
>>>> - when interpreted as CSVW metadata, the schema described must
>>>> be compatible with the actual schema of the CSV data file.
>>>> Schema compatibility is defined as one would expect, such as the
>>>> same number of columns, the same column names (where present),
>>>> etc: http://w3c.github.io/csvw/metadata/#schema-compatibility
>>>>
>>>> Short of having an infinite number of monkeys typing, that just
>>>> isn't going to happen accidentally.
>>>>
>>>>>>
>>>>>> Furthermore, this is *exactly* the same risk that would
>>>>>> *already* be present if the CSVW processor started with the
>>>>>> JSON URI instead of the CSV URI: If the JSON *accidentally*
>>>>>> looks like CSVW metadata and *accidentally* contains the URI
>>>>>> of an existing CSV resource, then that CSV resource will be
>>>>>> misinterpreted, regardless of the content of .well-known/csvm
>>>>>> , because a CSVW processor must ignore .well-known/csvm if it
>>>>>> is given CSVW metadata to start with, as described in section
>>>>>> 6.1:
>>>>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>>>>>
>>>>>
>>>>>>
> Right, and the way we prevent that on the Web is by giving something
>>>>> a distinctive media type.
>>>>>
>>>>> AFAICT the audience you're designing this for is "CSV downloads
>>>>> that don't have any context (e.g., a direct link, rather than
>>>>> one from HTML) where the author has no ability to set Link
>>>>> headers." Is that correct?
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>
>>>>>>> Earlier, you talked about the downsides:
>>>>>>>
>>>>>>>> - A *required* extra web access, nearly *every* time a
>>>>>>>> conforming CSVW processor is given a tabular data URL
>>>>>>>> and wishes to find the associated metadata -- because
>>>>>>>> surely http://example/.well-known/csvm will be 404 (and
>>>>>>>> not cachable) in the vast majority of cases.
>>>>>>>
>>>>>>> Why is that bad? HTTP requests can be parallelised, so it's
>>>>>>> not latency. Is the extra request processing *really* that
>>>>>>> much of an overhead (considering we're talking about a
>>>>>>> comma- or tab- delimited file)?
>>>>>>
>>>>>> It's not a big cost, but it is an actual cost, and it's
>>>>>> being weighed against a benefit that IMO is largely
>>>>>> theoretical.
>>>>>
>>>>> In isolation, I agree that's the right technical
>>>>> determination.
>>>>>
>>>>> This isn't an isolated problem, however; there are lots of
>>>>> applications trying to stake a claim on various parts of URI
>>>>> space. The main reason that I wrote the BCP was because writing
>>>>> protocols on top of HTTP has become popular, and a lot of folks
>>>>> wanted to define "standard" URI paths.
>>>>>
>>>>> As such, this is really a problem of the commons; your small
>>>>> encroachment might not make a big impact on its own, but in
>>>>> concert with others — especially when the W3C as steward of the
>>>>> Web is seen doing this — it starts to have impact.
>>>>>
>>>>> In ten years, I really don't want to have a list of "filenames
>>>>> I can't use on my Web site" because you wanted to save the
>>>>> overhead of a single request in 2015 — especially when HTTP/2
>>>>> makes requests really, really cheap.
>>>>>
>>>>> Is that "theoretical"? I don't know, but I do think it's
>>>>> important.
>>>>
>>>> I share your concern.  I think we should be vigilant against URI
>>>> squatting.  But although this case may look like URI squatting on
>>>> the surface, I don't think it actually is when you dig into it.
>>>> The fact that the content of the CSV metadata file must
>>>> *explicitly* indicate its intent to be used as a CSV metadata
>>>> file changes the situation in a critical way, because it means
>>>> that you are *not* prevented from using that filename for a
>>>> different purpose.  That file will *only* be interpreted as a CSV
>>>> metadata file if the owner explicitly indicates that it *should*
>>>> be interpreted that way.  That's not squatting, that's the URI
>>>> owner rightly exercising his/her choice.
>>>>
>>>> The only case where there's any name conflict at all is if the
>>>> URI owner wishes to use that URI for some other purpose *and* for
>>>> serving CSV metadata, simultaneously.  In that case the URI owner
>>>> would have to make a choice about how he/she chooses to use that
>>>> particular path.  But that's like trying to install two different
>>>> software packages in the same directory: nobody expects to be
>>>> able to do that, because both packages might have a Make file
>>>> called 'makefile', or some other conflict.  Plus it makes a mess
>>>> of the directory having files of different packages intermingled.
>>>> If someone really wants to use both software packages
>>>> simultaneously, they install them in *different* directories.
>>>> The same is true of CSV metadata: if you want to publish CSV data
>>>> and metadata, using the standard metadata filename, *and* you
>>>> want to use that same filename for some other purpose, then you
>>>> will have to put one of them in a different directory.  No big
>>>> deal.  That doesn't cause you to have to consult a list of
>>>> "filenames you can't use on your website".
>>>>
>>>>>
>>>>>>> As I pointed out earlier, you can specify a default
>>>>>>> heuristic for 404 on that resource so that you avoid it
>>>>>>> being uncacheable.
>>>>>>
>>>>>> I doubt many server owners will bother to make that 404
>>>>>> cachable, given that they didn't bother to install a
>>>>>> .well-known/csvm file.
>>>>>
>>>>> You misunderstand. You can specify a heuristic for the 404 to
>>>>> be interpreted on the *client* side; it tells consumers that if
>>>>> there's a 404 without freshness information, they can assume a
>>>>> specified default.
>>>>
>>>> Oh, I see.  Yes, I guess they could.
>>>>
>>>>>
>>>>>>>> - Greater complexity in all conforming CSVW
>>>>>>>> implementations.
>>>>>>>
>>>>>>> I don't find this convincing; if we were talking about
>>>>>>> some involved scheme that involved lots of processing and
>>>>>>> tricky syntax, sure, but this is extremely simple, and all
>>>>>>> of the code to support it (libraries for HTTP, Link header
>>>>>>> parsing and URI Templates) is already at hand in most
>>>>>>> cases.
>>>>>>
>>>>>> I agree that it's not a lot of additional complexity -- in
>>>>>> fact it's quite simple -- but it *is* additional code.
>>>>>
>>>>> And I find that really unconvincing. If the bar for doing the
>>>>> right thing is so small and still can't be overcome, we're in a
>>>>> really bad place.
>>>>
>>>> If it really were a matter of "doing the right thing" then I'd
>>>> agree. But as explained above, in this case I don't think it is.
>>>> Please consider the above points, and see what you think.
>>>>
>>>> Thanks, David Booth
>>
>> -- Mark Nottingham   https://www.mnot.net/
>>
>>
>>
>>
>
>
>
>
Received on Wednesday, 1 July 2015 03:18:05 UTC