Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE

On 1 July 2015 at 05:17, David Booth <david@dbooth.org> wrote:

> Hi Mark (and Tim, Daniel, Yan, Hadley, Peter, Yves, Alex and Travis),
>
> The CSVW working group appears to still be deferring to the TAG's
> 27-May-2015 suggestion[1] to use .well-known for specifying non-standard
> CSV metadata URIs[2].  This means that, unless the WG decides to throw me a
> bone to appease me, they will likely go ahead with a decision that was
> based on the incorrect assumption that *not* using .well-known would cause
> harmful URI squatting -- simply because no member of the TAG has yet spoken
> up to acknowledge this error.  Three other readers of the TAG list have
> acknowledged the error, but thus far no TAG members have.[4][5][6]
>
> I have previously explained[9] in some detail how the CSVW spec's standard
> CSV metadata URI mechanism avoids harmful URI squatting, in spite of first
> appearances.  Harmful URI squatting is caused when URI owners are prevented
> from using their own URIs how they choose. Although CSV metadata documents
> may use standard URI patterns, they avoid harmful URI squatting by
> following the approach of the Self Describing Web[7].  This is very similar
> to the way XML namespaces enable XML documents to be self describing.
> Where an XML document would use an attribute like xmlns="
> http://example/foo" to further indicate the document's type (beyond just
> being XML), a CSV metadata file uses a JSON property "@context": "
> http://www.w3.org/ns/csvw" to explicitly indicate its type.  But that's
> not all: a CSV metadata file must *also* meet several other requirements[9]
> that prevent a non-CSV-metadata file from being accidentally interpreted as
> a CSV metadata file.  To drive this point home: this means that a URI owner
> is *not* prevented from using a standard CSV metadata URI for a completely
> different purpose of his/her choosing.
>
> I have also previously explained the actual harms[6] (complexity, extra
> HTTP requests, and security) that would result if the CSVW spec includes
> such an obscure feature that so few sites are likely to use.
>
> I also put out a poll, asking who would actually use the .well-known
> feature if it were adopted.  So far there have been exactly zero responses.
>
> Furthermore, in reviewing RFC5785, I notice that this use of .well-known
> actually *violates* RFC5785!  Section 1.1 (Appropriate Use of Well-Known
> URIs) explicitly states:
>
>   "well-known URIs are not intended
>    for general information retrieval or establishment of large URI
>    namespaces on the Web.  Rather, they are designed to facilitate
>    discovery of information on a site **when it isn't practical to use
>    other mechanisms**"   [my emphasis]
>
> But in this case, it clearly *is* practical to use other mechanisms. The
> CSVW spec already provides at least three alternate mechanisms for
> associating CSV metadata with CSV data: (a) a Link header, which can point
> from a CSV data URI to its corresponding CSV metadata URI; (b) standard CSV
> metadata URI patterns; and (c) the ability to publicize non-standard URIs
> of CSV metadata documents that link to their corresponding CSV data files.
> (To clarify that last mechanism, one mode of use is for a user to start
> with a CSV data URI, and from that seek the corresponding CSV metadata.
> But another mode of use is for a user to start with the CSV metadata URI,
> and from that locate the corresponding CSV data.  To facilitate that mode
> of use, data publishers have the option of publicizing their CSV metadata
> URIs along with their CSV data, so that users can easily find the CSV
> metadata.  Given the CSV metadata, CSVW processor can then automatically
> locate the corresponding CSV data, because the CSV metadata file explicitly
> links to its corresponding data file.)
>
> If any TAG members could please take the time to diligently follow through
> this logic and speak up to right this wrong, please do so *now*, before the
> CSVW working group irrevocably bakes this misguided feature into the spec.
> I will be happy to assist in any way that I can, such as by answering
> questions or discussing it on a teleconference.
>

Would be great if anyone had a bit of time, to look at this in slightly
more detail.

It also ocured to me that we ( working with the team at DIG / MIT ) have
been using rel="meta" for quite a while now (with good results), and I
wonder if that will become standardized in Linked Data Platform v.next, and
is perhaps relevant here.  This stuff isnt in Linked Data Platform V1 but
may be in V2 (there's been about a dozen people interested in carrying on
the work).

For our work, link headers are always used where possible.  But some places
(e.g. github) dont give you that kind of access.

Now, looking at csvw, I wonder should we be using rel="describedBy"
instead?  The principle behind it is adding meta data to Linked Data
Platform files, much in the same way UNIX adds inode meta data.

In our implementations of meta data we do have standard names such as ,meta
(rather than, metadata.json).  We also are thinking about rel="acl" for
access control.  Tho acl was originally tied together with the acl.

I recognize a lot of this, is a new frontier, in many ways, it's not clear
cut, and many people wont be in a position to have a view at all.  But it
is really helpful to hear how others are doing this, and hear ideas, to
maybe be able to get our ducks in a row! :)


>
> Thanks very much,
> David Booth
>
> References
> 1.
> https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md
>
> 2. https://github.com/w3c/csvw/issues/555#issuecomment-117019654
>
> 3. https://tools.ietf.org/html/rfc5785
>
> 4. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0023.html
>
> 5. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0019.html
>
> 6. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0011.html
>
> 7. http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html
>
> 8. http://w3c.github.io/csvw/metadata/
>
> 9. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0026.html
>
> 10. https://lists.w3.org/Archives/Public/public-csv-wg/2015Jun/0085.html
>
>
> On 06/24/2015 12:52 AM, David Booth wrote:
>
>> Hi Mark,
>>
>> On 06/24/2015 12:30 AM, Mark Nottingham wrote:
>>
>>> David,
>>>
>>>  On 24 Jun 2015, at 2:07 pm, David Booth <david@dbooth.org> wrote:
>>>>
>>>> The CSVW working group is anxious to close this issue, so I'd like
>>>> to ask members of the TAG: In light of the discussion below and
>>>> elsewhere in this thread, does anyone still think that .well-known
>>>> is necessary in this case to prevent harmful URI squatting?   If
>>>> so, why?
>>>>
>>>> I maintain that this case does not represent harmful URI squatting,
>>>> because URI owners are not prevented from using the standard CSVW
>>>> metadata URIs ({+url}-metadata.json or metadata.json) for other
>>>> purposes, because a file of that name will *only* be interpreted as
>>>> a CSVW metadata file if it explicitly indicates that it *should* be
>>>> interpreted that way.  So far only Mark Nottingham has expressed
>>>> concerns.  (I hope that the explanations below have since allayed
>>>> Mark's concerns, but I do not yet know if they have.)
>>>>
>>>
>>> We discussed it on a TAG call and came to agreement;furthermore, I'd
>>> thought that .well-known was acceptable to CSVWG as well. As such, I
>>> think the relevant question is if TAG members think the issues you
>>> raise are sufficient to reopen the discussion.
>>>
>>
>> Right.  As far as I could tell from the minutes, it seems that the
>> assumption at the time was that harmful URI squatting would result if
>> .well-known were not used.  I have now brought new information that
>> explains more fully how the CSVW metadata mechanism works, which  rather
>> conclusively shows that this mechanism would *not* cause harmful URI
>> squatting without the use of .well-known.  Hence the previous decision
>> should be changed.
>>
>>
>>> Again, I'm happy to have that discussion.
>>>
>>
>> What more needs to be discussed?   Are there questions that still need
>> to be answered or aspects of the mechanism that are still unclear?  If
>> so, I'd like to get them on the table.
>>
>> Does it need to be discussed on another call?   What information or
>> process would be helpful in getting this resolved?
>>
>> Thanks,
>> David Booth
>>
>>
>>> Cheers,
>>>
>>>
>>>  Thanks, David Booth
>>>>
>>>> On 06/22/2015 02:39 AM, David Booth wrote:
>>>>
>>>>> On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>>>>>
>>>>>>
>>>>>>  On 19 Jun 2015, at 5:32 pm, David Booth <david@dbooth.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> And, since the Web is so big, I certainly wouldn't rule out
>>>>>>>> a collisions where it *is* misinterpreted as metadata.
>>>>>>>>
>>>>>>>
>>>>>>> It certainly is possible in theory that someone with a CSV
>>>>>>> resource at a particular URI could completely coincidentally
>>>>>>> and unintentionally create a JSON file with the exact name
>>>>>>> and exact contents -- including the URI of the CSV resource
>>>>>>> -- required to cause that JSON to be misinterpreted as
>>>>>>> metadata for the CSV file. But it seems so unlikely that
>>>>>>> virtually any non-zero cost to prevent it would be a waste.
>>>>>>>
>>>>>>
>>>>> Actually, we really can rule out the possibility that a non-CSVW
>>>>> file would accidentally be misinterpreted as a CSVW metadata
>>>>> file.  For a non-CSVW file to be accidentally misinterpreted as a
>>>>> CSVW metadata for a corresponding CSV data file, *all* of the
>>>>> following would have to be true of the non-CSVW file:
>>>>>
>>>>> - it would have to be in the same directory as the CSV data
>>>>> file;
>>>>>
>>>>> - it would have to have the name {+url}-metadata.json or
>>>>> metadata.json , where {+url} is the name of the CSV data file;
>>>>>
>>>>> - it would have to parse as JSON;
>>>>>
>>>>> - it would have to contain a top level JSON property called
>>>>> "@context", with a value of either the string
>>>>> "http://www.w3.org/ns/csvw" or an array containing that string;
>>>>>
>>>>> - it would have to explicitly reference the CSV data file; and
>>>>>
>>>>> - when interpreted as CSVW metadata, the schema described must
>>>>> be compatible with the actual schema of the CSV data file.
>>>>> Schema compatibility is defined as one would expect, such as the
>>>>> same number of columns, the same column names (where present),
>>>>> etc: http://w3c.github.io/csvw/metadata/#schema-compatibility
>>>>>
>>>>> Short of having an infinite number of monkeys typing, that just
>>>>> isn't going to happen accidentally.
>>>>>
>>>>>
>>>>>>> Furthermore, this is *exactly* the same risk that would
>>>>>>> *already* be present if the CSVW processor started with the
>>>>>>> JSON URI instead of the CSV URI: If the JSON *accidentally*
>>>>>>> looks like CSVW metadata and *accidentally* contains the URI
>>>>>>> of an existing CSV resource, then that CSV resource will be
>>>>>>> misinterpreted, regardless of the content of .well-known/csvm
>>>>>>> , because a CSVW processor must ignore .well-known/csvm if it
>>>>>>> is given CSVW metadata to start with, as described in section
>>>>>>> 6.1:
>>>>>>> http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>  Right, and the way we prevent that on the Web is by giving something
>>
>>> a distinctive media type.
>>>>>>
>>>>>> AFAICT the audience you're designing this for is "CSV downloads
>>>>>> that don't have any context (e.g., a direct link, rather than
>>>>>> one from HTML) where the author has no ability to set Link
>>>>>> headers." Is that correct?
>>>>>>
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>>
>>>>>>  Earlier, you talked about the downsides:
>>>>>>>>
>>>>>>>>  - A *required* extra web access, nearly *every* time a
>>>>>>>>> conforming CSVW processor is given a tabular data URL
>>>>>>>>> and wishes to find the associated metadata -- because
>>>>>>>>> surely http://example/.well-known/csvm will be 404 (and
>>>>>>>>> not cachable) in the vast majority of cases.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why is that bad? HTTP requests can be parallelised, so it's
>>>>>>>> not latency. Is the extra request processing *really* that
>>>>>>>> much of an overhead (considering we're talking about a
>>>>>>>> comma- or tab- delimited file)?
>>>>>>>>
>>>>>>>
>>>>>>> It's not a big cost, but it is an actual cost, and it's
>>>>>>> being weighed against a benefit that IMO is largely
>>>>>>> theoretical.
>>>>>>>
>>>>>>
>>>>>> In isolation, I agree that's the right technical
>>>>>> determination.
>>>>>>
>>>>>> This isn't an isolated problem, however; there are lots of
>>>>>> applications trying to stake a claim on various parts of URI
>>>>>> space. The main reason that I wrote the BCP was because writing
>>>>>> protocols on top of HTTP has become popular, and a lot of folks
>>>>>> wanted to define "standard" URI paths.
>>>>>>
>>>>>> As such, this is really a problem of the commons; your small
>>>>>> encroachment might not make a big impact on its own, but in
>>>>>> concert with others — especially when the W3C as steward of the
>>>>>> Web is seen doing this — it starts to have impact.
>>>>>>
>>>>>> In ten years, I really don't want to have a list of "filenames
>>>>>> I can't use on my Web site" because you wanted to save the
>>>>>> overhead of a single request in 2015 — especially when HTTP/2
>>>>>> makes requests really, really cheap.
>>>>>>
>>>>>> Is that "theoretical"? I don't know, but I do think it's
>>>>>> important.
>>>>>>
>>>>>
>>>>> I share your concern.  I think we should be vigilant against URI
>>>>> squatting.  But although this case may look like URI squatting on
>>>>> the surface, I don't think it actually is when you dig into it.
>>>>> The fact that the content of the CSV metadata file must
>>>>> *explicitly* indicate its intent to be used as a CSV metadata
>>>>> file changes the situation in a critical way, because it means
>>>>> that you are *not* prevented from using that filename for a
>>>>> different purpose.  That file will *only* be interpreted as a CSV
>>>>> metadata file if the owner explicitly indicates that it *should*
>>>>> be interpreted that way.  That's not squatting, that's the URI
>>>>> owner rightly exercising his/her choice.
>>>>>
>>>>> The only case where there's any name conflict at all is if the
>>>>> URI owner wishes to use that URI for some other purpose *and* for
>>>>> serving CSV metadata, simultaneously.  In that case the URI owner
>>>>> would have to make a choice about how he/she chooses to use that
>>>>> particular path.  But that's like trying to install two different
>>>>> software packages in the same directory: nobody expects to be
>>>>> able to do that, because both packages might have a Make file
>>>>> called 'makefile', or some other conflict.  Plus it makes a mess
>>>>> of the directory having files of different packages intermingled.
>>>>> If someone really wants to use both software packages
>>>>> simultaneously, they install them in *different* directories.
>>>>> The same is true of CSV metadata: if you want to publish CSV data
>>>>> and metadata, using the standard metadata filename, *and* you
>>>>> want to use that same filename for some other purpose, then you
>>>>> will have to put one of them in a different directory.  No big
>>>>> deal.  That doesn't cause you to have to consult a list of
>>>>> "filenames you can't use on your website".
>>>>>
>>>>>
>>>>>>  As I pointed out earlier, you can specify a default
>>>>>>>> heuristic for 404 on that resource so that you avoid it
>>>>>>>> being uncacheable.
>>>>>>>>
>>>>>>>
>>>>>>> I doubt many server owners will bother to make that 404
>>>>>>> cachable, given that they didn't bother to install a
>>>>>>> .well-known/csvm file.
>>>>>>>
>>>>>>
>>>>>> You misunderstand. You can specify a heuristic for the 404 to
>>>>>> be interpreted on the *client* side; it tells consumers that if
>>>>>> there's a 404 without freshness information, they can assume a
>>>>>> specified default.
>>>>>>
>>>>>
>>>>> Oh, I see.  Yes, I guess they could.
>>>>>
>>>>>
>>>>>>  - Greater complexity in all conforming CSVW
>>>>>>>>> implementations.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't find this convincing; if we were talking about
>>>>>>>> some involved scheme that involved lots of processing and
>>>>>>>> tricky syntax, sure, but this is extremely simple, and all
>>>>>>>> of the code to support it (libraries for HTTP, Link header
>>>>>>>> parsing and URI Templates) is already at hand in most
>>>>>>>> cases.
>>>>>>>>
>>>>>>>
>>>>>>> I agree that it's not a lot of additional complexity -- in
>>>>>>> fact it's quite simple -- but it *is* additional code.
>>>>>>>
>>>>>>
>>>>>> And I find that really unconvincing. If the bar for doing the
>>>>>> right thing is so small and still can't be overcome, we're in a
>>>>>> really bad place.
>>>>>>
>>>>>
>>>>> If it really were a matter of "doing the right thing" then I'd
>>>>> agree. But as explained above, in this case I don't think it is.
>>>>> Please consider the above points, and see what you think.
>>>>>
>>>>> Thanks, David Booth
>>>>>
>>>>
>>> -- Mark Nottingham   https://www.mnot.net/
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>

Received on Thursday, 2 July 2015 19:41:13 UTC