Re: Use of .well-known for CSV metadata: More harm than good -- OPINIONS PLEASE from David Booth on 2015-07-03 (www-tag@w3.org from July 2015)

From: David Booth <david@dbooth.org>
Date: Thu, 02 Jul 2015 20:42:27 -0400
To: Andrei Sambra <andrei@w3.org>, Mark Nottingham <mnot@mnot.net>, Tim Berners-Lee <timbl@w3.org>, Daniel Appelquist <appelquist@gmail.com>, Yan Zhu <yzhu@yahoo-inc.com>, Hadley Beeman <hadley@linkedgov.org>, Peter Linss <peter.linss@hp.com>, Yves Lafon <ylafon@w3.org>, Alex Russell <slightlyoff@google.com>, Travis Leithead <travis.leithead@microsoft.com>, "www-tag@w3.org List" <www-tag@w3.org>
CC: Melvin Carvalho <melvincarvalho@gmail.com>
Message-ID: <5595DA73.8070700@dbooth.org>
Bad news: this use of .well-known is even worse than I previously 
realized.

Suppose Alice publishes some CSV data and metadata on something like 
dropbox or github.  Initially the site has no /.well-known/csvm file, so 
Alice uses the standard CSV metadata naming convention and publishes:

   http://.../foo.csv                 (the CSV data)
   http://.../foo.csv-metadata.json   (the CSV metadata)

Consumers of Alice's data are very happy, because their CSVW processors 
can automatically find the CSV metadata, given only the URL of the CSV 
data.  All is fine until one day the site administrator Bob, for the 
convenience of the *site*'s applications, decides to install a 
/.well-known/csvm file that specifies a NON-standard URI pattern for CSV 
metadata files.   Suddenly, and mysteriously, CSVW processors can no 
longer find the metadata for Alice's CSV data.  Alice, of course, has no 
idea that this has even happened, because she has long since moved on to 
other research projects.  And besides, it isn't her responsibility: it 
was working fine when she left.  The losers are all of the potential 
consumers of Alice's data -- the web community.  Boo.  :(

The problem here is that .well-known is intended for **site-wide** 
policy data, as stipulated in RFC5785, and that is *not* the granularity 
of control that is needed or desired for file- or directory-specific CSV 
metadata.  Section 1.1 Appropriate Use of Well-Known URIs states:
https://tools.ietf.org/html/rfc5785#section-1.1

   " . . . the well-known URI space was created with the expectation
    that it will be used to make **site-wide** policy information
    and other metadata available"   [my emphasis]

To repeat: this use of .well-known: (a) is NOT needed to avoid harmful 
URI squatting; (b) violates section 1.1 of RFC5785 ("Appropriate Use of 
Well-Known URIs"); and (c) clearly causes more harm than good.   PLEASE 
speak up so that this error can be corrected.

Thanks,
David Booth


On 07/02/2015 03:40 PM, Melvin Carvalho wrote:
>
>
> On 1 July 2015 at 05:17, David Booth <david@dbooth.org
> <mailto:david@dbooth.org>> wrote:
>
>     Hi Mark (and Tim, Daniel, Yan, Hadley, Peter, Yves, Alex and Travis),
>
>     The CSVW working group appears to still be deferring to the TAG's
>     27-May-2015 suggestion[1] to use .well-known for specifying
>     non-standard CSV metadata URIs[2].  This means that, unless the WG
>     decides to throw me a bone to appease me, they will likely go ahead
>     with a decision that was based on the incorrect assumption that
>     *not* using .well-known would cause harmful URI squatting -- simply
>     because no member of the TAG has yet spoken up to acknowledge this
>     error.  Three other readers of the TAG list have acknowledged the
>     error, but thus far no TAG members have.[4][5][6]
>
>     I have previously explained[9] in some detail how the CSVW spec's
>     standard CSV metadata URI mechanism avoids harmful URI squatting, in
>     spite of first appearances.  Harmful URI squatting is caused when
>     URI owners are prevented from using their own URIs how they choose.
>     Although CSV metadata documents may use standard URI patterns, they
>     avoid harmful URI squatting by following the approach of the Self
>     Describing Web[7].  This is very similar to the way XML namespaces
>     enable XML documents to be self describing.  Where an XML document
>     would use an attribute like xmlns="http://example/foo" to further
>     indicate the document's type (beyond just being XML), a CSV metadata
>     file uses a JSON property "@context": "http://www.w3.org/ns/csvw" to
>     explicitly indicate its type.  But that's not all: a CSV metadata
>     file must *also* meet several other requirements[9] that prevent a
>     non-CSV-metadata file from being accidentally interpreted as a CSV
>     metadata file.  To drive this point home: this means that a URI
>     owner is *not* prevented from using a standard CSV metadata URI for
>     a completely different purpose of his/her choosing.
>
>     I have also previously explained the actual harms[6] (complexity,
>     extra HTTP requests, and security) that would result if the CSVW
>     spec includes such an obscure feature that so few sites are likely
>     to use.
>
>     I also put out a poll, asking who would actually use the .well-known
>     feature if it were adopted.  So far there have been exactly zero
>     responses.
>
>     Furthermore, in reviewing RFC5785, I notice that this use of
>     .well-known actually *violates* RFC5785!  Section 1.1 (Appropriate
>     Use of Well-Known URIs) explicitly states:
>
>        "well-known URIs are not intended
>         for general information retrieval or establishment of large URI
>         namespaces on the Web.  Rather, they are designed to facilitate
>         discovery of information on a site **when it isn't practical to use
>         other mechanisms**"   [my emphasis]
>
>     But in this case, it clearly *is* practical to use other mechanisms.
>     The CSVW spec already provides at least three alternate mechanisms
>     for associating CSV metadata with CSV data: (a) a Link header, which
>     can point from a CSV data URI to its corresponding CSV metadata URI;
>     (b) standard CSV metadata URI patterns; and (c) the ability to
>     publicize non-standard URIs of CSV metadata documents that link to
>     their corresponding CSV data files.  (To clarify that last
>     mechanism, one mode of use is for a user to start with a CSV data
>     URI, and from that seek the corresponding CSV metadata.  But another
>     mode of use is for a user to start with the CSV metadata URI, and
>     from that locate the corresponding CSV data.  To facilitate that
>     mode of use, data publishers have the option of publicizing their
>     CSV metadata URIs along with their CSV data, so that users can
>     easily find the CSV metadata.  Given the CSV metadata, CSVW
>     processor can then automatically locate the corresponding CSV data,
>     because the CSV metadata file explicitly links to its corresponding
>     data file.)
>
>     If any TAG members could please take the time to diligently follow
>     through this logic and speak up to right this wrong, please do so
>     *now*, before the CSVW working group irrevocably bakes this
>     misguided feature into the spec.  I will be happy to assist in any
>     way that I can, such as by answering questions or discussing it on a
>     teleconference.
>
>
> Would be great if anyone had a bit of time, to look at this in slightly
> more detail.
>
> It also ocured to me that we ( working with the team at DIG / MIT ) have
> been using rel="meta" for quite a while now (with good results), and I
> wonder if that will become standardized in Linked Data Platform v.next,
> and is perhaps relevant here.  This stuff isnt in Linked Data Platform
> V1 but may be in V2 (there's been about a dozen people interested in
> carrying on the work).
>
> For our work, link headers are always used where possible.  But some
> places (e.g. github) dont give you that kind of access.
>
> Now, looking at csvw, I wonder should we be using rel="describedBy"
> instead?  The principle behind it is adding meta data to Linked Data
> Platform files, much in the same way UNIX adds inode meta data.
>
> In our implementations of meta data we do have standard names such as
> ,meta (rather than, metadata.json).  We also are thinking about
> rel="acl" for access control.  Tho acl was originally tied together with
> the acl.
>
> I recognize a lot of this, is a new frontier, in many ways, it's not
> clear cut, and many people wont be in a position to have a view at all.
> But it is really helpful to hear how others are doing this, and hear
> ideas, to maybe be able to get our ducks in a row! :)
>
>
>     Thanks very much,
>     David Booth
>
>     References
>     1.
>     https://github.com/w3ctag/meetings/blob/gh-pages/2015/telcons/06-03-csv-minutes.md
>
>     2. https://github.com/w3c/csvw/issues/555#issuecomment-117019654
>
>     3. https://tools.ietf.org/html/rfc5785
>
>     4. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0023.html
>
>     5. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0019.html
>
>     6. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0011.html
>
>     7. http://www.w3.org/2001/tag/doc/selfDescribingDocuments.html
>
>     8. http://w3c.github.io/csvw/metadata/
>
>     9. https://lists.w3.org/Archives/Public/www-tag/2015Jun/0026.html
>
>     10. https://lists.w3.org/Archives/Public/public-csv-wg/2015Jun/0085.html
>
>
>     On 06/24/2015 12:52 AM, David Booth wrote:
>
>         Hi Mark,
>
>         On 06/24/2015 12:30 AM, Mark Nottingham wrote:
>
>             David,
>
>                 On 24 Jun 2015, at 2:07 pm, David Booth
>                 <david@dbooth.org <mailto:david@dbooth.org>> wrote:
>
>                 The CSVW working group is anxious to close this issue,
>                 so I'd like
>                 to ask members of the TAG: In light of the discussion
>                 below and
>                 elsewhere in this thread, does anyone still think that
>                 .well-known
>                 is necessary in this case to prevent harmful URI
>                 squatting?   If
>                 so, why?
>
>                 I maintain that this case does not represent harmful URI
>                 squatting,
>                 because URI owners are not prevented from using the
>                 standard CSVW
>                 metadata URIs ({+url}-metadata.json or metadata.json)
>                 for other
>                 purposes, because a file of that name will *only* be
>                 interpreted as
>                 a CSVW metadata file if it explicitly indicates that it
>                 *should* be
>                 interpreted that way.  So far only Mark Nottingham has
>                 expressed
>                 concerns.  (I hope that the explanations below have
>                 since allayed
>                 Mark's concerns, but I do not yet know if they have.)
>
>
>             We discussed it on a TAG call and came to
>             agreement;furthermore, I'd
>             thought that .well-known was acceptable to CSVWG as well. As
>             such, I
>             think the relevant question is if TAG members think the
>             issues you
>             raise are sufficient to reopen the discussion.
>
>
>         Right.  As far as I could tell from the minutes, it seems that the
>         assumption at the time was that harmful URI squatting would
>         result if
>         .well-known were not used.  I have now brought new information that
>         explains more fully how the CSVW metadata mechanism works,
>         which  rather
>         conclusively shows that this mechanism would *not* cause harmful URI
>         squatting without the use of .well-known.  Hence the previous
>         decision
>         should be changed.
>
>
>             Again, I'm happy to have that discussion.
>
>
>         What more needs to be discussed?   Are there questions that
>         still need
>         to be answered or aspects of the mechanism that are still
>         unclear?  If
>         so, I'd like to get them on the table.
>
>         Does it need to be discussed on another call?   What information or
>         process would be helpful in getting this resolved?
>
>         Thanks,
>         David Booth
>
>
>             Cheers,
>
>
>                 Thanks, David Booth
>
>                 On 06/22/2015 02:39 AM, David Booth wrote:
>
>                     On 06/21/2015 11:08 PM, Mark Nottingham wrote:
>
>
>                             On 19 Jun 2015, at 5:32 pm, David Booth
>                             <david@dbooth.org <mailto:david@dbooth.org>>
>                             wrote:
>
>                                 And, since the Web is so big, I
>                                 certainly wouldn't rule out
>                                 a collisions where it *is*
>                                 misinterpreted as metadata.
>
>
>                             It certainly is possible in theory that
>                             someone with a CSV
>                             resource at a particular URI could
>                             completely coincidentally
>                             and unintentionally create a JSON file with
>                             the exact name
>                             and exact contents -- including the URI of
>                             the CSV resource
>                             -- required to cause that JSON to be
>                             misinterpreted as
>                             metadata for the CSV file. But it seems so
>                             unlikely that
>                             virtually any non-zero cost to prevent it
>                             would be a waste.
>
>
>                     Actually, we really can rule out the possibility
>                     that a non-CSVW
>                     file would accidentally be misinterpreted as a CSVW
>                     metadata
>                     file.  For a non-CSVW file to be accidentally
>                     misinterpreted as a
>                     CSVW metadata for a corresponding CSV data file,
>                     *all* of the
>                     following would have to be true of the non-CSVW file:
>
>                     - it would have to be in the same directory as the
>                     CSV data
>                     file;
>
>                     - it would have to have the name {+url}-metadata.json or
>                     metadata.json , where {+url} is the name of the CSV
>                     data file;
>
>                     - it would have to parse as JSON;
>
>                     - it would have to contain a top level JSON property
>                     called
>                     "@context", with a value of either the string
>                     "http://www.w3.org/ns/csvw" or an array containing
>                     that string;
>
>                     - it would have to explicitly reference the CSV data
>                     file; and
>
>                     - when interpreted as CSVW metadata, the schema
>                     described must
>                     be compatible with the actual schema of the CSV data
>                     file.
>                     Schema compatibility is defined as one would expect,
>                     such as the
>                     same number of columns, the same column names (where
>                     present),
>                     etc:
>                     http://w3c.github.io/csvw/metadata/#schema-compatibility
>
>                     Short of having an infinite number of monkeys
>                     typing, that just
>                     isn't going to happen accidentally.
>
>
>                             Furthermore, this is *exactly* the same risk
>                             that would
>                             *already* be present if the CSVW processor
>                             started with the
>                             JSON URI instead of the CSV URI: If the JSON
>                             *accidentally*
>                             looks like CSVW metadata and *accidentally*
>                             contains the URI
>                             of an existing CSV resource, then that CSV
>                             resource will be
>                             misinterpreted, regardless of the content of
>                             .well-known/csvm
>                             , because a CSVW processor must ignore
>                             .well-known/csvm if it
>                             is given CSVW metadata to start with, as
>                             described in section
>                             6.1:
>                             http://w3c.github.io/csvw/syntax/#h-creating-annotated-tables
>
>
>
>
>         Right, and the way we prevent that on the Web is by giving something
>
>                         a distinctive media type.
>
>                         AFAICT the audience you're designing this for is
>                         "CSV downloads
>                         that don't have any context (e.g., a direct
>                         link, rather than
>                         one from HTML) where the author has no ability
>                         to set Link
>                         headers." Is that correct?
>
>
>                     Yes.
>
>
>
>                                 Earlier, you talked about the downsides:
>
>                                     - A *required* extra web access,
>                                     nearly *every* time a
>                                     conforming CSVW processor is given a
>                                     tabular data URL
>                                     and wishes to find the associated
>                                     metadata -- because
>                                     surely
>                                     http://example/.well-known/csvm will
>                                     be 404 (and
>                                     not cachable) in the vast majority
>                                     of cases.
>
>
>                                 Why is that bad? HTTP requests can be
>                                 parallelised, so it's
>                                 not latency. Is the extra request
>                                 processing *really* that
>                                 much of an overhead (considering we're
>                                 talking about a
>                                 comma- or tab- delimited file)?
>
>
>                             It's not a big cost, but it is an actual
>                             cost, and it's
>                             being weighed against a benefit that IMO is
>                             largely
>                             theoretical.
>
>
>                         In isolation, I agree that's the right technical
>                         determination.
>
>                         This isn't an isolated problem, however; there
>                         are lots of
>                         applications trying to stake a claim on various
>                         parts of URI
>                         space. The main reason that I wrote the BCP was
>                         because writing
>                         protocols on top of HTTP has become popular, and
>                         a lot of folks
>                         wanted to define "standard" URI paths.
>
>                         As such, this is really a problem of the
>                         commons; your small
>                         encroachment might not make a big impact on its
>                         own, but in
>                         concert with others — especially when the W3C as
>                         steward of the
>                         Web is seen doing this — it starts to have impact.
>
>                         In ten years, I really don't want to have a list
>                         of "filenames
>                         I can't use on my Web site" because you wanted
>                         to save the
>                         overhead of a single request in 2015 —
>                         especially when HTTP/2
>                         makes requests really, really cheap.
>
>                         Is that "theoretical"? I don't know, but I do
>                         think it's
>                         important.
>
>
>                     I share your concern.  I think we should be vigilant
>                     against URI
>                     squatting.  But although this case may look like URI
>                     squatting on
>                     the surface, I don't think it actually is when you
>                     dig into it.
>                     The fact that the content of the CSV metadata file must
>                     *explicitly* indicate its intent to be used as a CSV
>                     metadata
>                     file changes the situation in a critical way,
>                     because it means
>                     that you are *not* prevented from using that
>                     filename for a
>                     different purpose.  That file will *only* be
>                     interpreted as a CSV
>                     metadata file if the owner explicitly indicates that
>                     it *should*
>                     be interpreted that way.  That's not squatting,
>                     that's the URI
>                     owner rightly exercising his/her choice.
>
>                     The only case where there's any name conflict at all
>                     is if the
>                     URI owner wishes to use that URI for some other
>                     purpose *and* for
>                     serving CSV metadata, simultaneously.  In that case
>                     the URI owner
>                     would have to make a choice about how he/she chooses
>                     to use that
>                     particular path.  But that's like trying to install
>                     two different
>                     software packages in the same directory: nobody
>                     expects to be
>                     able to do that, because both packages might have a
>                     Make file
>                     called 'makefile', or some other conflict.  Plus it
>                     makes a mess
>                     of the directory having files of different packages
>                     intermingled.
>                     If someone really wants to use both software packages
>                     simultaneously, they install them in *different*
>                     directories.
>                     The same is true of CSV metadata: if you want to
>                     publish CSV data
>                     and metadata, using the standard metadata filename,
>                     *and* you
>                     want to use that same filename for some other
>                     purpose, then you
>                     will have to put one of them in a different
>                     directory.  No big
>                     deal.  That doesn't cause you to have to consult a
>                     list of
>                     "filenames you can't use on your website".
>
>
>                                 As I pointed out earlier, you can
>                                 specify a default
>                                 heuristic for 404 on that resource so
>                                 that you avoid it
>                                 being uncacheable.
>
>
>                             I doubt many server owners will bother to
>                             make that 404
>                             cachable, given that they didn't bother to
>                             install a
>                             .well-known/csvm file.
>
>
>                         You misunderstand. You can specify a heuristic
>                         for the 404 to
>                         be interpreted on the *client* side; it tells
>                         consumers that if
>                         there's a 404 without freshness information,
>                         they can assume a
>                         specified default.
>
>
>                     Oh, I see.  Yes, I guess they could.
>
>
>                                     - Greater complexity in all
>                                     conforming CSVW
>                                     implementations.
>
>
>                                 I don't find this convincing; if we were
>                                 talking about
>                                 some involved scheme that involved lots
>                                 of processing and
>                                 tricky syntax, sure, but this is
>                                 extremely simple, and all
>                                 of the code to support it (libraries for
>                                 HTTP, Link header
>                                 parsing and URI Templates) is already at
>                                 hand in most
>                                 cases.
>
>
>                             I agree that it's not a lot of additional
>                             complexity -- in
>                             fact it's quite simple -- but it *is*
>                             additional code.
>
>
>                         And I find that really unconvincing. If the bar
>                         for doing the
>                         right thing is so small and still can't be
>                         overcome, we're in a
>                         really bad place.
>
>
>                     If it really were a matter of "doing the right
>                     thing" then I'd
>                     agree. But as explained above, in this case I don't
>                     think it is.
>                     Please consider the above points, and see what you
>                     think.
>
>                     Thanks, David Booth
>
>
>             -- Mark Nottingham https://www.mnot.net/
>
>
>
>
>
>
>
>
>
>
Received on Friday, 3 July 2015 00:43:13 UTC