RE: addressable identifier? from Cole, Timothy W on 2017-07-27 (public-publ-wg@w3.org from July 2017)

From: Cole, Timothy W <t-cole3@illinois.edu>
Date: Thu, 27 Jul 2017 19:35:44 +0000
To: Hadrien Gardeur <hadrien.gardeur@feedbooks.com>
CC: Romain <rdeltour@gmail.com>, Laurent Le Meur <laurent.lemeur@edrlab.org>, W3C Publishing Working Group <public-publ-wg@w3.org>
Message-ID: <EECC28A63F2ED74B8420079BBE599453617E043C@CITESMBX6.ad.uillinois.edu>
Matt, sorry I wrote this before seeing your reply...
________________________________
From: Hadrien Gardeur [hadrien.gardeur@feedbooks.com]
Sent: Thursday, July 27, 2017 13:55
To: Cole, Timothy W
Cc: Romain; Laurent Le Meur; W3C Publishing Working Group
Subject: Re: addressable identifier?

On the Web, even in a Semantic Web / RDF context, the idea of canonical has just not held up well in most general contexts, in spite of the fact that you do see canonical as a value for many link rel attributes. Persistence of URL(s) for a Web Publication is critical, but Web resources, like people, in spite of our preferences, may be known by more than a single, canonical identifier (name).  I would argue we may not want to make a single canonical URL a core requirement - though we certainly could encourage it as a best practice.

If we don't, we'll still need a way to figure out that two URIs are basically serving the same manifest for a given publication.

Ideally, yes, but this is an inherent problem on the Web and I'm worried about trying to do this in the specs this group is creating. For example, publishers will continue to use DOIs. Assigning a DOI implicitly mints a URL, e.g., 10.17226/18619 implies the URL https://doi.org/10.17226/18619  This URL on the publisher's website gets you to the same page: https://www.nap.edu/catalog/18619/ .  Searching the Web for the title of this book gets as highest rank result is this URL https://www.nap.edu/catalog/18619/developing-a-21st-century-global-library-for-mathematics-research  also the same page. The publisher (NAS) in this case also published the report simultaneously on arxiv.org, so here's another URL: https://arxiv.org/abs/1404.1905 (2 different publications?).

And users will tend to bookmark this URL https://www.nap.edu/read/18619/
 or this URL: https://arxiv.org/ftp/arxiv/papers/1404/1404.1905.pdf  which is the second ranked result on Google when you do a title search.

Which URL would be best as the canonical one? Is this one publication or two?

I certainly appreciate the desire to avoid URL Pollution (reference earlier email about the TAG publications), but I am not confident that we can anticipate all the real-world scenarios well enough to create a definitive definition of canonical URL for generic Web Publications. I'm just trying to consider scope of our task.

The HTML versus non-HTML manifest URL raised earlier in this thread could potentially also be dealt with more as a matter of serialization and composition than as a one or the other binary choice.  For a UA wanting HTML, the URL pointing to the manifest could result in an HTML representation that embeds the manifest, e.g., as a script type="application/json" or type="application/ld+json. This representation (if the publisher so desired) could then also include/embed  by reference the 'primary' resource, so that to a Web browser user it looks like he or she has received the content of interest right away.  Machines reading this HTML would see the manifest in addition to the primary resource, or could just request the manifest only through content negotiation (if the publisher wished to offer this option or if we make manifest serialized a specific non-HTML way a minimum requirement).

This has been discussed on Github previously: content negotiation is by far the most elegant solution, but we also can't expect that we'll always have that option available.

For the manifest, there's no clear consensus yet between:

  *   external manifest as a JSON representation
  *   embedded manifest in a HTML document (probably using script)

It seems that there are more people leaning towards external manifest, but Florian Rivoal for example has advocated in favour of an embedded manifest in an HTML document.
I'm not sure the choice is defined correctly.  I also like the idea of a URL that allows my software to retrieve a JSON representation of a manifest.  Given that content negotiation is possible, this does not preclude a publisher providing html with manifest embedded to those clients who request the URL with an accept header that says I want HTML in preference to JSON.  If you want JSON, you should be expected to request it using a client that says it prefers JSON.   Is this really too much to ask?

As said, we have options here, I would just not want to be overly prescriptive when not required.

We do have options, but having too many options can always be confusing. We need to agree at least on how the manifest will be accessed (external or embedded).

Hadrien
Received on Thursday, 27 July 2017 19:36:19 UTC