Re: [locators] High-level thoughts from Ivan Herman on 2016-02-01 (public-digipub-ig@w3.org from February 2016)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 1 Feb 2016 14:31:44 +0100
To: Bill Kasdorf <bkasdorf@apexcovantage.com>
Cc: Ben De Meester <ben.demeester@ugent.be>, W3C Digital Publishing IG <public-digipub-ig@w3.org>, Luc Audrain <laudrain@hachette-livre.fr>
Message-Id: <DDD690A5-406C-4479-8FE8-442A33500DD4@w3.org>
Hey Bill,

As you say below, I am uncomfortable taking on this whole issue of ownership in the case of a PWP… The way I see it (and that may be oversimplification) with a way to move forward:

- A PWP has, as we agreed upon, an identifier. This is cast in concrete, stored in the (conceptual) manifest for the PWP.

- I have a specific PWP *instance* which has a locator. Let us call this one PWP1. (Let us put aside whether it has an abstract Locator as well as a locator for a, say, packed state and an unpacked state, ie, it may have several, closely related locators.). The locator to PW1 gives me access, via HTTP (or file reading) to the content. PWP1 may have rights attached to it, eg, I can add an annotation to it. The locator to PWP1 is accessible from within the specific state via the manifest, as well as the identifier.

- I may make a copy of a PWP1, eg, because I install it for a class. Let us call that instance PWP2. By doing so, I may have the possibility to add each student an access right for adding annotation for the whole class. In my mental model, PWP1 is different from PWP2, ie, it has a different locator. It uses the same identifier, of course, but it is a different instance.

(It is the same action as copy pasting a Web site and installing on my server. It has the same content, but it is on a different URL and with different rights because I can modify it.)

- Systems *may* add additional metadata to the copies if it is important to trace back where they come from. Eg, when I install a new copy for the classroom, the installation process may add to PWP2 a reference to PWP1. If I create a PWP3, that may have a reference to PWP2 and maybe even to [PWP1,PWP2] (ie, giving the full breadcrumb of the changes.).

I realize that this is a simplification. But I would prefer to leave the details out of our discussion, just providing the placeholder where systems may use complex information (eg, using the provenance vocabulary). But it should not affect the architecture of PWP-s more than that imho.

Note that Luc was asking/raising similar questions on the call. Maybe these thoughts are also relevant to him, too

WDYT?

Ivan



> On 28 Jan 2016, at 20:53, Bill Kasdorf <bkasdorf@apexcovantage.com> wrote:
> 
> Thanks Ben—not just for doing this, but for taking over the leadership of the TF, which was obviously a really good move! ;)
> 
> I really like the summary (your short version). Very helpful!
> 
> As to an issue that I am really concerned about:
> 
> I agree that FRBR is about identity, not location. However, I think the locator issue is inextricably bound up with the version issue, which is an identity issue. One complication that keeps nagging at me in our discussions is that on the one hand, we want a person to be able to annotate or otherwise alter a PWP while considering it the "same" PWP. That could be an individual; it could be a group of students using the PWP in a course; it could be a group of employees or another work group; etc.
> 
> The point I keep coming back to is that _that means the PWP is no longer the same [in the sense of "identical"] PWP_ even if it "lives" at the same location, i.e. has the same URL.
> 
> Obviously, the "owner" of the original PWP and its locator must control what can or cannot change about that PWP before it is considered a different PWP. And if NO alteration is made to the PWP, then the locator should not be changed (I would even argue for a MUST; but see "Cool URIs don't change" at [1]). And I realize we are now getting on to the very thin ice of rights, but that is for the about-to-exist POE WG to address. We can't get into whether somebody has _permission_ to change the PWP (and if so, how); but I think we need to address what happens if somebody _does_ alter a PWP. I'm guessing there will be significant disagreement on whether adding an annotation is considered a sufficient alteration. (I'm inclined to think the "owner" of the original PWP needs to make that determination, which gets us back into POE land, and I don't mean Baltimore. E.g., a professor takes a PWP from a publisher and makes it available to her class for discussion; the discussion happens by annotating the PWP; thus the professor has created a new PWP but when her students annotate _that_ PWP it stays the same PWP. Guess I just suggested a use case.)
> 
> Bottom line, as I said on the call yesterday, I think we need to separate three issues wrt the stability of a PWP:
> --State (I like the hierarchical concept of an abstract canonical URL onto which the state is designated, Ivan's basic model, with your proviso that a URL has to include the state designation but the content of all the states has to be identical and reachable from any of the others)
> --Owner (see above; gets into the really murky waters of what "ownership" means)
> --Alteration (does _any_ change of a single bit make it a new PWP? That's what EPUB 3's "unique identifier" does, via timestamp. If not, then what changes to a PWP require that it be considered a new PWP?)
> 
> While these aren't fundamentally locator issues, I think they're pretty fundamentally PWP issues. And I know we're uncomfortable getting into some of these swampy areas. But I worry about dodging them and just kicking the can down the road, i.e. sowing confusion and messy usage once PWP is a spec.
> 
> [Side note: in discussions about identity, people often use ISBN as an example of identity and think of it as identifying "the book" in the abstract. It doesn’t. An ISBN is a _product_ identifier for the supply chain. Some sectors of publishing do have identifiers that serve as work identifiers—e.g., the CrossRef DOI for scholarly articles and other publications, which applies to all formats of a given article; EIDR in the entertainment industry—but the ISBN, a book industry identifier, is a product identifier. The book industry is still struggling with the issue of a work identifier; ISTC has just been revived in ISO, but who knows where that will go; as of now there isn't one for books, and besides PWP is for lots of things that aren't books anyhow. This is not to say that a PWP _itself_ couldn't have an ISBN; if it's a PWP of a book, it should. But it should NOT be the same ISBN as the ISBNs of the corresponding hardback, paperback, audiobook, KF8, or . . . ahem . . . PDF; each of those must have their own ISBN, because they're different formats, different products for the supply chain. And when you start twigging PSPs by changing/altering/adding to their content (see above) . . . argh, technically, they're supposed to have different ISBNs.]
> 
> I realize you know all this, Ben, but I think some folks in the IG may not.
> 
> --Bill Kasdorf
> 
> [1] http://www.w3.org/Provider/Style/URI.html <http://www.w3.org/Provider/Style/URI.html>
> 
> 
> From: Ben De Meester [mailto:ben.demeester@ugent.be <mailto:ben.demeester@ugent.be>]
> Sent: Thursday, January 28, 2016 5:11 AM
> To: DPUB mailing list (public-digipub-ig@w3.org <mailto:public-digipub-ig@w3.org>)
> Subject: [locators] High-level thoughts
> 
> Hi all,
> 
> Based on the discussion yesterday, I have been musing, and drafted my thought below.
> It is insanely long, sorry for that, the short version is that I make following statements:
> 
> * A PWP locator can be absolute or relative.
> * The relative locator allows to link to resources once you know where the PWP is located
>   * and can be derived using the PWP manifest
> * The absolute locator consist of the relative locator and the PWP locator.
> * The PWP locator is always in a certain state (e.g., locally unpacked, or hosted packed, or ...)
> * However, all instantiations of the PWP link back to the state-less, abstract PWP, via its Canonical URL
> * and that Canonical URL needs to point to at least one instantiation of a PWP.
> * Thus, a PWP can be referenced using its specific instantiation, or via its Canonical URL.
> 
> All of these statements are open to debate of course :).
> 
> Also: @Romain: could you give an update to the current state of the use cases, and how we can help you?
> 
> Greetings,
> Ben
> 
> ## States - scope
> 
> As per the current state of the PWP WD,
> we scope this work specifically that a PWP can have different states (packed/unpacked, protocol/file),
> but otherwise, the contents of the PWP is exactly the same across those states.
> 
> Locating content between PWPs that have different contents (e.g., in another language, or an earlier version),
> are currently out of scope.
> Things such as the FRBR model is out of scope,
> as this is more about identifiers than about locators.
> 
> Also, with locators, there is meant (entire) PWP's and/or individual resources inside the PWP.
> For more fine-grained locations (e.g., the second paragraph of document X),
> other efforts are going on, e.g., in the annotation working group.
> 
> ## Remark: Absolute vs Relative
> 
> As far as I see it, it is possible to have relative and absolute locators,
> where relative locators will mostly (exclusively?) be used inside the PWP,
> and absolute locators might be used for internal links,
> but probably mostly for external sources linking to the PWP.
> 
> As such, I think of a locator as having two parts:
> [PWP locator]*[resource locator]
> 
> In the case of a relative locator, the [PWP locator] is missing,
> and needs to be derived from context.
> 
> ### Internal links
> 
> Inside the PWP
> > i.e., inside the 'container' that holds all contents of the PWP,
> > for a packed PWP, this is straightforward, i.e., inside the package,
> > for an unpacked PWP,
> > I mean inside the subfolder, whether it is file or protocol state
> 
> `<p>See <a href="[resource locator]">Section 2</a> for more info.</p>`
> 
> Q1: Is this locator the same when
> 
> (* Q1a. section 2 is the same file)
> * Q1b. section 2 is a different file, but within the same PWP
> * Q1c. the PWP is opened protocol/unpacked
> * Q1d. the PWP is opened file/packed
> * Q1e. the PWP is opened protocol/packed
> * Q1f. the PWP is opened file/unpacked
> * Q1g. the PWP is opened in a different protocol (e.g., via http or https or ftp)
> * Q1h. the PWP is moved/copied protocol-wise (e.g., from example.com <http://example.com/> to books.org <http://books.org/>)
> * Q1i. the PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to /user/home/bjdmeest/)
> * Q1j. the PWP is packed vs unpacked
> 
> ### External links
> 
> From a (online) website/ (offline) paper/...
> 
> <p>John et al. describe an <a href="[PWP locator][resource locator]>interesting algorithm</a> for this problem.</p>
> 
> Q2: Is this locator the same when
> 
> * Q2a. The referring document is actually inside the PWP
> * Q2b. The referred PWP is accessed protocol/unpacked
> * Q2c. The referred PWP is accessed file/packed
> * Q2d. The referred PWP is accessed protocol/packed
> * Q2e. The referred PWP is accessed file/unpacked
> * Q2f. The referred PWP is accessed in a different protocol (e.g., via http or https or ftp)
> * Q2g. the referred PWP is moved/copied protocol-wise (e.g., from example.com <http://example.com/> to books.org <http://books.org/>)
> * Q2h. the referred PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to /user/home/bjdmeest/)
> * Q2i. the referred PWP is packed vs unpacked
> 
> ## Idea
> 
> Personally, I see this as two different problems, i.e.,
> the [PWP locator] depends on the protocol the PWP is in,
> whereas the [resource locator] depends on how the the packed vs unpacked PWP should be accessed.
> To me, the [resource locator] is more technical, i.e., once you have the PWP,
> you can (probably via the manifest) access and link to the individual resources.
> Given the discussion yesterday, I see the following high-level model, to solve the [PWP locator]:
> 
> 1. Most importantly, a PWP consists of a Canonical URL and some resources.
> 2. The identifiers of a PWP are, e.g., ISBN numbers, but could coïncide with this Canonical URL
> 3. The Canonical URL is the reference to the abstract PWP, whereas different State URLs refer to specific instantiations of that PWP
> 4. The Canonical URL does not need to be on the same online place as the actual PWP (cfr. DOI)
> 5. The State URLs could be, e.g., the packed version on the publishers website, the unpacked version on the publishers website
> 6. or the URL of the local copy of the downloaded PWP
> 
> When referencing a publication, the user can reference the Canonical URL or the state URL.
> When referencing the state URL, the Canonical URL could be found, as it is part of the PWP.
> 
> ### (technical) TODOs
> 
> Systems need to be in place to make sure the Canonical URL can refer to at least one state URL,
> as otherwise only the abstract PWP exists, but no real content.
> 
> It should be specified how a PWP references to the Canonical URL.
> 
> It should be specified how to access and link to specific resources in a PWP, via some kind of manifest.
> 
> ### Fun things
> 
> Fun thing #1: the most minimal website can already be a PWP, namely:
> the Canonical URL is also a State URL to the unpacked protocol version of the PWP.
> 
> Fun thing #2: a user can remix the local PWP as much as he likes -- e.g.,
> stripping out all the videos to create a 'slim' PWP and republishing it --
> the remixed PWP could still refer to the 'official' PWP via its Canonical URL,
> and the publisher still keeps authority on 'correct' PWPs,
> as the Canonical URL does not need to refer to the remixed PWP, but only to the authorized PWPs.
> Add in checksums etc., and any user can verify whether a received PWP is the same as the published PWP.
> 
> ### Bad things
> 
> Bad thing #1: there is an insane amount of pressure on the Canonical URL.
> If this URL dies, then all instantiations of the PWP are disconnected.
> 
> Ben De Meester
> Researcher Semantic Web
> Ghent University - iMinds - Data Science Lab | Faculty of Engineering and Architecture | Department of Electronics and Information Systems
> Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
> t: +32 9 331 49 59 | e: ben.demeester@ugent.be <mailto:ben.demeester@ugent.be> | URL:  http://users.ugent.be/~bjdmeest/ <http://users.ugent.be/~bjdmeest/>

----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Monday, 1 February 2016 13:32:10 UTC