Re: [locators] High-level thoughts from Ivan Herman on 2016-02-01 (public-digipub-ig@w3.org from February 2016)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 1 Feb 2016 14:14:00 +0100
To: Ben De Meester <ben.demeester@ugent.be>
Cc: W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-Id: <28E6B1ED-0974-4386-BE39-6516EC8B6580@w3.org>
Hi Ben,

thanks for starting this. My apologies for the late reply, but I had some personal issues last week that had priority…

One general comment, not really on the mail below, but more, sort of, warning flags that we should keep in mind.

* We should restrict ourselves, whenever we can, to the Web usage. What this means is that we should try to avoid reinventing the wheel, and always remember that, at the end of the day, we are mostly talking about HTTP(S) URL-s. (I say 'mostly', because the issue of package is clearly muddling the waters a bit in this respect.) What this also means is that we will have to 'translate' whatever we do back on to the world and terminology of URL-s. This is also why I always try to look at the issues with a very down-to-Earth, operative way: what happens if I issue an HTTP GET on a specific, ehem, locator. What do I get back? What other HTTP verbs are possible on that locator (e.g., PUT). Etc.

* We agreed to try to keep away from specific design/implementation tools. But, again, we are talking about *Web* Documents. Service Workers are a specific implementation vehicle but I tend to look at them in a slightly more general way. Regardless of the details, what SW-s do is to provide a level of abstraction whereby offline and online use become indistinguishable. *That* abstraction is very useful I believe. If we did not have SW, the fundamental goal of the discussion of PWP (ie, the complete and smooth transition from offline to online, from packed to unpacked) would still require some sort of an abstraction that, the Javascript details put aside, would lead to the very same principles as SW-s do. The main point is that, seen from a certain level of abstraction, there should be no difference between packed and unpacked, or online vs. offline; a lower level layer should make these differences disappear. I think that kind of abstraction can be helpful, although the devil is obviously in the details (very much so…)

More comments below


> On 28 Jan 2016, at 11:11, Ben De Meester <ben.demeester@ugent.be> wrote:
> 
> Hi all,
> 
> Based on the discussion yesterday, I have been musing, and drafted my thought below.
> It is insanely long, sorry for that, the short version is that I make following statements:
> 
> * A PWP locator can be absolute or relative.

+1

I think the way you use the term, below, of a "PWP Locator" is a Locator (yes, essentially a URL!) to the PWP as a whole. I hope that I understand you o.k. on that.

> * The relative locator allows to link to resources once you know where the PWP is located
>   * and can be derived using the PWP manifest

+0.9.

The remaining 0.1 is to say that it is not necessarily the PWP manifest. If I look at it from the Web's point of view, the relative locator within a resource is a relative URL and if the PWP Locator is the URL of the document as a whole and is used to identify the 'root', then I do not need any extra manifest. The combination of the two yields the absolute URL of that resource.

> * The absolute locator consist of the relative locator and the PWP locator.

I would be careful: it sounds as if it was a simple concatenation of the two. I prefer the word 'combine'.

> * The PWP locator is always in a certain state (e.g., locally unpacked, or hosted packed, or …)

I am not sure I understand what you mean. To be precise, a "locator" does not have a state. It is, ehem, (like:-) a URL that refers to some resource that I can act upon an HTTP GET.

The *PWP* may be in a state. The resource referred to by the locator may have a state. But the locator itself doesn't, in my view.


> * However, all instantiations of the PWP link back to the state-less, abstract PWP, via its Canonical URL
> * and that Canonical URL needs to point to at least one instantiation of a PWP.

These two statements are in fact one, and I tend to agree with them. I would use a slightly different terminology: referring back to the model I had on this on the call: there is a Canonical Locator (ie, Canonical Locator) for the PWP and, somehow, any state of the PWP should carry this information.

> * Thus, a PWP can be referenced using its specific instantiation, or via its Canonical URL.

Referring back to the previous note that I was disagreeing with, let me try to rephrase to see if we have an agreement.

* There is an absolute, canonical URL that refers to a PWP. Ie, if I do a HTTP GET, what is returned is the PWP in some state
* For each available state there is a separate, absolute URL that refers to the PWP in a particular state. Eg, there there is Locator to identify a .zip version, maybe there is a locator to identify a .tar.gz version, and again one that simply refers to the information directly accessible on the Web.

Is this what you meant?

> 
> All of these statements are open to debate of course :).
> 
> Also: @Romain: could you give an update to the current state of the use cases, and how we can help you?
> 
> Greetings,
> Ben
> 
> ## States - scope
> 
> As per the current state of the PWP WD,
> we scope this work specifically that a PWP can have different states (packed/unpacked, protocol/file),
> but otherwise, the contents of the PWP is exactly the same across those states.

Not only +1 to this, but I think this is very important, and I refer back to my debate with Romain on this. The various states, that may have their distinct locators/URI-s, contain exactly the same abstract content. From a certain level of abstractions (see my note on Service Workers at the beginning of the mail) they are simply identical. (That is why I maintain that they are perfectly fine targets for content negotiations.)

> 
> Locating content between PWPs that have different contents (e.g., in another language, or an earlier version),
> are currently out of scope.

+1

> Things such as the FRBR model is out of scope,
> as this is more about identifiers than about locators.
> 

Not sure about that (see also BillK's email). And I am not sure FRBR is about identifiers. It is about more abstract notions all right, but the FRBR abstractions, or a simplified version thereof, may become useful. Let us not consider this out of scope in this sense.

> Also, with locators, there is meant (entire) PWP's and/or individual resources inside the PWP.

I guess the first is what you called 'PWP Locator', right?


> For more fine-grained locations (e.g., the second paragraph of document X),
> other efforts are going on, e.g., in the annotation working group.
> 

Indeed, but more specifically that is what fragment identifiers are for. Ie, our locators are there to refer to (and use in HTTP GET) the individual resources, and they can be combined, eg, with fragment identifiers, for a more fine grained granularity. *At this moment* this is all what should/would say.


> ## Remark: Absolute vs Relative
> 
> As far as I see it, it is possible to have relative and absolute locators,
> where relative locators will mostly (exclusively?) be used inside the PWP,

Note that, so far, there is no restriction within the definition of a PWP whereby for intra-PWP references only relative locators could be used. We may get there, but we do not have it now.

> and absolute locators might be used for internal links,

Can it? Should it be allowed? See my comment above


> but probably mostly for external sources linking to the PWP.

You mean to identify a resource within a PWP? I am not sure what you mean here.

> 
> As such, I think of a locator as having two parts:
> [PWP locator]*[resource locator]
> 
> In the case of a relative locator, the [PWP locator] is missing,
> and needs to be derived from context.

+1 with the caveat of 'combination' and not concatenation.

> 
> ### Internal links
> 
> Inside the PWP
> > i.e., inside the 'container' that holds all contents of the PWP,
> > for a packed PWP, this is straightforward, i.e., inside the package,
> > for an unpacked PWP,
> > I mean inside the subfolder, whether it is file or protocol state
> 
> `<p>See <a href="[resource locator]">Section 2</a> for more info.</p>`
> 
> Q1: Is this locator the same when
> 
> (* Q1a. section 2 is the same file)
> * Q1b. section 2 is a different file, but within the same PWP
> * Q1c. the PWP is opened protocol/unpacked
> * Q1d. the PWP is opened file/packed
> * Q1e. the PWP is opened protocol/packed
> * Q1f. the PWP is opened file/unpacked
> * Q1g. the PWP is opened in a different protocol (e.g., via http or https or ftp)
> * Q1h. the PWP is moved/copied protocol-wise (e.g., from example.com <http://example.com/> to books.org <http://books.org/>)
> * Q1i. the PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to /user/home/bjdmeest/)
> * Q1j. the PWP is packed vs unpacked
> 

+1 to all. The content of a resource must not be forced to change when the state changes.


> ### External links
> 
> From a (online) website/ (offline) paper/...
> 
> <p>John et al. describe an <a href="[PWP locator][resource locator]>interesting algorithm</a> for this problem.</p>
> 
> Q2: Is this locator the same when
> 
> * Q2a. The referring document is actually inside the PWP
> * Q2b. The referred PWP is accessed protocol/unpacked
> * Q2c. The referred PWP is accessed file/packed
> * Q2d. The referred PWP is accessed protocol/packed
> * Q2e. The referred PWP is accessed file/unpacked
> * Q2f. The referred PWP is accessed in a different protocol (e.g., via http or https or ftp)
> * Q2g. the referred PWP is moved/copied protocol-wise (e.g., from example.com <http://example.com/> to books.org <http://books.org/>)
> * Q2h. the referred PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to /user/home/bjdmeest/)
> * Q2i. the referred PWP is packed vs unpacked

*If* by PWP locator you mean the canonical PWP Locator (URL) then +1 to all.

If, say, the packed version is used then… not sure. That packed version may or may not exist (eg, somebody may have unpacked the package and store the unpacked version only)

> 
> ## Idea
> 
> Personally, I see this as two different problems, i.e.,
> the [PWP locator] depends on the protocol the PWP is in,
> whereas the [resource locator] depends on how the the packed vs unpacked PWP should be accessed.
> To me, the [resource locator] is more technical, i.e., once you have the PWP,
> you can (probably via the manifest) access and link to the individual resources.
> Given the discussion yesterday, I see the following high-level model, to solve the [PWP locator]:
> 
> 1. Most importantly, a PWP consists of a Canonical URL and some resources.

A PWP does not "consist" of a URL (Canonical or not). A Canonical URL refers to a PWP.

> 2. The identifiers of a PWP are, e.g., ISBN numbers, but could coïncide with this Canonical URL

+1

> 3. The Canonical URL is the reference to the abstract PWP, whereas different State URLs refer to specific instantiations of that PWP

+1. Actually, +1000, this is what I meant by my remark above

> 4. The Canonical URL does not need to be on the same online place as the actual PWP (cfr. DOI)

While true, I am not sure it is a good practice to do so. In this sense, a DOI is more of an identifier; the http version of the DOI *resolves* to the Canonical URL of the PWP, but that is a different matter

> 5. The State URLs could be, e.g., the packed version on the publishers website, the unpacked version on the publishers website

+1

> 6. or the URL of the local copy of the downloaded PWP
> 

+1

> When referencing a publication, the user can reference the Canonical URL or the state URL.

+1. But the good practice is to refer to the Canonical URL

> When referencing the state URL, the Canonical URL could be found, as it is part of the PWP.

+1

> 
> ### (technical) TODOs
> 
> Systems need to be in place to make sure the Canonical URL can refer to at least one state URL,
> as otherwise only the abstract PWP exists, but no real content.
> 
> It should be specified how a PWP references to the Canonical URL.
> 
> It should be specified how to access and link to specific resources in a PWP, via some kind of manifest.
> 
> ### Fun things
> 
> Fun thing #1: the most minimal website can already be a PWP, namely:
> the Canonical URL is also a State URL to the unpacked protocol version of the PWP.
> 
> Fun thing #2: a user can remix the local PWP as much as he likes -- e.g.,
> stripping out all the videos to create a 'slim' PWP and republishing it --
> the remixed PWP could still refer to the 'official' PWP via its Canonical URL,
> and the publisher still keeps authority on 'correct' PWPs,
> as the Canonical URL does not need to refer to the remixed PWP, but only to the authorized PWPs.
> Add in checksums etc., and any user can verify whether a received PWP is the same as the published PWP.
> 
> ### Bad things
> 
> Bad thing #1: there is an insane amount of pressure on the Canonical URL.
> If this URL dies, then all instantiations of the PWP are disconnected.
> 
> Ben De Meester
> Researcher Semantic Web
> Ghent University - iMinds - Data Science Lab | Faculty of Engineering and Architecture | Department of Electronics and Information Systems
> Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
> t: +32 9 331 49 59 | e: ben.demeester@ugent.be <mailto:ben.demeester@ugent.be> | URL:  http://users.ugent.be/~bjdmeest/ <http://users.ugent.be/~bjdmeest/>
> 


----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Monday, 1 February 2016 13:14:21 UTC