Re: [locators] High-level thoughts from Ben De Meester on 2016-02-01 (public-digipub-ig@w3.org from February 2016)

From: Ben De Meester <ben.demeester@ugent.be>
Date: Mon, 1 Feb 2016 16:38:46 +0100
To: Ivan Herman <ivan@w3.org>
Cc: W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <CAJ-O9TtRR11g=qox-J5Wb2F2rmRq0WfFPLhYq2VkjrsT_uZiZg@mail.gmail.com>
Hi Ivan, Romain, and Bill,

Thanks a lot for you comments, I'm going to try to assess both below :).

Ben De Meester
Researcher Semantic Web
Ghent University - iMinds - Data Science Lab | Faculty of Engineering and
Architecture | Department of Electronics and Information Systems
Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
t: +32 9 331 49 59 | e: ben.demeester@ugent.be | URL:
http://users.ugent.be/~bjdmeest/

2016-02-01 14:14 GMT+01:00 Ivan Herman <ivan@w3.org>:

> Hi Ben,
>
> thanks for starting this. My apologies for the late reply, but I had some
> personal issues last week that had priority…
>
> One general comment, not really on the mail below, but more, sort of,
> warning flags that we should keep in mind.
>
> * We should restrict ourselves, whenever we can, to the Web usage. What
> this means is that we should try to avoid reinventing the wheel, and always
> remember that, at the end of the day, we are mostly talking about HTTP(S)
> URL-s. (I say 'mostly', because the issue of package is clearly muddling
> the waters a bit in this respect.) What this also means is that we will
> have to 'translate' whatever we do back on to the world and terminology of
> URL-s. This is also why I always try to look at the issues with a very
> down-to-Earth, operative way: what happens if I issue an HTTP GET on a
> specific, ehem, locator. What do I get back? What other HTTP verbs are
> possible on that locator (e.g., PUT). Etc.
>
> * We agreed to try to keep away from specific design/implementation tools.
> But, again, we are talking about **Web** Documents. Service Workers are a
> specific implementation vehicle but I tend to look at them in a slightly
> more general way. Regardless of the details, what SW-s do is to provide a
> level of abstraction whereby offline and online use become
> indistinguishable. **That** abstraction is very useful I believe. If we
> did not have SW, the fundamental goal of the discussion of PWP (ie, the
> complete and smooth transition from offline to online, from packed to
> unpacked) would still require some sort of an abstraction that, the
> Javascript details put aside, would lead to the very same principles as
> SW-s do. The main point is that, seen from a certain level of abstraction,
> there should be no difference between packed and unpacked, or online vs.
> offline; a lower level layer should make these differences disappear. I
> think that kind of abstraction can be helpful, although the devil is
> obviously in the details (very much so…)
>
> More comments below
>
>
> On 28 Jan 2016, at 11:11, Ben De Meester <ben.demeester@ugent.be> wrote:
>
> Hi all,
>
> Based on the discussion yesterday, I have been musing, and drafted my
> thought below.
> It is insanely long, sorry for that, the short version is that I make
> following statements:
>
> * A PWP locator can be absolute or relative.
>
>
> +1
>
> I think the way you use the term, below, of a "PWP Locator" is a Locator
> (yes, essentially a URL!) to the PWP as a whole. I hope that I understand
> you o.k. on that.
>
Yes, indeed :).

>
> * The relative locator allows to link to resources once you know where the
> PWP is located
>
> @Romain: indeed, the specifics are to be discussed. But I kind of like
your second idea (if I understand it correctly), that the links to
sub-resources are as if the PWP was published unpackaged. This would make
it quite easy to publish and locate PWP resources (as it's exactly the same
as it is currently done when a website is published), but would involve
some more discussion on how to accommodate for the case where the PWP would
not be published unpackaged.

>   * and can be derived using the PWP manifest
>
>
> +0.9.
>
> The remaining 0.1 is to say that it is not necessarily the PWP manifest.
> If I look at it from the Web's point of view, the relative locator within a
> resource is a relative URL and if the PWP Locator is the URL of the
> document as a whole and is used to identify the 'root', then I do not need
> any extra manifest. The combination of the two yields the absolute URL of
> that resource.
>
+1

>
>

> * The absolute locator consist of the relative locator and the PWP locator.
>
>
> I would be careful: it sounds as if it was a simple concatenation of the
> two. I prefer the word 'combine'.
>
Indeed, let's make clear simple concatenation might not be enough.

>
> * The PWP locator is always in a certain state (e.g., locally unpacked, or
> hosted packed, or …)
>
>
> I am not sure I understand what you mean. To be precise, a "locator" does
> not have a state. It is, ehem, (like:-) a URL that refers to some resource
> that I can act upon an HTTP GET.
>
> The *PWP* may be in a state. The resource referred to by the locator may
> have a state. But the locator itself doesn't, in my view.
>
Sorry for that, that sentence came out wrong, I mean to say that it is
possible to link to PWPs (and their resources) in certain states (so it
could be possible to refer to the index.html in the packaged publication,
or in the unpackaged publication, and these two locators could be different)

>
>
> * However, all instantiations of the PWP link back to the state-less,
> abstract PWP, via its Canonical URL
> * and that Canonical URL needs to point to at least one instantiation of a
> PWP.
>
>
> These two statements are in fact one, and I tend to agree with them. I
> would use a slightly different terminology: referring back to the model I
> had on this on the call: there is a Canonical Locator (ie, Canonical
> Locator) for the PWP and, somehow, any state of the PWP should carry this
> information.
>
Very much agree :)

>
> * Thus, a PWP can be referenced using its specific instantiation, or via
> its Canonical URL.
>
>
> Referring back to the previous note that I was disagreeing with, let me
> try to rephrase to see if we have an agreement.
>
> * There is an absolute, canonical URL that refers to a PWP. Ie, if I do a
> HTTP GET, what is returned is the PWP in some state
> * For each available state there is a separate, absolute URL that refers
> to the PWP in a particular state. Eg, there there is Locator to identify a
> .zip version, maybe there is a locator to identify a .tar.gz version, and
> again one that simply refers to the information directly accessible on the
> Web.
>
> Is this what you meant?
>
Indeed, and I think that this coincides with Romain's remarks.

>
>
> All of these statements are open to debate of course :).
>
> Also: @Romain: could you give an update to the current state of the use
> cases, and how we can help you?
>
> Greetings,
> Ben
>
> ## States - scope
>
> As per the current state of the PWP WD,
> we scope this work specifically that a PWP can have different states
> (packed/unpacked, protocol/file),
> but otherwise, the contents of the PWP is exactly the same across those
> states.
>
>
> Not only +1 to this, but I think this is very important, and I refer back
> to my debate with Romain on this. The various states, that may have their
> distinct locators/URI-s, contain exactly the same abstract content. From a
> certain level of abstractions (see my note on Service Workers at the
> beginning of the mail) they are simply identical. (That is why I maintain
> that they are perfectly fine targets for content negotiations.)
>
>
> Locating content between PWPs that have different contents (e.g., in
> another language, or an earlier version),
> are currently out of scope.
>
>
> +1
>
> Things such as the FRBR model is out of scope,
> as this is more about identifiers than about locators.
>
>
> Not sure about that (see also BillK's email). And I am not sure FRBR is
> about identifiers. It is about more abstract notions all right, but the
> FRBR abstractions, or a simplified version thereof, may become useful. Let
> us not consider this out of scope in this sense.
>
Alright :)

>
> Also, with locators, there is meant (entire) PWP's and/or individual
> resources inside the PWP.
>
>
> I guess the first is what you called 'PWP Locator', right?
>
Indeed :)

>
>
> For more fine-grained locations (e.g., the second paragraph of document X),
> other efforts are going on, e.g., in the annotation working group.
>
>
> Indeed, but more specifically that is what fragment identifiers are for.
> Ie, our locators are there to refer to (and use in HTTP GET) the individual
> resources, and they can be combined, eg, with fragment identifiers, for a
> more fine grained granularity. *At this moment* this is all what
> should/would say.
>
+1

>
>
> ## Remark: Absolute vs Relative
>
> As far as I see it, it is possible to have relative and absolute locators,
> where relative locators will mostly (exclusively?) be used inside the PWP,
>
>
> Note that, so far, there is no restriction within the definition of a PWP
> whereby for intra-PWP references only relative locators could be used. We
> may get there, but we do not have it now.
>
+1

>
> and absolute locators might be used for internal links,
>
>
> Can it? Should it be allowed? See my comment above
>
>
> but probably mostly for external sources linking to the PWP.
>
>
> You mean to identify a resource within a PWP? I am not sure what you mean
> here.
>
Yes, miswording, I meant indeed referring to resources within a PWP

>
>
> As such, I think of a locator as having two parts:
> [PWP locator]*[resource locator]
>
> In the case of a relative locator, the [PWP locator] is missing,
> and needs to be derived from context.
>
>
> +1 with the caveat of 'combination' and not concatenation.
>
+1

>
>
> ### Internal links
>
> Inside the PWP
> > i.e., inside the 'container' that holds all contents of the PWP,
> > for a packed PWP, this is straightforward, i.e., inside the package,
> > for an unpacked PWP,
> > I mean inside the subfolder, whether it is file or protocol state
>
> `<p>See <a href="[resource locator]">Section 2</a> for more info.</p>`
>
> Q1: Is this locator the same when
>
> (* Q1a. section 2 is the same file)
> * Q1b. section 2 is a different file, but within the same PWP
> * Q1c. the PWP is opened protocol/unpacked
> * Q1d. the PWP is opened file/packed
> * Q1e. the PWP is opened protocol/packed
> * Q1f. the PWP is opened file/unpacked
> * Q1g. the PWP is opened in a different protocol (e.g., via http or https
> or ftp)
> * Q1h. the PWP is moved/copied protocol-wise (e.g., from example.com to
> books.org)
> * Q1i. the PWP is moved/copied file-wise (e.g., from /usr/home/ben/ to
> /user/home/bjdmeest/)
> * Q1j. the PWP is packed vs unpacked
>
>
> +1 to all. The content of a resource must not be forced to change when the
> state changes.
>
+1

>
>
> ### External links
>
> From a (online) website/ (offline) paper/...
>
> <p>John et al. describe an <a href="[PWP locator][resource
> locator]>interesting algorithm</a> for this problem.</p>
>
> Q2: Is this locator the same when
>
> * Q2a. The referring document is actually inside the PWP
> * Q2b. The referred PWP is accessed protocol/unpacked
> * Q2c. The referred PWP is accessed file/packed
> * Q2d. The referred PWP is accessed protocol/packed
> * Q2e. The referred PWP is accessed file/unpacked
> * Q2f. The referred PWP is accessed in a different protocol (e.g., via
> http or https or ftp)
> * Q2g. the referred PWP is moved/copied protocol-wise (e.g., from
> example.com to books.org)
> * Q2h. the referred PWP is moved/copied file-wise (e.g., from
> /usr/home/ben/ to /user/home/bjdmeest/)
> * Q2i. the referred PWP is packed vs unpacked
>
>
> *If* by PWP locator you mean the canonical PWP Locator (URL) then +1 to
> all.
>
> If, say, the packed version is used then… not sure. That packed version
> may or may not exist (eg, somebody may have unpacked the package and store
> the unpacked version only)
>
I also believe the canonical PWP Locator is the way to go, but I didn't
want to rule out that people *can* use, e.g., the locator to their local
copy. Then however, that 'local' locator would not stay the same when,
e.g., the PWP is moved in the file system.
So @Romain, I indeed think the locator should be resilient to packaging and
sharing. The locators should not change when a user downloads a PWP, but
the reading system should be able to map the canonical url to the local
location without the user's notion.
This is to me, independent whether the PWP is downloaded or cached. When
cached, the mapping could be done by something link SW, when downloaded,
the mapping could be done by something like Readium or a browser plug-in.

>
>
> ## Idea
>
> Personally, I see this as two different problems, i.e.,
> the [PWP locator] depends on the protocol the PWP is in,
> whereas the [resource locator] depends on how the the packed vs unpacked
> PWP should be accessed.
> To me, the [resource locator] is more technical, i.e., once you have the
> PWP,
> you can (probably via the manifest) access and link to the individual
> resources.
> Given the discussion yesterday, I see the following high-level model, to
> solve the [PWP locator]:
>
> 1. Most importantly, a PWP consists of a Canonical URL and some resources.
>
>
> A PWP does not "consist" of a URL (Canonical or not). A Canonical URL
> refers to a PWP.
>
 Well, I meant to say that, e.g., the local downloaded PWP should still
have some kind of link to its Canonical URL.

>
> 2. The identifiers of a PWP are, e.g., ISBN numbers, but could coïncide
> with this Canonical URL
>
>
> +1
>
> 3. The Canonical URL is the reference to the abstract PWP, whereas
> different State URLs refer to specific instantiations of that PWP
>
>
> +1. Actually, +1000, this is what I meant by my remark above
>
> 4. The Canonical URL does not need to be on the same online place as the
> actual PWP (cfr. DOI)
>
>
> While true, I am not sure it is a good practice to do so. In this sense, a
> DOI is more of an identifier; the http version of the DOI *resolves* to the
> Canonical URL of the PWP, but that is a different matter
>
Indeed, good practice would be to keep them in the same place, but I
wouldn't restrict that (but I also don't think you implied that). E.g., if
a publisher changes name, and thus domain name, the old PWPs could be kept
in their old location so that the old locators could work, but also
're-published' under the new name.

>
> 5. The State URLs could be, e.g., the packed version on the publishers
> website, the unpacked version on the publishers website
>
>
> +1
>
> 6. or the URL of the local copy of the downloaded PWP
>
>
> +1
>
> When referencing a publication, the user can reference the Canonical URL
> or the state URL.
>
>
> +1. But the good practice is to refer to the Canonical URL
>
+1, see comment above


>
> When referencing the state URL, the Canonical URL could be found, as it is
> part of the PWP.
>
>
> +1
>
>
> ### (technical) TODOs
>
> Systems need to be in place to make sure the Canonical URL can refer to at
> least one state URL,
> as otherwise only the abstract PWP exists, but no real content.
>
> It should be specified how a PWP references to the Canonical URL.
>
> It should be specified how to access and link to specific resources in a
> PWP, via some kind of manifest.
>
> ### Fun things
>
> Fun thing #1: the most minimal website can already be a PWP, namely:
> the Canonical URL is also a State URL to the unpacked protocol version of
> the PWP.
>
> Fun thing #2: a user can remix the local PWP as much as he likes -- e.g.,
> stripping out all the videos to create a 'slim' PWP and republishing it --
> the remixed PWP could still refer to the 'official' PWP via its Canonical
> URL,
> and the publisher still keeps authority on 'correct' PWPs,
> as the Canonical URL does not need to refer to the remixed PWP, but only
> to the authorized PWPs.
> Add in checksums etc., and any user can verify whether a received PWP is
> the same as the published PWP.
>
> ### Bad things
>
> Bad thing #1: there is an insane amount of pressure on the Canonical URL.
> If this URL dies, then all instantiations of the PWP are disconnected.
>
> Ben De Meester
> Researcher Semantic Web
> Ghent University - iMinds - Data Science Lab | Faculty of Engineering and
> Architecture | Department of Electronics and Information Systems
> Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
> t: +32 9 331 49 59 | e: ben.demeester@ugent.be | URL:
> http://users.ugent.be/~bjdmeest/
>
>
>
> ----
> Ivan Herman, W3C
> Digital Publishing Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
>
>
>
>
>
Received on Monday, 1 February 2016 15:40:08 UTC