Re: Musings on PWP Offline/Online Modes from Daniel Weck on 2016-01-06 (public-digipub-ig@w3.org from January 2016)

From: Daniel Weck <daniel.weck@gmail.com>
Date: Wed, 6 Jan 2016 12:43:27 +0000
To: Ivan Herman <ivan@w3.org>
Cc: Brady Duga <duga@google.com>, Dave Cramer <Dave.Cramer@hbgusa.com>, Leonard Rosenthol <lrosenth@adobe.com>, Nick Ruffilo <nickruffilo@gmail.com>, Tzviya Siegman <tsiegman@wiley.com>, Charles LaPierre <charlesl@benetech.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <CA+FkZ9GKJ6YFNkFLTT3PQ-0LkFuck8kh31hOrYGL6Qr9Ukc7yA@mail.gmail.com>
Ivan,
for security reasons: HTTPS is required, as well as a URL "scope"
within the *same* domain / origin as the Service Worker script (by
default, it's the location of the SW script itself, but that can be
configured to a different path on the server). In other words, a SW
script can only intercept (and therefore respond) to URL requests that
conform to these restrictions.

To illustrate this principle, here is a basic Service Workers usage
example (the script caches resources as they are being requested, to
allow for subsequent fast cache fetches instead of "real" HTTPS
connections):

1) web browser opens chapter1.html ( e.g.
https://server.com/pwp1/contents/chapter1.html )
To simplify, let's assume that there is an active Service Worker for
this page, with a top-level scope:
https://server.com/service_worker.js

2) web browser processes <img src="../images/logo.png" />

3) web browser resolves image relative path against HTML document base
href, resulting in e.g. https://server.com/pwp1/images/logo.png  (note
that base@href could potentially be overridden in the HTML head)

4) because the image URL is within the registered Service Worker
scope: SW script intercepts image request via the fetch event
listener, fetches and caches the image file if necessary (or updates
the cache with a fresh resource), and generates the appropriate
response (binary payload, content type, etc.).

5) web browser receives logo.png from the cache instead of from the
actual HTTPS location.

As you can see, this is a basic "on-demand" processing flow: no
attempt is made to proactively cache resources that have yet to
actually be requested by the web brower (i.e. chapters of the PWP that
have not been accessed yet).
Jake Archibald's "ebook demo" *reader* process goes one step further
from a design standpoint, by preemptively caching all the resource
URLs from a zip file (i.e. a publication archive that is previously
created by a separate *publisher* process). This way, the HTTPS URLs
that are requested when reading the publication chapters are totally
*non-existent* on the server, yet the responses are resolved by
fetching actual content from the cache. See:
https://github.com/jakearchibald/ebook-demo/blob/gh-pages/reader-site/sw.js

Jake's *publisher* process is also implemented with Service Workers
(although it could alternatively be pure server-side code), and the
goal is to intercept a particular URL syntax (i.e. 'fetch' event
listener on "/download-publication" path) in order to build a zip
archive response that contains the *entire* publication, as defined in
"pub-manifest.json" (i.e. list of resource URLs, any files that are
not deemed "external" to the publication):
https://github.com/jakearchibald/ebook-demo/blob/gh-pages/publisher-site/sw.js

The Readium Service Workers experiment does not use an intermediary
browser cache to fetch all resources at once from within a zip EPUB
archive. Instead, the resource requests are intercepted as they occur,
and content is extracted / inflated on-demand. The common denominator
with Jake's experiment is that publication resource URLs do not
actually map to existing files on the server: they just reference the
same HTTPS domain as the SW script itself (within the permitted
scope), and the Service Worker takes care of building the
corresponding payloads (either from the browser cache, or directly
from the EPUB archive). In both cases, some URL syntax "trickery"
(path convention) is used to map a full URL request to a resource
within the exploded cache, or the zipped EPUB.

I hope this helps clarifying possible SW usages (of which there are many).
Dan



On Wed, Jan 6, 2016 at 10:28 AM, Ivan Herman <ivan@w3.org> wrote:
> Hi Daniel,
>
>> On 6 Jan 2016, at 11:16, Daniel Weck <daniel.weck@gmail.com> wrote:
>>
>> Hi Brady,
>> Service Workers can intercept resource requests via "fetch" event
>> listeners, as long as the URLs originate from within the permitted
>> scope (which is itself an HTTPS URL). So in fact, intercepting
>> requests to "external" resources is not possible (i.e. different
>> domain, or even just URL path outside of the registered scope). Note
>> that the "fetch" *API* (not the event type) can of course be used to
>> programmatically emit requests to resources hosted on different
>> domains (via HTTP CORS, just like XmlHTTPRequest), and this can indeed
>> be used to populate a cache, or to build a PWP / EPUB zipped package
>> based on some predefined manifest (i.e. list of well-identified
>> publication resources).
>
> Just for my understanding: does it meant that, for a specific PWP, the (SW based) RS has to 'register' a number of domains or URL-s in its scope in order to be able to catch the requests and cache the content? If so then, in practice, we are close to the idea that a GET to a PWP should return (some form of) a manifest with the resources the PWP contains which should then be "registered" by the RS.
>
> What bothers me a bit is that, in [1], it talks about *a* 'scope URL'. Does it mean that, by default, the URL-s that are used by a PWP should all be under the same, fixed scope, and we must have a redirection mechanism built in to provide an access to external resources (using the fetch API)?
>
> This does have to shape our thinking, if this is all true.
>
> Thanks
>
> Ivan
>
> [1] http://www.w3.org/TR/service-workers/#dfn-scope-url
>
>>
>> References:
>>
>> http://www.w3.org/TR/service-workers/#dfn-scope-url
>>
>> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/publisher-site/sw.js
>>
>> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/reader-site/sw.js
>>
>> Regards,
>> Daniel
>>
>> On Tue, Jan 5, 2016 at 4:39 PM, Brady Duga <duga@google.com> wrote:
>>> One thing to note regarding service workers - while they can be used to
>>> cache in this simple case of an image on a different server, I don't think
>>> they could be used in a more complicated case where resources identify other
>>> resources. So, if you make a page of your publication be
>>> http://louvre.com/monalisa.html, which in turn references
>>> http://louvre.com/monalisa.jpg I don't think it is possible to cache the
>>> image. Though, I am not an expert on service workers, so my understanding
>>> could be flawed.
>>>
>>> On Tue, Jan 5, 2016 at 7:44 AM, Ivan Herman <ivan@w3.org> wrote:
>>>>
>>>> I think the goal should be somewhere in the middle. I agree that the
>>>> definition of PWP should be, as much as possible, implementation agnostic,
>>>> but I agree with Dave that saying "we don't care" is also not appropriate.
>>>>
>>>> We may have to define a PWP Processor in the abstract sense. What a
>>>> processor is supposed to do to answer to different use cases, what are its
>>>> functionalities, that sort of things. We may not define it in a normative
>>>> way in the sense of some formal language or terminology, but we have to
>>>> understand what can, cannot, should, or should not be done with a PWP. And
>>>> it is certainly important to know whether the realization of such a PWP
>>>> processor is possible with today's technologies, what is PWP specific and
>>>> what can be reused off the shelf, etc.
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On 5 Jan 2016, at 16:24, Cramer, Dave <Dave.Cramer@hbgusa.com> wrote:
>>>>
>>>> On Jan 5, 2016, at 9:41 AM, Leonard Rosenthol <lrosenth@adobe.com> wrote:
>>>>
>>>> Nick – the specifics of how an RS chooses (or not) to cache are out of
>>>> scope for PWP.  They may make sense for some sort of format-specific work
>>>> (eg. best practices for PWP with EPUB) but we don’t care about it here.
>>>>
>>>> Remember – PWP is format/packaging and implementation agnostic.   (we
>>>> seemed to all agree to that pre-holidays)
>>>>
>>>>
>>>> The fact that an existing web technology can solve a critical use case for
>>>> PWP is on-topic in my opinion, and learning about such things can only help
>>>> our work. Such technologies may not be a part of the documents we produce,
>>>> but saying "we don't care about it here" I think sends the wrong message.
>>>>
>>>> Dave
>>>> This may contain confidential material. If you are not an intended
>>>> recipient, please notify the sender, delete immediately, and understand that
>>>> no disclosure or reliance on the information herein is permitted. Hachette
>>>> Book Group may monitor email to and from our network.
>>>>
>>>>
>>>>
>>>> ----
>>>> Ivan Herman, W3C
>>>> Digital Publishing Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>>>
>>>>
>>>>
>>>>
>>>
>
>
> ----
> Ivan Herman, W3C
> Digital Publishing Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
>
>
>
>
Received on Wednesday, 6 January 2016 12:44:17 UTC