Re: Musings on PWP Offline/Online Modes from Ivan Herman on 2016-01-06 (public-digipub-ig@w3.org from January 2016)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 6 Jan 2016 14:02:40 +0100
To: Daniel Weck <daniel.weck@gmail.com>
Cc: Brady Duga <duga@google.com>, Dave Cramer <Dave.Cramer@hbgusa.com>, Leonard Rosenthol <lrosenth@adobe.com>, Nick Ruffilo <nickruffilo@gmail.com>, Tzviya Siegman <tsiegman@wiley.com>, Charles LaPierre <charlesl@benetech.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-Id: <2B992BCA-D0D1-4033-ADDC-F48E5CD27864@w3.org>
> On 6 Jan 2016, at 13:43, Daniel Weck <daniel.weck@gmail.com> wrote:
> 
> Ivan,
> for security reasons: HTTPS is required, as well as a URL "scope"
> within the *same* domain / origin as the Service Worker script (by
> default, it's the location of the SW script itself, but that can be
> configured to a different path on the server). In other words, a SW
> script can only intercept (and therefore respond) to URL requests that
> conform to these restrictions.
> 
> To illustrate this principle, here is a basic Service Workers usage
> example (the script caches resources as they are being requested, to
> allow for subsequent fast cache fetches instead of "real" HTTPS
> connections):
> 
> 1) web browser opens chapter1.html ( e.g.
> https://server.com/pwp1/contents/chapter1.html )
> To simplify, let's assume that there is an active Service Worker for
> this page, with a top-level scope:
> https://server.com/service_worker.js
> 
> 2) web browser processes <img src="../images/logo.png" />
> 
> 3) web browser resolves image relative path against HTML document base
> href, resulting in e.g. https://server.com/pwp1/images/logo.png  (note
> that base@href could potentially be overridden in the HTML head)
> 
> 4) because the image URL is within the registered Service Worker
> scope: SW script intercepts image request via the fetch event
> listener, fetches and caches the image file if necessary (or updates
> the cache with a fresh resource), and generates the appropriate
> response (binary payload, content type, etc.).
> 
> 5) web browser receives logo.png from the cache instead of from the
> actual HTTPS location.

Yep. That is clear. But the question is what happens if, in …/chapter1.html, there is a reference to the

https://anotherserver.com/image.png

file? (Note that I use https to avoid the https/http problem.) Does that mean that the service worker script *cannot* take care of that png file, ie, the web browser has to take care of it as usual? Ie, such a file cannot be cached (ie, cannot be used off line to come back to our original use case)? My feeling is that the answer is indeed that the SW cannot do this. Which would mean that a PWP *cannot* contain resources from different domains.

Another, followup question: what if we have an information somewhere in the right scope (in our manifest file, whatever that file's format is) which contains a mapping information from the form

https://server.com/pwp1/image.png ->  https://anotherserver.com/image.png

ie, that the first URI does not really contain any information but, instead, an RS player should fetch the content on anotherserver. I presume that (a) the SW part can catch the URI on …/pwp1/ because it is in its scope and (b) can act as a proper script fetching and caching that data. Is that correct?

Thanks

Ivan


> 
> As you can see, this is a basic "on-demand" processing flow: no
> attempt is made to proactively cache resources that have yet to
> actually be requested by the web brower (i.e. chapters of the PWP that
> have not been accessed yet).
> Jake Archibald's "ebook demo" *reader* process goes one step further
> from a design standpoint, by preemptively caching all the resource
> URLs from a zip file (i.e. a publication archive that is previously
> created by a separate *publisher* process). This way, the HTTPS URLs
> that are requested when reading the publication chapters are totally
> *non-existent* on the server, yet the responses are resolved by
> fetching actual content from the cache. See:
> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/reader-site/sw.js
> 
> Jake's *publisher* process is also implemented with Service Workers
> (although it could alternatively be pure server-side code), and the
> goal is to intercept a particular URL syntax (i.e. 'fetch' event
> listener on "/download-publication" path) in order to build a zip
> archive response that contains the *entire* publication, as defined in
> "pub-manifest.json" (i.e. list of resource URLs, any files that are
> not deemed "external" to the publication):
> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/publisher-site/sw.js
> 
> The Readium Service Workers experiment does not use an intermediary
> browser cache to fetch all resources at once from within a zip EPUB
> archive. Instead, the resource requests are intercepted as they occur,
> and content is extracted / inflated on-demand. The common denominator
> with Jake's experiment is that publication resource URLs do not
> actually map to existing files on the server: they just reference the
> same HTTPS domain as the SW script itself (within the permitted
> scope), and the Service Worker takes care of building the
> corresponding payloads (either from the browser cache, or directly
> from the EPUB archive). In both cases, some URL syntax "trickery"
> (path convention) is used to map a full URL request to a resource
> within the exploded cache, or the zipped EPUB.
> 
> I hope this helps clarifying possible SW usages (of which there are many).
> Dan
> 
> 
> 
> On Wed, Jan 6, 2016 at 10:28 AM, Ivan Herman <ivan@w3.org> wrote:
>> Hi Daniel,
>> 
>>> On 6 Jan 2016, at 11:16, Daniel Weck <daniel.weck@gmail.com> wrote:
>>> 
>>> Hi Brady,
>>> Service Workers can intercept resource requests via "fetch" event
>>> listeners, as long as the URLs originate from within the permitted
>>> scope (which is itself an HTTPS URL). So in fact, intercepting
>>> requests to "external" resources is not possible (i.e. different
>>> domain, or even just URL path outside of the registered scope). Note
>>> that the "fetch" *API* (not the event type) can of course be used to
>>> programmatically emit requests to resources hosted on different
>>> domains (via HTTP CORS, just like XmlHTTPRequest), and this can indeed
>>> be used to populate a cache, or to build a PWP / EPUB zipped package
>>> based on some predefined manifest (i.e. list of well-identified
>>> publication resources).
>> 
>> Just for my understanding: does it meant that, for a specific PWP, the (SW based) RS has to 'register' a number of domains or URL-s in its scope in order to be able to catch the requests and cache the content? If so then, in practice, we are close to the idea that a GET to a PWP should return (some form of) a manifest with the resources the PWP contains which should then be "registered" by the RS.
>> 
>> What bothers me a bit is that, in [1], it talks about *a* 'scope URL'. Does it mean that, by default, the URL-s that are used by a PWP should all be under the same, fixed scope, and we must have a redirection mechanism built in to provide an access to external resources (using the fetch API)?
>> 
>> This does have to shape our thinking, if this is all true.
>> 
>> Thanks
>> 
>> Ivan
>> 
>> [1] http://www.w3.org/TR/service-workers/#dfn-scope-url
>> 
>>> 
>>> References:
>>> 
>>> http://www.w3.org/TR/service-workers/#dfn-scope-url
>>> 
>>> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/publisher-site/sw.js
>>> 
>>> https://github.com/jakearchibald/ebook-demo/blob/gh-pages/reader-site/sw.js
>>> 
>>> Regards,
>>> Daniel
>>> 
>>> On Tue, Jan 5, 2016 at 4:39 PM, Brady Duga <duga@google.com> wrote:
>>>> One thing to note regarding service workers - while they can be used to
>>>> cache in this simple case of an image on a different server, I don't think
>>>> they could be used in a more complicated case where resources identify other
>>>> resources. So, if you make a page of your publication be
>>>> http://louvre.com/monalisa.html, which in turn references
>>>> http://louvre.com/monalisa.jpg I don't think it is possible to cache the
>>>> image. Though, I am not an expert on service workers, so my understanding
>>>> could be flawed.
>>>> 
>>>> On Tue, Jan 5, 2016 at 7:44 AM, Ivan Herman <ivan@w3.org> wrote:
>>>>> 
>>>>> I think the goal should be somewhere in the middle. I agree that the
>>>>> definition of PWP should be, as much as possible, implementation agnostic,
>>>>> but I agree with Dave that saying "we don't care" is also not appropriate.
>>>>> 
>>>>> We may have to define a PWP Processor in the abstract sense. What a
>>>>> processor is supposed to do to answer to different use cases, what are its
>>>>> functionalities, that sort of things. We may not define it in a normative
>>>>> way in the sense of some formal language or terminology, but we have to
>>>>> understand what can, cannot, should, or should not be done with a PWP. And
>>>>> it is certainly important to know whether the realization of such a PWP
>>>>> processor is possible with today's technologies, what is PWP specific and
>>>>> what can be reused off the shelf, etc.
>>>>> 
>>>>> Ivan
>>>>> 
>>>>> 
>>>>> On 5 Jan 2016, at 16:24, Cramer, Dave <Dave.Cramer@hbgusa.com> wrote:
>>>>> 
>>>>> On Jan 5, 2016, at 9:41 AM, Leonard Rosenthol <lrosenth@adobe.com> wrote:
>>>>> 
>>>>> Nick – the specifics of how an RS chooses (or not) to cache are out of
>>>>> scope for PWP.  They may make sense for some sort of format-specific work
>>>>> (eg. best practices for PWP with EPUB) but we don’t care about it here.
>>>>> 
>>>>> Remember – PWP is format/packaging and implementation agnostic.   (we
>>>>> seemed to all agree to that pre-holidays)
>>>>> 
>>>>> 
>>>>> The fact that an existing web technology can solve a critical use case for
>>>>> PWP is on-topic in my opinion, and learning about such things can only help
>>>>> our work. Such technologies may not be a part of the documents we produce,
>>>>> but saying "we don't care about it here" I think sends the wrong message.
>>>>> 
>>>>> Dave
>>>>> This may contain confidential material. If you are not an intended
>>>>> recipient, please notify the sender, delete immediately, and understand that
>>>>> no disclosure or reliance on the information herein is permitted. Hachette
>>>>> Book Group may monitor email to and from our network.
>>>>> 
>>>>> 
>>>>> 
>>>>> ----
>>>>> Ivan Herman, W3C
>>>>> Digital Publishing Lead
>>>>> Home: http://www.w3.org/People/Ivan/
>>>>> mobile: +31-641044153
>>>>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Wednesday, 6 January 2016 13:03:02 UTC