Re: [dpub-loc] Draft update from Ben De Meester on 2016-02-16 (public-digipub-ig@w3.org from February 2016)

From: Ben De Meester <ben.demeester@ugent.be>
Date: Tue, 16 Feb 2016 10:09:03 +0100
To: Ivan Herman <ivan@w3.org>
Cc: Leonard Rosenthol <lrosenth@adobe.com>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <CAJ-O9TsL5hxQceVJt2Edt0D5TAK7bbCQFaPRiMJP5YSLz_bYPg@mail.gmail.com>
2016-02-16 9:38 GMT+01:00 Ivan Herman <ivan@w3.org>:

>
> On 16 Feb 2016, at 09:33, Ben De Meester <ben.demeester@ugent.be> wrote:
>
> Hi Ivan, all,
>
> So, if I understand correctly, *M* consists of two parts: the manifest
> (the list of files that *P* comprises, comparable to what we have in,
> e.g., EPUB) *Ma*, and the link set *Mlinks* (i.e., the set *L*, *Lu*, and
> *Lp*).
> *Ma* is part of all states of *P*, and *Mlinks* is (probably) stored
> somewhere outside of *P* (the options for generating and/or storing
> *Mlinks* are manyfold: as a JSON-file, from a database, from a web
> service, automatically derived from the .htaccess file, ... I don't think
> there is a need now to specify that, just as we at the moment don't have to
> specify *how* *M* is returned).
> When someone GETs *L*, *Lu*, or *Lp*, *S* returns (the dynamically
> generated) *M*, in some way or another (see e.g., Ivan's suggestions), so
> the PWP processor knows both *Ma* and *Mlinks*.
> From there, the PWP processor knows what to do.
>
>
> Yes, I think this is a good summary.
>
>
> Concerning the 'server-modifications' discussion: as far as I see, we have
> two options discussed when trying to GET a resource from a packed PWP (and
> this, in fact, is orthogonal to the 'how to return *M* discussion'):
>
>    - either the server is modified to know about the internals of the
>    package format, and returns the resource to the client (complex server,
>    simple client)
>    - or the server returns the entire package, and the client needs to
>    know the internals of the package to retrieve the resource from the packed
>    PWP (simple server, complex client).
>
> Both have pros and cons, and I have the feeling this is the same problem
> as asking for any kind of data from a knowledge base from the web: either
> you download the entire data dump and retrieve the data on the client side,
> or you set up a query service and the client asks the question directly to
> the server. The end result is the same, the functionalities are the same,
> it's just a matter of where to put the complexity. Maybe, other
> intermediate options are also possible.
> So maybe, this last discussion doesn't have to be answered: complex
> servers can help the client to retrieve resources more efficiently, complex
> clients can handle simple servers, and we'll all live in a hybrid world.
>
>
> At this point, I definitely agree that the PWP spec does not have to
> (formally) specify all the various alternatives, certainly not trying to be
> exhaustive. But, I believe, due diligence requires that we do list *some*
> viable approaches that proves that whatever we are talking about is not
> just hot air:-)
>

Also fully agree, and -- as I assume the main issue here is, e.g., where to
unzip the packed PWP, client-side or server-side -- I think there are
viable approaches a-plenty, a quick search returned:
http://stuk.github.io/jszip/ (client-side) and
http://search.cpan.org/~phred/Archive-Zip-1.56/lib/Archive/Zip/MemberRead.pm
(server-side, although apache modules is not my cup of tea, so I might be
wrong here)

>
> I.
>
>
> Any thoughts?
>
> Greetings,
> Ben
>
> Ben De Meester
> Researcher Semantic Web
> Ghent University - iMinds - Data Science Lab | Faculty of Engineering and
> Architecture | Department of Electronics and Information Systems
> Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
> t: +32 9 331 49 59 | e: ben.demeester@ugent.be | URL:
> http://users.ugent.be/~bjdmeest/
>
> 2016-02-15 11:32 GMT+01:00 Ivan Herman <ivan@w3.org>:
>
>> Leonard,
>>
>> On 12 Feb 2016, at 18:21, Leonard Rosenthol <lrosenth@adobe.com> wrote:
>>
>> I don’t see any bootstrapping required.   Sure, having server-based
>> modifications (of various flavors) would make for a more optimization
>> implementation but IMO it’s optional and not required.  (this also matches
>> what you were saying on the phone the other day.  Have you changed your
>> mind??)
>>
>>
>>
>> we may mutually misunderstand one another, so maybe it is better (and
>> clearer to the others) if I write down this (only) issue we have with my
>> original writeup to see where we really are.
>>
>> My original writeup[1] said:
>>
>> > 1. The PWP Processor has access to the information in M.
>> > 2. As a consequence, M contains the list of states (and their Locators)
>> that are available on S. In other words, the PWP Processor “knows” Lp and
>> Lu, together with their media types.
>>
>> And your comment was:
>>
>> > I can’t agree to the first half of assumption #2. It would imply that M
>> is created AFTER P is already placed on the server (or is authored by the
>> same system that is responsible for hosting P on S). And if M is modified
>> after P is created, then P isn’t actually P, but instead is P’ - which
>> might be fine for the purposes of publishing, but we need to be clear about
>> that.
>>
>> And you proposed to simply remove of that sentence, leaving only "the PWP
>> Processor 'knows' *Lp* and *Lu*..."
>>
>> First of all, your comment is correct: there is a problem. But I also
>> believe that we should have clear ways to describe *how* the PWP
>> Processor knows about *Lp* and *Lu*, in case it is not in *M*, and not
>> leaving that question open (which would be the case if that sentence was
>> removed). Without having a clear idea on this, I do not believe our model
>> is credible.
>>
>> My general response is therefore to say: "*M*", ie, the metadata for a
>> specific *P*, is *conceptual* in the sense that it is perfectly all
>> right if the PWP Processor "gathers" the content from different sources.
>> What counts is that, at the end of the day, the PWP Processor gets hold of
>> *all* the data in "*M*" which then indeed includes *Lu* and *Lp*. Ie, we
>> can keep that statement if this fact is made clear in the text *and*
>> there is a way to ensure that this can be set up (without prescribing a
>> singular way of setting it up).
>>
>> We have, in the document, several scenarios listed to get to the metadata
>> (listed at the end of the writeup). What we have to make clear is that the
>> various approaches are *not* mutually exclusive but, if the metadata
>> comes from different sources, the PWP Process has to combine them. Ie, it
>> is perfectly o.k. if, for example, the result of the GET on the packed data
>> returns the package with the embedded metadata *and* also uses the HTTP
>> Link header for additional metadata (or use the the HTTP Alternates header?
>> I am not sure on that one) thereby providing the missing *Lu* and *L*,
>> for example. The processor combines these information into a coherent *M*
>> and it indeed gets the information on the list of states as stated in that
>> sentence.
>>
>> What is the problem with this? Isn't this acceptable?
>>
>> Maybe the source of our misunderstanding is actually elsewhere: I am not
>> sure what you meant by 'server-based modifications'. What I said on the
>> call is that I would be against imposing a server modification that would
>> require a modification of the code of the server itself. Ie, which would
>> require a new recompilation of Apache, for example, or that it would even
>> require the development and installation of a new "mod" module (to continue
>> using the Apache example) to be developed by the community. However, I
>> believe that a mechanism that may, in some cases, require the modification
>> or, rather, the addition of a new response header to a server response
>> (like in that example) should be acceptable; I would expect all servers
>> providing such facilities, even if its usage requires some admin right on
>> the server. I am not saying that should be the *only* way of achieving
>> something, but it should not be road blocker either.
>>
>> Ivan
>>
>> P.S. For those of you for whom HTTP header setting is a mystery: if you
>> run Apache and you have the right to include a .htaccess file in a
>> directory, adding a Link header on the file "test.html" in a directory
>> means adding something like:
>>
>> <Files "test.html">
>> Header set Link "http://www.ex.org/test2.html; rel=canonical"
>> </Files>
>>
>> to that .htaccess file and, voilà!
>>
>>
>> [1]
>> https://github.com/w3c/dpub-pwp-loc/blob/gh-pages/drafts/ivans-musings.md
>>
>>
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>
>>
>>
>>
>>
>
>
> ----
> Ivan Herman, W3C
> Digital Publishing Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
>
>
>
>
>
Received on Tuesday, 16 February 2016 09:10:02 UTC