Re: (Possibly) core issue on identification with EPUB-WEB, packaging, fragments... from Nick Ruffilo on 2015-06-01 (public-digipub-ig@w3.org from June 2015)

From: Nick Ruffilo <nickruffilo@gmail.com>
Date: Mon, 1 Jun 2015 10:11:02 -0400
To: Ivan Herman <ivan@w3.org>
Cc: W3C Digital Publishing IG <public-digipub-ig@w3.org>, Ralph Swick <swick@w3.org>
Message-ID: <CA+Dds58vDSr1CA0A0jVi5mYn1pGOB36k+kPg4bjP5qF6ApxVLQ@mail.gmail.com>
What if we leave it up to the client/server to determine what the root of
the package is and handle it approrpiately?

So, an epub-web object (or whatever we call it) might live at :
//my/item/awesome.epub

To address a specific FILE in that, you go to
//my/item/awesome.epub/text/chap2.html

To get to a fragment, you just use # in reference to whatever the fragment
is:
//my/item/awesome.epub/text/chap2.html#first_header
//my/item/awesome.epub#SomeCrazyTextRangeIdentifier

If run on a server, it would be the server's job to extract the appropriate
package files (when thinking about epub, the OPF for example) and provide
that to the client, who can then determine the resources it needs and
request them from the server.

When run LOCALLY, the client will simply extract the package files
directly.  Otherwise there is no duplication of work or resources, etc.

There was a note about the fragment (things after the #) not being sent to
the server.  If that is truly the case - and not just that the server
ignores it - a DIFFERENT marker - what - i have no idea...

-Nick

On Mon, Jun 1, 2015 at 6:22 AM, Ivan Herman <ivan@w3.org> wrote:

> Hi all,
>
> my sincere apologies for the length of this mail, but I thougt it would be
> worthwhile to get some issues written down to clarify our discussions...
>
> On the F2F meeting I made the claim that the identifier/fragment issue may
> be the most tricky one facing us around EPUB-WEB. I thought it is worth
> writing this down; maybe somebody can also prove me wrong that this is not
> such a complex issue after all. Actually, what is below is a summary of a
> very short email/personal discussion Markus, Tzviya, and I had on the
> matter after the F2F. (At some point it is probably worth writing down the
> conclusions of this thread somewhere on the wiki.)
>
> With that, here is where I see a real problem.
>
> Let us consider a Packaged Document. The URL of this document is
> http://www.example.org/doc. The document includes, among others, chapter
> 2 in file chap2.html. This has a section whose ID is 'sec' (for the sake of
> simplicity, I consider here the simplest and best known fragment used in an
> HTML file, ie, using the @id attribute on a, say, <h1> element). The
> question arising is: what is the full URI for that section? Or, to be more
> exact, what is the full, *canonical* URI for that section, ie, a URI that
> is independent on whether the document is off-line or on-line?
>
> An Aside: How do URI-s work?
> ----------------------------
>
> Tzviya told me privately that not everyone on the group may know how
> exactly URI-s and fragments work in browsers and on the Web. So maybe just
> a few words may be relevant here. If you know this, my apologies, you can
> just skip this part.
>
> A URL consists of, roughly, two parts:
>
> - A "primary" address that identifies the resource somewhere on the web.
> Say, 'http://xyx.example.com/mydoc'
> - A "fragment", that is added after the '#' sign, which identifies
> something *within* the resource; say, 'mysection'
>
> There are two steps in handling this to take into account:
>
> - There can be *only one fragment id in a URL*, ie, only one occurence of
> '#'. What is after the '#' is interpreted in accordance with a
> corresponding specification that is bound to the media type of the resource
>
> - A Web browser interprets the fragment locally. Ie, if it gets '
> http://xyx.example.com/mydoc#mysection' it
>         1. strips the fragment
>         2. it issues a request, through the HTTP protocol, for '/mydoc' to
> the 'http://xyx.example.com' server
>         3. it gets the full resource and then uses the fragment (i.e.,
> 'mysection') to identify something within the returned resource.
>
>
> What is the URI with fragment for section 'sec' in a package?
> -------------------------------------------------------------
>
> (For the sake of this discussion I refer to the way the packaging
> specification works in terms of fragments.)
>
> 1. If http://www.example.org/doc refers to a real, physical package on
> the Web, accessing 'sec' chap2.html, using the current fragment
> specification in the packaging document, would be:
>
> http://www.example.org/doc#url=/chap2.html;fragment=sec
>
> meaning:
>         1. The client retrieves the package http://www.example.org/doc
>         2. Unpackages the package in a local cache (or equivalent)
>         3. It interprets the fragment 'url=/chap2.html;fragment=sec' by
> (per the current specification of packaging) by
>                 3.1. identifying the 'part' within the package, yielding
> 'chap2.html'
>                 3.2. 'chap2.html' is an HTML file; because the server
> knows how to identify something within the file with a fragment, ie, it
> gets to section 'sec'
>
> It is important to realize that, in this model, the 'unpackaging' is done
> by the client (the browser i.e., the reading system)
>
> 2. If the package is just 'virtual', ie, all documents are on the Web,
> then there is of course a much simpler approach. The URL of the section is
>
> http://www.example.org/doc/chap2.html#sec
>
> meaning
>         1. The client retrieves the HTML document
> http://www.example.org/doc/chap2.html
>         2. It knows how to identify something within the HTML file with a
> fragment, ie, it gets to section 'sec'
>
>
> Back to the original question: what is the 'canonical' URI with fragment?
> -------------------------------------------------------------------------
>
> It should be one of the two above. However, both have issues:
>
> A. http://www.example.org/doc/chap2.html#sec
>
> Pro: this is the 'natural', Web way.
>
> Con 1: *if* the document is, in fact, a real package then there are two
> possible approaches to handle this:
>
> Con 1.1: The *server* handles the unpackaging. Ie, it should be in
> position to analyze the URL it receives, realize that there is a 'package'
> in between and do an unpackaging. What this would mean is that the client
> would have to make requests for all chapters separately, which is not
> optimal (although it can of course be cached)/
>
> Con 1.2: The *client* handles unpackaging. This would require a different
> server-client protocol, namely:
>         1. The client issues a request to '
> http://www.example.org/doc/chap2.html'
>         2. The server returns 'http://www.example.org/doc/' as a package
> instead of the original chap2.html file (ie, the server should know that
> this is part of a package through some redirection)
>         3. The client should then unpack and locate the chap2.html file in
> the package
>         4. the fragment should be identified and handled.
>
> Steps 1-2-3 is not the current practice on the Web in terms of Web
> Architecture: a client does not 'decompose' the 'primary' part of a URL
> (beyond separating the server's identification from the part within that
> server). It is unclear whether changing that is a viable/acceptable for the
> browsers, and for the overal Web Architecture; it certainly requires a
> discussion with the TAG.
>
> Con 2: If the URL is, in fact, a file:///... type one, this means that,
> for that case, the unpackaging must be done on the client. Ie, there may be
> duplication of functionality with the server and the client, which is not
> optimal.
>
> B.  http://www.example.org/doc#url=/chap2.html;fragment=sec
>
> Pro: this works for a package.
>
> For a document on the Web, it may also work if there is a 'conceptual'
> entity on the Web for the document. I.e., http://www.example.org/doc
> returns some sort of an information to the client that this is, fact, a
> 'virtual' package, and then the server can issue a new request to
> http://www.example.org/doc/chap2.html and take it from there.
>
> (Note that, regardless of the original issue, having a 'conceptual'
> package handle for a document may not be a bad thing!)
>
> Con: The URL form is (much) more complex, and may be in danger of being
> ignored for documents that are on the Web only.
>
> Personally, I do not have a clear solution in my head. Hence this mail,
> trying to see how we can move on...
>
> Let me also add another remark, coming originally from Tzviya, just to add
> it to the mix: "We need to think about situations such as multiple authors
> creating one package or peer review (one or many authors + one or many
> editors submit article + data set to journal for review. It undergoes peer
> review by one or many reviewers. Journal rejects the article. Something
> happens to the reviews, and the package is submitted to a second journal)
> and so on.) In scenarios like this, the concept of versioning and
> revisioning are a lot more important. It may be covered by OA. I don’t know
> that we can resolve versioning with an identifier."
>
> (Again, apologies to be so verbose…)
>
> Ivan
>
>
>
>
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
>
>
>
>
>


-- 
- Nick Ruffilo
@NickRuffilo
http://Aerbook.com
http://ZenOfTechnology.com <http://zenoftechnology.com/>
Received on Monday, 1 June 2015 14:11:30 UTC