RE: (Possibly) core issue on identification with EPUB-WEB, packaging, fragments... from Siegman, Tzviya - Hoboken on 2015-06-01 (public-digipub-ig@w3.org from June 2015)

From: Siegman, Tzviya - Hoboken <tsiegman@wiley.com>
Date: Mon, 1 Jun 2015 12:42:15 -0400
To: Brady Duga <duga@google.com>, Nick Ruffilo <nickruffilo@gmail.com>
CC: Ivan Herman <ivan@w3.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>, Ralph Swick <swick@w3.org>
Message-ID: <C274A5503C851E43A8ED400AC86E0285146C325AA3@SOM-MB.wiley.com>
So, let’s explore what we mean by “package”. Just because I am collecting requirements doesn’t mean I’m sold on the concept. The more I work on this, the less I think we need a package format. I’m getting increasingly comfortable with the concept of canonical URL that defines scope in some manner.

What makes a publication canonical is (and I REALLY don’t want to get into the what is a book conversation)

1.       It is published (by publisher, author, whomever)

2.       It has a defined set of content at the time of publication (annotations to come)

3.       Additions/deletions are versions/revisions (think back to the print line in paper)

4.       ???
If we can capture this in what we are defining, then I think we have done our job.

I think that we’d all like to see publications treated in the same way as websites by browsers. The question then is how to achieve portability and a formal archive status.

Tzviya Siegman
Digital Book Standards & Capabilities Lead
Wiley
201-748-6884
tsiegman@wiley.com<mailto:tsiegman@wiley.com>

From: Brady Duga [mailto:duga@google.com]
Sent: Monday, June 01, 2015 11:11 AM
To: Nick Ruffilo
Cc: Ivan Herman; W3C Digital Publishing IG; Ralph Swick
Subject: Re: (Possibly) core issue on identification with EPUB-WEB, packaging, fragments...

Well... then every web site today is in a "package". Which seems like confusing terminology to me, but we can go with it. That just changes to "if we use a different packaging format than is already used to package all existing web sites, do we still have a problem?" That is, these cons only seem to apply if we do something new, is that right?

On Mon, Jun 1, 2015 at 8:04 AM Nick Ruffilo <nickruffilo@gmail.com<mailto:nickruffilo@gmail.com>> wrote:
Brady,

No matter what there is still a package file.  The real question is - is that package file "zip" as it is today or some other logical grouping.  I'm conceptually unable to figure out a situation in which there is not a "package file"  Even if it's a directory - then it's just nomenclature...  A directory is still a package.



On Mon, Jun 1, 2015 at 10:58 AM, Brady Duga <duga@google.com<mailto:duga@google.com>> wrote:
If there is no package file, do these problems still exist? It seems like that is mentioned in the original email, but I am not sure if any of these cons apply to it.

On Mon, Jun 1, 2015 at 7:12 AM Nick Ruffilo <nickruffilo@gmail.com<mailto:nickruffilo@gmail.com>> wrote:
What if we leave it up to the client/server to determine what the root of the package is and handle it approrpiately?

So, an epub-web object (or whatever we call it) might live at : //my/item/awesome.epub

To address a specific FILE in that, you go to //my/item/awesome.epub/text/chap2.html

To get to a fragment, you just use # in reference to whatever the fragment is:
//my/item/awesome.epub/text/chap2.html#first_header
//my/item/awesome.epub#SomeCrazyTextRangeIdentifier

If run on a server, it would be the server's job to extract the appropriate package files (when thinking about epub, the OPF for example) and provide that to the client, who can then determine the resources it needs and request them from the server.

When run LOCALLY, the client will simply extract the package files directly.  Otherwise there is no duplication of work or resources, etc.

There was a note about the fragment (things after the #) not being sent to the server.  If that is truly the case - and not just that the server ignores it - a DIFFERENT marker - what - i have no idea...

-Nick

On Mon, Jun 1, 2015 at 6:22 AM, Ivan Herman <ivan@w3.org<mailto:ivan@w3.org>> wrote:
Hi all,

my sincere apologies for the length of this mail, but I thougt it would be worthwhile to get some issues written down to clarify our discussions...

On the F2F meeting I made the claim that the identifier/fragment issue may be the most tricky one facing us around EPUB-WEB. I thought it is worth writing this down; maybe somebody can also prove me wrong that this is not such a complex issue after all. Actually, what is below is a summary of a very short email/personal discussion Markus, Tzviya, and I had on the matter after the F2F. (At some point it is probably worth writing down the conclusions of this thread somewhere on the wiki.)

With that, here is where I see a real problem.

Let us consider a Packaged Document. The URL of this document is http://www.example.org/doc. The document includes, among others, chapter 2 in file chap2.html. This has a section whose ID is 'sec' (for the sake of simplicity, I consider here the simplest and best known fragment used in an HTML file, ie, using the @id attribute on a, say, <h1> element). The question arising is: what is the full URI for that section? Or, to be more exact, what is the full, *canonical* URI for that section, ie, a URI that is independent on whether the document is off-line or on-line?

An Aside: How do URI-s work?
----------------------------

Tzviya told me privately that not everyone on the group may know how exactly URI-s and fragments work in browsers and on the Web. So maybe just a few words may be relevant here. If you know this, my apologies, you can just skip this part.

A URL consists of, roughly, two parts:

- A "primary" address that identifies the resource somewhere on the web. Say, 'http://xyx.example.com/mydoc'
- A "fragment", that is added after the '#' sign, which identifies something *within* the resource; say, 'mysection'

There are two steps in handling this to take into account:

- There can be *only one fragment id in a URL*, ie, only one occurence of '#'. What is after the '#' is interpreted in accordance with a corresponding specification that is bound to the media type of the resource

- A Web browser interprets the fragment locally. Ie, if it gets 'http://xyx.example.com/mydoc#mysection' it
        1. strips the fragment
        2. it issues a request, through the HTTP protocol, for '/mydoc' to the 'http://xyx.example.com' server
        3. it gets the full resource and then uses the fragment (i.e., 'mysection') to identify something within the returned resource.


What is the URI with fragment for section 'sec' in a package?
-------------------------------------------------------------

(For the sake of this discussion I refer to the way the packaging specification works in terms of fragments.)

1. If http://www.example.org/doc refers to a real, physical package on the Web, accessing 'sec' chap2.html, using the current fragment specification in the packaging document, would be:

http://www.example.org/doc#url=/chap2.html;fragment=sec


meaning:
        1. The client retrieves the package http://www.example.org/doc

        2. Unpackages the package in a local cache (or equivalent)
        3. It interprets the fragment 'url=/chap2.html;fragment=sec' by (per the current specification of packaging) by
                3.1. identifying the 'part' within the package, yielding 'chap2.html'
                3.2. 'chap2.html' is an HTML file; because the server knows how to identify something within the file with a fragment, ie, it gets to section 'sec'

It is important to realize that, in this model, the 'unpackaging' is done by the client (the browser i.e., the reading system)

2. If the package is just 'virtual', ie, all documents are on the Web, then there is of course a much simpler approach. The URL of the section is

http://www.example.org/doc/chap2.html#sec


meaning
        1. The client retrieves the HTML document http://www.example.org/doc/chap2.html

        2. It knows how to identify something within the HTML file with a fragment, ie, it gets to section 'sec'


Back to the original question: what is the 'canonical' URI with fragment?
-------------------------------------------------------------------------

It should be one of the two above. However, both have issues:

A. http://www.example.org/doc/chap2.html#sec


Pro: this is the 'natural', Web way.

Con 1: *if* the document is, in fact, a real package then there are two possible approaches to handle this:

Con 1.1: The *server* handles the unpackaging. Ie, it should be in position to analyze the URL it receives, realize that there is a 'package' in between and do an unpackaging. What this would mean is that the client would have to make requests for all chapters separately, which is not optimal (although it can of course be cached)/

Con 1.2: The *client* handles unpackaging. This would require a different server-client protocol, namely:
        1. The client issues a request to 'http://www.example.org/doc/chap2.html'
        2. The server returns 'http://www.example.org/doc/' as a package instead of the original chap2.html file (ie, the server should know that this is part of a package through some redirection)
        3. The client should then unpack and locate the chap2.html file in the package
        4. the fragment should be identified and handled.

Steps 1-2-3 is not the current practice on the Web in terms of Web Architecture: a client does not 'decompose' the 'primary' part of a URL (beyond separating the server's identification from the part within that server). It is unclear whether changing that is a viable/acceptable for the browsers, and for the overal Web Architecture; it certainly requires a discussion with the TAG.

Con 2: If the URL is, in fact, a file:///<file:///\\>... type one, this means that, for that case, the unpackaging must be done on the client. Ie, there may be duplication of functionality with the server and the client, which is not optimal.

B.  http://www.example.org/doc#url=/chap2.html;fragment=sec


Pro: this works for a package.

For a document on the Web, it may also work if there is a 'conceptual' entity on the Web for the document. I.e., http://www.example.org/doc returns some sort of an information to the client that this is, fact, a 'virtual' package, and then the server can issue a new request to http://www.example.org/doc/chap2.html and take it from there.

(Note that, regardless of the original issue, having a 'conceptual' package handle for a document may not be a bad thing!)

Con: The URL form is (much) more complex, and may be in danger of being ignored for documents that are on the Web only.

Personally, I do not have a clear solution in my head. Hence this mail, trying to see how we can move on...

Let me also add another remark, coming originally from Tzviya, just to add it to the mix: "We need to think about situations such as multiple authors creating one package or peer review (one or many authors + one or many editors submit article + data set to journal for review. It undergoes peer review by one or many reviewers. Journal rejects the article. Something happens to the reviews, and the package is submitted to a second journal) and so on.) In scenarios like this, the concept of versioning and revisioning are a lot more important. It may be covered by OA. I don’t know that we can resolve versioning with an identifier."

(Again, apologies to be so verbose…)

Ivan




----
Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/

mobile: +31-641044153<tel:%2B31-641044153>
ORCID ID: http://orcid.org/0000-0003-0782-2704







--
- Nick Ruffilo
@NickRuffilo
http://Aerbook.com

http://ZenOfTechnology.com<http://zenoftechnology.com/>




--
- Nick Ruffilo
@NickRuffilo
http://Aerbook.com

http://ZenOfTechnology.com<http://zenoftechnology.com/>
Received on Monday, 1 June 2015 16:42:59 UTC