(Possibly) core issue on identification with EPUB-WEB, packaging, fragments... from Ivan Herman on 2015-06-01 (public-digipub-ig@w3.org from June 2015)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 1 Jun 2015 12:22:46 +0200
To: W3C Digital Publishing IG <public-digipub-ig@w3.org>
Cc: Ralph Swick <swick@w3.org>
Message-Id: <33CBDA95-3FFC-4B25-9F44-E922D79863E7@w3.org>
Hi all,

my sincere apologies for the length of this mail, but I thougt it would be worthwhile to get some issues written down to clarify our discussions...

On the F2F meeting I made the claim that the identifier/fragment issue may be the most tricky one facing us around EPUB-WEB. I thought it is worth writing this down; maybe somebody can also prove me wrong that this is not such a complex issue after all. Actually, what is below is a summary of a very short email/personal discussion Markus, Tzviya, and I had on the matter after the F2F. (At some point it is probably worth writing down the conclusions of this thread somewhere on the wiki.)

With that, here is where I see a real problem.

Let us consider a Packaged Document. The URL of this document is http://www.example.org/doc. The document includes, among others, chapter 2 in file chap2.html. This has a section whose ID is 'sec' (for the sake of simplicity, I consider here the simplest and best known fragment used in an HTML file, ie, using the @id attribute on a, say, <h1> element). The question arising is: what is the full URI for that section? Or, to be more exact, what is the full, *canonical* URI for that section, ie, a URI that is independent on whether the document is off-line or on-line?

An Aside: How do URI-s work?
----------------------------

Tzviya told me privately that not everyone on the group may know how exactly URI-s and fragments work in browsers and on the Web. So maybe just a few words may be relevant here. If you know this, my apologies, you can just skip this part.

A URL consists of, roughly, two parts:

- A "primary" address that identifies the resource somewhere on the web. Say, 'http://xyx.example.com/mydoc'
- A "fragment", that is added after the '#' sign, which identifies something *within* the resource; say, 'mysection'

There are two steps in handling this to take into account:

- There can be *only one fragment id in a URL*, ie, only one occurence of '#'. What is after the '#' is interpreted in accordance with a corresponding specification that is bound to the media type of the resource

- A Web browser interprets the fragment locally. Ie, if it gets 'http://xyx.example.com/mydoc#mysection' it
 1. strips the fragment
 2. it issues a request, through the HTTP protocol, for '/mydoc' to the 'http://xyx.example.com' server
 3. it gets the full resource and then uses the fragment (i.e., 'mysection') to identify something within the returned resource.


What is the URI with fragment for section 'sec' in a package?
-------------------------------------------------------------

(For the sake of this discussion I refer to the way the packaging specification works in terms of fragments.)

1. If http://www.example.org/doc refers to a real, physical package on the Web, accessing 'sec' chap2.html, using the current fragment specification in the packaging document, would be:

http://www.example.org/doc#url=/chap2.html;fragment=sec

meaning:
 1. The client retrieves the package http://www.example.org/doc
 2. Unpackages the package in a local cache (or equivalent)
 3. It interprets the fragment 'url=/chap2.html;fragment=sec' by (per the current specification of packaging) by
  3.1. identifying the 'part' within the package, yielding 'chap2.html'
  3.2. 'chap2.html' is an HTML file; because the server knows how to identify something within the file with a fragment, ie, it gets to section 'sec'

It is important to realize that, in this model, the 'unpackaging' is done by the client (the browser i.e., the reading system)

2. If the package is just 'virtual', ie, all documents are on the Web, then there is of course a much simpler approach. The URL of the section is

http://www.example.org/doc/chap2.html#sec

meaning
 1. The client retrieves the HTML document http://www.example.org/doc/chap2.html
 2. It knows how to identify something within the HTML file with a fragment, ie, it gets to section 'sec'


Back to the original question: what is the 'canonical' URI with fragment?
-------------------------------------------------------------------------

It should be one of the two above. However, both have issues:

A. http://www.example.org/doc/chap2.html#sec

Pro: this is the 'natural', Web way.

Con 1: *if* the document is, in fact, a real package then there are two possible approaches to handle this:

Con 1.1: The *server* handles the unpackaging. Ie, it should be in position to analyze the URL it receives, realize that there is a 'package' in between and do an unpackaging. What this would mean is that the client would have to make requests for all chapters separately, which is not optimal (although it can of course be cached)/

Con 1.2: The *client* handles unpackaging. This would require a different server-client protocol, namely:
 1. The client issues a request to 'http://www.example.org/doc/chap2.html'
 2. The server returns 'http://www.example.org/doc/' as a package instead of the original chap2.html file (ie, the server should know that this is part of a package through some redirection)
 3. The client should then unpack and locate the chap2.html file in the package
 4. the fragment should be identified and handled.

Steps 1-2-3 is not the current practice on the Web in terms of Web Architecture: a client does not 'decompose' the 'primary' part of a URL (beyond separating the server's identification from the part within that server). It is unclear whether changing that is a viable/acceptable for the browsers, and for the overal Web Architecture; it certainly requires a discussion with the TAG.

Con 2: If the URL is, in fact, a file:///... type one, this means that, for that case, the unpackaging must be done on the client. Ie, there may be duplication of functionality with the server and the client, which is not optimal.

B.  http://www.example.org/doc#url=/chap2.html;fragment=sec

Pro: this works for a package.

For a document on the Web, it may also work if there is a 'conceptual' entity on the Web for the document. I.e., http://www.example.org/doc returns some sort of an information to the client that this is, fact, a 'virtual' package, and then the server can issue a new request to http://www.example.org/doc/chap2.html and take it from there.

(Note that, regardless of the original issue, having a 'conceptual' package handle for a document may not be a bad thing!)

Con: The URL form is (much) more complex, and may be in danger of being ignored for documents that are on the Web only.

Personally, I do not have a clear solution in my head. Hence this mail, trying to see how we can move on...

Let me also add another remark, coming originally from Tzviya, just to add it to the mix: "We need to think about situations such as multiple authors creating one package or peer review (one or many authors + one or many editors submit article + data set to journal for review. It undergoes peer review by one or many reviewers. Journal rejects the article. Something happens to the reviews, and the package is submitted to a second journal) and so on.) In scenarios like this, the concept of versioning and revisioning are a lot more important. It may be covered by OA. I don’t know that we can resolve versioning with an identifier."

(Again, apologies to be so verbose…)

Ivan




----
Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Monday, 1 June 2015 10:22:55 UTC