Re: (Possibly) core issue on identification with EPUB-WEB, packaging, fragments... from Brady Duga on 2015-06-01 (public-digipub-ig@w3.org from June 2015)

From: Brady Duga <duga@google.com>
Date: Mon, 01 Jun 2015 15:10:34 +0000
To: Nick Ruffilo <nickruffilo@gmail.com>
Cc: Ivan Herman <ivan@w3.org>, W3C Digital Publishing IG <public-digipub-ig@w3.org>, Ralph Swick <swick@w3.org>
Message-ID: <CAH_p_eWcuQ=nxcOWSi=mtXmFrCW5WViebXBrV1JF7i1ZO1fJGQ@mail.gmail.com>
Well... then every web site today is in a "package". Which seems like
confusing terminology to me, but we can go with it. That just changes to
"if we use a different packaging format than is already used to package all
existing web sites, do we still have a problem?" That is, these cons only
seem to apply if we do something new, is that right?

On Mon, Jun 1, 2015 at 8:04 AM Nick Ruffilo <nickruffilo@gmail.com> wrote:

> Brady,
>
> No matter what there is still a package file.  The real question is - is
> that package file "zip" as it is today or some other logical grouping.  I'm
> conceptually unable to figure out a situation in which there is not a
> "package file"  Even if it's a directory - then it's just nomenclature...
> A directory is still a package.
>
>
>
> On Mon, Jun 1, 2015 at 10:58 AM, Brady Duga <duga@google.com> wrote:
>
>> If there is no package file, do these problems still exist? It seems like
>> that is mentioned in the original email, but I am not sure if any of these
>> cons apply to it.
>>
>> On Mon, Jun 1, 2015 at 7:12 AM Nick Ruffilo <nickruffilo@gmail.com>
>> wrote:
>>
>>> What if we leave it up to the client/server to determine what the root
>>> of the package is and handle it approrpiately?
>>>
>>> So, an epub-web object (or whatever we call it) might live at :
>>> //my/item/awesome.epub
>>>
>>> To address a specific FILE in that, you go to
>>> //my/item/awesome.epub/text/chap2.html
>>>
>>> To get to a fragment, you just use # in reference to whatever the
>>> fragment is:
>>> //my/item/awesome.epub/text/chap2.html#first_header
>>> //my/item/awesome.epub#SomeCrazyTextRangeIdentifier
>>>
>>> If run on a server, it would be the server's job to extract the
>>> appropriate package files (when thinking about epub, the OPF for example)
>>> and provide that to the client, who can then determine the resources it
>>> needs and request them from the server.
>>>
>>> When run LOCALLY, the client will simply extract the package files
>>> directly.  Otherwise there is no duplication of work or resources, etc.
>>>
>>> There was a note about the fragment (things after the #) not being sent
>>> to the server.  If that is truly the case - and not just that the server
>>> ignores it - a DIFFERENT marker - what - i have no idea...
>>>
>>> -Nick
>>>
>>> On Mon, Jun 1, 2015 at 6:22 AM, Ivan Herman <ivan@w3.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> my sincere apologies for the length of this mail, but I thougt it would
>>>> be worthwhile to get some issues written down to clarify our discussions...
>>>>
>>>> On the F2F meeting I made the claim that the identifier/fragment issue
>>>> may be the most tricky one facing us around EPUB-WEB. I thought it is worth
>>>> writing this down; maybe somebody can also prove me wrong that this is not
>>>> such a complex issue after all. Actually, what is below is a summary of a
>>>> very short email/personal discussion Markus, Tzviya, and I had on the
>>>> matter after the F2F. (At some point it is probably worth writing down the
>>>> conclusions of this thread somewhere on the wiki.)
>>>>
>>>> With that, here is where I see a real problem.
>>>>
>>>> Let us consider a Packaged Document. The URL of this document is
>>>> http://www.example.org/doc. The document includes, among others,
>>>> chapter 2 in file chap2.html. This has a section whose ID is 'sec' (for the
>>>> sake of simplicity, I consider here the simplest and best known fragment
>>>> used in an HTML file, ie, using the @id attribute on a, say, <h1> element).
>>>> The question arising is: what is the full URI for that section? Or, to be
>>>> more exact, what is the full, *canonical* URI for that section, ie, a URI
>>>> that is independent on whether the document is off-line or on-line?
>>>>
>>>> An Aside: How do URI-s work?
>>>> ----------------------------
>>>>
>>>> Tzviya told me privately that not everyone on the group may know how
>>>> exactly URI-s and fragments work in browsers and on the Web. So maybe just
>>>> a few words may be relevant here. If you know this, my apologies, you can
>>>> just skip this part.
>>>>
>>>> A URL consists of, roughly, two parts:
>>>>
>>>> - A "primary" address that identifies the resource somewhere on the
>>>> web. Say, 'http://xyx.example.com/mydoc'
>>>> - A "fragment", that is added after the '#' sign, which identifies
>>>> something *within* the resource; say, 'mysection'
>>>>
>>>> There are two steps in handling this to take into account:
>>>>
>>>> - There can be *only one fragment id in a URL*, ie, only one occurence
>>>> of '#'. What is after the '#' is interpreted in accordance with a
>>>> corresponding specification that is bound to the media type of the resource
>>>>
>>>> - A Web browser interprets the fragment locally. Ie, if it gets '
>>>> http://xyx.example.com/mydoc#mysection' it
>>>>         1. strips the fragment
>>>>         2. it issues a request, through the HTTP protocol, for '/mydoc'
>>>> to the 'http://xyx.example.com' server
>>>>         3. it gets the full resource and then uses the fragment (i.e.,
>>>> 'mysection') to identify something within the returned resource.
>>>>
>>>>
>>>> What is the URI with fragment for section 'sec' in a package?
>>>> -------------------------------------------------------------
>>>>
>>>> (For the sake of this discussion I refer to the way the packaging
>>>> specification works in terms of fragments.)
>>>>
>>>> 1. If http://www.example.org/doc refers to a real, physical package on
>>>> the Web, accessing 'sec' chap2.html, using the current fragment
>>>> specification in the packaging document, would be:
>>>>
>>>> http://www.example.org/doc#url=/chap2.html;fragment=sec
>>>>
>>>> meaning:
>>>>         1. The client retrieves the package http://www.example.org/doc
>>>>         2. Unpackages the package in a local cache (or equivalent)
>>>>         3. It interprets the fragment 'url=/chap2.html;fragment=sec' by
>>>> (per the current specification of packaging) by
>>>>                 3.1. identifying the 'part' within the package,
>>>> yielding 'chap2.html'
>>>>                 3.2. 'chap2.html' is an HTML file; because the server
>>>> knows how to identify something within the file with a fragment, ie, it
>>>> gets to section 'sec'
>>>>
>>>> It is important to realize that, in this model, the 'unpackaging' is
>>>> done by the client (the browser i.e., the reading system)
>>>>
>>>> 2. If the package is just 'virtual', ie, all documents are on the Web,
>>>> then there is of course a much simpler approach. The URL of the section is
>>>>
>>>> http://www.example.org/doc/chap2.html#sec
>>>>
>>>> meaning
>>>>         1. The client retrieves the HTML document
>>>> http://www.example.org/doc/chap2.html
>>>>         2. It knows how to identify something within the HTML file with
>>>> a fragment, ie, it gets to section 'sec'
>>>>
>>>>
>>>> Back to the original question: what is the 'canonical' URI with
>>>> fragment?
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> It should be one of the two above. However, both have issues:
>>>>
>>>> A. http://www.example.org/doc/chap2.html#sec
>>>>
>>>> Pro: this is the 'natural', Web way.
>>>>
>>>> Con 1: *if* the document is, in fact, a real package then there are two
>>>> possible approaches to handle this:
>>>>
>>>> Con 1.1: The *server* handles the unpackaging. Ie, it should be in
>>>> position to analyze the URL it receives, realize that there is a 'package'
>>>> in between and do an unpackaging. What this would mean is that the client
>>>> would have to make requests for all chapters separately, which is not
>>>> optimal (although it can of course be cached)/
>>>>
>>>> Con 1.2: The *client* handles unpackaging. This would require a
>>>> different server-client protocol, namely:
>>>>         1. The client issues a request to '
>>>> http://www.example.org/doc/chap2.html'
>>>>         2. The server returns 'http://www.example.org/doc/' as a
>>>> package instead of the original chap2.html file (ie, the server should know
>>>> that this is part of a package through some redirection)
>>>>         3. The client should then unpack and locate the chap2.html file
>>>> in the package
>>>>         4. the fragment should be identified and handled.
>>>>
>>>> Steps 1-2-3 is not the current practice on the Web in terms of Web
>>>> Architecture: a client does not 'decompose' the 'primary' part of a URL
>>>> (beyond separating the server's identification from the part within that
>>>> server). It is unclear whether changing that is a viable/acceptable for the
>>>> browsers, and for the overal Web Architecture; it certainly requires a
>>>> discussion with the TAG.
>>>>
>>>> Con 2: If the URL is, in fact, a file:///... type one, this means that,
>>>> for that case, the unpackaging must be done on the client. Ie, there may be
>>>> duplication of functionality with the server and the client, which is not
>>>> optimal.
>>>>
>>>> B.  http://www.example.org/doc#url=/chap2.html;fragment=sec
>>>>
>>>> Pro: this works for a package.
>>>>
>>>> For a document on the Web, it may also work if there is a 'conceptual'
>>>> entity on the Web for the document. I.e., http://www.example.org/doc
>>>> returns some sort of an information to the client that this is, fact, a
>>>> 'virtual' package, and then the server can issue a new request to
>>>> http://www.example.org/doc/chap2.html and take it from there.
>>>>
>>>> (Note that, regardless of the original issue, having a 'conceptual'
>>>> package handle for a document may not be a bad thing!)
>>>>
>>>> Con: The URL form is (much) more complex, and may be in danger of being
>>>> ignored for documents that are on the Web only.
>>>>
>>>> Personally, I do not have a clear solution in my head. Hence this mail,
>>>> trying to see how we can move on...
>>>>
>>>> Let me also add another remark, coming originally from Tzviya, just to
>>>> add it to the mix: "We need to think about situations such as multiple
>>>> authors creating one package or peer review (one or many authors + one or
>>>> many editors submit article + data set to journal for review. It undergoes
>>>> peer review by one or many reviewers. Journal rejects the article.
>>>> Something happens to the reviews, and the package is submitted to a second
>>>> journal) and so on.) In scenarios like this, the concept of versioning and
>>>> revisioning are a lot more important. It may be covered by OA. I don’t know
>>>> that we can resolve versioning with an identifier."
>>>>
>>>> (Again, apologies to be so verbose…)
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>>
>>>> ----
>>>> Ivan Herman, W3C
>>>> Digital Publishing Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> - Nick Ruffilo
>>> @NickRuffilo
>>> http://Aerbook.com
>>> http://ZenOfTechnology.com <http://zenoftechnology.com/>
>>>
>>>
>
>
> --
> - Nick Ruffilo
> @NickRuffilo
> http://Aerbook.com
> http://ZenOfTechnology.com <http://zenoftechnology.com/>
>
>
Received on Monday, 1 June 2015 15:11:17 UTC