Re: [TF] Virtual Locators based on html or raw text from Ivan Herman on 2021-09-18 (public-epub-wg@w3.org from September 2021)

From: Ivan Herman <ivan@w3.org>
Date: Sat, 18 Sep 2021 08:44:52 +0200
To: Laurent Le Meur <laurent.lemeur@edrlab.org>
Cc: W3C EPUB 3 Working Group <public-epub-wg@w3.org>
Message-Id: <25C563D8-4B37-4FFF-B689-4B63357E49D3@w3.org>
Merci Laurent.

My immediate reaction is that user experience quality should win over implementation difficulty. I do not see anything in Michaël's text that says "it cannot be done". I believe what we have to discuss, and decide upon, which approach is best for the users/readers in practice.

Ivan

> On 17 Sep 2021, at 19:28, Laurent Le Meur <laurent.lemeur@edrlab.org> wrote:
> 
> Following our discussion today, the feedback from Mickaël Menu, maintainer of Readium Mobile, about the idea to calculate positions ("calculated page numbers") from the raw text, and not the full html content: 
> 
> ---
> - It would much more CPU and IO intensive to compute the positions from this, because we'd have to read (potentially streamed / encrypted)  each resource to get its full size, compared to just looking at the ZIP metadata in the EPUB case 
> (nb Laurent: the size of the uncompressed resource is in bytes, not unicode code points).
> - It might be quite complicated to match a position in the web view.
> - It will likely give a poor experience for image-based resources.
> (nb Laurent: I don't see what could give a good experience). 
> 
> This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.
> 
> IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.
> 
> One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.
> ---
> 
> Best regards
> Laurent
> It's much more CPU and IO intensive to compute the positions from this, because we have to read (potentially streamed / encrypted)  all the resources, compared to just looking at the ZIP metadata
> Like you said, it might be quite complicated to match a position in the web view
> It will likely give a poor experience for image-based resources
> This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.
> 5:26 <https://readium.slack.com/archives/C2JNGCN6R/p1631892401014900>
> IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.
> 5:27 <https://readium.slack.com/archives/C2JNGCN6R/p1631892476015700>
> (You can probably read that I'm biased for the old Readium positions strategy which was using the "original length" of resources instead of the archive entry length )
> 5:29 <https://readium.slack.com/archives/C2JNGCN6R/p1631892542016500>
> One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.
> It's much more CPU and IO intensive to compute the positions from this, because we have to read (potentially streamed / encrypted)  all the resources, compared to just looking at the ZIP metadata
> Like you said, it might be quite complicated to match a position in the web view
> It will likely give a poor experience for image-based resources
> This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.
> 5:26 <https://readium.slack.com/archives/C2JNGCN6R/p1631892401014900>
> IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.
> 5:27 <https://readium.slack.com/archives/C2JNGCN6R/p1631892476015700>
> (You can probably read that I'm biased for the old Readium positions strategy which was using the "original length" of resources instead of the archive entry length )
> 5:29 <https://readium.slack.com/archives/C2JNGCN6R/p1631892542016500>
> One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.
> 


----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +33 6 52 46 00 43
ORCID ID: https://orcid.org/0000-0003-0782-2704
Received on Saturday, 18 September 2021 06:44:58 UTC