RE: [TF] Virtual Locators based on html or raw text

Hi,

 

I think I have said this many times, but we should first encourage authors and publishers to insert the page numbers, and then RS must not then calculate their own.

 

Next, companies that ingest titles should check if page numbers have been authored, and they should modify the titles using a common algorithm, assuming they have the rights to make modification.

Only then should RS should use calculated virtual pages.

 

Best

George

 

 

From: Laurent Le Meur <laurent.lemeur@edrlab.org> 
Sent: Saturday, September 18, 2021 2:53 AM
To: Dan Lazin <dlazin@google.com>
Cc: public-epub-wg@w3.org
Subject: Re: [TF] Virtual Locators based on html or raw text

 

Hi Dan, there may be a misunderstanding here. When Mickaël speaks about the "current" implementation, it is the current Readium implementation, i.e. positions calculated as 1024 bytes segments. 

 

We all agree that the major interest of the guidelines we try to create is Consistency. Consistency in the implementation of positions (aka calculated page numbers) in different reading systems. 

 

A consensus must emerge now on the relative weight two other features: Simplicity and Accuracy.  

 

Accuracy is a not reachable, we discussed it at length. images & video clips, mathml equations and other content that take a screen surface different from their markup size, typography, fixed layout structures, all this goes against the equivalence of print pages with the segments we can calculate. We can approximate print page lengths only when the content is textual and simple (e.g. novels in reflow mode). I think we all agree on that. 

 

Simplicity, for reading system developers, is therefore the second major feature we must reach, after Consistency. 

 

Do we all agree on the ordered list? 

1. Consistency

2. Simplicity

3. Accuracy

 

If yes, we can narrow down the possible solutions. 

 

What Mickaël said about content that can be packaged of streamed (Web Publication) is also important: the algorithm must consistently give the same result in both cases. Which means good by calculations of compressed content. 

 

Best regards

Laurent





Le 17 sept. 2021 à 22:37, Dan Lazin <dlazin@google.com <mailto:dlazin@google.com> > a écrit :

 

Thanks, Laurent!

 

To quickly answer "what does it bring compared to the current implementation for users":

 

1) "Page numbers" that are consistent across different reading systems (as opposed to each RS having its own proprietary calculation)

2) Guaranteed existence of stable page numbers (as opposed to being dependent on the publisher adding a page-list)

 

Remember, the idea is only that the algorithm is a worst-case fallback. The ideal solution is that every RS implements the algorithm, but no RS ever has to use it because the publishers choose to add page-lists to every book instead. Page-lists also provide consistent cross-device (and likely cross-digital/print) page numbers ... but they're not mandatory. The algorithmic page numbers only come into play if the publisher doesn't add a page-list, so if they don't like the calculated page numbers, it's easy to fix :)

 





On Sep 17, 2021, at 1:28 PM, Laurent Le Meur <laurent.lemeur@edrlab.org <mailto:laurent.lemeur@edrlab.org> > wrote:

 

Following our discussion today, the feedback from Mickaël Menu, maintainer of Readium Mobile, about the idea to calculate positions ("calculated page numbers") from the raw text, and not the full html content: 

 

---

- It would much more CPU and IO intensive to compute the positions from this, because we'd have to read (potentially streamed / encrypted)  each resource to get its full size, compared to just looking at the ZIP metadata in the EPUB case 

(nb Laurent: the size of the uncompressed resource is in bytes, not unicode code points).
- It might be quite complicated to match a position in the web view.

- It will likely give a poor experience for image-based resources.

(nb Laurent: I don't see what could give a good experience). 

This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.

IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.

One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.

---

 

Best regards

Laurent

*       It's much more CPU and IO intensive to compute the positions from this, because we have to read (potentially streamed / encrypted)  all the resources, compared to just looking at the ZIP metadata

*       Like you said, it might be quite complicated to match a position in the web view

*       It will likely give a poor experience for image-based resources

This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892401014900> 5:26

IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892476015700> 5:27

(You can probably read that I'm biased for the old Readium positions strategy which was using the "original length" of resources instead of the archive entry length   <https://a.slack-edge.com/production-standard-emoji-assets/13.0/apple-medium/1f609.png> )

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892542016500> 5:29

One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.

*       It's much more CPU and IO intensive to compute the positions from this, because we have to read (potentially streamed / encrypted)  all the resources, compared to just looking at the ZIP metadata

*       Like you said, it might be quite complicated to match a position in the web view

*       It will likely give a poor experience for image-based resources

This algorithm feels nice and less dependent of technical stuff like HTML structure, but what does it bring compared to the current implementation for the users? We will still have a varying number of positions per screen and they won't better match physical books.

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892401014900> 5:26

IMHO the TF could focus on improving on the downsides of the current implementations which often depend on the ZIP instead of the resources themselves. Meaning for the same publication, we won't get the same number of positions if the EPUB is exploded, if the compression status of resources change or if an EPUB is streamed from the web.

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892476015700> 5:27

(You can probably read that I'm biased for the old Readium positions strategy which was using the "original length" of resources instead of the archive entry length   <https://a.slack-edge.com/production-standard-emoji-assets/13.0/apple-medium/1f609.png> )

 <https://readium.slack.com/archives/C2JNGCN6R/p1631892542016500> 5:29

One redeeming quality of the strategy suggested by the TF is that if the HTML structure changes but not the actual publication text, then we get the same positions. But not sure it matters in practice.

 

 

 

Received on Saturday, 18 September 2021 14:16:18 UTC