Re: use case: page based scholarly reference? from AUDRAIN LUC on 2015-06-11 (public-digipub-ig@w3.org from June 2015)

From: AUDRAIN LUC <LAUDRAIN@hachette-livre.fr>
Date: Thu, 11 Jun 2015 17:36:01 +0200
To: Bill Kasdorf <bkasdorf@apexcovantage.com>
CC: Ivan Herman <ivan@w3.org>, Robin Berjon <robin@w3.org>, Tzviya Siegman <tsiegman@wiley.com>, Dave Cramer <dauwhe@gmail.com>, W3C Digital Publishing IG <public-digipub-ig@w3.org>
Message-ID: <449BDE26-CEE8-4683-8052-F5E9FDD0000D@hachette-livre.fr>
+1000!

That's exactly what we are doing at Hachette Livre for all trade books since 2001.
In the RFP I wrote in 2001 for XML workflow, all print PDF have to be archived along with the XML file of the content, and this XML had to signal each page break location with an rp tag like <rp folio="100"/>. 
Then in 2009 we add EPUB generation to this workflow, and these rp tags had to be used to build the NCX page list (now the nav document in EPUB3) using spans in the HTML documents. 

It's mandatory for digital accessibility, but also for many other use cases, like for pupils in a classroom where the teacher may reference paper pages... 

Best,
Luc


> Le 11 juin 2015 à 17:18, Bill Kasdorf <bkasdorf@apexcovantage.com> a écrit :
> 
> First of all, let me be clear: this is NOT about fluid pages, this is simply making a record of where the pages break in the final, published, to-be-cited print pages. It is not at all about being responsive to pagination in a digital environment.
> 
> Think of it this way: people need to know where the divisions between the print pages are. So we put a marker in there. It has nothing to do with where they break online or in their e-reader. Until there is a better option, the scholarly ecosystem (and to a large extent the educational ecosystem) is dependent on page references, as imperfect as they are.
> 
> So these are just markers. I wouldn't consider it a hack; it is essential information. Talk to George Kersher or Charles Lapierre or anybody from DAISY about how essential these are for print-disabled users. And to repeat: there are at least that gazillion-and-a-half uses of page references created every DAY for indexes, cross references, and citations in the scholarly literature. We can't stop scholarship and we can't stop the still pervasive use of print until we have a better way, so we have to do this.
> 
> A concrete example.
> 
> Here's what the EDUPUB spec (the official profile of EPUB 3 for educational content) strongly recommends [1]:
> 
> <span epub:type="pagebreak" id="pg302" title="302"/>
> 
> That is similar to (and now identical to, with one addition) what I have been putting in all the models I've been creating for publishers for the past few years. (Note that in non-EPUB contexts you do have to use something other than epub:type so admittedly we do have to hold our nose and add @class="pagebreak", again because we don't now see a better option. The ongoing discussion wrt @role is relevant here. Can't wait for that to get resolved.)
> 
> Here's the logic:
> 
> Using <span /> lets it exist pretty much anywhere, because of course the print pages break at arbitrary points that have no relation to the structure of the document. In our _specifications_ (an important rule of mine: you still always need a specification of how to _apply_ the markup for your particular publication or purposes) we recommend putting that <span> immediately at the beginning of the word that is breaking across a page break, if it is hyphenated, so that we don't fragment words, which has other dire consequences.)
> 
> The @id records the _actual sequential position_ within the publication. Note that the example from the EDUPUB spec is somewhat misleading in this regard. If the publication has 20 pages of frontmatter numbered in lowercase roman numerals (most common) then this would actually be @id="322".
> 
> The @title records what is in publishing called the "folio": the actual thing that is printed on the page. So page iv of the frontmatter would have @title="iv", and a blank page before a part opener on page 521 would have @title="".
> 
> This enables you do create scripts and other processing that counts the actual pages (including frontmatter and blanks), using the @id, but also to know what the teacher (or an index, or a cross reference, or a citation in a scholarly paper or review) means when she says "turn to page 302": she means what is designated by @title="302", not what is designated as @id="322".
> 
> I hope that's helpful.
> 
> This is what we need to do in the real world. ;-) Call it ugly, call it a hack if you want, but I got clients, they gotta do stuff, we can't wait around.
> 
> --Bill
> 
> [1] https://docs.google.com/document/d/1_Tzeq5xwdwLhSdaHStvAthhOFu9UuT8yFmK787yw420/edit#

> 
> 
> -----Original Message-----
> From: Ivan Herman [mailto:ivan@w3.org] 
> Sent: Thursday, June 11, 2015 5:31 AM
> To: Robin Berjon
> Cc: Tzviya Siegman; Dave Cramer; Bill Kasdorf; W3C Digital Publishing IG
> Subject: Re: use case: page based scholarly reference?
> 
> Right. I believe that, on long term, the page has to be exchanged against some fraction ID that reflects some real fraction id in the HTML file but can also be put into print easily.
> 
> Liam and Bill referred to the fake page break signs, which is a hack, and ugly in the sense that where a page break occurs is unpredictable (it will depend on the font and font size, the window size, etc, which are all elements that can be set by the end user in a proper reading environment). So indeed, a convention must be found.
> 
> Back to our original question, though: does this represent a kind of a use case that we would have to take into account for fraction ID-s? I believe the answer is yes, but I cannot properly formulate it. I think the idea is that it should be easy to find very flexible way of addressing a logical structure in a human readable way. The problem is that, unless the original author/publisher does not do it by defining an @id for a specific section, the fragment id-s that one have to use to say 'second paragraph of chapter entitled XYZ' is hardly readable to the non-initiated…
> 
> Bill do you have a good formulation to add to our collection?
> 
> Ivan
> 
> P.S. An aside: the social problem with this is also that this community is very conservative. Users, ie, researchers, are extremely wary leaving the beaten path because they are afraid that this would damage their publication list, ie, their CV, ie, their career path. People will continue publish on the old-style, traditional journals, accepting the PDF, print, and page oriented publication routes, and will not readily move to more modern style publications; they need the renown provided by the Nature, or various Elsevier publications for their own career. In a time when 'publish or perish' dominates, when younger researchers have immense difficulties getting a stable position somewhere, they cannot be blamed. And the big, traditional publishers are therefore not really under a pressure to change. It is complicated...
> 
> 
> 
>> On 10 Jun 2015, at 16:15 , Robin Berjon <robin@w3.org> wrote:
>> 
>> On 10/06/2015 14:02 , Ivan Herman wrote:
>>> I am not sure what this translates into in a requirement for the 
>>> identification part, namely that 'reasonable' units within the 
>>> publication should have an easily identifiable URL, or URL structure 
>>> (note that the examples above actually define ranges and not only one 
>>> page). This may be a page but that is a fluid notion in this case, 
>>> that may not be appropriate for scholarly purposes. But I am a bit 
>>> uncertain how to formulate it before putting it into the use case 
>>> directory…
>> 
>> One question I have reading this is about usability. Imagining some form of resilient linking is used (the example below is from Emphasis 2 [0] but others tend to be the same), if I wanted to anchor a link to a paragraph I'd end up with something that looked like:
>> 
>> D. Ahut, et al., “Sustainable Critical Avalanches of Alpine Fauna,” 
>> Cryptozoology, vol. 42, #p[MMTMMT],h[BcdTcg,1], Mar. 1977
>> 
>> The "#p[MMTMMT],h[BcdTcg,1]" bit can replace pages (you can also do ranges with it) and so long as you're in a digital context in which it is presumably clickable it's fine; but when it shows up in print as it invariably will, well, I'd hate to have to type that back in.
>> 
>> Is this something that should be a consideration, or should references from print to digital be largely considered hopeless anyway (since in practice you just search for the paper's title)?
>> 
>> [0] 
>> http://open.blogs.nytimes.com/2011/01/11/emphasis-update-and-source/

>> 
>> --
>> Robin Berjon - http://berjon.com/ - @robinberjon
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/

> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704

> 
> 
> 
>
Received on Thursday, 11 June 2015 15:37:06 UTC