RE: [dpub identifiers] OA's quote selector from Bill Kasdorf on 2015-04-09 (public-digipub-ig@w3.org from April 2015)

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Thu, 9 Apr 2015 17:46:46 +0000
To: Ivan Herman <ivan@w3.org>
CC: "Stein, Ayla" <astein@illinois.edu>, Thierry Michel <tmichel@w3.org>, "W3C Digital Publishing IG" <public-digipub-ig@w3.org>
Message-ID: <CO2PR06MB572C9677FF87787D0888CDCDFFB0@CO2PR06MB572.namprd06.prod.outlook.com>
While others are more qualified to judge on the technical aspects, I completely agree with this in principle. It has always been a priority to me to make sure we align with the Open Annotations work. Fundamentally, it is hard to imagine many reasons for pointing to a fragment (which btw is inherently a range: we said "fragment," not "point") that is not some type of annotation in a sense. And in any case having conflicting ways to identify fragments is a non-starter. Judging by what you've put in the wiki and of course the discussions on the OA WG I would have been advocating this approach as well.

And btw I also completely agree with you wrt the need for this sooner rather than later. While I did need to make the case that having a page break marker is a necessity (due to the fact that there are so many cases where currently the print is the version of record), that is certainly NOT the best way to identify a fragment. And yes, for journals, what you suggest has in fact already happened, the online version is increasingly becoming the version of record. There is even a formal mechanism for identifying Version of Record from NISO (http://www.niso.org/publications/rp/RP-8-2008.pdf), which is part of a more comprehensive vocabulary of online versions, and CrossRef provides a service that can redirect you to later versions from earlier versions (http://www.crossref.org/crossmark/index.html). Books will take quite a while to get there but journals are already there.

-----Original Message-----
From: Ivan Herman [mailto:ivan@w3.org] 
Sent: Thursday, April 09, 2015 6:10 AM
To: Bill Kasdorf
Cc: Stein, Ayla; Thierry Michel; W3C Digital Publishing IG
Subject: [dpub identifiers] OA's quote selector

(I have changed the subject line!)

Bill,

To continue this discussion: I think we all agree (me included) that we need a way to refer to pages in cases when the print is the version of record; the current approach taken in EPUB is a way that works (although there might be improvement in not using <a> but <span>, but that is a detail). Let us take that as a given.

However, in a true digerati way:-), I think we should also look ahead. Bill, you used the example of scholarly publishing, where the current practice (mainly in humanities I believe) is to put a page numbers into the references. There will come a time when the printed version is *not* the reference and, I believe, that will come to the world of scholarly journal publishing sooner than for books (scholarly or otherwise). We will get to the point when papers are published on the Web *first* (and maybe never in a printed form), and those can be downloaded in some format (today PDF, tomorrow, hopefully, EPUB-WEB). ([1] or [2] are good examples.) In which case the scholarly references will have to be reconsidered because page numbers *really* will not make any sense...

I looked at the OA selector model again, and I do believe that this is maybe one of the most important inputs that we should consider:

https://www.w3.org/dpub/IG/wiki/Task_Forces/identifiers#Open_Annotations_Fragment_Selector
http://www.w3.org/TR/2014/WD-annotation-model-20141211/#fragment-selector

the reason I like it because it gives a framework for different possible identifiers (selectors) that help our thinking (yeah, it is cast in an RDF structure that may be a challenge for some at first glance). I have also added a very important selector type to the wiki, namely the "text quote" selector, where a piece of text is identified with

- an exact text quote
- prefix, ie, a piece of text preceding that quote
- suffix, ie, a piece of text following that quote

Although not mathematically precise, obviously, in practice it is a pretty good way of quoting a text on the page in a robust way, ie, regardless of whether the page is reformatted. (B.t.w., afaik, the hypothes.is annotation system and others use that approach already).

I think that such identification methods should somehow be included in any approach we take for EPUB-WEB

WDYT?

Ivan

[1] https://peerj.com/articles/440/
[2] http://f1000research.com/articles/3-176/v1


> On 08 Apr 2015, at 16:55 , Bill Kasdorf <bkasdorf@apexcovantage.com> wrote:
> 
> This issue is mainly pertinent to publications originally published in print and only later provided in digital form. There are of course millions of such publications in libraries, which is the main domain of the HathiTrust.
> 
> The reason this is important is that there are four primary use cases characteristic of this "print is the version of record" situation:
> 
> --The indexes in print books typically (though not universally) point to arbitrary points in the content: the print page breaks.
> --Cross-references in the text of print books typically refer to print page breaks.
> --Citations in the literature (very important in scholarship) point to print page breaks.
> --The accessibility community strongly advocates the recording of print page breaks in digital versions of print publications, particularly textbooks, so that when the teacher says "turn to page 53" the print-disabled user can find that spot (as can any user of the digital version).
> 
> While most W3C folks would argue that this is a relic of print-based publishing (and it is), and would argue that these should be replaced with real links to meaningful points in the content, not to something as arbitrary as a print page break (which is indisputably better), it unfortunately happens to be a real need when we are in this transitional phase; and all of those millions of old books, and the citations to their pages, do actually exist. So it really does turn out to be useful to have "markers" in a digital file designating where the print page breaks are--accompanied, btw, with an ability to designate _which_ print edition the markers refer to.
> 
> As distasteful as that is to digerati like us. ;-)
> 
> And btw, in the context of EPUB-WEB, for these very reasons (especially the accessibility issue), providing such print page break markers is recommended in the EDUPUB spec, which provides a recommended syntax for the marker. It doesn't attempt to contain the page with a start-and-end-tag pair, because you run into well-formedness issues; instead, it just provides an empty element that says, in effect, "page 53 in the print book starts here."
> 
> --Bill K
> 
> -----Original Message-----
> From: Ivan Herman [mailto:ivan@w3.org]
> Sent: Wednesday, April 08, 2015 4:30 AM
> To: Stein, Ayla
> Cc: Thierry Michel; Bill Kasdorf; W3C Digital Publishing IG
> Subject: Re: [dpub identifiers] Please review updated Identifiers TF 
> wiki
> 
> Thank you Ayla.
> 
> Without going into the details of the proposal, the question it raises 
> to me, as part of the EPUB-WEB discussion, is what is the role (if 
> any) of an identifier that identifies a *page*. Indeed, depending on 
> the style of the online document, a page is
> 
> * a very ephemeral entity and thereby it is not really a suitable 
> target for an identifier (a flowing book, whose pagination is based on 
> user interaction, is the obvious example)
> * a fixed entity, ie, for fixed layout document
> 
> it strikes me that an identifier approach for an EPUB-WEB document needs to cover the second item, too. AFAIK, CFI can do that only if the fixed layout document is organized in terms of a series of separate files within the package, but that may not cover all the cases (e.g., if a presentation slide show is stored as a portable document, and the 'pagination' is the result of a javascript running on one single source).
> 
> Whether the approach taken by the HathiTrust document is the right one (as far as I could understand from a cursory look it assigns a UDDI type URN to each page, which is then combined with the identifier of a 'volume') is a different question. I am not sure this is a general solution but I guess the more general questions are certainly valid!
> 
> Thanks again
> 
> Ivan
> 
> 
>> On 07 Apr 2015, at 20:21 , Stein, Ayla <astein@illinois.edu> wrote:
>> 
>> Matt's comment about content version reminded me of some ongoing work at the HathiTrust Research Center. One of the problems they're looking into is identifying an object at a specific point in time. Their initial proposal document discusses several different issues regarding identifiers in HTRC and can be accessed here: https://www.ideals.illinois.edu/handle/2142/73147. I've also added it as an attachment to this email.
>> 
>> I know there's also been some work on a prototype for identifying versions, but the draft of that document is not yet available for circulation. While these aren't necessarily solutions that can be implemented here, I think it's of interest and relevance to this discussion.
>> 
>> Thanks,
>> 
>> Ayla
>> 
>> -----Original Message-----
>> From: Ivan Herman [mailto:ivan@w3.org]
>> Sent: Tuesday, March 24, 2015 3:32 AM
>> To: Thierry Michel
>> Cc: Bill Kasdorf; W3C Digital Publishing IG
>> Subject: Re: [dpub identifiers] Please review updated Identifiers TF 
>> wiki
>> 
>> 
>>> On 24 Mar 2015, at 09:30 , Ivan Herman <ivan@w3.org> wrote:
>>> 
>>> I have added the media fragment URI to the wiki with few examples. Thierry, if you want to add something, please do at:
>> 
>> Sorry, pushed the send button too soon:
>> 
>> https://www.w3.org/dpub/IG/wiki/Task_Forces/identifiers#W3C.E2.80.99s
>> _
>> Media_Fragment
>> 
>> Thanks
>> 
>> ivan
>> 
>>> 
>>> 
>>>> On 23 Mar 2015, at 08:20 , Thierry MICHEL <tmichel@w3.org> wrote:
>>>> 
>>>> Bill,
>>>> 
>>>> I would also suggest Media Fragments URI 1.0 It specifies the 
>>>> syntax for constructing media fragment URIs and explains how to handle them when used over the HTTP protocol.
>>>> 
>>>> http://www.w3.org/TR/2012/REC-media-frags-20120925/
>>>> a W3C Recommendation 25 September 2012.
>>>> 
>>>> Best,
>>>> 
>>>> thierry.
>>>> 
>>>> On 22/03/2015 17:51, Bill Kasdorf wrote:
>>>>> Thanks to Tzviya, we have some substantive content for review on 
>>>>> the Identifiers TF wiki at [1].
>>>>> 
>>>>> This initial draft of background information gives brief 
>>>>> descriptions, links, discussion, and examples of three possible 
>>>>> options for consideration as the basis for our initial work on a Fragment Identifier:
>>>>> 
>>>>> --EPUB CFI
>>>>> 
>>>>> --W3C Packaging for the Web Fragment Identifiers
>>>>> 
>>>>> --The Open Annotations Fragment Selector
>>>>> 
>>>>> In addition, there's a placeholder for XPath, and we need to 
>>>>> collect suggestions for other relevant specs or technologies to 
>>>>> take into account, e.g. XPointer.
>>>>> 
>>>>> Please take a look at this before the Monday IG call and suggest 
>>>>> any others we should add. Feel free to add a placeholder (ideally 
>>>>> with a
>>>>> link) if you aren't prepared to add the prose.
>>>>> 
>>>>> And although we now have a good list of participants in this TF, 
>>>>> please add your name if you'd like to participate as well. We will 
>>>>> discuss next steps on the call Monday, which will probably involve 
>>>>> a TF conference call later this week if we can find a time that works for everybody.
>>>>> 
>>>>> --Bill K
>>>>> 
>>>>> [1]
>>>>> https://www.w3.org/dpub/IG/wiki/Task_Forces/identifiers#Background
>>>>> 
>>>>> Bill Kasdorf
>>>>> 
>>>>> Vice President, Apex Content Solutions
>>>>> 
>>>>> Apex CoVantage
>>>>> 
>>>>> W: +1 734-904-6252
>>>>> 
>>>>> M: +1 734-904-6252
>>>>> 
>>>>> @BillKasdorf <http://twitter.com/#!/BillKasdorf> //
>>>>> 
>>>>> _bkasdorf@apexcovantage.com_
>>>>> 
>>>>> ISNI: 0000 0001 1649 0786__
>>>>> 
>>>>> https://orcid.org/0000-0001-7002-4786
>>>>> <https://orcid.org/0000-0001-7002-4786?lang=en>
>>>>> 
>>>>> www.apexcovantage.com <http://www.apexcovantage.com/>
>>>>> 
>>>>> Corporate Logo-Copy
>>>>> 
>>>> 
>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C
>>> Digital Publishing Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> ORCID ID: http://orcid.org/0000-0003-0782-2704
>> 
>> 
>> 
>> 
>> <IdentifiersProposal.pdf>
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: http://orcid.org/0000-0003-0782-2704
> 
> 
> 
> 
> 


----
Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Thursday, 9 April 2015 17:47:17 UTC