Re: Floating Quotable Citations (FQC) from Dan Whaley on 2013-02-22 (public-openannotation@w3.org from February 2013)

From: Dan Whaley <dwhaley@hypothes.is>
Date: Fri, 22 Feb 2013 10:50:41 -0800
To: Robert Sanderson <azaroth42@gmail.com>
Cc: David Cuenca <dacuetu@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Message-Id: <FF6C8224-BC32-4026-AEA0-86291E752DE1@hypothes.is>

David,

We're working on this same problem, and are experimenting with an approach we call "fuzzy anchoring".

You'll find a short narrative of the strategy here.
https://github.com/hypothesis/h/wiki/fuzzy-anchoring

With links to a repo and a demo.

We are in the process of actually now integrating this into the hypothes.is prototype, including integration with okfn/annotator on the backend.

Our dev list at hypothes.is contains a fair amount of discussion about this topic over the last two months (look for comments by Kristof)
https://list.hypothes.is/archive/dev/

The approach borrows from the work of Sebastian Hellmann
http://svn.aksw.org/papers/2012/WWW_NIF/public.pdf

But instead of exact matching on the prefix/postfix contexts, we use a fuzzy match to improve somewhat on the brittleness that hard context anchors have when changes to the document occur within them.

One of the design objectives here was to support cross-format annotation (annotations to the PDF can be surfaced on the HTML version, etc).

Dan



On Feb 22, 2013, at 8:49 AM, Robert Sanderson <azaroth42@gmail.com> wrote:

> It seems like the thing that's the target of the Annotation/citation
> is really very different in the various cases.
> Character counting in a print book would be a nightmare, of course,
> and that's why page references are so obviously important.  On the
> other hand, page references don't exist in some digital copies.
> 
> My suggestion, in Annotation speak, would be to have multiple targets
> with different selectors for the different expressions of the work.
> Then you could use systems appropriate for print with the print
> copies, and systems appropriate for digital with the digital copies,
> but the annotation/citation maintains the same identifier.
> 
> Otherwise just try to record as much information as possible, and let
> future systems sort it out as best they can :)
> 
> Rob
> 
> 
> On Fri, Feb 22, 2013 at 6:52 AM, David Cuenca <dacuetu@gmail.com> wrote:
>> On Fri, Feb 22, 2013 at 12:55 AM, Tom Morris <tfmorris@gmail.com> wrote:
>>> 
>>> PG does all kinds of weird stuff.  They insisted on 7-bit ASCII for ages
>>> after everyone else moved to ISO Latin-1.  They strip all edition
>>> information claiming that they are creating new editions (which means none
>>> of the citations would be any good anyway since you can't match them up with
>>> the correct edition).
>>> 
>>> If you look at the millions of books of PD books in the Internet Archive,
>>> HathiTrust, Google Books, etc, you'll see that they certainly do include
>>> page information. It's only the few thousand in the quirky Project Gutenburg
>>> which don't (and even PG has that information at the beginning of the
>>> process until they intentionally throw it away).
>> 
>> 
>> It is not a PG issue only, there are many other digital libraries that don't
>> signal page breaks or don't use any standard method to indicate it. Even in
>> Wikisource there are many transcribed texts that do mention the edition but
>> have no information about the pagination. One possible solution could be to
>> have several scoping options (default:whole document, page number, css
>> fragment, pararagraph+delimiter, etc) and then use a finer text selection on
>> that area (character count or quote selector).
>> 
>> Btw, if anyone has a contact in PG, I'd love to talk with them.
>> 
>> David
>

Received on Friday, 22 February 2013 18:52:12 UTC