Re: Floating Quotable Citations (FQC) from David Cuenca on 2013-02-24 (public-openannotation@w3.org from February 2013)

From: David Cuenca <dacuetu@gmail.com>
Date: Sun, 24 Feb 2013 13:13:22 -0500
To: Paolo Ciccarese <paolo.ciccarese@gmail.com>
Cc: Dan Whaley <dwhaley@hypothes.is>, Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Message-ID: <CAJBSGSpkPr6q++kjZy15dzLYkyrs_1Uh2EeG4CDn4UrA5GauJg@mail.gmail.com>

Paolo,
Amazing how parallel developments can be :)
I made a post on hypotes.is list outlining the differences when dealing
with printed text:
https://list.hypothes.is/archive/dev/2013-02/0000109.html

Of course that assuming that the actor extracting text from the printed
source is human. During the next years we might see interesting advances in
Augmented Reality (Google Glass comes to mind) and that would mean that we
could have an ocr system extracting the text layer from a book and
anchoring annotations to it using a real-time AR overlay.

David

On Sun, Feb 24, 2013 at 11:20 AM, Paolo Ciccarese <paolo.ciccarese@gmail.com
> wrote:

> David,
> in Domeo I do something very similar with what Dan's wiki page outlines.
>
> Domeo deals only with annotation of HTML but I need to be able to have the
> same annotation displayed on the PDF.
> We are using the system since more than 2 years now and  I perform  the
> following operations (ignoring the HTML markup).
>
> Once the user performs the selection I calculate  prefix, match  and
> postfix.
> I set a max number of chars for this step (normally 64 for both prefix and
> postfix).
> Given the potential complexity of the combination HTML+CSS I have some
> rules of thumb on how to select prefix/postfix.
>
> Then I calculate a score that basically adapts according to the length of
> the match.
> If the match is particularly short: I check the combined length of
> prefix+suffix. If those are too short combined (<64*2)
> I normally recalculate one of the two (ex: suffix) in order to be longer
> (=64*2-(length of the prefix).
> That way I end up having enough text to hit/find the match.
>
> I have the option of trying to search for the text right away and detect
> if what you find is the same of the current selection.
> If you don't you can try and make the prefix/match/postfix longer or
> change strategy (adding more info).
>
> For instance you can also store the location, but that can change if the
> document changes structure and the counting does not work very well with
> HTML.
> If you have a very redundant document, you can keep track of the
> occurrence of that prefix/match/postfix. That helps you until the document
> changes.
> When the document changes you have no guarantee that the selection is
> correct (a previous occurrence of that pattern is erased).
>
> Dan, I am guessing I can share more details on your wiki and we can join
> forces on this topic?
>
> Best,
> Paolo
>
>
>
>
> On Sat, Feb 23, 2013 at 11:25 PM, David Cuenca <dacuetu@gmail.com> wrote:
>
>> On Fri, Feb 22, 2013 at 1:50 PM, Dan Whaley <dwhaley@hypothes.is> wrote:
>>
>>> But instead of exact matching on the prefix/postfix contexts, we use a
>>> fuzzy match to improve somewhat on the brittleness that hard context
>>> anchors have when changes to the document occur within them.
>>>
>>> One of the design objectives here was to support cross-format annotation
>>> (annotations to the PDF can be surfaced on the HTML version, etc).
>>>
>>
>> Dan, that is certainly impressive, it looks like a quite reliable method
>> for annotating mutable digital documents.
>> The advantage of printed material is that changes between the original
>> source and proofread text are close to nil.
>> On the other hand, data availability is less than on purely digital
>> documents, therefore input text should be kept to a minimum.
>>
>> I'll elaborate on your mailing list, it might be worthwhile.
>>
>> David
>>
>
>
>
> --
> Dr. Paolo Ciccarese
> http://www.paolociccarese.info/
> Biomedical Informatics Research & Development
> Instructor of Neurology at Harvard Medical School
> Assistant in Neuroscience at Mass General Hospital
> Member of the MGH Biomedical Informatics Core
> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
>
> CONFIDENTIALITY NOTICE: This message is intended only for the
> addressee(s), may contain information that is considered
> to be sensitive or confidential and may not be forwarded or disclosed to
> any other party without the permission of the sender.
> If you have received this message in error, please notify the sender
> immediately.
>



-- 
Etiamsi omnes, ego non

Received on Sunday, 24 February 2013 18:14:10 UTC