Robust (and fuzzy) anchors from Dan Whaley on 2013-02-25 (public-openannotation@w3.org from February 2013)

From: Dan Whaley <dwhaley@hypothes.is>
Date: Sun, 24 Feb 2013 23:06:50 -0800
To: Paolo Ciccarese <paolo.ciccarese@gmail.com>
Cc: David Cuenca <dacuetu@gmail.com>, Robert Sanderson <azaroth42@gmail.com>, Kristof Csillag <csillag@nolmecolindor.com>, public-openannotation@w3.org
Message-Id: <975967A3-2827-4FBF-A1AF-C9F6558CB027@hypothes.is>
Paolo,

Thanks for this note.  (TL;DR Yes, we'd be thrilled to collaborate on an approach.)  I've cc'd Kristof, who is our lead developer on this functionality, and who probably has much better insights and comments than I will.

A few questions below:

> 
> Once the user performs the selection I calculate  prefix, match  and postfix. 
> I set a max number of chars for this step (normally 64 for both prefix and postfix).
> Given the potential complexity of the combination HTML+CSS I have some rules of thumb on how to select prefix/postfix.

Why not use the same # of chars every time?  After all, changes may happen later to a page that may be more a factor of how much context is needed that can't be anticipated initially.

> 
> Then I calculate a score that basically adapts according to the length of the match. 
> If the match is particularly short: I check the combined length of prefix+suffix. If those are too short combined (<64*2)
> I normally recalculate one of the two (ex: suffix) in order to be longer (=64*2-(length of the prefix).
> That way I end up having enough text to hit/find the match.

Do you find that optimizing here improves the performance when run against real world or simulated changes?  I'm thinking particularly of sebastian's tests against changes in the top 100 wikipedia articles as a way to sample this.

> 
> I have the option of trying to search for the text right away and detect if what you find is the same of the current selection.
> If you don't you can try and make the prefix/match/postfix longer or change strategy (adding more info).

This is an interesting approach.   Again, I suppose the question would be in general what are the advantages of searching directly on the match vs just searching for the prefix.  Let me rephrase-- I think I can see the advantages, but I'm wondering when one is reattaching 100 anchors, what amount of performance different one gets between these approaches.  We've been doing some timings on different stages of the process, which are illuminating.

> 
> For instance you can also store the location, but that can change if the document changes structure and the counting does not work very well with HTML.

But for the fuzzy match, it helps to start somewhere close to where you expect to find it, so generally, you'll gain performance by storing the original location by char offset I think and starting there.

> If you have a very redundant document, you can keep track of the occurrence of that prefix/match/postfix. That helps you until the document changes.

I've been curious about this-- how often, and what sort of documents are you finding redundant occurences of this triplet.


> When the document changes you have no guarantee that the selection is correct (a previous occurrence of that pattern is erased).

Yes, there are occasions, even with this strategy where you must fail catastrophically.

> 
> Dan, I am guessing I can share more details on your wiki and we can join forces on this topic?

100%   We'd be delighted.

> 
> Best,
> Paolo
> 
> 
> 
> On Sat, Feb 23, 2013 at 11:25 PM, David Cuenca <dacuetu@gmail.com> wrote:
> On Fri, Feb 22, 2013 at 1:50 PM, Dan Whaley <dwhaley@hypothes.is> wrote:
> But instead of exact matching on the prefix/postfix contexts, we use a fuzzy match to improve somewhat on the brittleness that hard context anchors have when changes to the document occur within them.
> 
> One of the design objectives here was to support cross-format annotation (annotations to the PDF can be surfaced on the HTML version, etc).
> 
> Dan, that is certainly impressive, it looks like a quite reliable method for annotating mutable digital documents.
> The advantage of printed material is that changes between the original source and proofread text are close to nil.
> On the other hand, data availability is less than on purely digital documents, therefore input text should be kept to a minimum.
> 
> I'll elaborate on your mailing list, it might be worthwhile.
> 
> David
> 
> 
> 
> -- 
> Dr. Paolo Ciccarese
> http://www.paolociccarese.info/
> Biomedical Informatics Research & Development
> Instructor of Neurology at Harvard Medical School
> Assistant in Neuroscience at Mass General Hospital
> Member of the MGH Biomedical Informatics Core
> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
> 
> CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s), may contain information that is considered
> to be sensitive or confidential and may not be forwarded or disclosed to any other party without the permission of the sender. 
> If you have received this message in error, please notify the sender immediately.
Received on Monday, 25 February 2013 07:07:18 UTC