- From: Paolo Ciccarese <paolo.ciccarese@gmail.com>
- Date: Sun, 24 Feb 2013 11:20:11 -0500
- To: David Cuenca <dacuetu@gmail.com>
- Cc: Dan Whaley <dwhaley@hypothes.is>, Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
- Message-ID: <CAFPX2kBw+u67YyZoJ-tZSUt_8Tzcti1w1cgpFbv0oMK5JipiRw@mail.gmail.com>
David, in Domeo I do something very similar with what Dan's wiki page outlines. Domeo deals only with annotation of HTML but I need to be able to have the same annotation displayed on the PDF. We are using the system since more than 2 years now and I perform the following operations (ignoring the HTML markup). Once the user performs the selection I calculate prefix, match and postfix. I set a max number of chars for this step (normally 64 for both prefix and postfix). Given the potential complexity of the combination HTML+CSS I have some rules of thumb on how to select prefix/postfix. Then I calculate a score that basically adapts according to the length of the match. If the match is particularly short: I check the combined length of prefix+suffix. If those are too short combined (<64*2) I normally recalculate one of the two (ex: suffix) in order to be longer (=64*2-(length of the prefix). That way I end up having enough text to hit/find the match. I have the option of trying to search for the text right away and detect if what you find is the same of the current selection. If you don't you can try and make the prefix/match/postfix longer or change strategy (adding more info). For instance you can also store the location, but that can change if the document changes structure and the counting does not work very well with HTML. If you have a very redundant document, you can keep track of the occurrence of that prefix/match/postfix. That helps you until the document changes. When the document changes you have no guarantee that the selection is correct (a previous occurrence of that pattern is erased). Dan, I am guessing I can share more details on your wiki and we can join forces on this topic? Best, Paolo On Sat, Feb 23, 2013 at 11:25 PM, David Cuenca <dacuetu@gmail.com> wrote: > On Fri, Feb 22, 2013 at 1:50 PM, Dan Whaley <dwhaley@hypothes.is> wrote: > >> But instead of exact matching on the prefix/postfix contexts, we use a >> fuzzy match to improve somewhat on the brittleness that hard context >> anchors have when changes to the document occur within them. >> >> One of the design objectives here was to support cross-format annotation >> (annotations to the PDF can be surfaced on the HTML version, etc). >> > > Dan, that is certainly impressive, it looks like a quite reliable method > for annotating mutable digital documents. > The advantage of printed material is that changes between the original > source and proofread text are close to nil. > On the other hand, data availability is less than on purely digital > documents, therefore input text should be kept to a minimum. > > I'll elaborate on your mailing list, it might be worthwhile. > > David > -- Dr. Paolo Ciccarese http://www.paolociccarese.info/ Biomedical Informatics Research & Development Instructor of Neurology at Harvard Medical School Assistant in Neuroscience at Mass General Hospital Member of the MGH Biomedical Informatics Core +1-857-366-1524 (mobile) +1-617-768-8744 (office) CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s), may contain information that is considered to be sensitive or confidential and may not be forwarded or disclosed to any other party without the permission of the sender. If you have received this message in error, please notify the sender immediately.
Received on Sunday, 24 February 2013 16:20:41 UTC