robustness of string matching XPointers

Matthew Wilson and I were thinking about the way we locate strings 
inside DOM elements.  Currently, both Annozilla and Amaya use the four 
argument variant of the XPointer string-range function.  For a quick 
refresher on string-range, see 
http://www.w3.org/TR/2001/CR-xptr-20010911/#stringrange

Both Amaya and Annozilla opt to use a degenerate form of string-range, 
which might better be described as "string-count".  For the second 
argument, the string to match, they always provide the empty string, "". 
This matches everything in the string representation of the DOM Element 
selected by the first argument.  Then, they provide a start offset and a 
length, thus uniquely identifying a substring within the DOM Element. 
However, as Matthew pointed out, this method is quite fragile.  Any 
changes to the text of the DOM Element before the selected string make 
the start value in an existing XPointer invalid.  For example, assume we 
have an XPointer that selects the first 'However' in this paragraph 
using the empty-string four argument string-range.  Now, if I add 
another sentence to the beginning of the paragraph, I've completely 
invalidated the XPointer.  Even worse, it won't simply be orphaned, but 
will select the wrong text entirely.

One way of solving this problem is to use the pattern-matching ability 
of string-range.  This would be the two-argument format of string-range 
(omitting start and length), looking like
string-range(path to paragraph,"However").   In most cases, this seems 
more robust against changes.

However, there is one problem.  If you change a paragraph by adding a 
similar phrase, you can confuse the XPointer.  For example, if your old 
paragraph was:

    OLD:
     I am a perfectly good paragraph.  Hear me roar.

and you defined a string-range XPointer to the word 'roar':
string-range(OLD's path, "roar")

and then, you change the paragraph to read:

    NEW:
    I am a perfectly good paragraph.  I'm having a roaring good time 
writing
    this.  Hear me roar.

Now, your old XPointer will return 2 locations--the 'roar' in the second 
sentence *and* the 'roar' in the first sentence.  It's impossible to 
know which one to choose--using the simple heuristic of choosing the 
first would be incorrect in this example.

Nevertheless, it seems to me that the pattern-matching string-range will 
perform better in most cases than the simple string-counting Amaya and 
Annozilla are doing now.  Do other people feel the same, or not?

Additionally, it's worth pointing out that Amaya (as of version 6.2) 
didn't seem to be capable of resolving string-range XPointers that 
required pattern matching.  I don't think Amaya 6.4 will do it--it seems 
to crash with an **irrecoverable error** whenever I try, which can't be 
a good sign.  Then again, a lot of annotations functionality seems to be 
broken in 6.4...

Doug

Received on Sunday, 17 November 2002 12:57:57 UTC