RE: Rough Draft of Robust Anchoring: the RangeFinder API from Bill Kasdorf on 2015-03-25 (public-annotation@w3.org from March 2015)

From: Bill Kasdorf <bkasdorf@apexcovantage.com>
Date: Wed, 25 Mar 2015 18:57:38 +0000
To: Doug Schepers <schepers@w3.org>, "Liam R. E. Quin" <liam@w3.org>
CC: W3C Public Annotation List <public-annotation@w3.org>
Message-ID: <BLUPR06MB563E8702C4FC1058ECED72EDF0B0@BLUPR06MB563.namprd06.prod.outlook.com>
I will defer to Liam on the datatyping issue, but your examples are excellent examples of the dangers of using the literal expression of the negativity! Way too unpredictable and unreliable.

Liam had cited the specific example of a profit/loss statement; in those, red or parens for negative numbers are common, minus signs not so much. All of these are choices available in Excel, for example, which is what makes them so commonly used.

To avoid going further down a rabbit hole, I think the point here is that you can only query a given set of content for something that is semantically distinct in some way in that content. Some _known_ way. Which means trying to associate anchoring with those kinds of domain-specific semantics is probably futile (unless Liam can point out some way Xpath can handle that). Excel may "know" that a number is negative (and represent it per your choice of minus sign, red, or parens) but the same tabular content in a table may just consider it text (represented one of those ways, or some other way, to indicate that it's negative).

-----Original Message-----
From: Doug Schepers [mailto:schepers@w3.org] 
Sent: Wednesday, March 25, 2015 2:22 PM
To: Bill Kasdorf; Liam R. E. Quin
Cc: W3C Public Annotation List
Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API

Hi, Bill–

On 3/25/15 10:37 AM, Bill Kasdorf wrote:
> Small but I think relevant point about
>
>> , then you'd look for instances of the minus sign
>
> That's only one of many ways a negative number could be indicated. It 
> is often in parentheses rather than having a minus sign. Or it could 
> be red. Etc.

Huh, I don't know that I've ever seen that. Probably now I'll see it everywhere, in true Baader-Meinhof phenomenology.

How would you indicate that a number is negative, such that a machine could always know that? I suppose you could express it in MathML in some way, but even MathML uses the minus sign as an operator, IIUI. And as of right now, the use of that in HTML is infinitesimal, so you'd have to include a whole set of heuristics in your interpreter to get any kind of reliability, and then you'd have false positives:
  "there were a bunch of people there -23 or so- and I couldn't find her"
  "I counted forty-two (42) chickens in the yard"


> If I'm understanding Liam's example correctly, he's properly basing 
> the selection on the value itself (<0) rather than trying to second 
> guess how that value is indicated.

Yes, that was an interesting difference to point out. If that's what's going on, then there has to be some sort of datatyping going on in XPath, which is another key difference. The text-search aspects of RangeFinder are only that: text. RangeFinder doesn't evaluate the semantics of the content, or try to datatype it. It just searches for the strings (modulo case-folding, diacritic-folding, edit distance, and other purely textual operations).

I think that a truly general searching webapp would have to be written using multiple APIs, including RangeFinder, the non-DOM text searcher Kristof has suggested, and maybe XPath or other pattern-matching mechanisms. But the scope of RangeFinder shouldn't be expanded to cover all cases, it should just do the bits that it's designed for, which I think will be broadly useful (especially in combination with other APIs).

Regards–
–Doug

> -----Original Message----- From: Doug Schepers 
> [mailto:schepers@w3.org] Sent: Wednesday, March 25, 2015 1:47 AM To:
> Liam R. E. Quin Cc: W3C Public Annotation List Subject: Re: Rough 
> Draft of Robust Anchoring: the RangeFinder API
>
> Hi, Liam–
>
> Thanks for the use cases.
>
> I'm sorry for being dense, but I'm not sure how this fits in with the 
> RangeFinder API.
>
> Both of these cases are about using XPath to locate multiple ranges in 
> a single pass, while RangeFinder is an iterative API that 
> incrementally finds a single range at a time, within a particular 
> scope of the document tree, with an optional initial starting point 
> (thus the CSS or XPath selector).
>
> I'm not an expert in XPath, so I'm also not sure how to interpret your 
> examples absent markup examples to apply them to.
>
>
> That being said, here's a quick reaction to the prose aspects of the 
> use cases:
>
> 1) Find (annotate) all cells in which the net revenue is negative:
> in this case, with the RangeFinder API, you'd narrow the scope to the 
> table element, then you'd look for instances of the minus sign, then 
> use regex in JS to see if that is followed by a number. If you were 
> looking for a specific negative number, that would be more 
> straightforward. I considered adding some sort of "wildcard/regex"
> syntax to the search string component, but was discouraged from doing 
> that, for performance reasons; it might still be a worthwhile idea to 
> explore.
>
> 2) Find all students whose tutor is not listed: this sort of operation 
> could be done in a manner similar to the example above (finding 
> instances of the student's name, then looking for related course 
> information in JS by scanning the DOM, assuming you know the DOM 
> structure); but this is not really the point of RangeFinder. It's not 
> intended as a generic pattern matcher, but rather as a 
> narrowly-focused API to find instances of text, or other known ranges, 
> with some ability to apply fuzzy logic around location in the 
> document, text edit distance, and a few other factors.
>
> The functionality you're describing sounds interesting, but it sounds 
> like a different technology; in fact, since you're describing a 
> solution in XPath, is there anything else needed to solve your use 
> case?
>
>
> As a side note regarding XPath, I'm most interested in the 
> robust/fuzzy aspects that I understand were left out of XPath, but 
> which were under consideration; can you share any info on that?
>
> Regards– –Doug
>
> On 3/24/15 7:36 PM, Liam R. E. Quin wrote:
>> On Wed, 2015-02-25 at 00:48 -0500, Doug Schepers wrote:
>>> Hi, folks–
>>>
>>> Just a quick note. Rob asked me to move this file, to keep the 
>>> deliverables organized. It's now located at:
>>>
>>> http://w3c.github.io/web-annotation/api/rangefinder/

>>
>> And now at https://specs.webplatform.org/rangefinder/w3c/master/

>>
>> I promised Doug at least a couple of uses cases for the XPath 
>> selector. I can write them up in more detail if they're felt to be 
>> reasonable.
>>
>> (1) consider a table such as a profit/loss statement in an annual 
>> report; let's annotate all cells in which the net revenue is 
>> negative. The XPath expression might be something like //table[@id = 
>> 'profit-and-loss']//th[. = 'Net Revenue']/following- sibling::td[. < 
>> 0]
>>
>> (2) Find all students whose tutor is not listed:
>>
>> //li[@class = 'student'] [ [@class='tutor'] [ 
>> not(//li[@class='tutor']/@id = concat('#', @href)) ] ]
>>
>> These are both fairly complex examples in the spirit of "make the 
>> easy easy and the complex possible". Note that any identifier 
>> pointing at actual text will not be possible with CSS selectors, 
>> although a combination of selectors and byte ranges within a 
>> containing element can be used. But there should also be a checksum 
>> and/or text comparison in case the wrong text is highlighted, of 
>> course.
>>
>> Hope this helps. I have both simpler and more complex examples of 
>> course, if needed.
>>
>> Liam
>>
>>
>>>
>>> Even this is a temporary location, though... I'll be moving it to 
>>> specs.webplatform.org soon, and adding the annotation capability to 
>>> it.
>>>
>>> Feel free to review, but be aware that the URL is transitory.
>>>
>>> Regards– –Doug
>>>
>>> On 2/24/15 1:33 PM, Doug Schepers wrote:
>>>> Hi, folks–
>>>>
>>>> After talking about Robust Anchoring with many people over the 
>>>> course of the last couple years (!), with encouragement and good 
>>>> criticisms, I've refined my notion of what's needed for a
>>>> client- side API for Robust Anchoring.
>>>>
>>>> I've drawn up a strawman of my current thinking for an API called 
>>>> RangeFinder [1].
>>>>
>>>> It's very rough in places, but I'd appreciate any feedback on the 
>>>> spec as it stands. I'd greatly appreciate any thoughts or opinions 
>>>> on it at this stage.
>>>>
>>>> I'm not sure it's mature enough for this yet, but at some point, 
>>>> I'd like to engage the research and academic communities and the 
>>>> experts who've published on text search algorithms, to polish this 
>>>> up and make it not quite as embarrassing as it is currently. If 
>>>> anyone knows who we should contact in that regard, please chime in. 
>>>> This is a great opportunity to leverage all that research in the 
>>>> service of Web developers and browsers!
>>>>
>>>> [1] http://w3c.github.io/web-annotation/rangefinder-api/

>>>>
>>>> Regards– –Doug
>>>>
>>>
>>>
>>
>
Received on Wednesday, 25 March 2015 18:58:07 UTC