Re: Rough Draft of Robust Anchoring: the RangeFinder API from Liam R. E. Quin on 2015-03-25 (public-annotation@w3.org from March 2015)

From: Liam R. E. Quin <liam@w3.org>
Date: Wed, 25 Mar 2015 16:14:10 -0400
To: Bill Kasdorf <bkasdorf@apexcovantage.com>
Cc: Doug Schepers <schepers@w3.org>, W3C Public Annotation List <public-annotation@w3.org>
Message-ID: <1427314450.5626.85.camel@w3.org>
On Wed, 2015-03-25 at 18:57 +0000, Bill Kasdorf wrote:
> Liam had cited the specific example of a profit/loss statement; in 
> those, red or parens for negative numbers are common, minus signs 
> not so much. All of these are choices available in Excel, for 
> example, which is what makes them so commonly used.

To be fair, XPath does not have a built-in capacity for deciding 
(3,000) is -3000, although it *does* have a way for providing such 
smarts (including redness). But I didn't illustrate that. XPath 
extensions look like functions, and in the XML world have an 
associated namespace, browser:redness(), browser:hover() or whatever.

Another issue with numbers is internationalization, with 1,300 meaning 
different things in different parts of the world and to different 
people. Another is representing scientific notation, 3.6×10¹² etc. But 
one could imagine some functions, av:financial(), av:scientific(), 
av:numeric() for this sort of use case (I used "av" for annotated 
value, arbitrarily). XPath 3 gives users mechanisms to write such 
functions themselves, but here, if annotations are important to the 
financial industry, to scientific journals, to places where people add 
and subtract :), there seems to be merit in such facilities, along 
with other hand-to-write but already-understood notations such as 
dates, times, sock sizes and so forth.

> To avoid going further down a rabbit hole, I think the point here is 
> that you can only query a given set of content for something that is 
> semantically distinct in some way in that content. Some _known_ way.


Yes - although even searching for a paragraph containing "rabbit" is 
something that can be done in XPath and not CSS. Including textual 
content in a fragment identifier (e.g. XPointer with the XPath scheme) 
can be much more robust against changes in documents than using (e.g.) 
numeric tumblers, :nth-child and so forth.

SoftQuad's Panorama SGML viewer used to look for the nearest ID-valued 
attribute and store a path from there to the highlight start and end, 
for annotations. This turned out to be very robust in practice because 
IDs (and HTML name attributes) tend to be stable over time with 
respect to the content they contain. Butit needed things like XPath's 
full parent and sibling navigation to do that (it predated XPath and 
was based on HyTime and TEI Pointers, out of which background XPath 
arose).

My goal really was to ask whether more complex functionality for 
finding ranges was considered as in-scope as a complex data model, and 
to probe that with some examples.

Liam

>  Which means trying to associate anchoring with those kinds of 
> domain-specific semantics is probably futile (unless Liam can point 
> out some way Xpath can handle that). Excel may "know" that a number 
> is negative (and represent it per your choice of minus sign, red, or 
> parens) but the same tabular content in a table may just consider it 
> text (represented one of those ways, or some other way, to indicate 
> that it's negative).
> 
> -----Original Message-----
> From: Doug Schepers [mailto:schepers@w3.org]
> Sent: Wednesday, March 25, 2015 2:22 PM
> To: Bill Kasdorf; Liam R. E. Quin
> Cc: W3C Public Annotation List
> Subject: Re: Rough Draft of Robust Anchoring: the RangeFinder API
> 
> Hi, Bill–
> 
> On 3/25/15 10:37 AM, Bill Kasdorf wrote:
> > Small but I think relevant point about
> > 
> > > , then you'd look for instances of the minus sign
> > 
> > That's only one of many ways a negative number could be indicated. 
> > It is often in parentheses rather than having a minus sign. Or it 
> > could be red. Etc.
> 
> Huh, I don't know that I've ever seen that. Probably now I'll see it 
> everywhere, in true Baader-Meinhof phenomenology.
> 
> How would you indicate that a number is negative, such that a 
> machine could always know that? I suppose you could express it in 
> MathML in some way, but even MathML uses the minus sign as an 
> operator, IIUI. And as of right now, the use of that in HTML is 
> infinitesimal, so you'd have to include a whole set of heuristics in 
> your interpreter to get any kind of reliability, and then you'd have 
> false positives:
>   "there were a bunch of people there -23 or so- and I couldn't find 
> her"
>   "I counted forty-two (42) chickens in the yard"
> 
> 
> > If I'm understanding Liam's example correctly, he's properly 
> > basing the selection on the value itself (<0) rather than trying 
> > to second guess how that value is indicated.
> 
> Yes, that was an interesting difference to point out. If that's 
> what's going on, then there has to be some sort of datatyping going 
> on in XPath, which is another key difference. The text-search 
> aspects of RangeFinder are only that: text. RangeFinder doesn't 
> evaluate the semantics of the content, or try to datatype it. It 
> just searches for the strings (modulo case-folding, diacritic-
> folding, edit distance, and other purely textual operations).
> 
> I think that a truly general searching webapp would have to be 
> written using multiple APIs, including RangeFinder, the non-DOM text 
> searcher Kristof has suggested, and maybe XPath or other pattern-
> matching mechanisms. But the scope of RangeFinder shouldn't be 
> expanded to cover all cases, it should just do the bits that it's 
> designed for, which I think will be broadly useful (especially in 
> combination with other APIs).
> 
> Regards–
> –Doug
> 
> > -----Original Message----- From: Doug Schepers
> > [mailto:schepers@w3.org] Sent: Wednesday, March 25, 2015 1:47 AM 
> > To: Liam R. E. Quin Cc: W3C Public Annotation List Subject: Re: 
> > Rough Draft of Robust Anchoring: the RangeFinder API
> > 
> > Hi, Liam–
> > 
> > Thanks for the use cases.
> > 
> > I'm sorry for being dense, but I'm not sure how this fits in with 
> > the RangeFinder API.
> > 
> > Both of these cases are about using XPath to locate multiple 
> > ranges in a single pass, while RangeFinder is an iterative API that
> > incrementally finds a single range at a time, within a particular 
> > scope of the document tree, with an optional initial starting 
> > point (thus the CSS or XPath selector).
> > 
> > I'm not an expert in XPath, so I'm also not sure how to interpret 
> > your examples absent markup examples to apply them to.
> > 
> > 
> > That being said, here's a quick reaction to the prose aspects of 
> > the use cases:
> > 
> > 1) Find (annotate) all cells in which the net revenue is negative: 
> > in this case, with the RangeFinder API, you'd narrow the scope to 
> > the table element, then you'd look for instances of the minus 
> > sign, then use regex in JS to see if that is followed by a number. 
> > If you were looking for a specific negative number, that would be 
> > more straightforward. I considered adding some sort of 
> > "wildcard/regex" syntax to the search string component, but was 
> > discouraged from doing that, for performance reasons; it might 
> > still be a worthwhile idea to explore.
> > 
> > 2) Find all students whose tutor is not listed: this sort of 
> > operation could be done in a manner similar to the example above 
> > (finding instances of the student's name, then looking for related 
> > course information in JS by scanning the DOM, assuming you know 
> > the DOM structure); but this is not really the point of 
> > RangeFinder. It's not intended as a generic pattern matcher, but 
> > rather as a
> > narrowly-focused API to find instances of text, or other known 
> > ranges, with some ability to apply fuzzy logic around location in 
> > the document, text edit distance, and a few other factors.
> > 
> > The functionality you're describing sounds interesting, but it 
> > sounds like a different technology; in fact, since you're 
> > describing a solution in XPath, is there anything else needed to 
> > solve your use case?
> > 
> > 
> > As a side note regarding XPath, I'm most interested in the
> > robust/fuzzy aspects that I understand were left out of XPath, but 
> > which were under consideration; can you share any info on that?
> > 
> > Regards– –Doug
> > 
> > On 3/24/15 7:36 PM, Liam R. E. Quin wrote:
> > > On Wed, 2015-02-25 at 00:48 -0500, Doug Schepers wrote:
> > > > Hi, folks–
> > > > 
> > > > Just a quick note. Rob asked me to move this file, to keep the 
> > > > deliverables organized. It's now located at:
> > > > 
> > > > http://w3c.github.io/web-annotation/api/rangefinder/
> > > 
> > > And now at https://specs.webplatform.org/rangefinder/w3c/master/
> > > 
> > > I promised Doug at least a couple of uses cases for the XPath 
> > > selector. I can write them up in more detail if they're felt to 
> > > be reasonable.
> > > 
> > > (1) consider a table such as a profit/loss statement in an 
> > > annual report; let's annotate all cells in which the net revenue 
> > > is
> > > negative. The XPath expression might be something like 
> > > //table[@id = 'profit-and-loss']//th[. = 'Net 
> > > Revenue']/following- sibling::td[. < 0]
> > > 
> > > (2) Find all students whose tutor is not listed:
> > > 
> > > //li[@class = 'student'] [ [@class='tutor'] [ 
> > > not(//li[@class='tutor'] /@id= concat('#', @href)) ] ]
> > > 
> > > These are both fairly complex examples in the spirit of "make 
> > > the easy easy and the complex possible". Note that any 
> > > identifier pointing at actual text will not be possible with CSS 
> > > selectors, although a combination of selectors and byte ranges 
> > > within a
> > > containing element can be used. But there should also be a 
> > > checksum and/or text comparison in case the wrong text is 
> > > highlighted, of course.
> > > 
> > > Hope this helps. I have both simpler and more complex examples 
> > > of course, if needed.
> > > 
> > > Liam
> > > 
> > > 
> > > > 
> > > > Even this is a temporary location, though... I'll be moving it 
> > > > to specs.webplatform.org soon, and adding the annotation 
> > > > capability to it.
> > > > 
> > > > Feel free to review, but be aware that the URL is transitory.
> > > > 
> > > > Regards– –Doug
> > > > 
> > > > On 2/24/15 1:33 PM, Doug Schepers wrote:
> > > > > Hi, folks–
> > > > > 
> > > > > After talking about Robust Anchoring with many people over 
> > > > > the course of the last couple years (!), with encouragement 
> > > > > and good criticisms, I've refined my notion of what's needed 
> > > > > for a
> > > > > client- side API for Robust Anchoring.
> > > > > 
> > > > > I've drawn up a strawman of my current thinking for an API 
> > > > > called RangeFinder [1].
> > > > > 
> > > > > It's very rough in places, but I'd appreciate any feedback 
> > > > > on the spec as it stands. I'd greatly appreciate any 
> > > > > thoughts or opinions on it at this stage.
> > > > > 
> > > > > I'm not sure it's mature enough for this yet, but at some 
> > > > > point, I'd like to engage the research and academic 
> > > > > communities and the experts who've published on text search 
> > > > > algorithms, to polish this up and make it not quite as 
> > > > > embarrassing as it is currently. If anyone knows who we 
> > > > > should contact in that regard, please chime in. This is a 
> > > > > great opportunity to leverage all that research in the 
> > > > > service of Web developers and browsers!
> > > > > 
> > > > > [1] http://w3c.github.io/web-annotation/rangefinder-api/
> > > > > 
> > > > > Regards– –Doug
> > > > > 
> > > > 
> > > > 
> > > 
> >
Received on Wednesday, 25 March 2015 20:14:19 UTC