Re: Rough Draft of Robust Anchoring: the RangeFinder API

Hi, Liam–

On 3/25/15 4:14 PM, Liam R. E. Quin wrote:
> On Wed, 2015-03-25 at 18:57 +0000, Bill Kasdorf wrote:
>> Liam had cited the specific example of a profit/loss statement; in
>> those, red or parens for negative numbers are common, minus signs
>> not so much. All of these are choices available in Excel, for
>> example, which is what makes them so commonly used.
>
> To be fair, XPath does not have a built-in capacity for deciding
> (3,000) is -3000, although it *does* have a way for providing such
> smarts (including redness). But I didn't illustrate that. XPath
> extensions look like functions, and in the XML world have an
> associated namespace, browser:redness(), browser:hover() or whatever.
>
> Another issue with numbers is internationalization, with 1,300 meaning
> different things in different parts of the world and to different
> people. Another is representing scientific notation, 3.6×10¹² etc. But
> one could imagine some functions, av:financial(), av:scientific(),
> av:numeric() for this sort of use case (I used "av" for annotated
> value, arbitrarily). XPath 3 gives users mechanisms to write such
> functions themselves, but here, if annotations are important to the
> financial industry, to scientific journals, to places where people add
> and subtract :), there seems to be merit in such facilities, along
> with other hand-to-write but already-understood notations such as
> dates, times, sock sizes and so forth.

Interesting.

We are planning to have a (as-yet-ill-defined) "customSelector" 
attribute that lets developers define search mechanisms specific to 
their use case, so this could be done as part of that. They won't be as 
performant as the native functionality, but it will allow for more 
complex searching behavior, including whatever datatyping is needed (so 
long as it can be applied via script).

Since we aren't dealing with datatyping natively, we sidestep all these 
issues of finding "types of things" vs strings.

FWIW, the Levenshtein distance between "1,300" and "1.300" (another 
common thousands separator) or indeed "1300" is 1; that would be easily 
found using RangeFinder with an edit distance allowance.


>> To avoid going further down a rabbit hole, I think the point here is
>> that you can only query a given set of content for something that is
>> semantically distinct in some way in that content. Some _known_ way.
>
> Yes - although even searching for a paragraph containing "rabbit" is
> something that can be done in XPath and not CSS. Including textual
> content in a fragment identifier (e.g. XPointer with the XPath scheme)
> can be much more robust against changes in documents than using (e.g.)
> numeric tumblers, :nth-child and so forth.
>
> SoftQuad's Panorama SGML viewer used to look for the nearest ID-valued
> attribute and store a path from there to the highlight start and end,
> for annotations. This turned out to be very robust in practice because
> IDs (and HTML name attributes) tend to be stable over time with
> respect to the content they contain. Butit needed things like XPath's
> full parent and sibling navigation to do that (it predated XPath and
> was based on HyTime and TEI Pointers, out of which background XPath
> arose).

Interesting background. I'm not convinced that the ID stability will be 
the same on Web documents on the whole; many don't use IDs at all, and 
those that do often change because they are generated by CMSes (e.g. 
MediaWiki, which automatically derives the ID from the heading text.

But I am interested in other robustness mechanisms might have been 
developed around the time of XPath.


> My goal really was to ask whether more complex functionality for
> finding ranges was considered as in-scope as a complex data model, and
> to probe that with some examples.

Some types of complex behavior may be in scope (again, such things as 
case-folding and diacritic-folding), but not this particular complex 
functionality you're asking about.

We're not trying to reinvent XPath. In fact, the current XPath selector 
(and querySelector) parts are probably going to be removed, in favor of 
a simple startRange selector. That way, an author can find the initial 
starting and scoping ranges themselves (using querySelector or an XPath 
selector or whatever they want), and simply feed that in generically. 
This will mean a less complex (and probably more perfomant) API to 
implement.

Regards–
–Doug

Received on Wednesday, 25 March 2015 22:01:51 UTC