- From: Al Gilman <asgilman@iamdigex.net>
- Date: Mon, 21 Apr 2003 09:41:43 -0400
- To: <uri@w3.org>
- Cc: <net.dred@dred.net>, "'John Cowan'" <cowan@mercury.ccil.org>
At 01:25 PM 2003-04-19, Larry Masinter wrote: >A line number plus some bits of context would be useful; >I don't see the use of character counts or general regular >expressions, at least for the applications I can imagine. Please do not consider line numbers a candidate form of matchable, here. The domain that 'patch' is applied to is nothing like (much more standardized than) the space of natural language texts that we need this to work with. At 09:22 AM 2003-04-21, John Cowan wrote: >The basic advantage of character counts is that they are fast to >compute and fast to reference, and would be handy for pointing into >documents like RFCs that are truly and forever frozen. Yes, but that class of documents represents a tiny fraction of the corpus we need to be able to index into. What are the mechanisms that introduce variation in character counts, and how often will they develop skew in routine handling of text/* data? claim: What we do has to work for marked text and plain text. At least in the lexicon application, what we need is something that finds unmarked points in the text inside a markup-language document-instance. A major domain of application for references into un-XML text is to merge comments on prior texts that originated in email. The text commented on will appear in different wrappings in different sources that were used as the basis for commenting. The original un-flowed message, the flowed digest version of the listserv that the message was addressed to, an HTML document from a web archive of the listserv, a reply to a message that quotes the text in question, Comments posted to a Wiki site initialized with the list archive, etc. Line numbers in something that the commentor has as plain text are unreliably associated with anything that the originator of the text felt was meaningful, outside of poetry where we a) can step up to XML markup or b) use enough [streaming] text context so that the pointing is effective without benefit of reference to line ends. Line number comparisons will lead to wrong results just too often for us to put any dependency on that mechanism. > > There's a tradeoff between robustness and the ability to point into > > frozen documents where the author has not politely scattered lots of > > ids for you. That's why XPointer supports both ids and Xanadu-style > > tumblers as the simple cases, along with the super-duper-XPath version > > that isn't yet ready for prime time. > >I don't see what the tradeoff is, actually. The idea for >pointing into 'frozen' documents without IDs is to supply >context. This has been common practice in finding locations >in plain text files for decades (`man patch`). > >The 'patch' command has been pretty reliable, in my experience, >in figuring out where to apply a patch even when the original >source file has been edited. So why shouldn't text/plain fragment >identifiers employ some of the same mechanisms? > >A line number plus some bits of context would be useful; >I don't see the use of character counts or general regular >expressions, at least for the applications I can imagine. Character counts are only applicable in documents that have been [standardized] fit for digital signature. Yes, they are risky. IMHO too risky. And so are line numbers. One of the functions that we in W3C/WAI/PF are searching for is the right, widely understood, form of just-fuzzy-enough matching of terms in text so that a glossary entry, for example, can be matched to instances of the defined term without marking the instances. This is of interest both for pronunciation and interpretation aids or hints. One of the things that the third-party tool industry [which includes 'Assistive Technology' as used by the Access Board in the Section 508 regs] needs from the Internet is that XQuery/XPath, VoiceBrowserLexicon, LearningDisabledPictionaryLink, and WhereItSaysReferences[1] into plain text not all go off in different directions on the form and levels of this string-matching functionality. I believe that whatever we do with "where it says" URI-references should work a lot like what one would do when passing a friend a searchPattern. One of the terms in the search pattern would be a sequence of words in quotes so as to be matched as a consecutive word sequence. This would be unique in the document of interest. To this one would add a few more terms which are atypically common in the document, after the manner of the 'Robust Hyperlinks' work[2]. These are added to the search pattern until the document of interest reliably appears at the head of the "quality of match ordered" list. That's the search-pattern version. In the presence of some more efficient key for a plain text scope such as ftp://ftp.ietf.org/rfc/rfc2822.txt ...the latter group of uniquifying search keys would be suppressed in deference to the scoping/locating URI. What we need is once again, as for references to time-range-delimited slices of temporal-media objects, is something that works like standard fieldnames for use in search patterns in search URIs. If our string-matching-for-whereItSays-location doesn't work much as it does in Internet search services, our users will be royally confused. Al [1] http://web.archive.org/web/19990909043116/http://www.access.digex.net/~asgilman/web-access/wis_rfc.html [2] http://www.cs.berkeley.edu/~phelps/Robust/papers/RobustHyperlinks.html >Larry >-- >http://larry.masinter.net
Received on Monday, 21 April 2003 11:35:54 UTC