RE: URI Fragment Identifiers for text/plain from Al Gilman on 2003-04-21 (uri@w3.org from April 2003)

From: Al Gilman <asgilman@iamdigex.net>
Date: Mon, 21 Apr 2003 09:41:43 -0400
To: <uri@w3.org>
Cc: <net.dred@dred.net>, "'John Cowan'" <cowan@mercury.ccil.org>
Message-Id: <5.1.0.14.2.20030420103945.025d21c0@pop.iamdigex.net>
At 01:25 PM 2003-04-19, Larry Masinter wrote:
>A line number plus some bits of context would be useful;
>I don't see the use of character counts or general regular
>expressions, at least for the applications I can imagine.

Please do not consider line numbers a candidate form of matchable, here.

The domain that 'patch' is applied to is nothing like (much more
standardized than) the space of natural language texts that we need this to
work with.

At 09:22 AM 2003-04-21, John Cowan wrote:

>The basic advantage of character counts is that they are fast to
>compute and fast to reference, and would be handy for pointing into
>documents like RFCs that are truly and forever frozen.

Yes, but that class of documents represents a tiny fraction of the
corpus we need to be able to index into.

What are the mechanisms that introduce variation in character counts, and
how often will they develop skew in routine handling of text/* data?

claim:  What we do has to work for marked text and plain text.  At least
in the lexicon application, what we need is something that finds unmarked
points in the text inside a markup-language document-instance.
A major domain of application for references into un-XML text is to merge
comments on prior texts that originated in email.  The text commented on
will appear in different wrappings in different sources that were used as
the basis for commenting.  The original un-flowed message, the flowed digest
version of the listserv that the message was addressed to, an HTML document
from a web archive of the listserv, a reply to a message that quotes the
text in question, Comments posted to a Wiki site initialized with the list
archive, etc.

Line numbers in something that the commentor has as plain text are
unreliably associated with anything that the originator of the text felt was
meaningful, outside of poetry where we a) can step up to XML markup or b)
use enough [streaming] text context so that the pointing is effective
without benefit of reference to line ends.  Line number comparisons will
lead to wrong results just too often for us to put any dependency on that
mechanism.

> > There's a tradeoff between robustness and the ability to point into
> > frozen documents where the author has not politely scattered lots of
> > ids for you.  That's why XPointer supports both ids and Xanadu-style
> > tumblers as the simple cases, along with the super-duper-XPath version
> > that isn't yet ready for prime time.
>
>I don't see what the tradeoff is, actually. The idea for
>pointing into 'frozen' documents without IDs is to supply
>context. This has been common practice in finding locations
>in plain text files for decades (`man patch`).
>
>The 'patch' command has been pretty reliable, in my experience,
>in figuring out where to apply a patch even when the original
>source file has been edited. So why shouldn't text/plain fragment
>identifiers employ some of the same mechanisms?
>
>A line number plus some bits of context would be useful;
>I don't see the use of character counts or general regular
>expressions, at least for the applications I can imagine.

Character counts are only applicable in documents that have been
[standardized] fit for digital signature.  Yes, they are risky.
IMHO too risky.  And so are line numbers.

One of the functions that we in W3C/WAI/PF are searching for
is the right, widely understood, form of just-fuzzy-enough
matching of terms in text so that a glossary entry, for example,
can be matched to instances of the defined term without
marking the instances.

This is of interest both for pronunciation and interpretation
aids or hints.

One of the things that the third-party tool industry [which includes
'Assistive Technology' as used by the Access Board in the Section 508 regs]
needs from the Internet is that XQuery/XPath, VoiceBrowserLexicon,
LearningDisabledPictionaryLink, and WhereItSaysReferences[1]
into plain text not all go off in different directions on the form and
levels of this string-matching functionality.

I believe that whatever we do with "where it says" URI-references should
work a lot like what one would do when passing a friend a searchPattern.

One of the terms in the search pattern would be a sequence of words in
quotes so as to be matched as a consecutive word sequence.  This would be
unique in the document of interest.  To this one would add a few more terms
which are atypically common in the document, after the manner of the 'Robust
Hyperlinks' work[2].  These are added to the search pattern until the 
document of
interest reliably appears at the head of the "quality of match ordered"
list.  That's the search-pattern version.

In the presence of some more efficient key for a plain text scope such as

ftp://ftp.ietf.org/rfc/rfc2822.txt

...the latter group of uniquifying search keys would be suppressed in 
deference
to the scoping/locating URI.  What we need is once again, as for references to
time-range-delimited slices of temporal-media objects, is something that works
like standard fieldnames for use in search patterns in search URIs.

If our string-matching-for-whereItSays-location doesn't work much as it does
in Internet search services, our users will be royally confused.

Al

[1] 
http://web.archive.org/web/19990909043116/http://www.access.digex.net/~asgilman/web-access/wis_rfc.html


[2] http://www.cs.berkeley.edu/~phelps/Robust/papers/RobustHyperlinks.html



>Larry
>--
>http://larry.masinter.net
Received on Monday, 21 April 2003 11:35:54 UTC