RE: Using XPointer with HTML

On Wed, 3 Apr 2002, Marja-Riitta Koivunen wrote:

> At 07:56 PM 4/3/2002 +0100, Nick Kew wrote:
>
> >On Wed, 3 Apr 2002, Laurent Denoue wrote:
> >
> > > Based on his research, David Bargeron (followed by Ka-Ping and myself)
> > pointed out that
> > > "using 'human-level' page content as the basis for anchoring was more
> > effective than using the internal structure of the page".
>
> I can imagine that human-level page content pointers are may be easy to
> come up with some pages e.g. those that follow the form of common book, a
> research document, play etc. format. With these formats it is easy to
> define unambiguous pointers that are both intuitive for users and exact for
> machines. For instance, we can talk about Chapter 2 in a book or References
> section or the Chapter named "Architecture". However, many Web pages do not
> follow such commonly known human-level formats. How should we then point to
> them exactly and intuitively?

Well, the Fuzzy Pointers Jim and I are using each reference an element
in the markup (as opposed to a full-blown XPointer, which deals with
offsets and ranges).  This harnesses the document structure, and seems
to me to be the simplest and most robust way to do so.  Robustness
is further improved by computing a measure of whether the document
structure has changed between producing and reading the metadata.

A reference to "Chapter 2" in a book is a pointer to a structural
element, and is exactly analagous to our pointers.  "The chapter named
'Architecture'" is then the case where we have an ID.

Stretching the analogy, XPointer offsets and ranges could be said to
refer to pages, lines and even words.  In another edition of the book,
these may very well be invalid, but references to a chapter will be
more robust.

On te Web, documents change more frequently than editions of a book,
which makes it harder to produce robust pointers.  On the other hand,
electronic processing makes it easier.

> Even these Web pages have a structure consisting of elements that come from
> the standard mark-up languages. This structure is often visible to users
> even if they are not familiar of the mark-up languages. So I don't think it
> is totally unintuitive to use this structure. For instance, Chapter 2 could
> in XPointer format refer to the second level 2 header, or Chapter with
> "Architecture" could refer either to the 3rd level 2 header that happens to
> have that name or directly refer to the the level 2 header with
> "Architecture" string. I do agree that there are cases, such as with the
> headers in HTML, where it would be easier to think that the paragraphs
> under each header are actually under the header also in the tree structure.

Indeed, we can infer structure.  An exercise to the reader: take an
HTML page, change the FPI to ISO/IEC HTML, and run it through (say)
sgmlnorm.  The output will be normalised to the DTD, which infers
exactly that kind of document structure, each <H2> and its descendents
being enclosed in an implicit <DIV2> from the DTD.

But in the general case where such structural elements are seldom
used, it would, I suspect, be problematic to infer such structure.
We should use instead the actual structure.

Take for instance:

<h2> something </h2>
<p>bla bla bla</p>
<table> .... complex stuff here ... </table>
<p>some more</p>
<hr>
<p>a third para</p>

We might reasonably attach the first <p> to the <h2>, but whether the
<table> and the second <p> belong there cannot robustly be inferred
(unless the author is using a structurally strict DTD such as ISO HTML).
The third <p> after the <hr> is unlikely to belong to the <h2>, but
this isn't really something we can robustly infer beyond the simplistic
ISO-HTML approach of normalising to:
<div2>
 (the above markup)
</div2>

Using Fuzzy Pointers, we reference the actual elements.  We could instead
reference an implied <div2>, but I think this analysis is more useful for
markup similarity measures than for anchoring actual pointers.

> Annotea uses currently the XPointers to be able to point to exact locations
> in the element tree structure. Using XPointer does not in anyway prevent
> adding more human-level content to the annotation schemas to help detect
> document changes. This could be done at least in two ways:

Indeed it does, with the proviso that Annotea's "XPointers" are also
not strictly XPointers at all, since you too are applying them to HTML.

> Also it is possible to add more info, such as a checksum, so that we know
> if the documents have been changed or not.  It would be great if the Web
> server could give us that information.

A checksum is not particularly useful, because it will be messed up
by trivial changes - e.g. a document that displays "todays date" but
otherwise doesn't change.  That's why Jim and I are using smarter
structural hashes to measure change.

-- 
Nick Kew

Site Valet - the mark of Quality on the Web.
<URL:http://valet.webthing.com/>

Received on Wednesday, 3 April 2002 18:51:44 UTC