Metrics for Markup (Re: Using XPointer with HTML) from Nick Kew on 2002-04-11 (www-annotation@w3.org from January to June 2002)

From: Nick Kew <nick@webthing.com>
Date: Fri, 12 Apr 2002 00:34:24 +0100 (BST)
To: Steven Pemberton <steven.pemberton@cwi.nl>
cc: <www-annotation@w3.org>, <w3c-wai-er-ig@w3.org>, HTML WG <w3c-html-wg@w3.org>
Message-ID: <20020412003000.G63170-100000@fenris.webthing.com>
Metrics for Markup Change Detection
===================================

Ideas for working towards an RDF Formalism
==========================================

When dealing with metadata about Web resources, we need to be able
to identify unambiguously the subject of our metadata.

THE PROBLEM
===========

Neither URI nor XPointer is sufficient for this, for several reasons.
Consider an assertion:

  Nick asserts that the third word in the second paragraph of
        <URL:http://example.org/example1.html> is misspelt.

This kind of assertion is the starting point for metadata schemas
such as EARL and Annotea.  However, it has some obvious shortcomings:

1.  It needs to be dated.  In the absence of a date, the error may
    have been corrected, invalidating the assertion.  Or it may
    still be there, but at a different point on the page due to
    changes unrelated to the error.

2.  If it is a content-negotiated page, then the assertion applies
    only to one instance of it: a misspelt English version doesn't
    affect the French, German, Russian or Arabic versions.

We can deal with these using a more complex assertion:

  Nick asserts that on April 11th 2002 at 20:05 GMT, the third word
        in the second paragraph of the document at <URL:...>
        as represented in the English language version is incorrect.

For completeness, we need to be more explicit about content negotation
A second assertion is needed to complete the unambiguous identification:

  Valet asserts that on April 11th 2002 at 20:05 GMT, the server at
        http://example.org/ confirmed that <URL:...> offers no content
        negotiation other than language.

This is getting rather messy.  Furthermore, it is inflexible:
If a page has no last-modified date, or a date later than Nick's
assertion, we have absolutely no information on whether the assertion
is still valid.

On a more technical note, we identify a third problem.
3.  Our original assertion presupposes that "the third word in the
    second paragraph" is a well-defined concept.  If we are dealing with
    well-formed XML then XPointer gives us this, but most pages on the
    Web today are neither XML nor well-formed.


THE GOALS
=========

In order to overcome these problems, we set ourselves three goals,
and seek a holistic approach to meeting them.

(1) A generalised Pointer, applicable to HTML and tag-soup
(2) A sensitive measure of content change, and whether change is
"significant"
(3) A means for recording dependency on Content Negotiation.

The third is trivial housekeeping.  The first and second are more
interesting, and are the subject of this note

A GENERALISED MARKUP POINTER
============================

Background:

1. In the WAI/ER working group, we needed a robust pointer for referencing
   an element in a document.  Using an experimental approach, Jim Ley and
I
   devised such a pointer, which currently enables his Client S/W to use
   pointers generated by Page Valet, even on tag-soup markup.
2. The Annotea project uses what it describes as an XPointer, and which
   is in fact an XPointer into a normalisation of non-XML markup.
   In practice this is very similar to (1).
3. The HTML WG has declined to consider defining a "canonical normalisation":

SP = Steven Pemberton
NK = Nick Kew

SP> Therefore the answer to the question "what should an XPointer into
SP> HTML look like?" is a very loud "it depends".

NK> Indeed.  It depends on defining a canonical normalisation of HTML.
NK> If we can do that, we're fine.

SP> And what I said is: that is a minefield onto which we [the HTML
working
SP> group] do not want to step.

Now, we can actually step around that minefield.  Instead of discussing
a Canonical Normalisation, we postulate an Abstract Normalisation, of
which our current implementations are instantiations.  So the problem
is now split into two parts:
        * To define an Abstract Normalisation
        * To define the metadata that will describe an instantiation

Having done so, we can go ahead and use pointers into a normalised
representation of any markup.  The associated metadata will enable
another agent to reconstruct the normalisation unambiguously.

Abstract Normalisation
======================

I would suggest two candidates for this, both subject to the condition
that the normalisation preserve all the original elements and their order,
but not necessarily their structural relationships.

Either
  (1) Normalisation to well-formed XML,
or
  (2) Any normalisation having the property that every element can be
      referenced unambiguously by a path from a root element.

(1) is probably the more generally useful, as it enables us to apply all
the usual XML technologies.  Although the weaker (2) is sufficient for the
problem we have set ourselves here, I will assume normalisation to
XML for the following discussion.

Instantiation
=============

The fundamental problem is that for a given non-XML document - from
valid HTML to tag-soup - a normalisation may not be unique.
For example:

* Implied elements in SGML.  Does
   <table><tr><td>FOO</table>
  get normalised to
   (a)  <table><tr><td>FOO</td></tr></table>
   (b)  <table><tbody><tr><td>FOO</td></tr></tbody></table>

(b) is a simple normalisation, but a parser working from an HTML DTD
might also normalise to (c).

* Treatment of empty elements.  Normalising
    <p>this<br>that<hr>the other</p>
  do we get
   (a) <p>this<br/>that<hr/>the other</p>
   (b) <p>this<br>that<hr>the other</hr></br></p>
  or indeed other variants?

Page Valet with an HTML DTD will normalise to (a), but in the
absence of any DTD will give us (b).  This is significant, because
it materially affects the XPath to the <hr> element and the text
around it.

We can disambiguate this by describing our normalisation.  For example,
we can describe our normalisation as "the XML generated by tool X
using settings Y", or we can define a precise spec.

In practical terms, we should use a URL to describe a normalisation
scheme.
This gives us a third option: instead of *describing* a normalisation,
the URL can be a webservice that *implements* the normalisation.  Or in
RDF we can offer two URLs: one to the scheme described in Chapter 4.5
of the mod_xml book; the other to a webservice implementing it.

I'm not going to try and formulate this as RDF, or I'll not finish
this evening, and lose momentum.  But I hope the above at least conveys
the basic idea.

MEASURING CONTENT CHANGE
========================

In the absence of date information (including valid Last-Modified headers)
to tell us when a resource has changed, we need to look at document
contents to detect changes.  The simplest measure is a checksum.

However, we can do better than that.  A checksum tells us nothing about
the magnitude of a change, so that for example a document containing
"todays" date might be updated daily without affecting the validity
of metadata assertions.

Since markup implies structure, we can improve on a simple checksum
by computing hashes not on the document itself, but on a suitable
representation of it.  We can then refine our measure by considering
only certain structural elements of interest, so that a mere date
change is ignored, or (conversely) detected as distinct from a
structural change - if we are looking for a spelling mistake to be
fixed.

A first experiment in this is described at
<URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2001Dec/0029.html>
and implemented at
<URL:http://valet.webthing.com/misc/dochash.html>
with source code at
<URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jan/0019.html>
As you can see from the code, it works by filtering on an ESIS
representation
generated by onsgmls.

This was found to be successful at tracking change at different levels of
significance, and successfully detected structural similarity over
changes in rapidly-changing news sites such as CNN.  A variant on it
(with an additional hash for all Form elements) is in use in Site
Valet's Problem Reporting and Tracking database.

In terms of determining the validity of an XPointer following a change
to a document, we have proposed hashing on the document elements:
<URL:http://lists.w3.org/Archives/Public/www-annotation/2002JanJun/0098.html>
and concluded that this can be implemented even in the restricted
environment of a browser with scripting:
<URL:http://lists.w3.org/Archives/Public/www-annotation/2002JanJun/0118.html>

Note that for this to be unambiguous presupposes normalisation to XML
as described above.  We can use any of the techniques described above.

To formalise a measure of change detection, we must take the
normalisation,
and use additional metadata to describe our hashes.  As before, we can
use a URL to describe it and/or a webservice implementing it.
Alternatively we can describe it directly in metadata: something like

<normalisation rdf:parseType="Resource">
<description rdf:resource="isbn:mod_xml book/section4/chapter5/part1"/>
<implementation rdf:resource="http://not.implemented.yet/normaliser/"/>
</normalisation>
<hashtype>MD5</hashtype>
<representation>DOM</representation>
<hashsubject rdf:parseType="Resource">
<elements>
<element>H1</elements>
<element>H2</elements>
<element>H3</elements>
<element>H4</elements>
<element>H5</elements>
<element>H6</elements>
</elements>
</hashsubject>

(sorry, I'm tired and I really want to get this don tonight; I'm
going to stop now and post unfinished for discussion).

-- 
Nick Kew

Site Valet - the mark of Quality on the Web.
<URL:http://valet.webthing.com/>
Received on Thursday, 11 April 2002 19:34:48 UTC