Robust metadata revisited from Nick Kew on 2005-11-08 (public-wai-ert@w3.org from November 2005)

From: Nick Kew <nick@webthing.com>
Date: Tue, 8 Nov 2005 09:19:31 +0000
To: public-wai-ert@w3.org
Message-Id: <200511080919.32812.nick@webthing.com>
In [1] and [2] I discussed problems with referencing content
within a webpage, and proposed some measures.  In summary:

1. XML techniques are potentially useful, but not well-specified
   on the Web at large.
2. We need ways of dealing with content change.
3. We need to deal with negotiated content.

Negotiated content is easy to deal with - we just need to
qualify our URLs with the negotiated HTTP headers.

Locators within a page are more problematic.  We have agreed that
EARL should offer a wide range of options, but it is harder to
define locations that are robust against content change.
XML techniques (XPath, XPointer) are the most useful for
referencing markup, but don't apply to HTML or tag-soup.
We can work around this, but at a significant cost in
complexity, as discussed in [1] and [2].  We should decide
now where to compromise between the conflicting requirements
to minimise both ambiguity and complexity.

Dropping the full generality of my previous proposal, the obvious
candidate for this is the HTML DOM.  If we have a DOM on a document,
as constructed by a browser, then we have normalised it implicitly
to XML.  As far as I can tell, the DOM does not deal with the
problems of ambiguity (I can't find any discussion of it).
That leaves us to decide:
  (a) What level of ambiguity is acceptable and/or unavoidable?
  (b) Can and should we canonicalise construction of a DOM?

The first problem is the harder, and boils down to error-correction
of badly broken tag-soup.  The only way we can deal with it fully
unambiguously is by defining normalisation in terms of a particular
implementation - software tool, library, or webservice.  If we
go down this route, I can offer to implement it as a webservice
and opensource code based on the HTMLparser module from libxml2.


If we consider some ambiguity acceptable, we can remove the
dependence on an implementation.  All we then need to do is
to specify rules regarding insertion of implied tags in an
HTML DTD.  There are several levels to consider:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<title>foo</title>
Here is some text.
<table><tr><td>Here is a table</td></tr></table>
<p/some valid markup is unsupported on the web/
<script type="text/javascript">
document.write("<p>but superficially well-formed script is a problem.</p>");
</script>

 (1) <head> and <body> are implied, and can be unambiguously
     inserted.  I think there is little doubt we should do so.
     <tbody> can be similarly treated, should we?
 (2) The bare body text could be considered as implying <p> or <div>,
     which would make it valid HTML.  We can do that by defining
     "best correction" rules (libxml2 and tidy have them), but we
     probably don't want that.
 (3) We can probably ignore shorttags and NET-enabling tags (everyone
     else does).  But what about cases where markup "within" scripting
     events totally changes the meaning of a document?

Moving on to content change, this is the most interesting topic.
It has been demonstrated ([3], [4]) that we can define measures
that can not merely detect change (checksum/hash), but can detect
some kinds of change and ignore others.

Some measures are easy to define on a DOM; for example, document
markup structure can be derived by discarding text, cdata and
comment nodes, while document text is (to a first-order measure)
derived by discarding everything but text nodes.

This can be used to determine programmatically whether a change
to a document affects EARL assertions: for example

 * An assertion about "avoid deprecated markup" need only concern
   itself about document structure.  So if it computes a hash on
   markup structure, any change that doesn't affect that hash
   is known not to affect the validity of the assertion.
 * An assertion about "use clear and simple language" can similarly
   ignore structure and look only at body text.
 * An assertion about table structure can ignore everything outside
   the table in question, and can also ignore the contents of table
   cells, and all attributes other than those relevant to the assertion.


As we see from the third case above, we can very substantially reduce
the problem space in some instances.  This doesn't directly deal
with the problem of identifying the table if other document contents
change - perhaps substantially - but that's not important: the question
is whether there is any matching table: if so, our assertion (still)
applies to it.

More examples are discussed in [2].

The computation of a local invariance measure is a property of a
test spec, and would therefore appear to fall outside the scope
of EARL.  The measure itself is a property of an Assertion.


References:
[1] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Apr/0029.html
[2] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jul/att-0017/metrics
[3] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2001Dec/0029.html
[4] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jan/0019.html
 
-- 
Nick Kew
Received on Tuesday, 8 November 2005 09:19:47 UTC