- From: Nick Kew <nick@webthing.com>
- Date: Tue, 8 Nov 2005 09:19:31 +0000
- To: public-wai-ert@w3.org
In [1] and [2] I discussed problems with referencing content within a webpage, and proposed some measures. In summary: 1. XML techniques are potentially useful, but not well-specified on the Web at large. 2. We need ways of dealing with content change. 3. We need to deal with negotiated content. Negotiated content is easy to deal with - we just need to qualify our URLs with the negotiated HTTP headers. Locators within a page are more problematic. We have agreed that EARL should offer a wide range of options, but it is harder to define locations that are robust against content change. XML techniques (XPath, XPointer) are the most useful for referencing markup, but don't apply to HTML or tag-soup. We can work around this, but at a significant cost in complexity, as discussed in [1] and [2]. We should decide now where to compromise between the conflicting requirements to minimise both ambiguity and complexity. Dropping the full generality of my previous proposal, the obvious candidate for this is the HTML DOM. If we have a DOM on a document, as constructed by a browser, then we have normalised it implicitly to XML. As far as I can tell, the DOM does not deal with the problems of ambiguity (I can't find any discussion of it). That leaves us to decide: (a) What level of ambiguity is acceptable and/or unavoidable? (b) Can and should we canonicalise construction of a DOM? The first problem is the harder, and boils down to error-correction of badly broken tag-soup. The only way we can deal with it fully unambiguously is by defining normalisation in terms of a particular implementation - software tool, library, or webservice. If we go down this route, I can offer to implement it as a webservice and opensource code based on the HTMLparser module from libxml2. If we consider some ambiguity acceptable, we can remove the dependence on an implementation. All we then need to do is to specify rules regarding insertion of implied tags in an HTML DTD. There are several levels to consider: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <title>foo</title> Here is some text. <table><tr><td>Here is a table</td></tr></table> <p/some valid markup is unsupported on the web/ <script type="text/javascript"> document.write("<p>but superficially well-formed script is a problem.</p>"); </script> (1) <head> and <body> are implied, and can be unambiguously inserted. I think there is little doubt we should do so. <tbody> can be similarly treated, should we? (2) The bare body text could be considered as implying <p> or <div>, which would make it valid HTML. We can do that by defining "best correction" rules (libxml2 and tidy have them), but we probably don't want that. (3) We can probably ignore shorttags and NET-enabling tags (everyone else does). But what about cases where markup "within" scripting events totally changes the meaning of a document? Moving on to content change, this is the most interesting topic. It has been demonstrated ([3], [4]) that we can define measures that can not merely detect change (checksum/hash), but can detect some kinds of change and ignore others. Some measures are easy to define on a DOM; for example, document markup structure can be derived by discarding text, cdata and comment nodes, while document text is (to a first-order measure) derived by discarding everything but text nodes. This can be used to determine programmatically whether a change to a document affects EARL assertions: for example * An assertion about "avoid deprecated markup" need only concern itself about document structure. So if it computes a hash on markup structure, any change that doesn't affect that hash is known not to affect the validity of the assertion. * An assertion about "use clear and simple language" can similarly ignore structure and look only at body text. * An assertion about table structure can ignore everything outside the table in question, and can also ignore the contents of table cells, and all attributes other than those relevant to the assertion. As we see from the third case above, we can very substantially reduce the problem space in some instances. This doesn't directly deal with the problem of identifying the table if other document contents change - perhaps substantially - but that's not important: the question is whether there is any matching table: if so, our assertion (still) applies to it. More examples are discussed in [2]. The computation of a local invariance measure is a property of a test spec, and would therefore appear to fall outside the scope of EARL. The measure itself is a property of an Assertion. References: [1] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Apr/0029.html [2] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jul/att-0017/metrics [3] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2001Dec/0029.html [4] http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jan/0019.html -- Nick Kew
Received on Tuesday, 8 November 2005 09:19:47 UTC