- From: Nick Kew <nick@webthing.com>
- Date: Fri, 12 Apr 2002 00:34:24 +0100 (BST)
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- cc: <www-annotation@w3.org>, <w3c-wai-er-ig@w3.org>, HTML WG <w3c-html-wg@w3.org>
Metrics for Markup Change Detection =================================== Ideas for working towards an RDF Formalism ========================================== When dealing with metadata about Web resources, we need to be able to identify unambiguously the subject of our metadata. THE PROBLEM =========== Neither URI nor XPointer is sufficient for this, for several reasons. Consider an assertion: Nick asserts that the third word in the second paragraph of <URL:http://example.org/example1.html> is misspelt. This kind of assertion is the starting point for metadata schemas such as EARL and Annotea. However, it has some obvious shortcomings: 1. It needs to be dated. In the absence of a date, the error may have been corrected, invalidating the assertion. Or it may still be there, but at a different point on the page due to changes unrelated to the error. 2. If it is a content-negotiated page, then the assertion applies only to one instance of it: a misspelt English version doesn't affect the French, German, Russian or Arabic versions. We can deal with these using a more complex assertion: Nick asserts that on April 11th 2002 at 20:05 GMT, the third word in the second paragraph of the document at <URL:...> as represented in the English language version is incorrect. For completeness, we need to be more explicit about content negotation A second assertion is needed to complete the unambiguous identification: Valet asserts that on April 11th 2002 at 20:05 GMT, the server at http://example.org/ confirmed that <URL:...> offers no content negotiation other than language. This is getting rather messy. Furthermore, it is inflexible: If a page has no last-modified date, or a date later than Nick's assertion, we have absolutely no information on whether the assertion is still valid. On a more technical note, we identify a third problem. 3. Our original assertion presupposes that "the third word in the second paragraph" is a well-defined concept. If we are dealing with well-formed XML then XPointer gives us this, but most pages on the Web today are neither XML nor well-formed. THE GOALS ========= In order to overcome these problems, we set ourselves three goals, and seek a holistic approach to meeting them. (1) A generalised Pointer, applicable to HTML and tag-soup (2) A sensitive measure of content change, and whether change is "significant" (3) A means for recording dependency on Content Negotiation. The third is trivial housekeeping. The first and second are more interesting, and are the subject of this note A GENERALISED MARKUP POINTER ============================ Background: 1. In the WAI/ER working group, we needed a robust pointer for referencing an element in a document. Using an experimental approach, Jim Ley and I devised such a pointer, which currently enables his Client S/W to use pointers generated by Page Valet, even on tag-soup markup. 2. The Annotea project uses what it describes as an XPointer, and which is in fact an XPointer into a normalisation of non-XML markup. In practice this is very similar to (1). 3. The HTML WG has declined to consider defining a "canonical normalisation": SP = Steven Pemberton NK = Nick Kew SP> Therefore the answer to the question "what should an XPointer into SP> HTML look like?" is a very loud "it depends". NK> Indeed. It depends on defining a canonical normalisation of HTML. NK> If we can do that, we're fine. SP> And what I said is: that is a minefield onto which we [the HTML working SP> group] do not want to step. Now, we can actually step around that minefield. Instead of discussing a Canonical Normalisation, we postulate an Abstract Normalisation, of which our current implementations are instantiations. So the problem is now split into two parts: * To define an Abstract Normalisation * To define the metadata that will describe an instantiation Having done so, we can go ahead and use pointers into a normalised representation of any markup. The associated metadata will enable another agent to reconstruct the normalisation unambiguously. Abstract Normalisation ====================== I would suggest two candidates for this, both subject to the condition that the normalisation preserve all the original elements and their order, but not necessarily their structural relationships. Either (1) Normalisation to well-formed XML, or (2) Any normalisation having the property that every element can be referenced unambiguously by a path from a root element. (1) is probably the more generally useful, as it enables us to apply all the usual XML technologies. Although the weaker (2) is sufficient for the problem we have set ourselves here, I will assume normalisation to XML for the following discussion. Instantiation ============= The fundamental problem is that for a given non-XML document - from valid HTML to tag-soup - a normalisation may not be unique. For example: * Implied elements in SGML. Does <table><tr><td>FOO</table> get normalised to (a) <table><tr><td>FOO</td></tr></table> (b) <table><tbody><tr><td>FOO</td></tr></tbody></table> (b) is a simple normalisation, but a parser working from an HTML DTD might also normalise to (c). * Treatment of empty elements. Normalising <p>this<br>that<hr>the other</p> do we get (a) <p>this<br/>that<hr/>the other</p> (b) <p>this<br>that<hr>the other</hr></br></p> or indeed other variants? Page Valet with an HTML DTD will normalise to (a), but in the absence of any DTD will give us (b). This is significant, because it materially affects the XPath to the <hr> element and the text around it. We can disambiguate this by describing our normalisation. For example, we can describe our normalisation as "the XML generated by tool X using settings Y", or we can define a precise spec. In practical terms, we should use a URL to describe a normalisation scheme. This gives us a third option: instead of *describing* a normalisation, the URL can be a webservice that *implements* the normalisation. Or in RDF we can offer two URLs: one to the scheme described in Chapter 4.5 of the mod_xml book; the other to a webservice implementing it. I'm not going to try and formulate this as RDF, or I'll not finish this evening, and lose momentum. But I hope the above at least conveys the basic idea. MEASURING CONTENT CHANGE ======================== In the absence of date information (including valid Last-Modified headers) to tell us when a resource has changed, we need to look at document contents to detect changes. The simplest measure is a checksum. However, we can do better than that. A checksum tells us nothing about the magnitude of a change, so that for example a document containing "todays" date might be updated daily without affecting the validity of metadata assertions. Since markup implies structure, we can improve on a simple checksum by computing hashes not on the document itself, but on a suitable representation of it. We can then refine our measure by considering only certain structural elements of interest, so that a mere date change is ignored, or (conversely) detected as distinct from a structural change - if we are looking for a spelling mistake to be fixed. A first experiment in this is described at <URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2001Dec/0029.html> and implemented at <URL:http://valet.webthing.com/misc/dochash.html> with source code at <URL:http://lists.w3.org/Archives/Public/w3c-wai-er-ig/2002Jan/0019.html> As you can see from the code, it works by filtering on an ESIS representation generated by onsgmls. This was found to be successful at tracking change at different levels of significance, and successfully detected structural similarity over changes in rapidly-changing news sites such as CNN. A variant on it (with an additional hash for all Form elements) is in use in Site Valet's Problem Reporting and Tracking database. In terms of determining the validity of an XPointer following a change to a document, we have proposed hashing on the document elements: <URL:http://lists.w3.org/Archives/Public/www-annotation/2002JanJun/0098.html> and concluded that this can be implemented even in the restricted environment of a browser with scripting: <URL:http://lists.w3.org/Archives/Public/www-annotation/2002JanJun/0118.html> Note that for this to be unambiguous presupposes normalisation to XML as described above. We can use any of the techniques described above. To formalise a measure of change detection, we must take the normalisation, and use additional metadata to describe our hashes. As before, we can use a URL to describe it and/or a webservice implementing it. Alternatively we can describe it directly in metadata: something like <normalisation rdf:parseType="Resource"> <description rdf:resource="isbn:mod_xml book/section4/chapter5/part1"/> <implementation rdf:resource="http://not.implemented.yet/normaliser/"/> </normalisation> <hashtype>MD5</hashtype> <representation>DOM</representation> <hashsubject rdf:parseType="Resource"> <elements> <element>H1</elements> <element>H2</elements> <element>H3</elements> <element>H4</elements> <element>H5</elements> <element>H6</elements> </elements> </hashsubject> (sorry, I'm tired and I really want to get this don tonight; I'm going to stop now and post unfinished for discussion). -- Nick Kew Site Valet - the mark of Quality on the Web. <URL:http://valet.webthing.com/>
Received on Thursday, 11 April 2002 19:34:51 UTC