Locating the subject, Fuzzy Pointers and Hashes.

Hi,

Sorry for the lateness of this, this is my overview of the previous Fuzzy 
Pointer suggeestions, and other parts of locating the subject.

One of the big problems of inaccessible content, is that it's also likely to 
be invalid content.  Because of this XPointers cannot be used, it's no good 
not continuing to test the content simply because there's already a failure 
due to invalid content, we want to review everything.

XPointers also are not defined for use with HTML content, and due to the way 
HTML parsers have been created, the same DOM representation is not created 
in different implementations even for valid content.  Because of this we 
developed the idea of a Fuzzy Pointer, this was defined against the infoset 
created as result of parsing an invalid document.  This pointer was 
interopable across many parsers, HTML renderers and validators - openSP, IE, 
Mozilla, Opera all created the same pointer on the same invalid documents. 
This allowed us to identify elements more reliably than just row/column in 
the source (this information is often not available, and is unreliable 
against minor changes in the source such as whitespace)

Fuzzy Pointers are also often persistent beyond changes that fix the HTML 
validation issues - this is an advantage which gives us the ability to not 
invalidate all the expensive checks.  For example:

With this document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<h1>Chickens!</h1>
<title>Example Page</title>
<img src="chicken.jpg" alt="[32324 bytes]">
</body>
</html>

There's two obvious errors, the document is invalid (title in the body) and 
the image doesn't have an appropriate alt.  Fixing the validation error, 
could give us:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<title>Example Page</title>
<h1>Chickens!</h1>
<img src="chicken.jpg" alt="[32324 bytes]">
</body>
</html>

but the ALT error hasn't been fixed - however because the document has 
changed we'll not have any idea at all if the error is still there, however 
fuzzy xpointers can be used to overcome this, whilst this example is easy to 
test again, if the test is an expensive one then re-testing may not be a 
practical option.

There's another element we need though, since the document may have changed 
more than is allowable to invalidate a result, and for this we developed a 
number of hashes, these again were based on structure, they took a hash of 
the structure of the document or the structure and the Hn element titles 
etc, and by seeing if these change you can see more if the document has 
changed.  Whilst not guaranteeing that the other tests results are still 
relevant, it allows for the computer to decide which are most likely to 
still be valid etc.

Eek, time for the meeting...

Cheers,

Jim. 

Received on Tuesday, 22 March 2005 16:58:21 UTC