Document Hashing of the structure of a document. from Jim Ley on 2002-03-21 (www-annotation@w3.org from January to June 2002)

From: Jim Ley <jim@jibbering.com>
Date: Thu, 21 Mar 2002 16:38:57 -0000
To: <w3c-wai-er-ig@w3.org>, <www-annotation@w3.org>
Message-ID: <02fb01c1d0f6$e689b040$ca969dc3@emedia.co.uk>

Hi,

I've had a quick look at document hashing comparing Mozilla 0.9.8 and
IE's normalisations:

http://jibbering.com/2002/3/documenthash.html  contains the source of the
script I used.

The algorithm is basically construct a string of all names of all the
nodes in document order that do not have any characters other than a-z
(to remove text, comment and non HTML namespace nodes.).  From that
string I then calculate 2 MD5 hashes one of the whole string, and one of
just the string from the body node.

http://jibbering.com/2002/3/hashcompare.html has the results.

Of the 9 I tested (this was just a quick test, I'll do some more if we
can get Amaya to also do the tests.) the only one that failed to match on
the "body hash" was http://www.ibm.com/ quite what happened with that one
I'm not sure, neither found a body element, but viewing it in a browser
you do have one (I proxy it through jibbering.com _without_ changing any
of the src's so document.write javascript in an external file isn't
generated.)

So it would seem IE and Mozilla have a very similar normalisation of the
body.

Received on Thursday, 21 March 2002 11:41:26 UTC