Document Hashing of the structure of a document.

Hi,

I've had a quick look at document hashing comparing Mozilla 0.9.8 and
IE's normalisations:

http://jibbering.com/2002/3/documenthash.html  contains the source of the
script I used.

The algorithm is basically construct a string of all names of all the
nodes in document order that do not have any characters other than a-z
(to remove text, comment and non HTML namespace nodes.).  From that
string I then calculate 2 MD5 hashes one of the whole string, and one of
just the string from the body node.

http://jibbering.com/2002/3/hashcompare.html has the results.

Of the 9 I tested (this was just a quick test, I'll do some more if we
can get Amaya to also do the tests.) the only one that failed to match on
the "body hash" was http://www.ibm.com/ quite what happened with that one
I'm not sure, neither found a body element, but viewing it in a browser
you do have one (I proxy it through jibbering.com _without_ changing any
of the src's so document.write javascript in an external file isn't
generated.)

So it would seem IE and Mozilla have a very similar normalisation of the
body.

Received on Thursday, 21 March 2002 11:41:26 UTC