- From: Simon Pieters <zcorpan@hotmail.com>
- Date: Sun, 16 Apr 2006 20:58:27 +0000
Hi, From: "Serban Ghita" <serban.ghita@verasys.com> >I have a web crawler, that i am using for personal research. It crawls the >entire site, finding all the links and creating a sitemap, and grabs some >statistics. After a while i felt that i can do more then that, so i have >decided to make it parse html code and extract some statistics about tags. You may be interested in Google's Web Authoring Statistics[1]. >For the moment i have created an array with all HTML tags (deprecated ones >to), grouped by their structure type (block, inline, single - thats how i >call them). I am parsing the HTML code using regular expressions, but as >i've searched the net, i saw lots of people saying: dont parse html using >regex. You can't reliably parse HTML with regexp because HTML has more complicated parsing rules. >I studied a bit more, then i've found the relation between the HTML >document and the DTD (Document Type Definition) declaration. I've noticed >that browsers rely on it (the ones that are public are cached, and the >custom ones are grabbed before the HTML document is parsed). Actually, browsers don't parse DTDs at all for HTML. >Can you point me out to some documentation that explains the way a browser >parses HTML documents, or the way it uses the DTD document for interpreting >the tags and their attributes. It is specified in the Parsing section[2] of Web Applications 1.0. [1] http://code.google.com/webstats/index.html [2] http://whatwg.org/specs/web-apps/current-work/#parsing Regards, Simon Pieters
Received on Sunday, 16 April 2006 13:58:27 UTC