[whatwg] hello list from Simon Pieters on 2006-04-16 (public-whatwg-archive@w3.org from April 2006)

From: Simon Pieters <zcorpan@hotmail.com>
Date: Sun, 16 Apr 2006 20:58:27 +0000
Message-ID: <BAY109-F12DA4599356BFBBCF6ABB6B4C60@phx.gbl>

Hi,

From: "Serban Ghita" <serban.ghita@verasys.com>
>I have a web crawler, that i am using for personal research. It crawls the 
>entire site, finding all the links and creating a sitemap, and grabs some 
>statistics. After a while i felt that i can do more then that, so i have 
>decided to make it parse html code and extract some statistics about tags.

You may be interested in Google's Web Authoring Statistics[1].

>For the moment i have created an array with all HTML tags (deprecated ones 
>to), grouped by their structure type (block, inline, single - thats how i 
>call them). I am parsing the HTML code using regular expressions, but as 
>i've searched the net, i saw lots of people saying: dont parse html using 
>regex.

You can't reliably parse HTML with regexp because HTML has more complicated 
parsing rules.

>I studied a bit more, then i've found the relation between the HTML 
>document and the DTD (Document Type Definition) declaration. I've noticed 
>that browsers rely on it (the ones that are public are cached, and the 
>custom ones are grabbed before the HTML document is parsed).

Actually, browsers don't parse DTDs at all for HTML.

>Can you point me out to some documentation that explains the way a browser 
>parses HTML documents, or the way it uses the DTD document for interpreting 
>the tags and their attributes.

It is specified in the Parsing section[2] of Web Applications 1.0.

[1] http://code.google.com/webstats/index.html
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

Regards,
Simon Pieters

Received on Sunday, 16 April 2006 13:58:27 UTC