W3C home > Mailing lists > Public > whatwg@whatwg.org > April 2006

[whatwg] hello list

From: Serban Ghita <serban.ghita@verasys.com>
Date: Sun, 16 Apr 2006 22:04:13 +0300
Message-ID: <00ff01c66188$8d124870$fa92a8c0@verasysi>
Hello guys,

So happy to find a list interested in the future of Web (HTML/CSS/W3 Standards).

Until i'll get a feeling of what's happening here i will try only to read and learn from your messages. But, i have one problem, that i am sure you might know how to handle it (i hope this is not offtopic in here)

I have a web crawler, that i am using for personal research. It crawls the entire site, finding all the links and creating a sitemap, and grabs some statistics. After a while i felt that i can do more then that, so i have decided to make it parse html code and extract some statistics about tags. For the moment i have created an array with all HTML tags (deprecated ones to), grouped by their structure type (block, inline, single - thats how i call them). I am parsing the HTML code using regular expressions, but as i've searched the net, i saw lots of people saying: dont parse html using regex.
I studied a bit more, then i've found the relation between the HTML document and the DTD (Document Type Definition) declaration. I've noticed that browsers rely on it (the ones that are public are cached, and the custom ones are grabbed before the HTML document is parsed).

Can you point me out to some documentation that explains the way a browser parses HTML documents, or the way it uses the DTD document for interpreting the tags and their attributes.

Another thing that is that everyone recomended to use an already build library, but i want to slowly learn the whole parsing process by myself, so i can understand all the priciples.

Thanks a lot!

Best wishes, 

--------------------------------------
Serban Gh. Ghita
Project Manager

VERASYS Intl.
Web Dept.
Bucuresti, ROMANIA
Tel:    +40-21-201.67.62
Fax:    +40-251-306.017
GSM: +40-788-28.29.10
email: serban.ghita at verasys.com
email: zamolxe at php.net
www.verasys.com / www.itpromo.ro 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20060416/a19b4047/attachment.htm>
Received on Sunday, 16 April 2006 12:04:13 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:46 UTC