- From: Thaddeus L. Olczyk <olczyk@interaccess.com>
- Date: Wed, 20 Mar 2002 07:44:58 -0500 (EST)
- To: html-tidy@w3.org
First. Some one I suggested that I send my query to this mailing list. I haven't been able to find any way to subscribe to this mailing list, so please either send me the answer directly or show me how to subscribe. My problem. I've written software which crawls through web pages ie given a web page, I find all the links ( and all the images ) on that web page. ( The purpose of this is that I get a lot of manuals books etc. as a tar gzipped set of html documents [eg the Python documentation ]. I then install these on my local web server [ accessible only from my Lan of which I am the only user ]. I download stuff faster than I can add a link, so the crawler finds all the files and adds links to files ( I try to be top down--make a best guess of what the index pages are ). Then I find all the links on the links etc. The main problem I have is parsing the page to find the links. At first I tried using regular expressions, and it mostly worked. But it was: 1) Fragile and there seemed to be multiple expections to the rules that kept growing. 2) Slow. So then I used expat to parse the files. Which was fine for the xml files, but didn't work for the html files ( of course). The solution to this was: if expat choked on the file, then change the html files to xhtml using: tidy -asxhtml -m $filename. Unfortunately tidy chokes on some of the files. Very few and it looks worthwhile to go on a case by case basis. The biggest offenders seem to be web pages that contain embedded C++. For example: vector<T>. Tidy interprets this as a tag <T>. Now I would like to do two things: 1) Instead of calling tidy via a system call, I would like to take the tidy source, remove main and write a function somthing like: char *tidy(char *buffer,char *error); Where buffer is the to be parsed file, error is a buffer containing error messages and tidy returns a xhtml version of the buffer. 2) If this tidy function encounters an error, I would like some way of being told what character in the buffer the error firsts occurs in. The idea is: memcpy(tidy_buffer,original_buffer, sizeof(file)); tidy(tidy_buffer); while((int char_pos=error_is_bad_tag())!-0) { fix_tag_a_pos(&original_buffer, char_pos) memcpy(tidy_buffer,original_buffer, sizeof(file)); tidy(tidy_buffer); } >I have some html pages that have C++ code embedded within them. >I would like to take tidy and use it to convert them to xhtml, but the >problem is that tidy chockes on the template stuff. >Any idea of how to handle this?
Received on Thursday, 21 March 2002 02:23:10 UTC