Tidy for html-xml parser and embedded C++.

First.
Some one I suggested that I send my query to this mailing list.
I haven't been able to find any way to subscribe to this mailing list,
so please either send me the answer directly or show me how
to subscribe.

My problem.
I've written software which crawls through web pages ie given
a web page, I find all the links ( and all the images ) on that web
page. ( The purpose of this is that I get a lot of manuals books etc.
as a tar gzipped set of html documents [eg the Python documentation ].
I then install these on my local web server [ accessible only from my
Lan of which I am the only user ]. I download stuff faster than I can
add a link, so the crawler finds all the files and adds links to files
( I try to be top down--make a best guess of what the index pages are
).  Then I find all the links on the links etc.

The main problem I have is parsing the page to find the links.
At first I tried using  regular expressions, and it mostly worked.
But it was:
1) Fragile and there seemed to be multiple expections to the rules
    that kept growing.
2) Slow.

So then I used expat to parse the files.
Which was fine for the xml files, but didn't work
for the html files ( of course).

The solution to this was: if expat choked on the file,  then change
the html files to xhtml using: 
tidy -asxhtml -m $filename.

Unfortunately tidy chokes on some of the files.
Very few and it looks worthwhile to go on a case by case basis.
The biggest offenders seem to be web pages that contain embedded 
C++. For example: vector<T>. Tidy interprets this as a tag <T>.
Now I would like to do two things:
1) Instead of calling tidy via a system call,
    I would like to take the tidy source, remove main and write a
    function somthing like: 
           char *tidy(char *buffer,char *error);
   Where buffer is the to be parsed file, error is a buffer containing

   error messages and tidy returns a xhtml version of the buffer. 

2) If this tidy function encounters an error, I would like some way of
    being told what character in the buffer the error firsts occurs  
    in. The idea is:
     memcpy(tidy_buffer,original_buffer, sizeof(file));
     tidy(tidy_buffer);
     while((int char_pos=error_is_bad_tag())!-0)
     {
         fix_tag_a_pos(&original_buffer, char_pos) 
         memcpy(tidy_buffer,original_buffer, sizeof(file));
         tidy(tidy_buffer);
    }


        





 
>I have some html pages that have C++ code embedded within them.
>I would like to take tidy and use it to convert them to xhtml, but the
>problem is that tidy chockes on the template stuff.
>Any idea of how to handle this?

Received on Thursday, 21 March 2002 02:23:10 UTC