tidying an mbox file?

Hi Folks,

I'm trying to migrate an email list from yahoogroups to sympa.  There's 
a fairly straightforward tool chain available, but with a major glitch:
- yahoo2mbox - extracts archives from yahoogroups to an mbox file
- a group of scripts that use mhonarc to turn an mbox file into a set of 
html files and indices for sympa's archives

But... yahoogroups generates some pretty malformed HTML inside its 
archives.  When combined with messages originated from Outlook, the HTML 
can be REALLY bad.  And mhonarc takes the simple step of filtering out 
really bad HTML. 

The result is that I end up with an archive where about half the 
messages have headers, but no bodies!

It occurs to me that running the source file through Tidy might be a way 
to clean things up sufficiently for mhonarc to process messages 
correctly, but, of course, an mbox file is not the same as a single web 

Which leads me to following question:  Does anybody have any experience 
and/or suggestions on how to process an mbox file to clean up HTML 
that's embedded in mail messages?

Thanks much,

Miles Fidelman

Received on Tuesday, 6 February 2007 18:46:35 UTC