W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2007

tidying an mbox file?

From: Miles Fidelman <mfidelman@meetinghouse.net>
Date: Tue, 06 Feb 2007 09:35:33 -0500
Message-ID: <45C89235.3050804@meetinghouse.net>
To: html-tidy@w3.org

Hi Folks,

I'm trying to migrate an email list from yahoogroups to sympa.  There's 
a fairly straightforward tool chain available, but with a major glitch:
- yahoo2mbox - extracts archives from yahoogroups to an mbox file
- a group of scripts that use mhonarc to turn an mbox file into a set of 
html files and indices for sympa's archives

But... yahoogroups generates some pretty malformed HTML inside its 
archives.  When combined with messages originated from Outlook, the HTML 
can be REALLY bad.  And mhonarc takes the simple step of filtering out 
really bad HTML. 

The result is that I end up with an archive where about half the 
messages have headers, but no bodies!

It occurs to me that running the source file through Tidy might be a way 
to clean things up sufficiently for mhonarc to process messages 
correctly, but, of course, an mbox file is not the same as a single web 
page.

Which leads me to following question:  Does anybody have any experience 
and/or suggestions on how to process an mbox file to clean up HTML 
that's embedded in mail messages?

Thanks much,

Miles Fidelman
Received on Tuesday, 6 February 2007 18:46:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:56 GMT