- From: Miles Fidelman <mfidelman@meetinghouse.net>
- Date: Tue, 06 Feb 2007 09:35:33 -0500
- To: html-tidy@w3.org
Hi Folks, I'm trying to migrate an email list from yahoogroups to sympa. There's a fairly straightforward tool chain available, but with a major glitch: - yahoo2mbox - extracts archives from yahoogroups to an mbox file - a group of scripts that use mhonarc to turn an mbox file into a set of html files and indices for sympa's archives But... yahoogroups generates some pretty malformed HTML inside its archives. When combined with messages originated from Outlook, the HTML can be REALLY bad. And mhonarc takes the simple step of filtering out really bad HTML. The result is that I end up with an archive where about half the messages have headers, but no bodies! It occurs to me that running the source file through Tidy might be a way to clean things up sufficiently for mhonarc to process messages correctly, but, of course, an mbox file is not the same as a single web page. Which leads me to following question: Does anybody have any experience and/or suggestions on how to process an mbox file to clean up HTML that's embedded in mail messages? Thanks much, Miles Fidelman
Received on Tuesday, 6 February 2007 18:46:35 UTC