- From: Ed Summers <ehs@pobox.com>
- Date: Tue, 18 Dec 2012 11:07:51 -0500
- To: jose.kahan@w3.org
- Cc: w3t-sys@w3.org, public-webhistory@w3.org
Thanks for the writing with all those details Jose. Before I dive in any further I thought I would just point out that the files I received from Arjun are easily parse-able with Python's mailbox.mbox [1]. Would it be possible for you to make the data that you have available in some form, for me to compare to what I have from Arjun? I do belong to a W3 institution (the Library of Congress) so theoretically I have a user account somewhere on w3 machines? If not, sending via email (they aren't that big) or some other mechanism (dropbox) could work. //Ed PS. Is it ok to have this discussion on public-webhistory? [1] https://gist.github.com/432 On Tue, Dec 18, 2012 at 10:45 AM, Jose Kahan <jose.kahan@w3.org> wrote: > Hi Ed, > > We saw your message go by. I downloaded your archives. I already > had a copy of those messages. > > What has kept us from moving forward is that our tools were unable > to correctly parse all those messages. Sometimes messages > were combined as the heuristics to find the beginning of > a message (e.g. From envelope, newline) weren't always constant. > > In order to be able to convert the archives into our hypermail > system, we need to transform the mbox format into an mh-like > one (one file per message) and synthetize a first Received: > header in the format we expect it to be. > > More precisely, we need each message to begin with this > a format close to this one: > > [[ > From someone@hotmail.com Tue Dec 18 12:14:42 2012 > Received: from listserver.example.com ([xxx.xxx.xx]) > by listserver.example.org with esmtp (Exim 4.72) > (envelope-from <adamsobieski@hotmail.com>) > id 1Tkw4I-0004wZ-El > for www-talk@example.org; Tue, 18 Dec 2012 12:14:42 +0000 > ]] > > The exact contents (nameserver, etc), are not important. What > is important is the received date in the envelope (From) and > the first received header. It has to be the same. It is this > date that we use for sorting the messages. > > Our tools (including procmail, mimetools and friends), were unable > to correctly process those mboxes. We started cleaning them up by hand, > but it was so much time consumming we dropped it after a while. > > It's important that once we put the archives online, we can't modify > their URLs, so they become references. And as we were not able > to correctly separate each message, this mean we would not be able > to fix mixed messages easily. That's why we only made available > parts of the historical messages. > > That's the current status. Please contact me if you do have > some free time to help us prepare them. I'm not sure where I put > the scripts I had used to separate the messages, but I can look > for them (not sure if I still have them, though). > > Thanks for your message, > > -jose
Received on Tuesday, 18 December 2012 16:08:20 UTC