- From: Ed Summers <ehs@pobox.com>
- Date: Tue, 18 Dec 2012 11:09:19 -0500
- To: jose.kahan@w3.org
- Cc: w3t-sys@w3.org, public-webhistory@w3.org
... sorry, here's the correct gist URL https://gist.github.com/4329253 //Ed On Tue, Dec 18, 2012 at 11:07 AM, Ed Summers <ehs@pobox.com> wrote: > Thanks for the writing with all those details Jose. Before I dive in > any further I thought I would just point out that the files I received > from Arjun are easily parse-able with Python's mailbox.mbox [1]. Would > it be possible for you to make the data that you have available in > some form, for me to compare to what I have from Arjun? I do belong to > a W3 institution (the Library of Congress) so theoretically I have a > user account somewhere on w3 machines? If not, sending via email (they > aren't that big) or some other mechanism (dropbox) could work. > > //Ed > > PS. Is it ok to have this discussion on public-webhistory? > > [1] https://gist.github.com/432 > > On Tue, Dec 18, 2012 at 10:45 AM, Jose Kahan <jose.kahan@w3.org> wrote: >> Hi Ed, >> >> We saw your message go by. I downloaded your archives. I already >> had a copy of those messages. >> >> What has kept us from moving forward is that our tools were unable >> to correctly parse all those messages. Sometimes messages >> were combined as the heuristics to find the beginning of >> a message (e.g. From envelope, newline) weren't always constant. >> >> In order to be able to convert the archives into our hypermail >> system, we need to transform the mbox format into an mh-like >> one (one file per message) and synthetize a first Received: >> header in the format we expect it to be. >> >> More precisely, we need each message to begin with this >> a format close to this one: >> >> [[ >> From someone@hotmail.com Tue Dec 18 12:14:42 2012 >> Received: from listserver.example.com ([xxx.xxx.xx]) >> by listserver.example.org with esmtp (Exim 4.72) >> (envelope-from <adamsobieski@hotmail.com>) >> id 1Tkw4I-0004wZ-El >> for www-talk@example.org; Tue, 18 Dec 2012 12:14:42 +0000 >> ]] >> >> The exact contents (nameserver, etc), are not important. What >> is important is the received date in the envelope (From) and >> the first received header. It has to be the same. It is this >> date that we use for sorting the messages. >> >> Our tools (including procmail, mimetools and friends), were unable >> to correctly process those mboxes. We started cleaning them up by hand, >> but it was so much time consumming we dropped it after a while. >> >> It's important that once we put the archives online, we can't modify >> their URLs, so they become references. And as we were not able >> to correctly separate each message, this mean we would not be able >> to fix mixed messages easily. That's why we only made available >> parts of the historical messages. >> >> That's the current status. Please contact me if you do have >> some free time to help us prepare them. I'm not sure where I put >> the scripts I had used to separate the messages, but I can look >> for them (not sure if I still have them, though). >> >> Thanks for your message, >> >> -jose
Received on Tuesday, 18 December 2012 16:09:51 UTC