W3C home > Mailing lists > Public > public-webhistory@w3.org > December 2012

Re: www-talk archives: 1992-1995?

From: Ed Summers <ehs@pobox.com>
Date: Tue, 18 Dec 2012 11:09:19 -0500
Message-ID: <CABzDd=49KwWQd4x_sPzrcd=s_5F7h6UHADOKDaZunXvv--rA7g@mail.gmail.com>
To: jose.kahan@w3.org
Cc: w3t-sys@w3.org, public-webhistory@w3.org
... sorry, here's the correct gist URL



On Tue, Dec 18, 2012 at 11:07 AM, Ed Summers <ehs@pobox.com> wrote:
> Thanks for the writing with all those details Jose. Before I dive in
> any further I thought I would just point out that the files I received
> from Arjun are easily parse-able with Python's mailbox.mbox [1]. Would
> it be possible for you to make the data that you have available in
> some form, for me to compare to what I have from Arjun? I do belong to
> a W3 institution (the Library of Congress) so theoretically I have a
> user account somewhere on w3 machines? If not, sending via email (they
> aren't that big) or some other mechanism (dropbox) could work.
> //Ed
> PS. Is it ok to have this discussion on public-webhistory?
> [1] https://gist.github.com/432
> On Tue, Dec 18, 2012 at 10:45 AM, Jose Kahan <jose.kahan@w3.org> wrote:
>> Hi Ed,
>> We saw your message go by. I downloaded your archives. I already
>> had a copy of those messages.
>> What has kept us from moving forward is that our tools were unable
>> to correctly parse all those messages. Sometimes messages
>> were combined as the heuristics to find the beginning of
>> a message (e.g. From envelope, newline) weren't always constant.
>> In order to be able to convert the archives into our hypermail
>> system, we need to transform the mbox format into an mh-like
>> one (one file per message) and synthetize a first Received:
>> header in the format we expect it to be.
>> More precisely, we need each message to begin with this
>> a format close to this one:
>> [[
>> From someone@hotmail.com Tue Dec 18 12:14:42 2012
>> Received: from listserver.example.com ([xxx.xxx.xx])
>>         by listserver.example.org with esmtp (Exim 4.72)
>>         (envelope-from <adamsobieski@hotmail.com>)
>>         id 1Tkw4I-0004wZ-El
>>         for www-talk@example.org; Tue, 18 Dec 2012 12:14:42 +0000
>> ]]
>> The exact contents (nameserver, etc), are not important. What
>> is important is the received date in the envelope (From) and
>> the first received header. It has to be the same. It is this
>> date that we use for sorting the messages.
>> Our tools (including procmail, mimetools and friends), were unable
>> to correctly process those mboxes. We started cleaning them up by hand,
>> but it was so much time consumming we dropped it after a while.
>> It's important that once we put the archives online, we can't modify
>> their URLs, so they become references. And as we were not able
>> to correctly separate each message, this mean we would not be able
>> to fix mixed messages easily. That's why we only made available
>> parts of the historical messages.
>> That's the current status. Please contact me if you do have
>> some free time to help us prepare them. I'm not sure where I put
>> the scripts I had used to separate the messages, but I can look
>> for them (not sure if I still have them, though).
>> Thanks for your message,
>> -jose
Received on Tuesday, 18 December 2012 16:09:51 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:40:43 UTC