Re: www-talk archives: 1992-1995? from Ed Summers on 2012-12-18 (public-webhistory@w3.org from December 2012)

From: Ed Summers <ehs@pobox.com>
Date: Tue, 18 Dec 2012 11:07:51 -0500
To: jose.kahan@w3.org
Cc: w3t-sys@w3.org, public-webhistory@w3.org
Message-ID: <CABzDd=6qzJzZo4BnAF+uP=Ku2krzpj=1anE47ep4jAgovmbstA@mail.gmail.com>

Thanks for the writing with all those details Jose. Before I dive in
any further I thought I would just point out that the files I received
from Arjun are easily parse-able with Python's mailbox.mbox [1]. Would
it be possible for you to make the data that you have available in
some form, for me to compare to what I have from Arjun? I do belong to
a W3 institution (the Library of Congress) so theoretically I have a
user account somewhere on w3 machines? If not, sending via email (they
aren't that big) or some other mechanism (dropbox) could work.

//Ed

PS. Is it ok to have this discussion on public-webhistory?

[1] https://gist.github.com/432

On Tue, Dec 18, 2012 at 10:45 AM, Jose Kahan <jose.kahan@w3.org> wrote:
> Hi Ed,
>
> We saw your message go by. I downloaded your archives. I already
> had a copy of those messages.
>
> What has kept us from moving forward is that our tools were unable
> to correctly parse all those messages. Sometimes messages
> were combined as the heuristics to find the beginning of
> a message (e.g. From envelope, newline) weren't always constant.
>
> In order to be able to convert the archives into our hypermail
> system, we need to transform the mbox format into an mh-like
> one (one file per message) and synthetize a first Received:
> header in the format we expect it to be.
>
> More precisely, we need each message to begin with this
> a format close to this one:
>
> [[
> From someone@hotmail.com Tue Dec 18 12:14:42 2012
> Received: from listserver.example.com ([xxx.xxx.xx])
>         by listserver.example.org with esmtp (Exim 4.72)
>         (envelope-from <adamsobieski@hotmail.com>)
>         id 1Tkw4I-0004wZ-El
>         for www-talk@example.org; Tue, 18 Dec 2012 12:14:42 +0000
> ]]
>
> The exact contents (nameserver, etc), are not important. What
> is important is the received date in the envelope (From) and
> the first received header. It has to be the same. It is this
> date that we use for sorting the messages.
>
> Our tools (including procmail, mimetools and friends), were unable
> to correctly process those mboxes. We started cleaning them up by hand,
> but it was so much time consumming we dropped it after a while.
>
> It's important that once we put the archives online, we can't modify
> their URLs, so they become references. And as we were not able
> to correctly separate each message, this mean we would not be able
> to fix mixed messages easily. That's why we only made available
> parts of the historical messages.
>
> That's the current status. Please contact me if you do have
> some free time to help us prepare them. I'm not sure where I put
> the scripts I had used to separate the messages, but I can look
> for them (not sure if I still have them, though).
>
> Thanks for your message,
>
> -jose

Received on Tuesday, 18 December 2012 16:08:20 UTC