W3C home > Mailing lists > Public > public-webhistory@w3.org > December 2012

Re: www-talk archives: 1992-1995?

From: Ed Summers <ehs@pobox.com>
Date: Tue, 18 Dec 2012 11:07:51 -0500
Message-ID: <CABzDd=6qzJzZo4BnAF+uP=Ku2krzpj=1anE47ep4jAgovmbstA@mail.gmail.com>
To: jose.kahan@w3.org
Cc: w3t-sys@w3.org, public-webhistory@w3.org
Thanks for the writing with all those details Jose. Before I dive in
any further I thought I would just point out that the files I received
from Arjun are easily parse-able with Python's mailbox.mbox [1]. Would
it be possible for you to make the data that you have available in
some form, for me to compare to what I have from Arjun? I do belong to
a W3 institution (the Library of Congress) so theoretically I have a
user account somewhere on w3 machines? If not, sending via email (they
aren't that big) or some other mechanism (dropbox) could work.


PS. Is it ok to have this discussion on public-webhistory?

[1] https://gist.github.com/432

On Tue, Dec 18, 2012 at 10:45 AM, Jose Kahan <jose.kahan@w3.org> wrote:
> Hi Ed,
> We saw your message go by. I downloaded your archives. I already
> had a copy of those messages.
> What has kept us from moving forward is that our tools were unable
> to correctly parse all those messages. Sometimes messages
> were combined as the heuristics to find the beginning of
> a message (e.g. From envelope, newline) weren't always constant.
> In order to be able to convert the archives into our hypermail
> system, we need to transform the mbox format into an mh-like
> one (one file per message) and synthetize a first Received:
> header in the format we expect it to be.
> More precisely, we need each message to begin with this
> a format close to this one:
> [[
> From someone@hotmail.com Tue Dec 18 12:14:42 2012
> Received: from listserver.example.com ([xxx.xxx.xx])
>         by listserver.example.org with esmtp (Exim 4.72)
>         (envelope-from <adamsobieski@hotmail.com>)
>         id 1Tkw4I-0004wZ-El
>         for www-talk@example.org; Tue, 18 Dec 2012 12:14:42 +0000
> ]]
> The exact contents (nameserver, etc), are not important. What
> is important is the received date in the envelope (From) and
> the first received header. It has to be the same. It is this
> date that we use for sorting the messages.
> Our tools (including procmail, mimetools and friends), were unable
> to correctly process those mboxes. We started cleaning them up by hand,
> but it was so much time consumming we dropped it after a while.
> It's important that once we put the archives online, we can't modify
> their URLs, so they become references. And as we were not able
> to correctly separate each message, this mean we would not be able
> to fix mixed messages easily. That's why we only made available
> parts of the historical messages.
> That's the current status. Please contact me if you do have
> some free time to help us prepare them. I'm not sure where I put
> the scripts I had used to separate the messages, but I can look
> for them (not sure if I still have them, though).
> Thanks for your message,
> -jose
Received on Tuesday, 18 December 2012 16:08:20 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:40:43 UTC