W3C home > Mailing lists > Public > www-rdf-interest@w3.org > July 2000

extracting RDF from RFC822 formatted email

From: Dan Connolly <connolly@w3.org>
Date: Fri, 14 Jul 2000 16:21:53 -0500
Message-ID: <396F8471.236A8632@w3.org>
To: www-rdf-interest@w3.org
There are very few data formats I trust... when I use
the computer to capture my knowledge, I pretty
much stick to plain text, XML (esp XHTML, or at least HTML that
tidy can turn into XHTML for me), RCS/CVS, and RFC822/MIME.'
I use JPG, PNG, and PDF if I must,
but not for capturing knowledge for exchange, revision, etc.

I'm having pretty good luck extracting RDF from
XML/XHTML stuff using XSLT, e.g.
	http://www.w3.org/People/Connolly/smart-home.xsl
	http://www.w3.org/People/Connolly/home-smart.rdf
	http://www.w3.org/People/Connolly/events/events-smart.rdf

But I still mostly use messy perl/grep stuff for dealing
with my email, because email is so messy to parse. All
the perl and python libraries I've seen for email sort
of work, except for a few hundred wierdly formatted
messages in my archive. Then I found this fantastic resource:

	Internet mail message header format
	by D. J. Bernstein
	http://cr.yp.to/immhf.html

that has a wealth of knowledge about how to parse email.
(See also: anything written by jwz, esp the comments
in the grendle source code, btw
http://www.mozilla.org/projects/grendel/).

That, and a particular query I wanted to run over
my whole email archive, inspired me to write a little
perl script to extract RDF from my email -- at least
a little mid/date/from/to/subject log I keep of
my incoming mail:

	http://www.w3.org/2000/04/maillog2rdf/log2rdf.pl
	$Id: log2rdf.pl,v 1.2 2000/07/14 20:28:21 connolly Exp $

Of course, to encode stuff in RDF, I had to make up
a schema:

	Email Fields, an RDF Schema
	http://www.w3.org/2000/04/maillog2rdf/email#
	$Revision: 1.2 $ of $Date: 2000/07/14 20:29:32 $

I'm still wrestling with a few things, especially the
case of

	Message-Id: 23@example.org
	To: Fred <fred@example.org>, Bob <bob@example.com>

Should that be

	mid:23@example.org
		-- to --> mailto:fred@example.org
				--called--> "Fred"
		-- to --> mailto:bob@example.com
				--called--> "Bob"

i.e. is the mailbox called Fred? I wouldn't think so,
and RFC822 agrees: "The name reference is optional and is
usually used to indicate  the  human name of a recipient."
That suggests:

	mid:23@example.org
		-- to --> [recip1]
			--phrase-->"Fred"
			--addr-spec-->mailto:fred@example.org

		-- to --> [recip1]
			--phrase-->"Bob"
			--addr-spec-->mailto:bob@example.com

And that doesn't capture that there were no other
(stated) recipients. For that, I should model it ala:

	mid:23@example.org
		-- to --> [bag1]
			--first-->[recip1]
				(with phrase/addr-spec as above)
			--rest-->[bag2]
				--first-->[recip2] (as above)
				--rest-->empty

(I use first/rest/empty rather than _1 _2 to model lists.
See http://www.w3.org/2000/07/12-lists# )

That's sort of a mouthful... but I suppose I can
use convenience rules/properties ala

	toAddr(?msg, ?addr) :- to(?msg, ?recips),
			includes(?recips, ?recip),
			addr-spec(?recip, ?addr).

	includes(?lst, ?item) :- first(?lst, ?item).
	includes(?lst, ?item) :- rest(?lst, ?lst2),
					includes(?lst2, ?item).

I wish I had an RDF model for rules that I was happy with.



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Friday, 14 July 2000 17:22:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:51:43 GMT