- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Wed, 06 Mar 2013 00:25:36 +0100
- To: www-archive@w3.org
- Message-ID: <k0vbj8tleu639akmgqadu5frmkmopl3gih@hive.bjoern.hoehrmann.de>
So, With http://www.ietf.org/mail-archive/web/rtcweb/current/msg06576.html it seems I am not the only one feeling that e-mail is going way down; at http://www.ietf.org/mail-archive/web/rtcweb/current/msg06572.html is an example. There is an increasingly large body of evidence that people do indeed type [initials] markers manually... There are a couple of things that I still need to identify automatically and this time around I've looked at quote attribution lines (like "You wrote:"). I do not particularily wish to hard code words like "wrote" in identification code if it can be avoided (or perhaps be left to a higher level machine learning component), so I used other characteristics. The general idea is to identify a region that stands out as attribution line because it has many of the characteristics of one. For instance, an attribution line typically ends in a colon ':', so I mark colons at the end of a line. They often include the name of "the author" of the parent mail. That's a bit more tricky (which header to you use, do you use the phrase from the From header only, or also comments, do you retain the ordering names or do you turn "First Last" into "Last, First", ...). For now I've used an approximation. The e-mail address of the author of the parent mail is also a good indicator. Interestingly, time and date information is still quite common there, so as a start I take the Date header from the parent mail and split it into numeric components and search for those literally, and also after I've adjusted them for the time zone of the child e-mail's Date header. This seems to cover many, but not all such components, and I have yet to find out why that is. It's conceivable for instance that the date in attribu- tion lines is given in GMT while neither parent nor child e-mail use GMT but there may also be a bug in my time zone code. Sadly I could not find a suitable module that adjusts the time zone and gives me the components without my code looking fragile (one should keep in mind that time and date modules tend to do crazy things like reading $ENV{TZ} which I have not set, but then my code might break on the next system; starting with Date::Parse apparently accepting an optional time zone, but it seems to be ignored when the date string includes a zone, but the document isn't very elaborate on that point)... So I basically look for these numbers in the raw text and then map the match back to my list of tokens (using a suitable index that maps any offset in the text to the corresponding token, and the match positions that Perl offers through the `@+` and `@-` lists). In the attached HTML document the matches are in <b> with a green background colour. It seems to work fairly well. There would need to be additional logic (like to ignore these date components in quoted text, and some false positives could be removed, take telephone numbers for instance), but this seems to be enough to specify some treshhold for green-ness on a line and e.g. print the suspected attribution lines out, and filter out mails that do not have an identified attribution line, to come up with more rules. (The simpler solution I considered earlier was to perhaps find a module that can recognise formatted date strings in text, which would seem to be a fairly standard problem, but apart from many people rolling their own, limited to a few locales, the better ones offering hooks to add additional names for days and months, nothing useful turned up. I tried DateTime::Format::Natural for instance but it split combined dates and times too much, failed to recognise common ones, and seemed to hand on larger input texts, besides being seemingly limited to english.) I have also been thinking about what to do with Outlook blurps like -----Original Message----- From: ... Sent: ... ... It seems to me how these are generated varies with the Outlook version. For instance, older versions of Outlook Express seem to generate this as unquoted text and they do not include the Subject, while never versions are quoted and do include the subject. It's also important to note that these blurps are localised. They also tend to include things like there are in attribution lines, but it would require some work to get all the logic there, like how it ends up with the -bounces address on typical mailman mailing lists as "From: ... On Behalf Of ..." or rules for using phrases versus addresses in To and Cc lines. I figured it would be easiest to simply look for the Subject, but I would then have to study how it deals with white space normalization in that header. Also, given the subject header is not always included, I figure it might be best to regex that problem away, so I apply roughly this to the raw body (that is prefixed by a "\n" to avoid dealing with "start of string" versus "start of line" issues). (\n[> ]*)-----[^\n-]+-----\s* (\2[^\n]+?:[ ](.+?)){3,} (\2\s*\n) With modifiers sgx this seems to do the trick, short of somebody in- cluding such text on purpose. It would also be nice if there was better support for intersections and negations in regular expression implemen- tations, for instance, this would be more robust if the middle part was prevented from matching "empty" lines (at the same quote nesting level). parts matched by this are in <b> with red background colour. (And as an aside, quoted text is replaced by '.' if that is not obvious). (Some people seem to trim the blurp, I am not sure yet how to handle that, if at all. It can also incorrectly match on PGP signatures, an example is <news:4B0B4778-37BF-4B9C-99A8-865B7BAD59A0@cdt.org> which quotes a PGP-signed message. That would be addressed by the "no empty lines" rule I mentioned above but that's not easily available... I'd think I can cheat by requiring upper- and lowecase letters in whatever "Original Message" is translated to, would work for german and english and genuine PGP signatures, but beyond that I lack empirical data...) (Also, the Outlook blurp sometimes appears without the initial dashed line, possibly generated by webmail products by the same vendor. That seems to be a bit harder to deal with without generating more false positives, so long as the contents of such blurps is largely ignored.) Some other things I've noticed include that apparently a popular webmail product likes to add entirely redundant duplicates of addresses, think <example@example.com> becoming <example@example.com<mailto:example@example.com>> Also for 'http:' addresses, and sometimes `<javascript:;>` is the added text. Primarily in supposedly quoted text. And that's just the latest in a long list of mail crimes. Something also seems to add `*` characters in quoted text, partly in similarily mangled addresses, and sometimes it seems for emphasis. I have no idea how people sleep at night when their code generates rubbish like these links, perhaps people just snap after a while of reading MIME specs. A sanctuary for webmail programmers may be in order. I will probably regex the address additions, marking the tokens they add as quoted if the surrounding text is quoted, and probably add a general rule that non-word tokens surrounded by quoted text are also considered quoted. That is needed anyway to handle inline ">" ... (Sample document attached.) regards. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/html attachment: 2012-07.mail.html
Received on Tuesday, 5 March 2013 23:27:19 UTC