Rule of thumb for email spam metrics

Hi,

  I get a lot of email. That is largely because Humanity seems to be bad
at developing decent messaging infrastructure and user interfaces. Group
discussions for instance do not fit nicely with the E-Mail protocols. It
would be better to use a protocol more like NNTP for those, at least for
my uses (it is easy to see, of course, how low volume users might prefer
a setup that is entirely unworkable for very high volume users). Anyway.

When spam levels surpassed half a million of messages per year, I moved
spam detection into the "cloud", as they would now call it, and did not
care to keep an eye to the statistics as I used to. The past couple of
weeks I took a close look though, and the numbers come out nicely along
the lines of the following, with some rounding.

  X messages per 30 minutes identified as ham
  X messages per 60 minutes identified as spam
  X messages per day false negatives
  X messages per week false positives

The false negatives figure is so high because I do not train the filter
with false negatives as the systems are largely disconnected. Oddly it's
mostly spam in non-latin languages and daily newsletters I've obviously
never subscribed to (newsletters, too, would be better suited for a pull
medium resembling NNTP but with more centralization than the Atom world
usually employs when not using a shared feed system that does resemble
NNTP more like per-user-pull).

If I did train the filter for false negatives I am pretty sure the rate
would match the rate of false positives. I gather globally the ham/spam
ratio is more the other way around, which is easily explained by volume
of ham in my particular inbox. Even without the training the accuracy is
at 98%, and would be at 99.7% if I am correct about my assumption. That
matches the usually claimed figure of "99%". The false positives I note
come mostly from the same people or are due to configuration errors on
my part (I have a forwarding setup that is incompatible with SPF), so it
seems rather plausible to move accuracy to 1 in 1000 levels.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Monday, 15 August 2011 23:36:06 UTC