Hi Paul
On 4/24/05, Paul Shields paul.shields-at-blueyonder.co.uk |surbl list| <...> wrote:
Below are some stats from our incoming mail since midnight 23/04/05. I'm not going to go into too deep an analysis as it's the weekend and I'm too 'tired' to do that ;).
Total nbr of messages with at least one ??_SURBL hit over the last 22 hours was around 1.4 million. The counts below show how many triggered as 'spam' ("result: Y"), and how many didn't trigger ("result: ."). This is based on our Spam Assassin default threshold of 8, but we have many custom rules so spam threshold is really only meaningful to our config - YMMV ;). We don't currently block or tag via SURBL/RBL at the MTA layer - everything goes through SA.
Wow, that's very usefull information. Thanks an awefull lot.
I have placed the nr's in a small table (also based on Jeff's table) :
not tagged means a hit, without being marked as spam by SA. It doesn't mean that those e-mails aren't spam, just that those pass the -conservative- filter.
Spam Not tagged %not tagged % spam total AB 521886 604 0,116% 37,28% WS 996200 12578 1,247% 71,16% JP 1234602 4376 0,353% 88,19% OB 1139181 36760 3,126% 81,37% SC 751549 1095 0,145% 53,68% PH 383 1 0,260% 0,03% XS 939134 6283 0,665% 67,08% XS-unique 10456 5300 33,638% 0,75%
Looking at the percentages I do see that JP (and SC) are good predictors for a e-mail being spam and thus very usable if only a few checks can be made (for example if scoring isn't an option).
Comparing XS with WS and OB it's clear that XS is a better predictor than those two lists...
However that doesn't mean they have more FP's, it's possible that WS, OB and XS do catch many spams that pass the other lists and so pass through this SA setup. In this case those lists would be very usefull.
Looking at XS-unique I do wonder how much the other lists catch "unique" and how many of those unique hits to pass the filter, it's possible there's a big overlap between the lists (partly seen at http://www.surbl.org/permuted-hits.out.txt).
It's possible that XS is so much faster than the other links that it's catching very new spam that's still passing the other lists.
Some remarks :
- While the total nr of e-mails is quite high, it's based on just a day, maybe one or two big spamruns are skewing the results.
- It would also very usefull to have info about the unique results of the other lists and combinations of them.
BTW. As written before a large corpus of domains that have occured several times over a periode of time on not-tagged e-mails, could be a very effective first filter to avoid automatic inclusion. The nice thing is that this info doesn't need to be very new, I would even ignore the last week... A domain that's did occur on several seperate days in the past on several not-tagged e-mails is probably hammy. Of course a manual inclusion inside the blacklist should be possible. This info is probably more usefull than the creation day of the domain and probably easier to make (just a lookup i a internal list).
Alain