Eric Kolve and I were looking at how to best set the default SpamCopURI scores for the various SURBL lists and at first we tried looking at the SpamAssassin 3.0 perceptron-generated scores as a possible guide:
http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
# The following block of scores were generated using the mass-checking # scripts, and a perceptron to determine the optimum scores which # resulted in minimum false positives or negatives. The scores are # weighted to produce roughly 1 false positive in 2500 non-spam messages # using the default threshold of 5.0.
score URIBL_AB_SURBL 0 2.007 0 0.417 score URIBL_OB_SURBL 0 1.996 0 3.213 score URIBL_PH_SURBL 0 0.839 0 2.000 score URIBL_SC_SURBL 0 3.897 0 4.263 score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant, to which Theo Van Dinter cited:
$ perldoc Mail::SpamAssassin::Conf [...] If four valid scores are listed, then the score that is used depends on how SpamAssassin is being used. The first score is used when both Bayes and network tests are disabled (score set 0). The second score is used when Bayes is disabled, but network tests are enabled (score set 1). The third score is used when Bayes is enabled and network tests are disabled (score set 2). The fourth score is used when Bayes is enabled and network tests are enabled (score set 3).
We wondered if we could somehow use those scores with SpamCopURI and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to set scores, to which I mentioned:
We have these test results from Justin from 25 June:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 121405 22516 98889 0.185 0.00 0.00 (all messages) 100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %) 13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS 3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC 2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB 0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH 12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
which shows a pretty high FP rate for WS, less for the others. Do you happen to have access to any more recent corpus check data like this? Could be useful to have another snapshot for a more complete picture.
Which was followed up with more data and discussion:
On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:
high spam + low ham is good from an FP standpoint, but having a "significant" (for your definition thereof) ham hitrate means the score shouldn't be too high. My handwaving scores would be something like:
[Theo's wild guess scores for Justin's June data: -- Jeff C.]
WS 1.2 SC 2.5 AB 3.5 OB 1.8
Theo then gave some of his own stats on a couple different corpora:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 416072 365031 51041 0.877 0.00 0.00 (all messages) 100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 119215 67094 52121 0.563 0.00 0.00 (all messages) 100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
so for these results, I'd probably do something like:
WS 1.3 SC 4.0 AB 3.0 OB 2.2
since the hit rates and S/O are a bit higher for me, related to the fact I ran more spam through than Justin did.
To which I added:
Those final scores look like an excellent fit to the data to me.
and:
Also while the PH spam hit rate [from Justin's stats] is low, the data is of hand checked phishing scams, which deserve to be blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like 3 to 5.
So we'll probably adjust the default scores on SpamCopURI to something like:
WS 1.3 SC 4.0 AB 3.0 OB 2.2 PH 4.5
and we recommend SpamCopURI users do likewise. Please be sure to use the latest version of SpamCopURI with multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/ http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for ws.surbl.org is way too high at about 0.45 to 0.5% across multiple corpora. That FP rate needs to be reduced for WS to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to reduce FPs in WS further. If so, ***please*** try to apply it. We need to get the FPs to be much less than 0.5%. The other lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular FP rates?
Jeff C.