Setting SpamAssassin scores for SURBL lists - Discuss

5 Sep 2004


      Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
SpamAssassin 3.0 perceptron-generated scores as a possible guide:
...
http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
# The following block of scores were generated using the mass-checking
# scripts, and a perceptron to determine the optimum scores which
# resulted in minimum false positives or negatives.  The scores are
# weighted to produce roughly 1 false positive in 2500 non-spam messages
# using the default threshold of 5.0.
...
score URIBL_AB_SURBL 0 2.007 0 0.417
score URIBL_OB_SURBL 0 1.996 0 3.213
score URIBL_PH_SURBL 0 0.839 0 2.000
score URIBL_SC_SURBL 0 3.897 0 4.263
score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant,
to which Theo Van Dinter cited:
...
$ perldoc Mail::SpamAssassin::Conf
[...]
   If four valid scores are listed, then the score that is used
   depends on how SpamAssassin is being used. The first score is used
   when both Bayes and network tests are disabled (score set 0). The
   second score is used when Bayes is disabled, but network tests are
   enabled (score set 1). The third score is used when Bayes is
   enabled and network tests are disabled (score set 2). The fourth
   score is used when Bayes is enabled and network tests are enabled
   (score set 3).
We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to
set scores, to which I mentioned:
...
We have these test results from Justin from 25 June:
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 121405    22516    98889    0.185   0.00    0.00  (all messages)
100.000  18.5462  81.4538    0.185   0.00    0.00  (all messages as %)
 13.453  70.3766   0.4925    0.993   1.00    1.00  SURBL_WS
  3.807  20.3811   0.0334    0.998   0.50    1.00  SURBL_SC
  2.650  14.2565   0.0071    1.000   0.50    1.00  SURBL_AB
  0.019   0.0933   0.0020    0.979   0.50    1.00  SURBL_PH
 12.624  67.6275   0.1001    0.999   0.50    1.00  SURBL_OB
which shows a pretty high FP rate for WS, less for the others.
Do you happen to have access to any more recent corpus check data
like this?  Could be useful to have another snapshot for a more
complete picture.
Which was followed up with more data and discussion:
...
On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:
...
...
high spam + low ham is good from an FP standpoint, but having a "significant"
(for your definition thereof) ham hitrate means the score shouldn't be too
high.  My handwaving scores would be something like:
[Theo's wild guess scores for Justin's June data:  -- Jeff C.]
...
...
WS      1.2
SC      2.5
AB      3.5
OB      1.8
Theo then gave some of his own stats on a couple different corpora:
...
...
    OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
     416072   365031    51041    0.877   0.00    0.00  (all messages)
    100.000  87.7327  12.2673    0.877   0.00    0.00  (all messages as %)

set1     30.923  35.2466   0.0000    1.000   0.99    0.00  URIBL_SC_SURBL
set1     72.231  82.3273   0.0274    1.000   0.98    1.00  URIBL_OB_SURBL
set1     19.375  22.0847   0.0000    1.000   0.98    1.00  URIBL_AB_SURBL
set1     74.883  85.2939   0.4310    0.995   0.74    0.00  URIBL_WS_SURBL
set1      0.001   0.0000   0.0059    0.000   0.48    0.00  URIBL_PH_SURBL
...
    OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
     119215    67094    52121    0.563   0.00    0.00  (all messages)
    100.000  56.2798  43.7202    0.563   0.00    0.00  (all messages as %)

set3     39.217  69.6605   0.0288    1.000   0.98    1.00  URIBL_OB_SURBL
set3     10.340  18.3727   0.0000    1.000   0.97    0.00  URIBL_SC_SURBL
set3      5.998  10.6582   0.0000    1.000   0.94    1.00  URIBL_AB_SURBL
set3     42.730  75.5522   0.4797    0.994   0.73    0.00  URIBL_WS_SURBL
set3      0.008   0.0089   0.0058    0.608   0.49    0.00  URIBL_PH_SURBL
...
so for these results, I'd probably do something like:
...
WS      1.3
SC      4.0
AB      3.0
OB      2.2
...
since the hit rates and S/O are a bit higher for me, related to the fact I ran
more spam through than Justin did.
To which I added:
...
Those final scores look like an excellent fit to the data to me.
and:
...
Also while the PH spam hit rate [from Justin's stats] is low,
the data is of hand checked phishing scams, which deserve to be
blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like
3 to 5.
So we'll probably adjust the default scores on SpamCopURI
to something like:
WS      1.3
  SC      4.0
  AB      3.0
  OB      2.2
  PH      4.5
and we recommend SpamCopURI users do likewise.  Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/
  http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora.  That FP rate needs to be reduced for WS
to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further.  If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%.  The other
lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular
FP rates?
Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/