Re: [SURBL-Discuss] FP rate? - Discuss

12 Feb 2005


      Hi Jeff
...
...
...
That said, here are some results Daniel Quinlan posted from the
mass-checks on the SpamAssassin corpora around 26 January 2005:
...
Weekly mass-check results for SURBL:
...
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
217996   164295    53701    0.754   0.00    0.00  (all messages)
100.000  75.3661  24.6339    0.754   0.00    0.00  (all messages as %)
11.644  15.4490   0.0037    1.000   0.98    3.90  URIBL_SC_SURBL
39.572  52.4976   0.0261    1.000   0.98    3.00  URIBL_JP_SURBL
51.955  68.9236   0.0391    0.999   0.96    2.00  URIBL_OB_SURBL
 5.690   7.5492   0.0000    1.000   0.95    2.01  URIBL_AB_SURBL
53.948  71.5238   0.1769    0.998   0.83    0.54  URIBL_WS_SURBL
 0.030   0.0396   0.0000    1.000   0.51    0.84  URIBL_PH_SURBL
...
Am I right with the following :
...
JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg
(is less than 1 in 15.000)
That sounds right, but the particular proportions of spam versus
ham may not be meaningful, i.e. they may not be representative
of an actual mail stream.  So the percentages are probably more
usefully compared only to spam or ham and not to a combined total
of messages.
ok
...
Certainly the relative percentages within spam or ham are
meaningful and mostly useful with the caveat that the spam
detection rates are wrong for quickly moving data in SC and AB
since the test corpora cover too much time for them.  (This is
more true for spam than ham since spam domains vary quickly with
time, but ham domains are relatively steady.)
ok
...
...
...
SC and AB have much better real world results than show above
because their time period is much shorter than the test
corpora's.
...
Yes, but maybe the FP's will grow faster ;-)
That tends not to be the case.  The SpamCop data is filtered
multiple times and is human-checked at the front end.  The SC FP
rates are consistently among the lowest, and the spam detection
rates are very high for a very small list.  In short it's an
effective strategy.
ok and I am overall impressed with the low FP rates on all lists.
...
...
...
Also note that the JP data is now removed from the WS data, and
some old data was removed from WS.  So the WS spam and ham hit
rates have probably both decreased since this check was done.
JP should be about the same.
...
That will show in the future.  Is also a good thing.
Yes, it's fairer to the data sources.
...
...
...
And if possible, has anybody statistics from FP's that where on
several of the sublists -at the same time-?
...
[snip]
...
...
I don't think that is known yet.  I had proposed setting up some
test lists with combinations like this, but got no response.  ;-)
If it *is* known I think we'd all like to hear about it.  :-)
...
I think it could be known to the great people that check the FP
reports.  Normally they check against all sublists (I hope) and fix
them all.
When we whitelist a domain, it's excluded from all SURBLs.  The
original data source is usually notified.
...
I know that not all FP's are reported and there are
probably no exact numbers, but it should give a good idea.  Or am I
wrong?
The FP reports are probably too few overall to be meaningful in
terms of differentiating performance between lists.  There just
aren't that many, maybe a few a day on average.
Yes, but I wasn't thinking on differentiating between the lists, there
are other results for.  What I was thinking on was the number of FP's
that exists on more than one list.  This is very usefull information
when combining lists.  If almost no FP's do occur on more than one
list (at the same time)  requiring appearance on at least 2 lists
would be a very safe one.
Alain