New subject: RFC: pj.surbl.org - list from Joe Wein andProlocation data

16 Sep 2004


      As you know WS has data from several different sources:
1. Bill Stearns' sa-blacklist
2. Chris Santerre and the SARE Ninja's former BigEvil, MidEvil
and other new ones.
3. Joe Wein's jwSpamSpy traps
4. Raymond's Prolocation traps and manual list.
5. MailSecurity lists
and probably many others I'm not even aware of.  So WS has
become a collection of many different data sources.  In some
cases, such as for the jw data, my initial thought was to set it
up as a separate list, but it was somewhat easier to let them all
be added together.  Raymond is currently feeding spamtrap data
into Joe's system also.
Raymond and I were looking at some of the data sources in WS and
their spam detection and false positive rates, and we tried an
experiment of checking his Prolocation spam data with Joe Wein's
to see what the results would be like.  We called that list "PJ"
for "Prolocation and Joe" and found that the FP rate on one large
corpus was significantly lower than WS, while the spam detection
rate was approximately the same:
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
2424443  2357143    67300    0.972   0.00    0.00  (all messages)
100.000  97.2241   2.7759    0.972   0.00    0.00  (all messages as %)
  7.595   7.8122   0.0045    0.999   1.00    0.00  URIBL_SC_SURBL
 76.754  78.9448   0.0178    1.000   0.80    0.00  URIBL_OB_SURBL
 77.230  79.4340   0.0208    1.000   0.60    1.00  URIBL_PJ_SURBL
  0.985   1.0126   0.0045    0.996   0.50    0.00  URIBL_AB_SURBL
 82.119  84.4600   0.1367    0.998   0.40    0.00  URIBL_WS_SURBL
  0.021   0.0216   0.0045    0.829   0.00    0.00  URIBL_PH_SURBL
(We removed FPs from both PJ and WS as a result of some of this
testing, so both should score relatively better now in terms of
FPs.  The spam hit rates on SC and AB are low because this spam
corpus includes many old spams with URIs which would have rolled
off these lists.  A test only on more recent days would show much
higher spam hit rates for SC and AB.)
In Raymond's own test of only spam hits, a later version of PJ
got higher detection rates than WS:
SpamAssassin tag hits: (edited to top 10)
#1      108958  BAYES_99
#2      87001   URIBL_SBL
#3      84709   URIBL_PJ_SURBL
#4      81455   HTML_MESSAGE
#5      78177   RCVD_IN_BL_SPAMCOP_NET
#6      75546   URIBL_OB_SURBL
#7      74892   URIBL_WS_SURBL
#8      64610   URIBL_SC_SURBL
#9      58190   URIBL_AB_SURBL
#10     54230   MIME_HTML_ONLY
Since WS has relatively low scores in SA 3, presumably due to
the relatively high FP rate:
...
On Thu, Sep 02, 2004 at 08:09:17PM -0700, Jeff Chan wrote:
...
score URIBL_AB_SURBL 0 2.007 0 0.417
score URIBL_OB_SURBL 0 1.996 0 3.213
score URIBL_PH_SURBL 0 0.839 0 2.000
score URIBL_SC_SURBL 0 3.897 0 4.263
score URIBL_WS_SURBL 0 0.539 0 1.462
So what do the columns above mean?
(Theo replied:)
...
$ perldoc Mail::SpamAssassin::Conf
[...]
   If four valid scores are listed, then the score that is used
   depends on how SpamAssassin is being used. The first score is used
   when both Bayes and network tests are disabled (score set 0). The
   second score is used when Bayes is disabled, but network tests are
   enabled (score set 1). The third score is used when Bayes is
   enabled and network tests are disabled (score set 2). The fourth
   score is used when Bayes is enabled and network tests are enabled
   (score set 3).
we thought it might be useful to make the PJ data available as
a separate list, at least within multi.surbl.org, the combined
SURBL.  We'd like to get your comments on this.
We're also wondering whether the PJ data should be taken out of
WS, or left in, if we do make PJ a distinct list.  There's not
much downside in leaving PJ in WS, aside from a somewhat larger
standalone WS list.  On the other hand all of our lists are
currently standalone and not deliberate subsets in terms of
data sources.  But I assume most people will use multi, for which
the difference is small either way.
By the way, please don't use PJ for production data yet, unless
you are rsyncing the zone files, in which case you can mirror PJ
locally now from the rsync servers for testing purposes if you
like.  PJ's only being served up on a couple public servers for
our testing; we don't want to overload those servers.  PJ is
not in multi currently.   Note that PJ is only a test list now.
It may go away.
Please comment,
Jeff C.

RFC: pj.surbl.org - list from Joe Wein and Prolocation data