[SURBL-Discuss] RFC: pj.surbl.org - list from Joe Wein and
Prolocation data
Jeff Chan
jeffc at surbl.org
Thu Sep 16 01:43:32 CEST 2004
As you know WS has data from several different sources:
1. Bill Stearns' sa-blacklist
2. Chris Santerre and the SARE Ninja's former BigEvil, MidEvil
and other new ones.
3. Joe Wein's jwSpamSpy traps
4. Raymond's Prolocation traps and manual list.
5. MailSecurity lists
and probably many others I'm not even aware of. So WS has
become a collection of many different data sources. In some
cases, such as for the jw data, my initial thought was to set it
up as a separate list, but it was somewhat easier to let them all
be added together. Raymond is currently feeding spamtrap data
into Joe's system also.
Raymond and I were looking at some of the data sources in WS and
their spam detection and false positive rates, and we tried an
experiment of checking his Prolocation spam data with Joe Wein's
to see what the results would be like. We called that list "PJ"
for "Prolocation and Joe" and found that the FP rate on one large
corpus was significantly lower than WS, while the spam detection
rate was approximately the same:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
2424443 2357143 67300 0.972 0.00 0.00 (all messages)
100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %)
7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL
76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL
77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL
0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL
82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL
0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
(We removed FPs from both PJ and WS as a result of some of this
testing, so both should score relatively better now in terms of
FPs. The spam hit rates on SC and AB are low because this spam
corpus includes many old spams with URIs which would have rolled
off these lists. A test only on more recent days would show much
higher spam hit rates for SC and AB.)
In Raymond's own test of only spam hits, a later version of PJ
got higher detection rates than WS:
SpamAssassin tag hits: (edited to top 10)
#1 108958 BAYES_99
#2 87001 URIBL_SBL
#3 84709 URIBL_PJ_SURBL
#4 81455 HTML_MESSAGE
#5 78177 RCVD_IN_BL_SPAMCOP_NET
#6 75546 URIBL_OB_SURBL
#7 74892 URIBL_WS_SURBL
#8 64610 URIBL_SC_SURBL
#9 58190 URIBL_AB_SURBL
#10 54230 MIME_HTML_ONLY
Since WS has relatively low scores in SA 3, presumably due to
the relatively high FP rate:
> On Thu, Sep 02, 2004 at 08:09:17PM -0700, Jeff Chan wrote:
>> score URIBL_AB_SURBL 0 2.007 0 0.417
>> score URIBL_OB_SURBL 0 1.996 0 3.213
>> score URIBL_PH_SURBL 0 0.839 0 2.000
>> score URIBL_SC_SURBL 0 3.897 0 4.263
>> score URIBL_WS_SURBL 0 0.539 0 1.462
>>
>> So what do the columns above mean?
(Theo replied:)
> $ perldoc Mail::SpamAssassin::Conf
> [...]
> If four valid scores are listed, then the score that is used
> depends on how SpamAssassin is being used. The first score is used
> when both Bayes and network tests are disabled (score set 0). The
> second score is used when Bayes is disabled, but network tests are
> enabled (score set 1). The third score is used when Bayes is
> enabled and network tests are disabled (score set 2). The fourth
> score is used when Bayes is enabled and network tests are enabled
> (score set 3).
we thought it might be useful to make the PJ data available as
a separate list, at least within multi.surbl.org, the combined
SURBL. We'd like to get your comments on this.
We're also wondering whether the PJ data should be taken out of
WS, or left in, if we do make PJ a distinct list. There's not
much downside in leaving PJ in WS, aside from a somewhat larger
standalone WS list. On the other hand all of our lists are
currently standalone and not deliberate subsets in terms of
data sources. But I assume most people will use multi, for which
the difference is small either way.
By the way, please don't use PJ for production data yet, unless
you are rsyncing the zone files, in which case you can mirror PJ
locally now from the rsync servers for testing purposes if you
like. PJ's only being served up on a couple public servers for
our testing; we don't want to overload those servers. PJ is
not in multi currently. Note that PJ is only a test list now.
It may go away.
Please comment,
Jeff C.
More information about the Discuss
mailing list