As you know WS has data from several different sources:
1. Bill Stearns' sa-blacklist 2. Chris Santerre and the SARE Ninja's former BigEvil, MidEvil and other new ones. 3. Joe Wein's jwSpamSpy traps 4. Raymond's Prolocation traps and manual list. 5. MailSecurity lists
and probably many others I'm not even aware of. So WS has become a collection of many different data sources. In some cases, such as for the jw data, my initial thought was to set it up as a separate list, but it was somewhat easier to let them all be added together. Raymond is currently feeding spamtrap data into Joe's system also.
Raymond and I were looking at some of the data sources in WS and their spam detection and false positive rates, and we tried an experiment of checking his Prolocation spam data with Joe Wein's to see what the results would be like. We called that list "PJ" for "Prolocation and Joe" and found that the FP rate on one large corpus was significantly lower than WS, while the spam detection rate was approximately the same:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 2424443 2357143 67300 0.972 0.00 0.00 (all messages) 100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %) 7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL 76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL 77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL 0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL 82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL 0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
(We removed FPs from both PJ and WS as a result of some of this testing, so both should score relatively better now in terms of FPs. The spam hit rates on SC and AB are low because this spam corpus includes many old spams with URIs which would have rolled off these lists. A test only on more recent days would show much higher spam hit rates for SC and AB.)
In Raymond's own test of only spam hits, a later version of PJ got higher detection rates than WS:
SpamAssassin tag hits: (edited to top 10) #1 108958 BAYES_99 #2 87001 URIBL_SBL #3 84709 URIBL_PJ_SURBL #4 81455 HTML_MESSAGE #5 78177 RCVD_IN_BL_SPAMCOP_NET #6 75546 URIBL_OB_SURBL #7 74892 URIBL_WS_SURBL #8 64610 URIBL_SC_SURBL #9 58190 URIBL_AB_SURBL #10 54230 MIME_HTML_ONLY
Since WS has relatively low scores in SA 3, presumably due to the relatively high FP rate:
On Thu, Sep 02, 2004 at 08:09:17PM -0700, Jeff Chan wrote:
score URIBL_AB_SURBL 0 2.007 0 0.417 score URIBL_OB_SURBL 0 1.996 0 3.213 score URIBL_PH_SURBL 0 0.839 0 2.000 score URIBL_SC_SURBL 0 3.897 0 4.263 score URIBL_WS_SURBL 0 0.539 0 1.462
So what do the columns above mean?
(Theo replied:)
$ perldoc Mail::SpamAssassin::Conf [...] If four valid scores are listed, then the score that is used depends on how SpamAssassin is being used. The first score is used when both Bayes and network tests are disabled (score set 0). The second score is used when Bayes is disabled, but network tests are enabled (score set 1). The third score is used when Bayes is enabled and network tests are disabled (score set 2). The fourth score is used when Bayes is enabled and network tests are enabled (score set 3).
we thought it might be useful to make the PJ data available as a separate list, at least within multi.surbl.org, the combined SURBL. We'd like to get your comments on this.
We're also wondering whether the PJ data should be taken out of WS, or left in, if we do make PJ a distinct list. There's not much downside in leaving PJ in WS, aside from a somewhat larger standalone WS list. On the other hand all of our lists are currently standalone and not deliberate subsets in terms of data sources. But I assume most people will use multi, for which the difference is small either way.
By the way, please don't use PJ for production data yet, unless you are rsyncing the zone files, in which case you can mirror PJ locally now from the rsync servers for testing purposes if you like. PJ's only being served up on a couple public servers for our testing; we don't want to overload those servers. PJ is not in multi currently. Note that PJ is only a test list now. It may go away.
Please comment,
Jeff C.
On Wed, 15 Sep 2004 16:43:32 -0700, Jeff Chan jeffc@surbl.org wrote:
we thought it might be useful to make the PJ data available as a separate list, at least within multi.surbl.org, the combined SURBL. We'd like to get your comments on this.
I think having a separate list makes sense if the data quality is different to that of the pooled data it was previously connected to.
We're also wondering whether the PJ data should be taken out of WS, or left in, if we do make PJ a distinct list.
No point in lowering the hitrate of the superset, any additional score added to a spam is better than none at all.
Please comment,
The greater choice and control we provide SURBL users the better. If we have the ability to sustainably break data out like this and provide ongoing data quality ratings to aid score adjustments I think we should do it.
On Wednesday, September 15, 2004, 7:06:34 PM, David Hooton wrote:
On Wed, 15 Sep 2004 16:43:32 -0700, Jeff Chan jeffc@surbl.org wrote:
we thought it might be useful to make the PJ data available as a separate list, at least within multi.surbl.org, the combined SURBL. We'd like to get your comments on this.
I think having a separate list makes sense if the data quality is different to that of the pooled data it was previously connected to.
We're also wondering whether the PJ data should be taken out of WS, or left in, if we do make PJ a distinct list.
No point in lowering the hitrate of the superset, any additional score added to a spam is better than none at all.
Please comment,
The greater choice and control we provide SURBL users the better. If we have the ability to sustainably break data out like this and provide ongoing data quality ratings to aid score adjustments I think we should do it.
Thanks for your feedback David. Does anyone else have comments about the possibility of PJ? Making separate lists from the WS data is a little different from the direction we've been going lately, so it would be nice to get comments on it. We're still somewhat undecided about whether to do it or not....
As you can see from the first message about this, the FP rates of PJ look significantly lower than WS as a whole.
Jeff C.
Thanks for your feedback David. Does anyone else have comments about the possibility of PJ? Making separate lists from the WS data is a little different from the direction we've been going lately, so it would be nice to get comments on it. We're still somewhat undecided about whether to do it or not....
As you can see from the first message about this, the FP rates of PJ look significantly lower than WS as a whole.
I'm for splitting the data out of WS into it's own list and including it in multi. Certainly, having a lower FP rate will encourage users to increase its weight in scoring messages.
Bret
On Thursday, September 16, 2004, 4:45:46 PM, Bret Miller wrote:
As you can see from the first message about this, the FP rates of PJ look significantly lower than WS as a whole.
I'm for splitting the data out of WS into it's own list and including it in multi. Certainly, having a lower FP rate will encourage users to increase its weight in scoring messages.
Exactly. With a lower FP rate, PJ could be scored higher than WS, and the data could therefore be more effective at stopping spam.
Jeff C.