On Thu, 23 Sep 2004 00:58:42 -0700, Jeff Chan jeffc@surbl.org wrote:
[Please post follow ups to the SURBL discuss list or to me.]
One of the distinct data sources currently feeding into ws.surbl.org includes data from Joe Wein and Raymond Dijkxhoorn with his colleagues at Prolocation. Raymond and Prolocation are currently processing more than 300,000 potential spams per day using Joe's jwSpamSpy server software and combining those with Joe's own results. In addition to the data processing software, Joe has an elaborate, thorough, and well-thought-out set of inclusion criteria which includes age of domain registration, manual checks, and other factors. The resulting data are an extensive list of spam URI domains with a very low false positive rate (hits on legitimate messages). We are calling this resulting data JP for Joe Wein + Prolocation.
The bottom line is that JP (called PJ in the table below) has a significantly lower false positive rate than WS while having similar spam detection rates, for example as measured against a large corpora set belonging to Theo Van Dinter of SpamAssassin:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 2424443 2357143 67300 0.972 0.00 0.00 (all messages) 100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %) 7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL 76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL 77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL 0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL 82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL 0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
So we feel the data could usefully be broken out into a separate list which could safely be scored higher than WS. We also continue to work on improving the False Positive rate of WS of course. We propose making JP a separate list within multi.surbl.org, but *not* a standalone list like jp.surbl.org, since it's a major effort to set up entirely new lists and most people should be using multi now.
The main reason for announcing this change ahead of time is to allow developers of the many programs (in addition to SpamAssassin) now using SURBL data to update their code or configurations to take into account that the result codes in multi will be changing as a result of adding JP. JP would get the 64 bitmask, as in:
2 = comes from sc.surbl.org 4 = comes from ws.surbl.org 8 = comes from phishing list (labelled as [ph] in multi) 16 = comes from ob.surbl.org 32 = comes from ab.surbl.org 64 = comes from jp list
So a record in SC, WS, and JP would give a value 127.0.0.70. One with WS, OB, and JP would resolve to 127.0.0.84, etc. Programs using multi.surbl.org should be updated accordingly.
Since JP is currently included in WS, there will be 100% overlap of JP entries in WS so that any record in JP will also be in WS. In other words about half of the WS records in multi will increase by 64 due to overlap with JP. But WS will continue to use the 4 bit, as before. If your programs are decoding the multi results using the bit positions, they should need no adjustments to continue to handle the WS data.
We hope that 5 days is not too short notice for this kind of change.... I will try to contact the developers of the various (non-SA) programs separately to make sure they're aware of the coming change. Hopefully most of them are on this announcement list however.
We were not able to get JP as a separate list in yesterday's SpamAssassin 3.0.0 full release, but we have gotten it into SA 3.1 development.
For now the JP data will continue to be included in WS, but just before Spam Assassin 3.1 gets released (probably in 6 months to a year from now), we will remove JP data from WS to make them separate lists within multi. This means that SpamAssassin 3.0 and other current users of WS will continue to to get the benefits of JP under their default shipping configurations, and that JP can also be used separately by those who modify their configurations to take advantage of it.
In summary, we will:
- Add JP to multi.surbl.org on Monday September 27th.
(Note that like PH, JP would not be available as a separate list, only as part of multi.)
- Keep the JP data in WS for now, so that regular 3.0 users
get the advantages of JP also (as part of WS).
- Ask the SpamAssassin developers to score JP separately in
SA 3.1.
- Remove JP from WS before the final SA 3.1 mass check and
re-scoring is done, to make the two lists more separate for 3.1 . (Note that the separation is removal of the specific subset arrangement suggested in #2. If that is done, there will still be some minor overlap of the records in WS and JP.)
- Inform people about removing JP from WS before we do it,
so existing WS users can add JP, etc.
Please post follow up questions or comments to the SURBL discuss list or to me personally.
It looks to me as a sensible way to handle this...
I followed your advise about SURBL scoring in a thread a few weeks ago (I think Theo or another ninja also participated), but WS has a somehow low score... I didn't rise it, 'cause I have a setup that is very FP-sensitive (large ISPs), but would love to see high-quality, high-scoring multi sublists...
Do you have a current reasonable scoring for jp? (that is, considering that, for now, this score will be added to the ws score, since, until SA3.1, jp will be a subset of ws.
Thanx.
On Friday, September 24, 2004, 8:51:52 AM, Mariano Absatz wrote:
On Thu, 23 Sep 2004 00:58:42 -0700, Jeff Chan jeffc@surbl.org wrote:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
[...]
76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL 77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL
[...]
82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL
Do you have a current reasonable scoring for jp? (that is, considering that, for now, this score will be added to the ws score, since, until SA3.1, jp will be a subset of ws.
Since OB and JP have about the same spam detection and false positive rates, their scores should probably be somewhat similar.
Not sure what to recommend in terms of scoring JP as added to WS, but JP should probably be scored higher than WS.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."