[Please post follow ups to the SURBL discuss list or to me.]
One of the distinct data sources currently feeding into ws.surbl.org includes data from Joe Wein and Raymond Dijkxhoorn with his colleagues at Prolocation. Raymond and Prolocation are currently processing more than 300,000 potential spams per day using Joe's jwSpamSpy server software and combining those with Joe's own results. In addition to the data processing software, Joe has an elaborate, thorough, and well-thought-out set of inclusion criteria which includes age of domain registration, manual checks, and other factors. The resulting data are an extensive list of spam URI domains with a very low false positive rate (hits on legitimate messages). We are calling this resulting data JP for Joe Wein + Prolocation.
The bottom line is that JP (called PJ in the table below) has a significantly lower false positive rate than WS while having similar spam detection rates, for example as measured against a large corpora set belonging to Theo Van Dinter of SpamAssassin:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 2424443 2357143 67300 0.972 0.00 0.00 (all messages) 100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %) 7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL 76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL 77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL 0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL 82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL 0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
So we feel the data could usefully be broken out into a separate list which could safely be scored higher than WS. We also continue to work on improving the False Positive rate of WS of course. We propose making JP a separate list within multi.surbl.org, but *not* a standalone list like jp.surbl.org, since it's a major effort to set up entirely new lists and most people should be using multi now.
The main reason for announcing this change ahead of time is to allow developers of the many programs (in addition to SpamAssassin) now using SURBL data to update their code or configurations to take into account that the result codes in multi will be changing as a result of adding JP. JP would get the 64 bitmask, as in:
2 = comes from sc.surbl.org 4 = comes from ws.surbl.org 8 = comes from phishing list (labelled as [ph] in multi) 16 = comes from ob.surbl.org 32 = comes from ab.surbl.org 64 = comes from jp list
So a record in SC, WS, and JP would give a value 127.0.0.70. One with WS, OB, and JP would resolve to 127.0.0.84, etc. Programs using multi.surbl.org should be updated accordingly.
Since JP is currently included in WS, there will be 100% overlap of JP entries in WS so that any record in JP will also be in WS. In other words about half of the WS records in multi will increase by 64 due to overlap with JP. But WS will continue to use the 4 bit, as before. If your programs are decoding the multi results using the bit positions, they should need no adjustments to continue to handle the WS data.
We hope that 5 days is not too short notice for this kind of change.... I will try to contact the developers of the various (non-SA) programs separately to make sure they're aware of the coming change. Hopefully most of them are on this announcement list however.
We were not able to get JP as a separate list in yesterday's SpamAssassin 3.0.0 full release, but we have gotten it into SA 3.1 development.
For now the JP data will continue to be included in WS, but just before Spam Assassin 3.1 gets released (probably in 6 months to a year from now), we will remove JP data from WS to make them separate lists within multi. This means that SpamAssassin 3.0 and other current users of WS will continue to to get the benefits of JP under their default shipping configurations, and that JP can also be used separately by those who modify their configurations to take advantage of it.
In summary, we will:
1. Add JP to multi.surbl.org on Monday September 27th. (Note that like PH, JP would not be available as a separate list, only as part of multi.)
2. Keep the JP data in WS for now, so that regular 3.0 users get the advantages of JP also (as part of WS).
3. Ask the SpamAssassin developers to score JP separately in SA 3.1.
4. Remove JP from WS before the final SA 3.1 mass check and re-scoring is done, to make the two lists more separate for 3.1 . (Note that the separation is removal of the specific subset arrangement suggested in #2. If that is done, there will still be some minor overlap of the records in WS and JP.)
5. Inform people about removing JP from WS before we do it, so existing WS users can add JP, etc.
Please post follow up questions or comments to the SURBL discuss list or to me personally.
Thanks,
Jeff C. -- Jeff Chan mailto:jeffc@surbl.org http://www.surbl.org/
This is a reminder that we will be adding JP as a new list within multi.surbl.org, as described in the previous announcement:
http://lists.surbl.org/pipermail/announce/2004-September/000077.html
on Monday September 27th. JP will have the bitmask value of 64, which means about half of the WS records will have results that increase by 64. We'll probably make the change around close of business U.S. East Coast time or around 22:00 UTC/GMT.
For now, JP records will continue to be included in WS, but when SpamAssassin 3.1 gets released, the JP data will come out of WS and these two will become separate lists within multi. Please update your programs accordingly.
SpamAssassin users won't need to make any changes to keep using WS, but should probably add JP to their configurations now so that they will be ready for the future change, and also to gain the significant benefits of the separate JP list now:
http://www.surbl.org/quickstart.html __
jp - jwSpamSpy + Prolocation data source
Joe Wein's jwSpamSpy program is used both by Joe's own systems and also Raymond Dijkxhoorn and his colleagues at Prolocation to process more than 300,000 likely spams per day. The resulting list has a very good spam detection rate around 80% and a very low false positive rate below 0.02%. This data is only available in the combined list multi.surbl.org.
An SA 2.63 and 2.64 rule and score using SpamCopURI 0.22 or later looks like this:
uri JP_URI_RBL eval:check_spamcop_uri_rbl('multi.surbl.org','127.0.0.0+64') describe JP_URI_RBL URI's domain appears in JP at http://www.surbl.org/lists.html tflags JP_URI_RBL net
score JP_URI_RBL 4.0
An SA 3.0 rule and score using URIBL's urirhssub looks like this:
urirhssub URIBL_JP_SURBL multi.surbl.org. A 64 header URIBL_JP_SURBL eval:check_uridnsbl('URIBL_JP_SURBL') describe URIBL_JP_SURBL Contains a URL listed in JP at http://www.surbl.org/lists.html tflags URIBL_JP_SURBL net
score URIBL_JP_SURBL 4.0 __
JP has approximately the same spam detection and false positive rates as OB and should probably be scored accordingly. The data are not the same however since JP uses different data sources and Joe Wein's processing algorithms and inclusion policies.
Jeff C.
JP is now active as part of multi. Please give it a try. We think you'll like the results. :-) __
http://www.surbl.org/quickstart.html
jp - jwSpamSpy + Prolocation data source
Joe Wein's jwSpamSpy program is used both by Joe's own systems and also Raymond Dijkxhoorn and his colleagues at Prolocation to process more than 300,000 likely spams per day. The resulting list has a very good spam detection rate around 80% and a very low false positive rate below 0.02%. This data is only available in the combined list multi.surbl.org.
An SA 2.63 and 2.64 rule and score using SpamCopURI 0.22 or later looks like this:
uri JP_URI_RBL eval:check_spamcop_uri_rbl('multi.surbl.org','127.0.0.0+64') describe JP_URI_RBL URI's domain appears in JP at http://www.surbl.org/lists.html tflags JP_URI_RBL net
score JP_URI_RBL 4.0
An SA 3.0 rule and score using URIBL's urirhssub looks like this:
urirhssub URIBL_JP_SURBL multi.surbl.org. A 64 header URIBL_JP_SURBL eval:check_uridnsbl('URIBL_JP_SURBL') describe URIBL_JP_SURBL Contains a URL listed in JP at http://www.surbl.org/lists.html tflags URIBL_JP_SURBL net
score URIBL_JP_SURBL 4.0 __
Note that JP is not available as a separate list, only as part of multi.surbl.org. Use it with urirhssub or SpamCopURI 0.22 as described above. Please see the lists document mentioned in the description for more information about JP.
Jeff C. -- Jeff Chan mailto:jeffc@surbl.org http://www.surbl.org/