[SURBL-Discuss] Re: [SURBL-Announce] ANNOUNCE: Adding new JP list
to multi.surbl.org
Mariano Absatz
el.baby at gmail.com
Fri Sep 24 17:51:52 CEST 2004
On Thu, 23 Sep 2004 00:58:42 -0700, Jeff Chan <jeffc at surbl.org> wrote:
> [Please post follow ups to the SURBL discuss list or to me.]
>
> One of the distinct data sources currently feeding into
> ws.surbl.org includes data from Joe Wein and Raymond Dijkxhoorn
> with his colleagues at Prolocation. Raymond and Prolocation
> are currently processing more than 300,000 potential spams per
> day using Joe's jwSpamSpy server software and combining those
> with Joe's own results. In addition to the data processing
> software, Joe has an elaborate, thorough, and well-thought-out
> set of inclusion criteria which includes age of domain
> registration, manual checks, and other factors. The resulting
> data are an extensive list of spam URI domains with a very
> low false positive rate (hits on legitimate messages). We
> are calling this resulting data JP for Joe Wein + Prolocation.
>
> The bottom line is that JP (called PJ in the table below) has a
> significantly lower false positive rate than WS while having
> similar spam detection rates, for example as measured against a
> large corpora set belonging to Theo Van Dinter of SpamAssassin:
>
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 2424443 2357143 67300 0.972 0.00 0.00 (all messages)
> 100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %)
> 7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL
> 76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL
> 77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL
> 0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL
> 82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL
> 0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
>
> So we feel the data could usefully be broken out into a
> separate list which could safely be scored higher than
> WS. We also continue to work on improving the False Positive
> rate of WS of course. We propose making JP a separate list
> within multi.surbl.org, but *not* a standalone list like
> jp.surbl.org, since it's a major effort to set up entirely
> new lists and most people should be using multi now.
>
> The main reason for announcing this change ahead of time
> is to allow developers of the many programs (in addition to
> SpamAssassin) now using SURBL data to update their code or
> configurations to take into account that the result codes in
> multi will be changing as a result of adding JP. JP would get
> the 64 bitmask, as in:
>
> 2 = comes from sc.surbl.org
> 4 = comes from ws.surbl.org
> 8 = comes from phishing list (labelled as [ph] in multi)
> 16 = comes from ob.surbl.org
> 32 = comes from ab.surbl.org
> 64 = comes from jp list
>
> So a record in SC, WS, and JP would give a value 127.0.0.70.
> One with WS, OB, and JP would resolve to 127.0.0.84, etc.
> Programs using multi.surbl.org should be updated accordingly.
>
> Since JP is currently included in WS, there will be 100%
> overlap of JP entries in WS so that any record in JP will
> also be in WS. In other words about half of the WS records
> in multi will increase by 64 due to overlap with JP. But
> WS will continue to use the 4 bit, as before. If your
> programs are decoding the multi results using the bit
> positions, they should need no adjustments to continue to
> handle the WS data.
>
> We hope that 5 days is not too short notice for this kind of
> change.... I will try to contact the developers of the various
> (non-SA) programs separately to make sure they're aware of the
> coming change. Hopefully most of them are on this announcement
> list however.
>
> We were not able to get JP as a separate list in yesterday's
> SpamAssassin 3.0.0 full release, but we have gotten it into
> SA 3.1 development.
>
> For now the JP data will continue to be included in WS,
> but just before Spam Assassin 3.1 gets released (probably in
> 6 months to a year from now), we will remove JP data from WS
> to make them separate lists within multi. This means that
> SpamAssassin 3.0 and other current users of WS will continue
> to to get the benefits of JP under their default shipping
> configurations, and that JP can also be used separately by
> those who modify their configurations to take advantage of it.
>
> In summary, we will:
>
> 1. Add JP to multi.surbl.org on Monday September 27th.
> (Note that like PH, JP would not be available as a separate
> list, only as part of multi.)
>
> 2. Keep the JP data in WS for now, so that regular 3.0 users
> get the advantages of JP also (as part of WS).
>
> 3. Ask the SpamAssassin developers to score JP separately in
> SA 3.1.
>
> 4. Remove JP from WS before the final SA 3.1 mass check and
> re-scoring is done, to make the two lists more separate
> for 3.1 . (Note that the separation is removal of the
> specific subset arrangement suggested in #2. If that is
> done, there will still be some minor overlap of the records
> in WS and JP.)
>
> 5. Inform people about removing JP from WS before we do it,
> so existing WS users can add JP, etc.
>
> Please post follow up questions or comments to the SURBL discuss
> list or to me personally.
It looks to me as a sensible way to handle this...
I followed your advise about SURBL scoring in a thread a few weeks ago
(I think Theo or another ninja also participated), but WS has a
somehow low score... I didn't rise it, 'cause I have a setup that is
very FP-sensitive (large ISPs), but would love to see high-quality,
high-scoring multi sublists...
Do you have a current reasonable scoring for jp? (that is, considering
that, for now, this score will be added to the ws score, since, until
SA3.1, jp will be a subset of ws.
Thanx.
--
Mariano Absatz - El Baby
el (dot) baby (AT) gmail (dot) com
el (punto) baby (ARROBA:@) gmail (punto) com
More information about the Discuss
mailing list