[SURBL-Discuss] Re: [SURBL-Announce] ANNOUNCE: Adding new JP list to multi.surbl.org

Mariano Absatz el.baby at gmail.com
Fri Sep 24 17:51:52 CEST 2004


On Thu, 23 Sep 2004 00:58:42 -0700, Jeff Chan <jeffc at surbl.org> wrote:
> [Please post follow ups to the SURBL discuss list or to me.]
> 
> One of the distinct data sources currently feeding into
> ws.surbl.org includes data from Joe Wein and Raymond Dijkxhoorn
> with his colleagues at Prolocation.  Raymond and Prolocation
> are currently processing more than 300,000 potential spams per
> day using Joe's jwSpamSpy server software and combining those
> with Joe's own results.  In addition to the data processing
> software, Joe has an elaborate, thorough, and well-thought-out
> set of inclusion criteria which includes age of domain
> registration, manual checks, and other factors.  The resulting
> data are an extensive list of spam URI domains with a very
> low false positive rate (hits on legitimate messages).  We
> are calling this resulting data JP for Joe Wein + Prolocation.
> 
> The bottom line is that JP (called PJ in the table below) has a
> significantly lower false positive rate than WS while having
> similar spam detection rates, for example as measured against a
> large corpora set belonging to Theo Van Dinter of SpamAssassin:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
> 2424443  2357143    67300    0.972   0.00    0.00  (all messages)
> 100.000  97.2241   2.7759    0.972   0.00    0.00  (all messages as %)
>   7.595   7.8122   0.0045    0.999   1.00    0.00  URIBL_SC_SURBL
>  76.754  78.9448   0.0178    1.000   0.80    0.00  URIBL_OB_SURBL
>  77.230  79.4340   0.0208    1.000   0.60    1.00  URIBL_PJ_SURBL
>   0.985   1.0126   0.0045    0.996   0.50    0.00  URIBL_AB_SURBL
>  82.119  84.4600   0.1367    0.998   0.40    0.00  URIBL_WS_SURBL
>   0.021   0.0216   0.0045    0.829   0.00    0.00  URIBL_PH_SURBL
> 
> So we feel the data could usefully be broken out into a
> separate list which could safely be scored higher than
> WS.  We also continue to work on improving the False Positive
> rate of WS of course.  We propose making JP a separate list
> within multi.surbl.org, but *not* a standalone list like
> jp.surbl.org, since it's a major effort to set up entirely
> new lists and most people should be using multi now.
> 
> The main reason for announcing this change ahead of time
> is to allow developers of the many programs (in addition to
> SpamAssassin) now using SURBL data to update their code or
> configurations to take into account that the result codes in
> multi will be changing as a result of adding JP.  JP would get
> the 64 bitmask, as in:
> 
>  2 = comes from sc.surbl.org
>  4 = comes from ws.surbl.org
>  8 = comes from phishing list (labelled as [ph] in multi)
> 16 = comes from ob.surbl.org
> 32 = comes from ab.surbl.org
> 64 = comes from jp list
> 
> So a record in SC, WS, and JP would give a value 127.0.0.70.
> One with WS, OB, and JP would resolve to 127.0.0.84, etc.
> Programs using multi.surbl.org should be updated accordingly.
> 
> Since JP is currently included in WS, there will be 100%
> overlap of JP entries in WS so that any record in JP will
> also be in WS.  In other words about half of the WS records
> in multi will increase by 64 due to overlap with JP.  But
> WS will continue to use the 4 bit, as before.  If your
> programs are decoding the multi results using the bit
> positions, they should need no adjustments to continue to
> handle the WS data.
> 
> We hope that 5 days is not too short notice for this kind of
> change....  I will try to contact the developers of the various
> (non-SA) programs separately to make sure they're aware of the
> coming change.  Hopefully most of them are on this announcement
> list however.
> 
> We were not able to get JP as a separate list in yesterday's
> SpamAssassin 3.0.0 full release, but we have gotten it into
> SA 3.1 development.
> 
> For now the JP data will continue to be included in WS,
> but just before Spam Assassin 3.1 gets released (probably in
> 6 months to a year from now), we will remove JP data from WS
> to make them separate lists within multi.  This means that
> SpamAssassin 3.0 and other current users of WS will continue
> to to get the benefits of JP under their default shipping
> configurations, and that JP can also be used separately by
> those who modify their configurations to take advantage of it.
> 
> In summary, we will:
> 
> 1.  Add JP to multi.surbl.org on Monday September 27th.
> (Note that like PH, JP would not be available as a separate
> list, only as part of multi.)
> 
> 2.  Keep the JP data in WS for now, so that regular 3.0 users
> get the advantages of JP also (as part of WS).
> 
> 3.  Ask the SpamAssassin developers to score JP separately in
> SA 3.1.
> 
> 4.  Remove JP from WS before the final SA 3.1 mass check and
> re-scoring is done, to make the two lists more separate
> for 3.1 .  (Note that the separation is removal of the
> specific subset arrangement suggested in #2.  If that is
> done, there will still be some minor overlap of the records
> in WS and JP.)
> 
> 5.  Inform people about removing JP from WS before we do it,
> so existing WS users can add JP, etc.
> 
> Please post follow up questions or comments to the SURBL discuss
> list or to me personally.

It looks to me as a sensible way to handle this...

I followed your advise about SURBL scoring in a thread a few weeks ago
(I think Theo or another ninja also participated), but WS has a
somehow low score... I didn't rise it, 'cause I have a setup that is
very FP-sensitive (large ISPs), but would love to see high-quality,
high-scoring multi sublists...

Do you have a current reasonable scoring for jp? (that is, considering
that, for now, this score will be added to the ws score, since, until
SA3.1, jp will be a subset of ws.

Thanx.

-- 
Mariano Absatz - El Baby
el (dot) baby (AT) gmail (dot) com
el (punto) baby (ARROBA:@) gmail (punto) com


More information about the Discuss mailing list