[SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

Alain coc454402 at sneakemail.com
Tue Apr 19 22:34:25 CEST 2005


Jus a few things to add

<snip>

> 3 idea's :
> 
> 1) Use the base data used for sc.  Before inclusion you want a nr of
> reports to spamcop (I doin't recall it but let's say 20), before
> adding it to sc.  A domain that appears on both the CBL datafeed and
> the sc datafeed on the "same" time, is far more likely spam.  You
> could either use the new datafeed to selective lower the threshold for
> sc (not really my first choice) or use the occurences inside the sc
> datafeed to lower the threshold for the new list.  Only a few
> occurences (more than one) on the sc datafeed would be enough in that
> case.

After thinking a while longer, it's maybe not such a bad idea to use
the new data to improve the SC list.  By needing less seperate reports
the time gap until inclusion will be much less. Instead of 10 (just
checked) it's maybe enough to use 3 or 4, which gives a gain of at
least 6 minutes, but probably much more.  Moreover it's probably
possible to check the "right" threshold and the average time gain. 
Check the percentage of domains that get inside the CBL datafeed and
get less reports than the threshold. for example (no real data):

1 reports only and CBL'ed : 10%
2 reports only and CBL'ed : 5%
3 reports only and CBL'ed : 3%
...
9 reports only and CBL'ed : 0.01%
10 or more reports and CBL'ed : 75%

(And compare against those that are not CBL'ed)

Another thing I think of not linked with CBL : The speed that reports
come in is also important, 5 reports in 15 minutes is probably much
more spammy than 5 reports in 1 day.


Alain



More information about the Discuss mailing list