[SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

Jeff Chan jeffc at surbl.org
Wed Apr 20 09:59:13 CEST 2005


On Tuesday, April 19, 2005, 1:34:25 PM, Alain Alain wrote:

>> 1) Use the base data used for sc.  Before inclusion you want a nr of
>> reports to spamcop (I doin't recall it but let's say 20), before
>> adding it to sc.  A domain that appears on both the CBL datafeed and
>> the sc datafeed on the "same" time, is far more likely spam.  You
>> could either use the new datafeed to selective lower the threshold for
>> sc (not really my first choice) or use the occurences inside the sc
>> datafeed to lower the threshold for the new list.  Only a few
>> occurences (more than one) on the sc datafeed would be enough in that
>> case.

> After thinking a while longer, it's maybe not such a bad idea to use
> the new data to improve the SC list.  By needing less seperate reports
> the time gap until inclusion will be much less. Instead of 10 (just
> checked) it's maybe enough to use 3 or 4, which gives a gain of at
> least 6 minutes, but probably much more.  Moreover it's probably
> possible to check the "right" threshold and the average time gain. 
> Check the percentage of domains that get inside the CBL datafeed and
> get less reports than the threshold. for example (no real data):

> 1 reports only and CBL'ed : 10%
> 2 reports only and CBL'ed : 5%
> 3 reports only and CBL'ed : 3%
> ...
> 9 reports only and CBL'ed : 0.01%
> 10 or more reports and CBL'ed : 75%

> (And compare against those that are not CBL'ed)

> Another thing I think of not linked with CBL : The speed that reports
> come in is also important, 5 reports in 15 minutes is probably much
> more spammy than 5 reports in 1 day.

> Alain

CBL hits would be a good indication of spammyness but only if we
could eliminate the FPs.  If amazon.com appears a lot on CBL and
someone reports amazon.com on SpamCop, even accidentally, it
could get it listed (were it not for our whitelists).  This would
be more of a problem for whitehats that are less well known than
amazon, etc.

Rate of reports or hits in CBL or SC or any other source can
be a good indicator of spam, except that legitimate mailers
sometimes send to large mailing lists suddenly and this causes a
spike that can look like spamsign.  This trips up the OB data
some times.  However the CBL traps are so large that it takes a
very large spike to register.  Therefore it's probably a better
indication of a spam attack than Outblaze may be seeing.  Also the
fact the our version of the CBL trap data is correlated with zombie
and open proxy activity probably helps *a lot*.  Legitimate
mailers, even those sending to a large list of their own
customers, probably don't use zombies.  Large, sudden volumes
of zombie hits may be indicative of a major spammer using a lot
of their bots suddenly.  Not all spammers send large blasts like
that, but enough may that this could indeed be useful to note.

Regarding applying special measurements to get the lower-hit CBL
records onto the XS list sooner, yes, that's precisely the goal.
We can automatically find the most common hits through
percentiles or thresholds.  It's the less common hits that we want
to try to list sooner and "dig out of the noise."


Regarding the SC data, I'm also planning to do a self-correlation
on the SC data into IP addresses, probably /24s to bias inclusion
of SC data more aggressively.  I.e. if a new site resolves into a
/24 that previously had a lot of spam reports, then that new
domain would get added to SC much sooner.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list