Re: [SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

20 Apr 2005

      On Tuesday, April 19, 2005, 1:34:25 PM, Alain Alain wrote:
...
...

Use the base data used for sc.  Before inclusion you want a nr of

reports to spamcop (I doin't recall it but let's say 20), before
adding it to sc.  A domain that appears on both the CBL datafeed and
the sc datafeed on the "same" time, is far more likely spam.  You
could either use the new datafeed to selective lower the threshold for
sc (not really my first choice) or use the occurences inside the sc
datafeed to lower the threshold for the new list.  Only a few
occurences (more than one) on the sc datafeed would be enough in that
case.
...
After thinking a while longer, it's maybe not such a bad idea to use
the new data to improve the SC list.  By needing less seperate reports
the time gap until inclusion will be much less. Instead of 10 (just
checked) it's maybe enough to use 3 or 4, which gives a gain of at
least 6 minutes, but probably much more.  Moreover it's probably
possible to check the "right" threshold and the average time gain. 
Check the percentage of domains that get inside the CBL datafeed and
get less reports than the threshold. for example (no real data):
...
1 reports only and CBL'ed : 10%
2 reports only and CBL'ed : 5%
3 reports only and CBL'ed : 3%
...
9 reports only and CBL'ed : 0.01%
10 or more reports and CBL'ed : 75%
...
(And compare against those that are not CBL'ed)
...
Another thing I think of not linked with CBL : The speed that reports
come in is also important, 5 reports in 15 minutes is probably much
more spammy than 5 reports in 1 day.
...
Alain
CBL hits would be a good indication of spammyness but only if we
could eliminate the FPs.  If amazon.com appears a lot on CBL and
someone reports amazon.com on SpamCop, even accidentally, it
could get it listed (were it not for our whitelists).  This would
be more of a problem for whitehats that are less well known than
amazon, etc.
Rate of reports or hits in CBL or SC or any other source can
be a good indicator of spam, except that legitimate mailers
sometimes send to large mailing lists suddenly and this causes a
spike that can look like spamsign.  This trips up the OB data
some times.  However the CBL traps are so large that it takes a
very large spike to register.  Therefore it's probably a better
indication of a spam attack than Outblaze may be seeing.  Also the
fact the our version of the CBL trap data is correlated with zombie
and open proxy activity probably helps *a lot*.  Legitimate
mailers, even those sending to a large list of their own
customers, probably don't use zombies.  Large, sudden volumes
of zombie hits may be indicative of a major spammer using a lot
of their bots suddenly.  Not all spammers send large blasts like
that, but enough may that this could indeed be useful to note.
Regarding applying special measurements to get the lower-hit CBL
records onto the XS list sooner, yes, that's precisely the goal.
We can automatically find the most common hits through
percentiles or thresholds.  It's the less common hits that we want
to try to list sooner and "dig out of the noise."
Regarding the SC data, I'm also planning to do a self-correlation
on the SC data into IP addresses, probably /24s to bias inclusion
of SC data more aggressively.  I.e. if a new site resolves into a
/24 that previously had a lot of spam reports, then that new
domain would get added to SC much sooner.
Jeff C.
--
"If it appears in hams, then don't list it."

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders