New subject: RFC: How to use new data source: URIs advertised through CBL-listed senders

19 Apr 2005


      We've been working for a few weeks with the folks at CBL to
extract URIs appearing on their extensive spam traps that also
trigger inclusion in CBL, i.e. zombies, open proxies, etc.  What
this means is that we can get URIs of spams that are sent using
zombies and open proxies, where that mode of sending is a very
good indication of spamminess since legitimate senders probably
don't use hijacked hosts or open proxies to send their mail.
For anyone not familiar with CBL, here are a few words about it.
IP addresses of compromised senders like zombies and open proxies
end up in cbl.abuseat.org and xbl.spamhaus.org, which are widely
used to block spam senders at the MTA level.  Experience with
this RBL shows it to be very accurate and useful indicators of
compromised senders with a low false positive rate.  Many systems
and networks find these RBLs useful to block on, with good results.
http://cbl.abuseat.org/
One of the goals of looking at URIs appearing on the CBL traps in
messages also triggering CBL inclusion is to get listings of new
URIs into SURBLs sooner.  One of the valid criticisms of SURBLs
is that there is too much delay between the time a URI is first
used and it gets listed in SURBLs.  This is a problem with RBLs
in general, and it means that the targeted senders (or URIs) have
a window of time before detection and list inclusion where they
can send unhindered.
One advantage we have with SURBLs is that the hosts mentioned in
spam URIs tend to be longer-lasting than the compromised spam
senders.  In other words URIs are often somewhat more "durable"
indicators of spams than zombie IP addresses.  Zombie usage is
often rather fleeting and in the minutes to hours range, where
URI usage can be in the days to weeks range.  Therefore if we can
find URIs sent by zombies, we can potentially "bridge the gap"
and get new URI hosts blacklisted sooner.  In that sense they
work together with and improve the effectiveness of RBLs like CBL
by creating a longer-lasting and more persistent view of some of
the same types of messages that get caught by RBLs, by taking a
closer look at the content of those messages, specifically the
sites they advertise.
An aspect of the CBL URI data that makes them potentially very
attractive as a new data source for SURBLs is that the CBL traps
are very extensive and specifically focussed on and correlated
with zombie and open proxy detection.  As such, it's somewhat
orthogonal to other existing SURBL data sources which are manual
lists, user reports, or smaller, but still rather substantial
*spam-focussed* traps.  As a new data source, CBL URIs could
therefore complement our existing sources quite well due to its
size and differing composition, thus hopefully increasing the
overall detection performance of SURBLs in general.
Like most URI data sources, the main problem with the CBL URI
data is false positives or appearance of otherwise legitimate
domains.  For example amazon.com is one that appears frequently.
This does not mean that amazon.com is using zombies to send mail,
or that the CBL traps have been deliberately poisoned, but that
spammers occasionally mention legitimate domains like amazon.com
in their spams.  FPs aside, the CBL URI data does indeed appear
to include other domains operating for the benefit of spammers or
their customers.  These are the new domains we would like to
catch.  Our challenge therefore is to find ways to use those
while excluding the FPs.  Some solutions that have been proposed
so far are:
1.  Counting trap appearance volume and taking the top most often
appearing URIs.
2.  Including domains and the infrequent IP that are already in
other SURBLs.  This is useful as a confirmation of a zombieness
dimension of existing SURBL records.
3.  Including domains that resolve into sbl.spamhaus.org as
NS, MX or web host records.
4.  Excluding records already in our somewhat limited whitelists.
In fact we have an existing program which takes a combination of
the first four to produce a list, but the output of that program
is not yet published in SURBL form.  We may put these in the
128th-bit position of multi.surbl.org to begin testing, but
looking at the data there are probably still too many FPs to put
it into official production use.  Consider some of the additional
possibilities below which are not currently being done, and let
us know if you think it may be useful to start publishing the
above data.
5.  Including domains that resolve into the IP space of manually
reported URIs, for example from the SpamCop spamvertised site
data used in sc.surbl.org and ab.surbl.org.
6.  Doing regular (probably nightly) manual review of SURBL
additions and whitelisting FPs that appear.  (This should
probably be done regardless of any new data sources.)
Obviously we can't check every new domain that appears on SURBLs,
but we could set up criteria to flag checking, such as domain
registration older than 90 days, non-inclusion in SBL, few NANAS
reports, etc.  Some kind of rating engine using those or other
criteria could be applied to new listings to flag manual review
of some of the more likely FPs.  We would not automatically
whitelist these, but flag them for further checking.
CBL URI data may well represent a useful new data source, but the
best way to determine that may be to start using them.  However
I'd like your comments on some of the above FP mitigation ideas
and any new ideas anyone may have for that purpose before we put
these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff C.
--
"If it appears in hams, then don't list it."