We've been working for a few weeks with the folks at CBL to extract URIs appearing on their extensive spam traps that also trigger inclusion in CBL, i.e. zombies, open proxies, etc. What this means is that we can get URIs of spams that are sent using zombies and open proxies, where that mode of sending is a very good indication of spamminess since legitimate senders probably don't use hijacked hosts or open proxies to send their mail.
For anyone not familiar with CBL, here are a few words about it. IP addresses of compromised senders like zombies and open proxies end up in cbl.abuseat.org and xbl.spamhaus.org, which are widely used to block spam senders at the MTA level. Experience with this RBL shows it to be very accurate and useful indicators of compromised senders with a low false positive rate. Many systems and networks find these RBLs useful to block on, with good results.
One of the goals of looking at URIs appearing on the CBL traps in messages also triggering CBL inclusion is to get listings of new URIs into SURBLs sooner. One of the valid criticisms of SURBLs is that there is too much delay between the time a URI is first used and it gets listed in SURBLs. This is a problem with RBLs in general, and it means that the targeted senders (or URIs) have a window of time before detection and list inclusion where they can send unhindered.
One advantage we have with SURBLs is that the hosts mentioned in spam URIs tend to be longer-lasting than the compromised spam senders. In other words URIs are often somewhat more "durable" indicators of spams than zombie IP addresses. Zombie usage is often rather fleeting and in the minutes to hours range, where URI usage can be in the days to weeks range. Therefore if we can find URIs sent by zombies, we can potentially "bridge the gap" and get new URI hosts blacklisted sooner. In that sense they work together with and improve the effectiveness of RBLs like CBL by creating a longer-lasting and more persistent view of some of the same types of messages that get caught by RBLs, by taking a closer look at the content of those messages, specifically the sites they advertise.
An aspect of the CBL URI data that makes them potentially very attractive as a new data source for SURBLs is that the CBL traps are very extensive and specifically focussed on and correlated with zombie and open proxy detection. As such, it's somewhat orthogonal to other existing SURBL data sources which are manual lists, user reports, or smaller, but still rather substantial *spam-focussed* traps. As a new data source, CBL URIs could therefore complement our existing sources quite well due to its size and differing composition, thus hopefully increasing the overall detection performance of SURBLs in general.
Like most URI data sources, the main problem with the CBL URI data is false positives or appearance of otherwise legitimate domains. For example amazon.com is one that appears frequently. This does not mean that amazon.com is using zombies to send mail, or that the CBL traps have been deliberately poisoned, but that spammers occasionally mention legitimate domains like amazon.com in their spams. FPs aside, the CBL URI data does indeed appear to include other domains operating for the benefit of spammers or their customers. These are the new domains we would like to catch. Our challenge therefore is to find ways to use those while excluding the FPs. Some solutions that have been proposed so far are:
1. Counting trap appearance volume and taking the top most often appearing URIs.
2. Including domains and the infrequent IP that are already in other SURBLs. This is useful as a confirmation of a zombieness dimension of existing SURBL records.
3. Including domains that resolve into sbl.spamhaus.org as NS, MX or web host records.
4. Excluding records already in our somewhat limited whitelists.
In fact we have an existing program which takes a combination of the first four to produce a list, but the output of that program is not yet published in SURBL form. We may put these in the 128th-bit position of multi.surbl.org to begin testing, but looking at the data there are probably still too many FPs to put it into official production use. Consider some of the additional possibilities below which are not currently being done, and let us know if you think it may be useful to start publishing the above data.
5. Including domains that resolve into the IP space of manually reported URIs, for example from the SpamCop spamvertised site data used in sc.surbl.org and ab.surbl.org.
6. Doing regular (probably nightly) manual review of SURBL additions and whitelisting FPs that appear. (This should probably be done regardless of any new data sources.)
Obviously we can't check every new domain that appears on SURBLs, but we could set up criteria to flag checking, such as domain registration older than 90 days, non-inclusion in SBL, few NANAS reports, etc. Some kind of rating engine using those or other criteria could be applied to new listings to flag manual review of some of the more likely FPs. We would not automatically whitelist these, but flag them for further checking.
CBL URI data may well represent a useful new data source, but the best way to determine that may be to start using them. However I'd like your comments on some of the above FP mitigation ideas and any new ideas anyone may have for that purpose before we put these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff C. -- "If it appears in hams, then don't list it."