Chris Santerre wrote to SURBL Discussion list (E-mail):
OK, this isn't the first time we've had this discussion, but Raymond and I felt this should be made public again. He ran thru some tests of 1500+ domains and found the following data. Looks like they maybe send from zombies, and never their hosts. IPs are similar across the board.
So is there a way to use the IP info in a good way? Could SA or SURBL do a quick ping of the URL and match against a URL? This would allow us to simply list 1 IP instead of all these domains.
(I'm well aware of virtual hosts! So only the filthiest of spammers would be put on this IP list. Then their IP better boot them or anyone hosted on that box would feel the rath of SURBL.)
I talked to Raymond about this, too... and, basically, here are my big thoughts:
We need to find the correlation of IP addresses to hostnames. See http://whois.sc/ ; I can, with some help, duplicate what they're doing in a way that will help us fight spam.
Then, for 219.254.32.111, we could see that there are, say, 200 sites hosted at that IP, and, after some hand checking, identify that all of them belong to spammers.
However, for all we know *so far*, 219.254.32.111 could be a HA cluster of a few dozen machines, and, while there may be 200 pill spammers on that cluster, there may be 20,000 other legit sites.
With our current data, we can't make either determination. But, using forward zone data, we can do forward lookups, and track them in a database. Then, do forward lookups on SURBL data to get the IPs of spammers, and (algorithmically!) find correlations.
The programming effort to implement this would not be trivial, not to mention processing power and bandwidth, to do the initial run. The datasets (.com!) are huge. After that, we just have to periodically sample for new, removed, and changed domains, at which point the processing will be reduced.
Still, there's no way I have time or money to do this alone, given my current commitments. I *wish* I could spend my whole day fighting spam. I'd need a fair amount of real help. It'd be good to make happen, though, considering we could then *proactively* list domains (or IPs) with a high degree of confidence and little or no collateral damage. (Because we can *measure* collateral damage if we know which other domains are hosted on a particular IP). And there would be many many other statistical benefits we could gain.
- Ryan