On Friday, September 10, 2004, 10:43:39 AM, Jeff wrote:
<snip/>
Holy confusion! I can't tell where you are on this subject now Jeff :)
<snip/>
JC> If you're talking about adding resolved IP addresses to SURBLs, JC> no we're not going to do that. :-(
JC> What I'm talking about is an internal process where we keep track JC> of resolved IP addresses and use that to add new domains to JC> SURBLs sooner if they resolve to a similar IP range (probably JC> /24s). We would use the resolved IP addresses to add domains JC> to sc.surbl.org and possibly other lists sooner. Most would JC> probably get added on the first report. :-)
I recommend a bit of caution on this point. My preliminary data on using /24s to drive recursive domain additions is that it is prone to false positives - The network surrounding a given web host is frequently populated with non-spam servers it seems... at least frequently enough that it's a challenge to generalize in this way.
(I have also observed random changes to these IPs, I believe in an effort to thwart automated attacks. On occasion these domains may point to random legitimate services - the only safe, simple way to know is to look... that's such an aggressive, forward thinking countermeasure that I almost didn't believe it when I saw it and I probably wouldn't have caught it if I weren't looking for it.)
One of the reasons we are able to entertain this kind of analysis is that we (humans) are heavily involved in a continuous refinement and posting process - this allows us to provide tuning inputs that are difficult to quantify for any autonomous AI. We're working in concert with our tools so our techniques don't easily translate out of that environment --- but they do often point in directions where more automation is possible.
It's worth noting that the spammers are accellerating in their efforts to blend their presence with legitimate services, equipment, etc... I suppose this is a natural response to the kinds of automated countermeasures that have been put in place. The upshot of this is that automated schemes must become increasingly sophisticated (intelligent) in order to maintain accuracy.
That said, there are some ways to leverage this data when it is qualified properly. For example, a clean spamtrap - particularly one spawned from dictionary attacks - can provide a ready stream of messages from which you can derive domains through recursive ip references.
It is common for Snake-Oil spammers to leverage a family of domains at one time for a new campaign and to have these point to a single IP or a small group of IPs. As a result it is possible to carefully select a domain (from URI) in one of these messages and then leverage the resolved IP of that domain to automatically derive the other members of the family from the spamtrap data. You cannot reliably use open message data to derive this however since there is a significant risk of tagging a legitimate virtual host --- however, the spamtrap data does not have this problem _USUALLY_.
This technique also is limited, however, and requires some significant review/monitoring. Also, spammers are already complicating this mechanisms (they're thinking ahead more)... There are a growing number of cases where randomstuff.example.com resolves to something different than /example.com and in fact /example.com is often targeted to some random legitimate host - so you need to target the larger URI to extract the filtering candidate... There is no simple way of knowing the conditions - only complex ways.
I probably shouldn't go into much more about this because it will become confusing - Like I said, many of our techniques only work within the infrastructure we have created. Where I can offer any useful insight I will though.
I recommend taking a look at this data yourself with the resources you have by building prototypes of your mechanisms, then test the bitz out of them until the statistics reveal themselves. As a matter of practice this is the only way to know what will work and how it can be applied.
Hope this helps,
_M