-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Thursday, September 09, 2004 8:44 PM To: Jeff Chan Cc: Pete McNeil; SURBL Discussion list; Spamassassin-Talk Subject: Re: Start an IP list to block?
On Thursday, September 9, 2004, 5:34:05 PM, Jeff Chan wrote:
My first pass at cleaning the resolved IP data would be to take the to 70th percentile of IP addresses and only use those to check domain resolved IPs to. It's not perfect, but it should cut down on the uncertainty.
I should add that this mostly applies to data where we have a constant feed of actual spam reports such as from SpamCop. It does not apply as strongly to data sources where we only have a unitary list of domains, for example where each domain appears once over the whole list. Though even there, it applies weakly, for example a dozen domains that all resolve to the same network probably could be used to bias future domains appearing in the same network towards list inclusion.
But when you have a stream of reports about the *same domain*, then you can get better statistics about that domain or it's resolved IP. There simply more data to work with in more meaningful ways.
Holy confusion! I can't tell where you are on this subject now Jeff :)
Are you saying , that if we get really good data like what was in my original post, and we keep the data in the 90th percentile area, then we might possibly be able to list the IP hosts and have SURBL check against it? If so..I'm up for that.
Granted it would take a little more research then just a domain listing, but I think the benefits are very good. Especially if we keep it only high ranking IP offenders. I mean, we may add less then 50 IPs a year? Just the really nasty spammers.
Anyway, its been a great discussion.
If we've learned anything in the last 24 hours, its that the Patriots defense needs some work against the run game ;)
--Chris (Go Tom Brady!)
On Friday, September 10, 2004, 7:33:10 AM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
On Thursday, September 9, 2004, 5:34:05 PM, Jeff Chan wrote:
My first pass at cleaning the resolved IP data would be to take the to 70th percentile of IP addresses and only use those to check domain resolved IPs to. It's not perfect, but it should cut down on the uncertainty.
I should add that this mostly applies to data where we have a constant feed of actual spam reports such as from SpamCop. It does not apply as strongly to data sources where we only have a unitary list of domains, for example where each domain appears once over the whole list. Though even there, it applies weakly, for example a dozen domains that all resolve to the same network probably could be used to bias future domains appearing in the same network towards list inclusion.
But when you have a stream of reports about the *same domain*, then you can get better statistics about that domain or it's resolved IP. There simply more data to work with in more meaningful ways.
Holy confusion! I can't tell where you are on this subject now Jeff :)
Are you saying , that if we get really good data like what was in my original post, and we keep the data in the 90th percentile area, then we might possibly be able to list the IP hosts and have SURBL check against it? If so..I'm up for that.
Granted it would take a little more research then just a domain listing, but I think the benefits are very good. Especially if we keep it only high ranking IP offenders. I mean, we may add less then 50 IPs a year? Just the really nasty spammers.
If you're talking about adding resolved IP addresses to SURBLs, no we're not going to do that. :-(
What I'm talking about is an internal process where we keep track of resolved IP addresses and use that to add new domains to SURBLs sooner if they resolve to a similar IP range (probably /24s). We would use the resolved IP addresses to add domains to sc.surbl.org and possibly other lists sooner. Most would probably get added on the first report. :-)
http://www.surbl.org/faq.html#numbered
Jeff C.
On Friday, September 10, 2004, 10:43:39 AM, Jeff wrote:
<snip/>
Holy confusion! I can't tell where you are on this subject now Jeff :)
<snip/>
JC> If you're talking about adding resolved IP addresses to SURBLs, JC> no we're not going to do that. :-(
JC> What I'm talking about is an internal process where we keep track JC> of resolved IP addresses and use that to add new domains to JC> SURBLs sooner if they resolve to a similar IP range (probably JC> /24s). We would use the resolved IP addresses to add domains JC> to sc.surbl.org and possibly other lists sooner. Most would JC> probably get added on the first report. :-)
I recommend a bit of caution on this point. My preliminary data on using /24s to drive recursive domain additions is that it is prone to false positives - The network surrounding a given web host is frequently populated with non-spam servers it seems... at least frequently enough that it's a challenge to generalize in this way.
(I have also observed random changes to these IPs, I believe in an effort to thwart automated attacks. On occasion these domains may point to random legitimate services - the only safe, simple way to know is to look... that's such an aggressive, forward thinking countermeasure that I almost didn't believe it when I saw it and I probably wouldn't have caught it if I weren't looking for it.)
One of the reasons we are able to entertain this kind of analysis is that we (humans) are heavily involved in a continuous refinement and posting process - this allows us to provide tuning inputs that are difficult to quantify for any autonomous AI. We're working in concert with our tools so our techniques don't easily translate out of that environment --- but they do often point in directions where more automation is possible.
It's worth noting that the spammers are accellerating in their efforts to blend their presence with legitimate services, equipment, etc... I suppose this is a natural response to the kinds of automated countermeasures that have been put in place. The upshot of this is that automated schemes must become increasingly sophisticated (intelligent) in order to maintain accuracy.
That said, there are some ways to leverage this data when it is qualified properly. For example, a clean spamtrap - particularly one spawned from dictionary attacks - can provide a ready stream of messages from which you can derive domains through recursive ip references.
It is common for Snake-Oil spammers to leverage a family of domains at one time for a new campaign and to have these point to a single IP or a small group of IPs. As a result it is possible to carefully select a domain (from URI) in one of these messages and then leverage the resolved IP of that domain to automatically derive the other members of the family from the spamtrap data. You cannot reliably use open message data to derive this however since there is a significant risk of tagging a legitimate virtual host --- however, the spamtrap data does not have this problem _USUALLY_.
This technique also is limited, however, and requires some significant review/monitoring. Also, spammers are already complicating this mechanisms (they're thinking ahead more)... There are a growing number of cases where randomstuff.example.com resolves to something different than /example.com and in fact /example.com is often targeted to some random legitimate host - so you need to target the larger URI to extract the filtering candidate... There is no simple way of knowing the conditions - only complex ways.
I probably shouldn't go into much more about this because it will become confusing - Like I said, many of our techniques only work within the infrastructure we have created. Where I can offer any useful insight I will though.
I recommend taking a look at this data yourself with the resources you have by building prototypes of your mechanisms, then test the bitz out of them until the statistics reveal themselves. As a matter of practice this is the only way to know what will work and how it can be applied.
Hope this helps,
_M
On Friday, September 10, 2004, 9:00:16 AM, Pete McNeil wrote:
On Friday, September 10, 2004, 10:43:39 AM, Jeff wrote:
JC>> What I'm talking about is an internal process where we keep track JC>> of resolved IP addresses and use that to add new domains to JC>> SURBLs sooner if they resolve to a similar IP range (probably JC>> /24s). We would use the resolved IP addresses to add domains JC>> to sc.surbl.org and possibly other lists sooner. Most would JC>> probably get added on the first report. :-)
I recommend a bit of caution on this point. My preliminary data on using /24s to drive recursive domain additions is that it is prone to false positives - The network surrounding a given web host is frequently populated with non-spam servers it seems... at least frequently enough that it's a challenge to generalize in this way.
Hi Pete, Thanks for your comments. By "recursive domain additions" to you mean to initiate a proactive search of domains within a given network? What I'm proposing is not to actively try to search, but simply to bias the inclusion of domains that are *actually reported to us as being in spams*.
Hopefully my description of the difference makes some sense and it can be seen why the potential for false inclusions might be lower when the space is *actual spam reports*, and not the space of all domains hosted in nearby networks.
Jeff C.
On Friday, September 10, 2004, 1:13:38 PM, Jeff wrote:
JC> Thanks for your comments. By "recursive domain additions" to you JC> mean to initiate a proactive search of domains within a given JC> network? What I'm proposing is not to actively try to search, JC> but simply to bias the inclusion of domains that are *actually JC> reported to us as being in spams*.
What I mean by "recursive domain additions" (an internal name I use for this process) is something like this:
1. Spamtrap sources the addition of a domain (URI) to the blacklist.
2. A subset of domains in the blacklist are resolved to IPs and those IPs are added to an internal reference list.
3. Subsquent clean spamtrap sources are scanned for domain URI that resolve to IPs on the reference list and if found these new domains are added to the blacklist (or at least recommended as candidates).
So, this is not a proactive search really - rather the capture of one domain predisposes the candidate generator to capture additional domains that resolve to the same IP(s).
(Candidate generator = AI monitoring spamtrap data to extract URI and recommend them as candidates for the black list).
--- Sorry for the complexity here, I'm used to thinking in terms of our system and it is sometimes difficult to describe the concepts outside of that context.
JC> Hopefully my description of the difference makes some sense JC> and it can be seen why the potential for false inclusions JC> might be lower when the space is *actual spam reports*, and JC> not the space of all domains hosted in nearby networks.
Clearly. *actual spam reports* is analogous to clean spamtrap data - though I presume it may also include some non-spamtrap data submitted by users. You are definitely on the right track - that is, I think we're on the same page generally.
The caution is - even with very strong spamtraps there are errors in this process often enough to require some extra research before gating the new "candidates" into the blacklist, IME.
_M
On Friday, September 10, 2004, 10:40:39 AM, Pete McNeil wrote:
On Friday, September 10, 2004, 1:13:38 PM, Jeff wrote:
JC>> Thanks for your comments. By "recursive domain additions" to you JC>> mean to initiate a proactive search of domains within a given JC>> network? What I'm proposing is not to actively try to search, JC>> but simply to bias the inclusion of domains that are *actually JC>> reported to us as being in spams*.
What I mean by "recursive domain additions" (an internal name I use for this process) is something like this:
- Spamtrap sources the addition of a domain (URI) to the blacklist.
- A subset of domains in the blacklist are resolved to IPs and those
IPs are added to an internal reference list.
- Subsquent clean spamtrap sources are scanned for domain URI that
resolve to IPs on the reference list and if found these new domains are added to the blacklist (or at least recommended as candidates).
Aha, the space I was referring to was SpamCop reports, which AFIAK are human. SpamCop does get trap data, but I'm not exactly sure what they do with it.
That said some of the same techniques might apply to our use of spamtrap data, provided hand-checking is also done.
Otherwise your description matches ours.
So, this is not a proactive search really - rather the capture of one domain predisposes the candidate generator to capture additional domains that resolve to the same IP(s).
Got it. That is similar to the principle I was proposing. ;-)
(Candidate generator = AI monitoring spamtrap data to extract URI and recommend them as candidates for the black list).
--- Sorry for the complexity here, I'm used to thinking in terms of our system and it is sometimes difficult to describe the concepts outside of that context.
We all get accustomed to thinking in terms of our own systems, which sometimes is why explanations like this are needed to clear things up. I find it sometimes helps to try to step back and describe an outsider's view of things. I don't always succeed or remember to do that. ;-)
JC>> Hopefully my description of the difference makes some sense JC>> and it can be seen why the potential for false inclusions JC>> might be lower when the space is *actual spam reports*, and JC>> not the space of all domains hosted in nearby networks.
Clearly. *actual spam reports* is analogous to clean spamtrap data - though I presume it may also include some non-spamtrap data submitted by users. You are definitely on the right track - that is, I think we're on the same page generally.
The SpamCop data I assume to be *human-sourced* reports. That's what I meant by "actual spam reports". "Human spam reports" would have been more descriptive.
The caution is - even with very strong spamtraps there are errors in this process often enough to require some extra research before gating the new "candidates" into the blacklist, IME.
Our use of spamtraps (mostly into the WS and OB lists) are carefully tested. The WS entries are supposed to all be hand checked, since we all agree that purely automatic methods let in too many FPs. Human checkers make mistakes too, though we're trying to cut down on those errors, for example by suggesting some requirements such as:
1. Domain age. Older domains should only be added with a lot of evidence. Most spammer domains are no more than a week or two old, often less than a few days old.
2. Only add domains that only appear in spams. Don't add domains that appear in hams.
The second seems the hardest to get across, even though it should seem pretty obvious. The problem seems to be that people say "yep, I've seen a spam with this domain so I'm adding it". If so, that's not the right criterion.
Thanks for comparing notes! :-)
Jeff C.