We've been working for a few weeks with the folks at CBL to extract URIs appearing on their extensive spam traps that also trigger inclusion in CBL, i.e. zombies, open proxies, etc. What this means is that we can get URIs of spams that are sent using zombies and open proxies, where that mode of sending is a very good indication of spamminess since legitimate senders probably don't use hijacked hosts or open proxies to send their mail.
For anyone not familiar with CBL, here are a few words about it. IP addresses of compromised senders like zombies and open proxies end up in cbl.abuseat.org and xbl.spamhaus.org, which are widely used to block spam senders at the MTA level. Experience with this RBL shows it to be very accurate and useful indicators of compromised senders with a low false positive rate. Many systems and networks find these RBLs useful to block on, with good results.
One of the goals of looking at URIs appearing on the CBL traps in messages also triggering CBL inclusion is to get listings of new URIs into SURBLs sooner. One of the valid criticisms of SURBLs is that there is too much delay between the time a URI is first used and it gets listed in SURBLs. This is a problem with RBLs in general, and it means that the targeted senders (or URIs) have a window of time before detection and list inclusion where they can send unhindered.
One advantage we have with SURBLs is that the hosts mentioned in spam URIs tend to be longer-lasting than the compromised spam senders. In other words URIs are often somewhat more "durable" indicators of spams than zombie IP addresses. Zombie usage is often rather fleeting and in the minutes to hours range, where URI usage can be in the days to weeks range. Therefore if we can find URIs sent by zombies, we can potentially "bridge the gap" and get new URI hosts blacklisted sooner. In that sense they work together with and improve the effectiveness of RBLs like CBL by creating a longer-lasting and more persistent view of some of the same types of messages that get caught by RBLs, by taking a closer look at the content of those messages, specifically the sites they advertise.
An aspect of the CBL URI data that makes them potentially very attractive as a new data source for SURBLs is that the CBL traps are very extensive and specifically focussed on and correlated with zombie and open proxy detection. As such, it's somewhat orthogonal to other existing SURBL data sources which are manual lists, user reports, or smaller, but still rather substantial *spam-focussed* traps. As a new data source, CBL URIs could therefore complement our existing sources quite well due to its size and differing composition, thus hopefully increasing the overall detection performance of SURBLs in general.
Like most URI data sources, the main problem with the CBL URI data is false positives or appearance of otherwise legitimate domains. For example amazon.com is one that appears frequently. This does not mean that amazon.com is using zombies to send mail, or that the CBL traps have been deliberately poisoned, but that spammers occasionally mention legitimate domains like amazon.com in their spams. FPs aside, the CBL URI data does indeed appear to include other domains operating for the benefit of spammers or their customers. These are the new domains we would like to catch. Our challenge therefore is to find ways to use those while excluding the FPs. Some solutions that have been proposed so far are:
1. Counting trap appearance volume and taking the top most often appearing URIs.
2. Including domains and the infrequent IP that are already in other SURBLs. This is useful as a confirmation of a zombieness dimension of existing SURBL records.
3. Including domains that resolve into sbl.spamhaus.org as NS, MX or web host records.
4. Excluding records already in our somewhat limited whitelists.
In fact we have an existing program which takes a combination of the first four to produce a list, but the output of that program is not yet published in SURBL form. We may put these in the 128th-bit position of multi.surbl.org to begin testing, but looking at the data there are probably still too many FPs to put it into official production use. Consider some of the additional possibilities below which are not currently being done, and let us know if you think it may be useful to start publishing the above data.
5. Including domains that resolve into the IP space of manually reported URIs, for example from the SpamCop spamvertised site data used in sc.surbl.org and ab.surbl.org.
6. Doing regular (probably nightly) manual review of SURBL additions and whitelisting FPs that appear. (This should probably be done regardless of any new data sources.)
Obviously we can't check every new domain that appears on SURBLs, but we could set up criteria to flag checking, such as domain registration older than 90 days, non-inclusion in SBL, few NANAS reports, etc. Some kind of rating engine using those or other criteria could be applied to new listings to flag manual review of some of the more likely FPs. We would not automatically whitelist these, but flag them for further checking.
CBL URI data may well represent a useful new data source, but the best way to determine that may be to start using them. However I'd like your comments on some of the above FP mitigation ideas and any new ideas anyone may have for that purpose before we put these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
CBL URI data may well represent a useful new data source, but the best way to determine that may be to start using them. However I'd like your comments on some of the above FP mitigation ideas and any new ideas anyone may have for that purpose before we put these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff,
I can safely run this new zone on a couple of boxes and report FPs. What are the coordinates?
If we can Rsync, pls let us know as well.
Thanks
Alex
On Tuesday, April 19, 2005, 1:30:48 AM, Alex Broens wrote:
Jeff Chan wrote:
CBL URI data may well represent a useful new data source, but the best way to determine that may be to start using them. However I'd like your comments on some of the above FP mitigation ideas and any new ideas anyone may have for that purpose before we put these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff,
I can safely run this new zone on a couple of boxes and report FPs. What are the coordinates?
If we can Rsync, pls let us know as well.
Thanks
Alex
Hi Alex, Thanks much for your kind offer; a separate list may be a good way to test it for now, as we've done with new lists in past. You can find the files on our private rsync server as xs.surbl.org.bind and xs.surbl.org.rbldnsd, where xs I suppose can stand for Exploited Sender. :-) (The name is not fixed; suggestions for names are welcomed.) If you'd like to serve it publically for testing, let me know and I'll put your name servers in a public delegation. (Same goes for anyone else. :-) (Please use the rbldnsd versions as they're easier for me to munge the NS records correctly in.)
OTOH, we could also put in multi on the 128th bit and not publish it as an official list yet. OTOOH that could make it de facto live if particular implementations did not look at the actual bit position values and simply looked at list inclusion. So maybe a separate list is better for now.
A couple notes about this version of the data. It's based on about a million CBL URI hits per day, which is only a small portion of their total hits. It's also only hits that come from senders that qualify for CBL inclusion, i.e. from zombies and open proxies. From that we're currently taking the 97th percentile of the top highest volume reports and added the existing SURBL hits (without respect to percentile). SBL hits are not included until I can re-engineer some things.
97th percentile is quite conservative and results in only 70 new records not already in SURBLs, where the full list has about 6000 new records, but it also avoids many obvious FPs in the "noise" of infrequently appearing domains, for example afghan.com at 2 hits and aarhus.com at 3 hits. In a sense taking the most often appearing records is a good thing since they're also most likely to appear in spams and also most likely to come from zombies. In other words, there may only be 70 new records added to SURBLs at this level, but they should be 70 really big spammers. :-) It would be very interesting to know how many spams are being hit by only these 70.
Also this is only a starting point. We can tune further from here, bump up the inclusion as we improve FP procedures, etc. We can also try the 98th percentile and see how it works out. We can also threshold the counts instead of taking a percentile, so that we only get records that have more than N hits, etc.
Note also that the proportion of new records will vary as the race between existing SURBLs and new trap data goes back and fourth. In other words there will be some varying lead and lag between the lists, though I expect the CBL data will generally tend to see the new records first, i.e. xs will usually lead the other SURBLs.
Here are some stats of total records, blacklist hits, whitelist hits and new records at some selected percentile levels:
cbl at percentile, has records, blacklist hits, whitelist hits, novel 100 percentile, 6929 records, 764 blacklist hits, 248 whitelist hits, 5917 novel 99 percentile, 2897 records, 672 blacklist hits, 137 whitelist hits, 2088 novel 98 percentile, 722 records, 523 blacklist hits, 57 whitelist hits, 142 novel 97 percentile, 446 records, 349 blacklist hits, 28 whitelist hits, 69 novel 96 percentile, 357 records, 296 blacklist hits, 16 whitelist hits, 45 novel 95 percentile, 302 records, 259 blacklist hits, 12 whitelist hits, 31 novel 94 percentile, 268 records, 229 blacklist hits, 11 whitelist hits, 28 novel 93 percentile, 246 records, 209 blacklist hits, 11 whitelist hits, 26 novel 92 percentile, 228 records, 197 blacklist hits, 11 whitelist hits, 20 novel 91 percentile, 212 records, 181 blacklist hits, 11 whitelist hits, 20 novel 90 percentile, 198 records, 168 blacklist hits, 11 whitelist hits, 19 novel 89 percentile, 186 records, 159 blacklist hits, 11 whitelist hits, 16 novel 88 percentile, 177 records, 151 blacklist hits, 10 whitelist hits, 16 novel 87 percentile, 168 records, 142 blacklist hits, 10 whitelist hits, 16 novel 86 percentile, 160 records, 135 blacklist hits, 10 whitelist hits, 15 novel 85 percentile, 152 records, 133 blacklist hits, 8 whitelist hits, 11 novel
At the 95th percentile we're getting about 200 hits per record. At the 96th percentile we're getting about 120 hits per record. At the 97th percentile we're getting about 60 hits per record. At the 98th percentile, that goes to about 10 hits per record. The 99th percentile gets into the 2 hit per record level, which is the overall threshold CBL is doing on their end, so it's not distinct from the 100th percentile in terms of hit counts.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
One of the goals of looking at URIs appearing on the CBL traps in messages also triggering CBL inclusion is to get listings of new URIs into SURBLs sooner. One of the valid criticisms of SURBLs is that there is too much delay between the time a URI is first used and it gets listed in SURBLs. This is a problem with RBLs in general, and it means that the targeted senders (or URIs) have a window of time before detection and list inclusion where they can send unhindered.
...
Our challenge therefore is to find ways to use those while excluding the FPs. Some solutions that have been proposed so far are:
...
What strikes me most is the fundamental incompatibility between aiming to reduce the window of opportunity before a URI gets onto any lists, yet using inclusion on other lists as a way of confirming the validity of the data.
How about a multi-level system, where any (non-whitelisted) URI in the CBL data is immediately included on the first level, then gradually gets promoted to the higher levels once it is corroborated by further reports, inclusion in other lists, manual confirmation or whatever. The last byte of the A record could be used to indicate the level. The number of levels and the details of promotion/demotion strategies would obviously need to be worked out and refined over time.
Logically the lower levels would have higher FP rates, but can be given lower SA scores (or equivalent weightings in other client apps).
John.
John Wilcock wrote:
Logically the lower levels would have higher FP rates, but can be given lower SA scores (or equivalent weightings in other client apps)
For that matter, it occurs to me that it could actually be a *good* thing if an obscure but legitimate domain gets listed at the lower levels of a multi-level system due to being mentioned in a big spam run, as its presence would, albeit temporarily, be a sign of spamminess. This logic wouldn't apply for more commonly-mentioned legitimate domains, but those will be on the SURBL whitelist anyway.
Obviously this only holds in the context of a weighted scoring system such as SpamAssassin, not one which excludes messages outright.
John.
On Tuesday, April 19, 2005, 2:35:37 AM, John Wilcock wrote:
John Wilcock wrote:
Logically the lower levels would have higher FP rates, but can be given lower SA scores (or equivalent weightings in other client apps)
For that matter, it occurs to me that it could actually be a *good* thing if an obscure but legitimate domain gets listed at the lower levels of a multi-level system due to being mentioned in a big spam run, as its presence would, albeit temporarily, be a sign of spamminess. This logic wouldn't apply for more commonly-mentioned legitimate domains, but those will be on the SURBL whitelist anyway.
I'm not favor of even intermittent listing of otherwise legitimate domains. Remember many of the FPs are innocent bystanders, like a stock spammer mentioning a legitimate investment site, a bank phish mentioning a legitimate bank, or a 419er mentioning some news story about their purported country, etc.
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Obviously this only holds in the context of a weighted scoring system such as SpamAssassin, not one which excludes messages outright.
John.
Indeed.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Your concept of "black lists" is too black, or in other words wrong. Nobody uses say *.whois.rfc-ignorant.org to block all *.co.uk domains. That's no reason to close this list, it's still useful for scoring.
What you can't (or rather shouldn't) do is to _mix_ different concepts in one combined lists like MULTI actually meant to block. But in separate lists you can do anything you like.
For XS I don't see your problem, it could be a part of MULTI.
Bye, Frank
On Tuesday, April 19, 2005, 10:34:17 AM, Frank Ellermann wrote:
Jeff Chan wrote:
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Your concept of "black lists" is too black, or in other words wrong.
Hmm, perhaps "wrong" is a little (too) strong statement. SURBLs as they are currently defined are proving quite useful for many folks.
Nobody uses say *.whois.rfc-ignorant.org to block all *.co.uk domains. That's no reason to close this list, it's still useful for scoring.
Sure, but that's a different list, with a different purpose.
What you can't (or rather shouldn't) do is to _mix_ different concepts in one combined lists like MULTI actually meant to block. But in separate lists you can do anything you like.
True, but for overhead reasons and general project focus, we're going to try to stick to blacklists and multi.
For XS I don't see your problem, it could be a part of MULTI.
Bye, Frank
Yes, we're just testing xs separately for now to see how it's performing, tune it further, try some different processing options, etc. If we can get it to work well, we will add it to multi, as you suggest. :-)
This is how we also brought most of the other new lists like OB, JP, PH, etc. into SURBLs: test first and add to multi later.
So it anyone has some results to share we'd like to see them. :-)
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
Your concept of "black lists" is too black, or in other words wrong.
Hmm, perhaps "wrong" is a little (too) strong statement.
Okay, let's agree on black != block ;-)
SURBLs as they are currently defined are proving quite useful for many folks.
Sure. But a red.surbl.org "this is an open redirector" could be also useful. Yesterday I actually missed a white.surbl.org when I didn't see 18.to in MULTI
If you have whitelisted 18.to please don't, I got more than three nina.18.to in the last weeks.
Black, white, red, what else ? It's all okay if you don't mix it in one list, where stupid users would get it wrong (e.g. SORBS 127.0.0.6 is a NoNo).
Bye, Frank
On Wednesday, April 20, 2005, 10:10:48 AM, Frank Ellermann wrote:
Jeff Chan wrote:
SURBLs as they are currently defined are proving quite useful for many folks.
Sure. But a red.surbl.org "this is an open redirector" could be also useful.
It doesn't really fit our model, which is to list blackhats, especially zombie users.
Yesterday I actually missed a white.surbl.org when I didn't see 18.to in MULTI
If you have whitelisted 18.to please don't, I got more than three nina.18.to in the last weeks.
It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
red.surbl.org "this is an open redirector" could be also useful.
It doesn't really fit our model, which is to list blackhats, especially zombie users.
Spammers are slow, but sooner or later they'll use redirectors everywhere to bypass SURBL. I don't see the relation between zombies and SURBL, zombies are used to send spam, not to host spamvertized sites. Or are you talking about zombies used as redirectors, are we already at this point ?
It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not.
I found neither nina.18.to nor 18.to in multi when I looked for it. Today http://nina.18.to is http://Opt.To/notfound.htm
Opt.To offers "free subdomain name redirection service and ads for your site". Whatever that means. AFAIK the automatical procedures won't list subdomains of a SLD as long as the SLD is not a "two level ccTLD". Maybe you could add redirectors like 18.to to http://spamcheck.freeapp.net/two-level-tlds
Otherwise I don't see how you could catch the next nina.18.to if it's reported indirectly via SC as spamvertized site. Bye.
Jeff said: "It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not."
Why not then add certain redirectors to the SURBL lists where the redirector is deemed to NOT be found in hams? Specifically, I'm referring to situations where we could list redirect.somedomain.com but NOT list somedomain.com
Rob McEwen
Rob McEwen wrote:
Jeff said: "It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not."
Why not then add certain redirectors to the SURBL lists where the redirector is deemed to NOT be found in hams? Specifically, I'm referring to situations where we could list redirect.somedomain.com but NOT list somedomain.com
That would require the calling applications to know to do a lookup on redirect.somedomain.com and not somedomain.com. SpamAssassin for one won't do that.
I'm trying to get user configurable redirector pattern matching into the SA code (bug 4176). I've got one ISP using it to identify domains being redirected to via the zdnet redirector with good results. Hopefully I can get it in 3.1.
Daryl
On Tuesday, April 26, 2005, 8:22:26 AM, Daryl O'Shea wrote:
Rob McEwen wrote:
Jeff said: "It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not."
Why not then add certain redirectors to the SURBL lists where the redirector is deemed to NOT be found in hams? Specifically, I'm referring to situations where we could list redirect.somedomain.com but NOT list somedomain.com
That would require the calling applications to know to do a lookup on redirect.somedomain.com and not somedomain.com. SpamAssassin for one won't do that.
Actually I thought SpamAssassin did check two level domains like foo.com on two and three levels. Not sure if it still does that but I recall it doing that at oue point, i.e. both redirect.somedomain.com *and* somedomain.com. were checked.
Pretty sure we saw that in the DNS traffic SA was generating, or showing up in debug mode. But maybe the domain handling's been updated to be more specific since then.
SA also checks all visible hosts (including redirected-to ones) in a URI, including all of a redirector, so:
http://redirector.clubie.isp/blah/feh/http://spammer.com/
and similiar style URIs are checked by spamassassin for at least clubie.isp and spammer.com. That's what I recall from the original SA development of redirector handling.
I'm trying to get user configurable redirector pattern matching into the SA code (bug 4176). I've got one ISP using it to identify domains being redirected to via the zdnet redirector with good results. Hopefully I can get it in 3.1.
Daryl
Cool. Very glad to hear there's code to handle this other style of redirector in the works! :-)
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
On Tuesday, April 26, 2005, 8:22:26 AM, Daryl O'Shea wrote:
That would require the calling applications to know to do a lookup on redirect.somedomain.com and not somedomain.com. SpamAssassin for one won't do that.
Actually I thought SpamAssassin did check two level domains like foo.com on two and three levels. Not sure if it still does that but I recall it doing that at oue point, i.e. both redirect.somedomain.com *and* somedomain.com. were checked.
Pretty sure we saw that in the DNS traffic SA was generating, or showing up in debug mode. But maybe the domain handling's been updated to be more specific since then.
Hmm, I could be mistaken. I guess I could check the code or a debug, but are there any three level domains listed to do a quick check against?
SA also checks all visible hosts (including redirected-to ones) in a URI, including all of a redirector, so:
http://redirector.clubie.isp/blah/feh/http://spammer.com/
and similiar style URIs are checked by spamassassin for at least clubie.isp and spammer.com. That's what I recall from the original SA development of redirector handling.
Yeah, it'll lookup both those domains. Any time it finds http(s) in the URI it assumes that it and the rest is a domain being redirected to.
Daryl
On Tuesday, April 26, 2005, 7:56:17 AM, Rob McEwen wrote:
Jeff said: "It is possible to blacklist nina.18.to but not 18.to if nina is owned by spammers but 18 is not."
Why not then add certain redirectors to the SURBL lists where the redirector is deemed to NOT be found in hams? Specifically, I'm referring to situations where we could list redirect.somedomain.com but NOT list somedomain.com
Rob McEwen
Sure, if we find a redirector owned and operated purely by spammers (as opposed to clueless ISPs, etc.) then we can certainly blacklist it.
So far I don't recall seeing any that fit that category, but if spammers do start running their own redirectors we an absolutely blacklist them.
Jeff C. -- "If it appears in hams, then don't list it."
At 03:57 2005-04-19 -0700, Jeff Chan wrote:
On Tuesday, April 19, 2005, 2:35:37 AM, John Wilcock wrote:
For that matter, it occurs to me that it could actually be a *good* thing if an obscure but legitimate domain gets listed at the lower levels of a multi-level system due to being mentioned in a big spam run, as its presence would, albeit temporarily, be a sign of spamminess. This logic wouldn't apply for more commonly-mentioned legitimate domains, but those will be on the SURBL whitelist anyway.
I'm not favor of even intermittent listing of otherwise legitimate domains. Remember many of the FPs are innocent bystanders, like a stock spammer mentioning a legitimate investment site, a bank phish mentioning a legitimate bank, or a 419er mentioning some news story about their purported country, etc.
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Maybe this data source would be best used as a (real dark) non-multi grey list? Instead of trying to make it play well in a black-and-white set-up?
Patrik
On Tuesday, April 19, 2005, 3:35:02 PM, Patrik Nilsson wrote:
At 03:57 2005-04-19 -0700, Jeff Chan wrote:
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Maybe this data source would be best used as a (real dark) non-multi grey list? Instead of trying to make it play well in a black-and-white set-up?
Patrik
I can see your point that uncertain data may argue for lower weighting of them into a greylist, and that idea has merit, but I think it may be more useful to try to grab the true blackhat domains out of the data and simply block on them. assuming it's possible.
The fact that these are being sent through zombies, etc. certainly says much (bad) about the senders. Unfortunately it doesn't automatically mean that the URIs they mention are necessarily black. But I still believe it may be possible to gather that information if we're sufficiently clever.
For example, looking at only the most commonly appearing domains simplifies the task of checking them simply by reducing their volume, i.e., with fewer domains, there are fewer to check. It's somewhat crude, but it does simplify the task. At the same time it does imply that we're looking at the domains most likely to appear in spam since they appeared so often on the CBL traps.
In other words, taking the top percentile lets us operate in the taller part of the Zipf curve and ignore some of the very high volume, low hit rate noise in the shorter tail.
Jeff C. -- "If it appears in hams, then don't list it."
On Tuesday, April 19, 2005, 2:02:10 AM, John Wilcock wrote:
Jeff Chan wrote:
One of the goals of looking at URIs appearing on the CBL traps in messages also triggering CBL inclusion is to get listings of new URIs into SURBLs sooner. One of the valid criticisms of SURBLs is that there is too much delay between the time a URI is first used and it gets listed in SURBLs. This is a problem with RBLs in general, and it means that the targeted senders (or URIs) have a window of time before detection and list inclusion where they can send unhindered.
...
Our challenge therefore is to find ways to use those while excluding the FPs. Some solutions that have been proposed so far are:
...
What strikes me most is the fundamental incompatibility between aiming to reduce the window of opportunity before a URI gets onto any lists, yet using inclusion on other lists as a way of confirming the validity of the data.
I agree that depending on inclusion in other lists can sometime mean that we're dependent on the other lists and will therefore lag them if we try to depend on them. On the other hand things like SBL inclusion does not necessarily have that result. SBL lists IP ranges belonging to spammers. If a spammer registers a brand new domain but points web, NS or MX service into SBL-listed space, then the domain could in principle be listed immediately, by virtue of IP matching and not the domain itself matching any other list. IOW matches like that permit immediate listing of completely new domains that don't appear as domains in other lists.
The inclusions based on other lists represents a separate approach to try to reach into the "noise" of low-hit-count records to see if any useful data can be grabbed from it. It's generally not our primary use of the data. We will use other techniques such as looking at the volume of hits per record to get new records, do some tuning etc.
Suggestions of other methods of correlating the data to dig deeper into the noise are welcomed.
How about a multi-level system, where any (non-whitelisted) URI in the CBL data is immediately included on the first level, then gradually gets promoted to the higher levels once it is corroborated by further reports, inclusion in other lists, manual confirmation or whatever. The last byte of the A record could be used to indicate the level. The number of levels and the details of promotion/demotion strategies would obviously need to be worked out and refined over time.
Logically the lower levels would have higher FP rates, but can be given lower SA scores (or equivalent weightings in other client apps).
John.
Right, but it probably should be kept in mind that some SURBL-using applications may not be doing weight-type scoring. Some may be doing outright yes/no blocking. I also prefer the more difficult approach of trying to say a record belongs to hard core spammers or it doesn't. I'm not a big fan of uncertain or grey results. Especially given applications that do outright blocking, listings may be most useful when they're either black or white.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
SBL lists IP ranges belonging to spammers. If a spammer registers a brand new domain but points web, NS or MX service into SBL-listed space, then the domain could in principle be listed immediately, by virtue of IP matching and not the domain itself matching any other list. IOW matches like that permit immediate listing of completely new domains that don't appear as domains in other lists.
OK, I'm with you now.
Right, but it probably should be kept in mind that some SURBL-using applications may not be doing weight-type scoring. Some may be doing outright yes/no blocking. I also prefer the more difficult approach of trying to say a record belongs to hard core spammers or it doesn't. I'm not a big fan of uncertain or grey results. Especially given applications that do outright blocking, listings may be most useful when they're either black or white.
For applications that do outright blocking, naturally the only acceptable results are black or white. But for those that do make use of weighted scoring, shades of grey are also an extremely valuable contribution.
The multi-level shades-of-grey-list I was advocating could conceivably coexist with your existing black-or-white approach, providing useful information to those applications that can cope with greys, and indeed feeding data into the blacklist once a domain reached a dark enough shade of grey!
John.
On 4/19/05, Jeff Chan jeffc-at-surbl.org |surbl list| <...> wrote:
We've been working for a few weeks with the folks at CBL to extract URIs appearing on their extensive spam traps that also trigger inclusion in CBL, i.e. zombies, open proxies, etc. What this means is that we can get URIs of spams that are sent using zombies and open proxies, where that mode of sending is a very good indication of spamminess since legitimate senders probably don't use hijacked hosts or open proxies to send their mail.
great
<snip>
Like most URI data sources, the main problem with the CBL URI data is false positives or appearance of otherwise legitimate domains. For example amazon.com is one that appears frequently. This does not mean that amazon.com is using zombies to send mail, or that the CBL traps have been deliberately poisoned, but that spammers occasionally mention legitimate domains like amazon.com in their spams. FPs aside, the CBL URI data does indeed appear to include other domains operating for the benefit of spammers or their customers. These are the new domains we would like to catch. Our challenge therefore is to find ways to use those while excluding the FPs. Some solutions that have been proposed so far are:
<snip>
Therefore please speak up if you have any ideas or comments,
3 idea's :
1) Use the base data used for sc. Before inclusion you want a nr of reports to spamcop (I doin't recall it but let's say 20), before adding it to sc. A domain that appears on both the CBL datafeed and the sc datafeed on the "same" time, is far more likely spam. You could either use the new datafeed to selective lower the threshold for sc (not really my first choice) or use the occurences inside the sc datafeed to lower the threshold for the new list. Only a few occurences (more than one) on the sc datafeed would be enough in that case.
2) Try to get a big lists with domains that are probably ok (not whitelist as such, but a greylist to avoid automaticaly adding domains). They are probably not as fast moving than spam domains (aka this list wouldn't need very frequent updating)
a) use data from large proxyservers
b) use data from inside e-mails that passed a spamfilter as ham.
While there are privacy issues with both techniques, they are probably small from practical viewpoint when using large quantities and a rather high threshold before inclusion.
Alain
Jus a few things to add
<snip>
3 idea's :
- Use the base data used for sc. Before inclusion you want a nr of
reports to spamcop (I doin't recall it but let's say 20), before adding it to sc. A domain that appears on both the CBL datafeed and the sc datafeed on the "same" time, is far more likely spam. You could either use the new datafeed to selective lower the threshold for sc (not really my first choice) or use the occurences inside the sc datafeed to lower the threshold for the new list. Only a few occurences (more than one) on the sc datafeed would be enough in that case.
After thinking a while longer, it's maybe not such a bad idea to use the new data to improve the SC list. By needing less seperate reports the time gap until inclusion will be much less. Instead of 10 (just checked) it's maybe enough to use 3 or 4, which gives a gain of at least 6 minutes, but probably much more. Moreover it's probably possible to check the "right" threshold and the average time gain. Check the percentage of domains that get inside the CBL datafeed and get less reports than the threshold. for example (no real data):
1 reports only and CBL'ed : 10% 2 reports only and CBL'ed : 5% 3 reports only and CBL'ed : 3% ... 9 reports only and CBL'ed : 0.01% 10 or more reports and CBL'ed : 75%
(And compare against those that are not CBL'ed)
Another thing I think of not linked with CBL : The speed that reports come in is also important, 5 reports in 15 minutes is probably much more spammy than 5 reports in 1 day.
Alain
On Tuesday, April 19, 2005, 1:34:25 PM, Alain Alain wrote:
- Use the base data used for sc. Before inclusion you want a nr of
reports to spamcop (I doin't recall it but let's say 20), before adding it to sc. A domain that appears on both the CBL datafeed and the sc datafeed on the "same" time, is far more likely spam. You could either use the new datafeed to selective lower the threshold for sc (not really my first choice) or use the occurences inside the sc datafeed to lower the threshold for the new list. Only a few occurences (more than one) on the sc datafeed would be enough in that case.
After thinking a while longer, it's maybe not such a bad idea to use the new data to improve the SC list. By needing less seperate reports the time gap until inclusion will be much less. Instead of 10 (just checked) it's maybe enough to use 3 or 4, which gives a gain of at least 6 minutes, but probably much more. Moreover it's probably possible to check the "right" threshold and the average time gain. Check the percentage of domains that get inside the CBL datafeed and get less reports than the threshold. for example (no real data):
1 reports only and CBL'ed : 10% 2 reports only and CBL'ed : 5% 3 reports only and CBL'ed : 3% ... 9 reports only and CBL'ed : 0.01% 10 or more reports and CBL'ed : 75%
(And compare against those that are not CBL'ed)
Another thing I think of not linked with CBL : The speed that reports come in is also important, 5 reports in 15 minutes is probably much more spammy than 5 reports in 1 day.
Alain
CBL hits would be a good indication of spammyness but only if we could eliminate the FPs. If amazon.com appears a lot on CBL and someone reports amazon.com on SpamCop, even accidentally, it could get it listed (were it not for our whitelists). This would be more of a problem for whitehats that are less well known than amazon, etc.
Rate of reports or hits in CBL or SC or any other source can be a good indicator of spam, except that legitimate mailers sometimes send to large mailing lists suddenly and this causes a spike that can look like spamsign. This trips up the OB data some times. However the CBL traps are so large that it takes a very large spike to register. Therefore it's probably a better indication of a spam attack than Outblaze may be seeing. Also the fact the our version of the CBL trap data is correlated with zombie and open proxy activity probably helps *a lot*. Legitimate mailers, even those sending to a large list of their own customers, probably don't use zombies. Large, sudden volumes of zombie hits may be indicative of a major spammer using a lot of their bots suddenly. Not all spammers send large blasts like that, but enough may that this could indeed be useful to note.
Regarding applying special measurements to get the lower-hit CBL records onto the XS list sooner, yes, that's precisely the goal. We can automatically find the most common hits through percentiles or thresholds. It's the less common hits that we want to try to list sooner and "dig out of the noise."
Regarding the SC data, I'm also planning to do a self-correlation on the SC data into IP addresses, probably /24s to bias inclusion of SC data more aggressively. I.e. if a new site resolves into a /24 that previously had a lot of spam reports, then that new domain would get added to SC much sooner.
Jeff C. -- "If it appears in hams, then don't list it."
On Tuesday, April 19, 2005, 12:31:30 PM, Alain Alain wrote:
3 idea's :
- Use the base data used for sc. Before inclusion you want a nr of
reports to spamcop (I doin't recall it but let's say 20), before adding it to sc. A domain that appears on both the CBL datafeed and the sc datafeed on the "same" time, is far more likely spam. You could either use the new datafeed to selective lower the threshold for sc (not really my first choice) or use the occurences inside the sc datafeed to lower the threshold for the new list. Only a few occurences (more than one) on the sc datafeed would be enough in that case.
Yes, or use some kind of sliding threshold for XS inclusion based on number of SC hits. The SC data is pretty good and in aggregation is a pretty powerful indicator of spammyness. URIs that hit SC and CBL around the same time are probably spammy, and manual SC reports probably don't get too many major ham sites too often (e.g., most people would not report amazon.com or yahoo.com to SpamCop, even if they appeared in spams).
- Try to get a big lists with domains that are probably ok (not
whitelist as such, but a greylist to avoid automaticaly adding domains). They are probably not as fast moving than spam domains (aka this list wouldn't need very frequent updating)
a) use data from large proxyservers
b) use data from inside e-mails that passed a spamfilter as ham.
While there are privacy issues with both techniques, they are probably small from practical viewpoint when using large quantities and a rather high threshold before inclusion.
Alain
b) We actually have a source of anonymous ham URIs from a medium- sized ISP that I have not had time to use yet. Your suggestion would indeed be a good application of that ham data.
a) Large web proxies could also be used as a whitening factor for domains, assuming most people don't visit spam sites, at least not as often as they visit ham sites, which is probably a pretty safe assumption, in aggregate.
Does anyone have access to large web proxy server data that they could anonymize and share or publish? Does anyone know if data like that is perhaps already published somewhere on the Internet?
Jeff C. -- "If it appears in hams, then don't list it."