We've been in contact with the operators of a large trap which feeds lists of exploited hosts into RBLs, inquiring if they'd be able to provide URI domains from some of the spams they receive. The idea is to try to find URIs that are specifically sent through zombies and other exploited hosts on the concept that only the worst spammers use zombies and brute force to try to go around RBLs to deliver their spam. The trap operators are able to extract some URI hosts for us, but for now can't afford much more CPU than to use a PERL script calling the Email::MIME module to grab URI domains from about 60k messages. (There's not enough spare CPU to use a program like SpamAssassin, which would likely have more success extracting URIs, but is much more resource intensive.) They may be able to process up to a hundred times as many of their messages for us (i.e. 6M a day) if this moves forward, though even that would be only a small fraction of their trap hits.
At my request they are including a count of the number of appearances of each URI domain name or IP so that we can rank them in order of frequency of appearance on the theory that the bigger spammers may appear more often. Based on that test run and some tweaking of the scripts on their side and ours, we got the following table of percentiles of hits, resulting output record counts, hits against existing SURBLs, hits against the SURBL whitelist, and new records (i.e., in neither our black or white lists):
100th percentile, 1293 records, 732 blacklist hits, 112 whitelist hits, 449 novel 99th percentile, 844 records, 549 blacklist hits, 81 whitelist hits, 214 novel 98th percentile, 653 records, 461 blacklist hits, 67 whitelist hits, 125 novel 97th percentile, 548 records, 397 blacklist hits, 54 whitelist hits, 97 novel 96th percentile, 481 records, 352 blacklist hits, 48 whitelist hits, 81 novel 95th percentile, 433 records, 320 blacklist hits, 42 whitelist hits, 71 novel 94th percentile, 396 records, 298 blacklist hits, 40 whitelist hits, 58 novel 93th percentile, 362 records, 287 blacklist hits, 39 whitelist hits, 36 novel 92th percentile, 332 records, 263 blacklist hits, 38 whitelist hits, 31 novel 91th percentile, 307 records, 251 blacklist hits, 29 whitelist hits, 27 novel 90th percentile, 286 records, 231 blacklist hits, 29 whitelist hits, 26 novel 89th percentile, 267 records, 218 blacklist hits, 25 whitelist hits, 24 novel 88th percentile, 250 records, 202 blacklist hits, 25 whitelist hits, 23 novel 87th percentile, 235 records, 188 blacklist hits, 25 whitelist hits, 22 novel 86th percentile, 221 records, 177 blacklist hits, 23 whitelist hits, 21 novel 85th percentile, 209 records, 170 blacklist hits, 22 whitelist hits, 17 novel 84th percentile, 197 records, 161 blacklist hits, 20 whitelist hits, 16 novel 83th percentile, 186 records, 155 blacklist hits, 18 whitelist hits, 13 novel 82th percentile, 176 records, 148 blacklist hits, 16 whitelist hits, 12 novel 81th percentile, 167 records, 140 blacklist hits, 16 whitelist hits, 11 novel 80th percentile, 159 records, 135 blacklist hits, 14 whitelist hits, 10 novel 79th percentile, 152 records, 130 blacklist hits, 13 whitelist hits, 9 novel 78th percentile, 145 records, 124 blacklist hits, 13 whitelist hits, 8 novel 77th percentile, 139 records, 118 blacklist hits, 13 whitelist hits, 8 novel 76th percentile, 133 records, 112 blacklist hits, 13 whitelist hits, 8 novel 75th percentile, 127 records, 107 blacklist hits, 12 whitelist hits, 8 novel 74th percentile, 122 records, 102 blacklist hits, 12 whitelist hits, 8 novel 73th percentile, 116 records, 98 blacklist hits, 11 whitelist hits, 7 novel 72th percentile, 112 records, 95 blacklist hits, 11 whitelist hits, 6 novel 71th percentile, 107 records, 91 blacklist hits, 11 whitelist hits, 5 novel 70th percentile, 103 records, 88 blacklist hits, 10 whitelist hits, 5 novel
For this sample, the 96th or 97th percentile appears to be an inflection point of expectedly Zipfian-looking data. (I.e. just a few URI hosts appear many times, and many URI hosts appear just a few times.)
Even after whitelisting there are still a few legitimate-looking domains coming through, so one idea would be to list the records up to the 96th or 97th percentile, but for the remaining ones with fewer hits, only list those that also appeared in existing SURBLs, or resolved into sbl.spamhaus.org, or where the sending software was clearly spamware. Hopefully that would reduce FPs in these records with fewer hits, but still let us "pull some useable data out of the noise" and list some of the less frequently appearing records.
Does anyone have any comments on this? IMO what makes these data somewhat unique is that it's an early look at the content which exploited hosts are sending into very large traps. The benefit is that it helps us potentially catch up to a few hundred otherwise unlisted domains sooner, and helps reduce the usefulness of those domains in future zombie usage, etc. In other words it potentially improves the detection rates of SURBLs and increases the usefulness of traps feeding traditional RBLs.
Comments?
Jeff C. -- "If it appears in hams, then don't list it."
At 03:21 2005-03-24 -0800, Jeff Chan wrote:
intensive.) They may be able to process up to a hundred times as many of their messages for us (i.e. 6M a day) if this moves forward, though even that would be only a small fraction of their trap hits.
Is there anything we can do to increase this fraction? Donate CPU cycles, etc?
Even after whitelisting there are still a few legitimate-looking domains coming through, so one idea would be to list the records up to the 96th or 97th percentile, but for the remaining ones with fewer hits, only list those that also appeared in existing SURBLs,
The ones in existing SURBLs are not really that interesting, unless we are looking for a confirmation that what is listed should stay listed. The main point of working on this particular setup would be catching additional domains, not confirming already listed ones, right?
or resolved into sbl.spamhaus.org,
Might seem like a redundant check for people that are used to running SA 3 with uridnsbl, but for people using other SURBL implementations, that are not implementing anything like the uridnsbl "check dns servers for the domain against SBL", this might be very useful for catching additional spam domains.
or where the sending software was clearly spamware. Hopefully that would reduce FPs in these records with fewer hits, but still let us "pull some useable data out of the noise" and list some of the less frequently appearing records.
I think that the important thing for putting efforts into something like this would be to catch more of the zero-hour domains currently slipping by SURBL for a couple of hours, rather than to just confirm current listings. Agreed?
Patrik
On Thursday, March 24, 2005, 11:30:23 AM, Patrik Nilsson wrote:
At 03:21 2005-03-24 -0800, Jeff Chan wrote:
intensive.) They may be able to process up to a hundred times as many of their messages for us (i.e. 6M a day) if this moves forward, though even that would be only a small fraction of their trap hits.
Is there anything we can do to increase this fraction? Donate CPU cycles, etc?
Thanks for your kind offer, but in this case I expect the answer is no. Chances are good that they already have access to many hundreds of servers for processing their existing trap data.
Even after whitelisting there are still a few legitimate-looking domains coming through, so one idea would be to list the records up to the 96th or 97th percentile, but for the remaining ones with fewer hits, only list those that also appeared in existing SURBLs,
The ones in existing SURBLs are not really that interesting, unless we are looking for a confirmation that what is listed should stay listed.
I think confirmation that a given domain, etc. is being spamvertised through zombies is quite useful.
The main point of working on this particular setup would be catching additional domains, not confirming already listed ones, right?
Yes, both.
or resolved into sbl.spamhaus.org,
Might seem like a redundant check for people that are used to running SA 3 with uridnsbl, but for people using other SURBL implementations, that are not implementing anything like the uridnsbl "check dns servers for the domain against SBL", this might be very useful for catching additional spam domains.
1. Not every one runs SA 3. 2. Not every SA 3 user can spare the time delays to use SBL in uridnsbl. 3. SBL URI name server checks results in significantly more false positives than SURBL URI checks. 4. SBL correlated with zombie usage probably has much fewer false positives.
or where the sending software was clearly spamware. Hopefully that would reduce FPs in these records with fewer hits, but still let us "pull some useable data out of the noise" and list some of the less frequently appearing records.
I think that the important thing for putting efforts into something like this would be to catch more of the zero-hour domains currently slipping by SURBL for a couple of hours, rather than to just confirm current listings. Agreed?
Patrik
It would:
1. Confirm some SURBL 2. Find new SBL gang domains 3. Generally find fresh domains being sent through zombies that aren't in SBL or SURBLs.
Zombies are the biggest reason for SURBLs IMO, and this new data source "cuts out the middleman" and gets directly at zombie usage. :D
The main question to me is how to cut down on FPs and that's why I wanted some comments on the post-processing of data. It turns out the source can readily identify and tag records sent using specific spamware, and those would get "special treatment" to be much more likely listed.
All would be subject to whitelisting as a final safety valve, but I'd like to hear more ideas about how to filter zombie-heard URIs.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."