On Tuesday, April 19, 2005, 1:30:48 AM, Alex Broens wrote:
Jeff Chan wrote:
CBL URI data may well represent a useful new data source, but the best way to determine that may be to start using them. However I'd like your comments on some of the above FP mitigation ideas and any new ideas anyone may have for that purpose before we put these new data into production.
Therefore please speak up if you have any ideas or comments,
Jeff,
I can safely run this new zone on a couple of boxes and report FPs. What are the coordinates?
If we can Rsync, pls let us know as well.
Thanks
Alex
Hi Alex, Thanks much for your kind offer; a separate list may be a good way to test it for now, as we've done with new lists in past. You can find the files on our private rsync server as xs.surbl.org.bind and xs.surbl.org.rbldnsd, where xs I suppose can stand for Exploited Sender. :-) (The name is not fixed; suggestions for names are welcomed.) If you'd like to serve it publically for testing, let me know and I'll put your name servers in a public delegation. (Same goes for anyone else. :-) (Please use the rbldnsd versions as they're easier for me to munge the NS records correctly in.)
OTOH, we could also put in multi on the 128th bit and not publish it as an official list yet. OTOOH that could make it de facto live if particular implementations did not look at the actual bit position values and simply looked at list inclusion. So maybe a separate list is better for now.
A couple notes about this version of the data. It's based on about a million CBL URI hits per day, which is only a small portion of their total hits. It's also only hits that come from senders that qualify for CBL inclusion, i.e. from zombies and open proxies. From that we're currently taking the 97th percentile of the top highest volume reports and added the existing SURBL hits (without respect to percentile). SBL hits are not included until I can re-engineer some things.
97th percentile is quite conservative and results in only 70 new records not already in SURBLs, where the full list has about 6000 new records, but it also avoids many obvious FPs in the "noise" of infrequently appearing domains, for example afghan.com at 2 hits and aarhus.com at 3 hits. In a sense taking the most often appearing records is a good thing since they're also most likely to appear in spams and also most likely to come from zombies. In other words, there may only be 70 new records added to SURBLs at this level, but they should be 70 really big spammers. :-) It would be very interesting to know how many spams are being hit by only these 70.
Also this is only a starting point. We can tune further from here, bump up the inclusion as we improve FP procedures, etc. We can also try the 98th percentile and see how it works out. We can also threshold the counts instead of taking a percentile, so that we only get records that have more than N hits, etc.
Note also that the proportion of new records will vary as the race between existing SURBLs and new trap data goes back and fourth. In other words there will be some varying lead and lag between the lists, though I expect the CBL data will generally tend to see the new records first, i.e. xs will usually lead the other SURBLs.
Here are some stats of total records, blacklist hits, whitelist hits and new records at some selected percentile levels:
cbl at percentile, has records, blacklist hits, whitelist hits, novel 100 percentile, 6929 records, 764 blacklist hits, 248 whitelist hits, 5917 novel 99 percentile, 2897 records, 672 blacklist hits, 137 whitelist hits, 2088 novel 98 percentile, 722 records, 523 blacklist hits, 57 whitelist hits, 142 novel 97 percentile, 446 records, 349 blacklist hits, 28 whitelist hits, 69 novel 96 percentile, 357 records, 296 blacklist hits, 16 whitelist hits, 45 novel 95 percentile, 302 records, 259 blacklist hits, 12 whitelist hits, 31 novel 94 percentile, 268 records, 229 blacklist hits, 11 whitelist hits, 28 novel 93 percentile, 246 records, 209 blacklist hits, 11 whitelist hits, 26 novel 92 percentile, 228 records, 197 blacklist hits, 11 whitelist hits, 20 novel 91 percentile, 212 records, 181 blacklist hits, 11 whitelist hits, 20 novel 90 percentile, 198 records, 168 blacklist hits, 11 whitelist hits, 19 novel 89 percentile, 186 records, 159 blacklist hits, 11 whitelist hits, 16 novel 88 percentile, 177 records, 151 blacklist hits, 10 whitelist hits, 16 novel 87 percentile, 168 records, 142 blacklist hits, 10 whitelist hits, 16 novel 86 percentile, 160 records, 135 blacklist hits, 10 whitelist hits, 15 novel 85 percentile, 152 records, 133 blacklist hits, 8 whitelist hits, 11 novel
At the 95th percentile we're getting about 200 hits per record. At the 96th percentile we're getting about 120 hits per record. At the 97th percentile we're getting about 60 hits per record. At the 98th percentile, that goes to about 10 hits per record. The 99th percentile gets into the 2 hit per record level, which is the overall threshold CBL is doing on their end, so it's not distinct from the 100th percentile in terms of hit counts.
Jeff C. -- "If it appears in hams, then don't list it."