[SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

Jeff Chan jeffc at surbl.org
Tue Apr 19 12:37:48 CEST 2005


On Tuesday, April 19, 2005, 1:30:48 AM, Alex Broens wrote:
> Jeff Chan wrote:
>> CBL URI data may well represent a useful new data source, but the
>> best way to determine that may be to start using them.  However
>> I'd like your comments on some of the above FP mitigation ideas
>> and any new ideas anyone may have for that purpose before we put
>> these new data into production.
>> 
>> Therefore please speak up if you have any ideas or comments,

> Jeff,

> I can safely run this new zone on a couple of boxes and report FPs.
> What are the coordinates?

> If we can Rsync, pls let us know as well.

> Thanks

> Alex

Hi Alex,
Thanks much for your kind offer; a separate list may be a good
way to test it for now, as we've done with new lists in past.
You can find the files on our private rsync server as
xs.surbl.org.bind and xs.surbl.org.rbldnsd, where xs I suppose
can stand for Exploited Sender.  :-)  (The name is not fixed;
suggestions for names are welcomed.)  If you'd like to serve it
publically for testing, let me know and I'll put your name
servers in a public delegation.  (Same goes for anyone else. :-)
(Please use the rbldnsd versions as they're easier for me to
munge the NS records correctly in.)

OTOH, we could also put in multi on the 128th bit and not
publish it as an official list yet.  OTOOH that could make
it de facto live if particular implementations did not look at
the actual bit position values and simply looked at list
inclusion.  So maybe a separate list is better for now.

A couple notes about this version of the data.  It's based on
about a million CBL URI hits per day, which is only a small
portion of their total hits.  It's also only hits that come from
senders that qualify for CBL inclusion, i.e. from zombies and
open proxies.  From that we're currently taking the 97th
percentile of the top highest volume reports and added the
existing SURBL hits (without respect to percentile).  SBL hits
are not included until I can re-engineer some things.

97th percentile is quite conservative and results in only 70 new
records not already in SURBLs, where the full list has about
6000 new records, but it also avoids many obvious FPs in the
"noise" of infrequently appearing domains, for example afghan.com
at 2 hits and aarhus.com at 3 hits.  In a sense taking the most
often appearing records is a good thing since they're also most
likely to appear in spams and also most likely to come from
zombies.  In other words, there may only be 70 new records added
to SURBLs at this level, but they should be 70 really big
spammers.  :-)   It would be very interesting to know how many
spams are being hit by only these 70.

Also this is only a starting point.  We can tune further from
here, bump up the inclusion as we improve FP procedures, etc.
We can also try the 98th percentile and see how it works out.
We can also threshold the counts instead of taking a percentile,
so that we only get records that have more than N hits, etc.

Note also that the proportion of new records will vary as the race
between existing SURBLs and new trap data goes back and fourth.
In other words there will be some varying lead and lag between
the lists, though I expect the CBL data will generally tend to
see the new records first, i.e. xs will usually lead the other
SURBLs.


Here are some stats of total records, blacklist hits, whitelist
hits and new records at some selected percentile levels:

cbl at percentile, has records, blacklist hits, whitelist hits, novel
100 percentile, 6929 records, 764 blacklist hits, 248 whitelist hits, 5917 novel
99 percentile, 2897 records, 672 blacklist hits, 137 whitelist hits, 2088 novel
98 percentile, 722 records, 523 blacklist hits, 57 whitelist hits, 142 novel
97 percentile, 446 records, 349 blacklist hits, 28 whitelist hits, 69 novel
96 percentile, 357 records, 296 blacklist hits, 16 whitelist hits, 45 novel
95 percentile, 302 records, 259 blacklist hits, 12 whitelist hits, 31 novel
94 percentile, 268 records, 229 blacklist hits, 11 whitelist hits, 28 novel
93 percentile, 246 records, 209 blacklist hits, 11 whitelist hits, 26 novel
92 percentile, 228 records, 197 blacklist hits, 11 whitelist hits, 20 novel
91 percentile, 212 records, 181 blacklist hits, 11 whitelist hits, 20 novel
90 percentile, 198 records, 168 blacklist hits, 11 whitelist hits, 19 novel
89 percentile, 186 records, 159 blacklist hits, 11 whitelist hits, 16 novel
88 percentile, 177 records, 151 blacklist hits, 10 whitelist hits, 16 novel
87 percentile, 168 records, 142 blacklist hits, 10 whitelist hits, 16 novel
86 percentile, 160 records, 135 blacklist hits, 10 whitelist hits, 15 novel
85 percentile, 152 records, 133 blacklist hits, 8 whitelist hits, 11 novel


At the 95th percentile we're getting about 200 hits per record.
At the 96th percentile we're getting about 120 hits per record.
At the 97th percentile we're getting about 60 hits per record.
At the 98th percentile, that goes to about 10 hits per record.
The 99th percentile gets into the 2 hit per record level, which
is the overall threshold CBL is doing on their end, so it's not
distinct from the 100th percentile in terms of hit counts.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list