On Tuesday, April 19, 2005, 1:34:25 PM, Alain Alain wrote:
- Use the base data used for sc. Before inclusion you want a nr of
reports to spamcop (I doin't recall it but let's say 20), before adding it to sc. A domain that appears on both the CBL datafeed and the sc datafeed on the "same" time, is far more likely spam. You could either use the new datafeed to selective lower the threshold for sc (not really my first choice) or use the occurences inside the sc datafeed to lower the threshold for the new list. Only a few occurences (more than one) on the sc datafeed would be enough in that case.
After thinking a while longer, it's maybe not such a bad idea to use the new data to improve the SC list. By needing less seperate reports the time gap until inclusion will be much less. Instead of 10 (just checked) it's maybe enough to use 3 or 4, which gives a gain of at least 6 minutes, but probably much more. Moreover it's probably possible to check the "right" threshold and the average time gain. Check the percentage of domains that get inside the CBL datafeed and get less reports than the threshold. for example (no real data):
1 reports only and CBL'ed : 10% 2 reports only and CBL'ed : 5% 3 reports only and CBL'ed : 3% ... 9 reports only and CBL'ed : 0.01% 10 or more reports and CBL'ed : 75%
(And compare against those that are not CBL'ed)
Another thing I think of not linked with CBL : The speed that reports come in is also important, 5 reports in 15 minutes is probably much more spammy than 5 reports in 1 day.
Alain
CBL hits would be a good indication of spammyness but only if we could eliminate the FPs. If amazon.com appears a lot on CBL and someone reports amazon.com on SpamCop, even accidentally, it could get it listed (were it not for our whitelists). This would be more of a problem for whitehats that are less well known than amazon, etc.
Rate of reports or hits in CBL or SC or any other source can be a good indicator of spam, except that legitimate mailers sometimes send to large mailing lists suddenly and this causes a spike that can look like spamsign. This trips up the OB data some times. However the CBL traps are so large that it takes a very large spike to register. Therefore it's probably a better indication of a spam attack than Outblaze may be seeing. Also the fact the our version of the CBL trap data is correlated with zombie and open proxy activity probably helps *a lot*. Legitimate mailers, even those sending to a large list of their own customers, probably don't use zombies. Large, sudden volumes of zombie hits may be indicative of a major spammer using a lot of their bots suddenly. Not all spammers send large blasts like that, but enough may that this could indeed be useful to note.
Regarding applying special measurements to get the lower-hit CBL records onto the XS list sooner, yes, that's precisely the goal. We can automatically find the most common hits through percentiles or thresholds. It's the less common hits that we want to try to list sooner and "dig out of the noise."
Regarding the SC data, I'm also planning to do a self-correlation on the SC data into IP addresses, probably /24s to bias inclusion of SC data more aggressively. I.e. if a new site resolves into a /24 that previously had a lot of spam reports, then that new domain would get added to SC much sooner.
Jeff C. -- "If it appears in hams, then don't list it."