On Tuesday, April 19, 2005, 3:35:02 PM, Patrik Nilsson wrote:
At 03:57 2005-04-19 -0700, Jeff Chan wrote:
It's hard for me to think of a time when it would be a good idea to blacklist legitimate banks, etc. Most people don't want to miss ham from their banks, etc.
Maybe this data source would be best used as a (real dark) non-multi grey list? Instead of trying to make it play well in a black-and-white set-up?
Patrik
I can see your point that uncertain data may argue for lower weighting of them into a greylist, and that idea has merit, but I think it may be more useful to try to grab the true blackhat domains out of the data and simply block on them. assuming it's possible.
The fact that these are being sent through zombies, etc. certainly says much (bad) about the senders. Unfortunately it doesn't automatically mean that the URIs they mention are necessarily black. But I still believe it may be possible to gather that information if we're sufficiently clever.
For example, looking at only the most commonly appearing domains simplifies the task of checking them simply by reducing their volume, i.e., with fewer domains, there are fewer to check. It's somewhat crude, but it does simplify the task. At the same time it does imply that we're looking at the domains most likely to appear in spam since they appeared so often on the CBL traps.
In other words, taking the top percentile lets us operate in the taller part of the Zipf curve and ignore some of the very high volume, low hit rate noise in the shorter tail.
Jeff C. -- "If it appears in hams, then don't list it."