[SURBL-Discuss] RFC: How to use new data source: URIs advertised
through CBL-listed senders
jeffc at surbl.org
Wed Apr 20 09:36:23 CEST 2005
On Tuesday, April 19, 2005, 3:35:02 PM, Patrik Nilsson wrote:
> At 03:57 2005-04-19 -0700, Jeff Chan wrote:
>>It's hard for me to think of a time when it would be a good idea
>>to blacklist legitimate banks, etc. Most people don't want to
>>miss ham from their banks, etc.
> Maybe this data source would be best used as a (real dark) non-multi grey list?
> Instead of trying to make it play well in a black-and-white set-up?
I can see your point that uncertain data may argue for lower
weighting of them into a greylist, and that idea has merit,
but I think it may be more useful to try to grab the true
blackhat domains out of the data and simply block on them.
assuming it's possible.
The fact that these are being sent through zombies, etc.
certainly says much (bad) about the senders. Unfortunately
it doesn't automatically mean that the URIs they mention are
necessarily black. But I still believe it may be possible
to gather that information if we're sufficiently clever.
For example, looking at only the most commonly appearing domains
simplifies the task of checking them simply by reducing their
volume, i.e., with fewer domains, there are fewer to check. It's
somewhat crude, but it does simplify the task. At the same time
it does imply that we're looking at the domains most likely to
appear in spam since they appeared so often on the CBL traps.
In other words, taking the top percentile lets us operate in the
taller part of the Zipf curve and ignore some of the very high
volume, low hit rate noise in the shorter tail.
"If it appears in hams, then don't list it."
More information about the Discuss