[SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

Wed Apr 20 09:45:42 CEST 2005

On Tuesday, April 19, 2005, 12:31:30 PM, Alain Alain wrote:

> 3 idea's :

> 1) Use the base data used for sc.  Before inclusion you want a nr of
> reports to spamcop (I doin't recall it but let's say 20), before
> adding it to sc.  A domain that appears on both the CBL datafeed and
> the sc datafeed on the "same" time, is far more likely spam.  You
> could either use the new datafeed to selective lower the threshold for
> sc (not really my first choice) or use the occurences inside the sc
> datafeed to lower the threshold for the new list.  Only a few
> occurences (more than one) on the sc datafeed would be enough in that
> case.

Yes, or use some kind of sliding threshold for XS inclusion based
on number of SC hits.  The SC data is pretty good and in
aggregation is a pretty powerful indicator of spammyness.  URIs
that hit SC and CBL around the same time are probably spammy, and
manual SC reports probably don't get too many major ham sites too
often (e.g., most people would not report amazon.com or yahoo.com
to SpamCop, even if they appeared in spams).

> 2) Try to get a big lists with domains that are probably ok (not
> whitelist as such, but a greylist to avoid automaticaly adding
> domains).  They are probably not as fast moving than spam domains (aka
> this list wouldn't need very frequent updating)

> a) use data from large proxyservers

> b) use data from inside e-mails that passed a spamfilter as ham.  

> While there are privacy issues with both techniques, they are probably
> small from practical viewpoint when using large quantities and a
> rather high threshold before inclusion.

> Alain

b) We actually have a source of anonymous ham URIs from a medium-
sized ISP that I have not had time to use yet.  Your suggestion
would indeed be a good application of that ham data.

a) Large web proxies could also be used as a whitening factor for
domains, assuming most people don't visit spam sites, at least
not as often as they visit ham sites, which is probably a pretty
safe assumption, in aggregate.

Does anyone have access to large web proxy server data that they
could anonymize and share or publish?  Does anyone know if data
like that is perhaps already published somewhere on the Internet?

Jeff C.
--
"If it appears in hams, then don't list it."