Re: [SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders

19 Apr 2005


      On 4/19/05, Jeff Chan jeffc-at-surbl.org |surbl list|
<...> wrote:
...
We've been working for a few weeks with the folks at CBL to
extract URIs appearing on their extensive spam traps that also
trigger inclusion in CBL, i.e. zombies, open proxies, etc.  What
this means is that we can get URIs of spams that are sent using
zombies and open proxies, where that mode of sending is a very
good indication of spamminess since legitimate senders probably
don't use hijacked hosts or open proxies to send their mail.
great
<snip>
...
Like most URI data sources, the main problem with the CBL URI
data is false positives or appearance of otherwise legitimate
domains.  For example amazon.com is one that appears frequently.
This does not mean that amazon.com is using zombies to send mail,
or that the CBL traps have been deliberately poisoned, but that
spammers occasionally mention legitimate domains like amazon.com
in their spams.  FPs aside, the CBL URI data does indeed appear
to include other domains operating for the benefit of spammers or
their customers.  These are the new domains we would like to
catch.  Our challenge therefore is to find ways to use those
while excluding the FPs.  Some solutions that have been proposed
so far are:
<snip>
...
Therefore please speak up if you have any ideas or comments,
3 idea's :
1) Use the base data used for sc.  Before inclusion you want a nr of
reports to spamcop (I doin't recall it but let's say 20), before
adding it to sc.  A domain that appears on both the CBL datafeed and
the sc datafeed on the "same" time, is far more likely spam.  You
could either use the new datafeed to selective lower the threshold for
sc (not really my first choice) or use the occurences inside the sc
datafeed to lower the threshold for the new list.  Only a few
occurences (more than one) on the sc datafeed would be enough in that
case.
2) Try to get a big lists with domains that are probably ok (not
whitelist as such, but a greylist to avoid automaticaly adding
domains).  They are probably not as fast moving than spam domains (aka
this list wouldn't need very frequent updating)
a) use data from large proxyservers
b) use data from inside e-mails that passed a spamfilter as ham.
While there are privacy issues with both techniques, they are probably
small from practical viewpoint when using large quantities and a
rather high threshold before inclusion.
Alain

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] RFC: How to use new data source: URIs advertised through CBL-listed senders