Re: [SURBL-Discuss] RE: FP Reduction Progress?

20 Oct 2004


      On Wednesday, October 20, 2004, 9:22:27 AM, Rob McEwen wrote:
...
As we all know, there has been much work done in the past several weeks to
remove those URIs in the SURBL lists which could cause FPs.
...
Is there any "measuring stick" to evaluate our progress? I recall about a
month or two ago someone commenting on the list a test which showed an FP
percentage which really seemed to disturbed you (Jeff). Do you recall that
message? (I can't seem to find it). It would be great if someone could find
that message and then see what results that same testing would show today.
...
Certainly, we are probably far from done... but it seems like this effort
these past several weeks would have helped much by now?
We've definitely reduced the FPs, but we still need to keep
checking the data to improve it further.  For example, some of
the DMOZ hits we found still need to be checked.  Some of those
are false positives and others would be false negatives if removed.
Manual checking is tedious, but the best and perhaps only way to
make that determination.
The list with the highest FP rate was WS at about 0.4%.  Others
are at least an order of magnitude (ten times) lower, which is
a very significant difference.  Everyone is welcomed and encouraged
to help check the WS hits against DMOZ and thereby improve the
usefulness and performance of SURBLs.  Note that a few have
already been whitelisted (which tools like GetURI will show):
http://spamcheck.freeapp.net/whitelists/check-ws-dmoz.txt
A good method is to check them against some of the proposed
inclusion policies at:
http://www.surbl.org/policy.html
Ryan's GetURI and Dallas' SURBL + Checker are both useful
tools that automate some of those checks:
http://ry.ca/cgi-bin/geturi.cgi
  http://www.rulesemporium.com/cgi-bin/uribl.cgi
Basically we're looking for domains that have legitimate
(non-spam) uses or could reasonably be mentioned in hams.
Those that have legitimate uses should not be listed.
As far as measurements, probably the most meaningful results
come from a really comprehensive ham corpus, such as some of
the ones used to test SpamAssassin.  Another way is by looking
at the rate of false positive reports, but those are probably
too infrequent to be a useful measure.  Another measurement
is the number of hits against what should be mostly legitimate
domains in things like DMOZ, Wikipedia, etc.  Those came out
similar in apparently similar proportions to the SA ham corpus
results, for example the various list hit counts against DMOZ
as of October 6 were:
4 dmoz-blocklist.ab
      61 dmoz-blocklist.jp
     165 dmoz-blocklist.ob
       2 dmoz-blocklist.ph
       8 dmoz-blocklist.sc
    1141 dmoz-blocklist.ws
    1381 total
As of today it looks like this:
4 dmoz-blocklist.ab
      44 dmoz-blocklist.jp
      26 dmoz-blocklist.ob
       4 dmoz-blocklist.ph
       3 dmoz-blocklist.sc
     943 dmoz-blocklist.ws
    1024 total
That's only one measure against the 2.3 million unique domains
and IPs in DMOZ, but it's still a possible hint at the  FP
rates in the different lists.
To me the biggest surprise is the drop in OB hits.  Outblaze
folks, thanks much for working on that!
Jeff C.
--
"If it appears in hams, then don't list it."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] RE: FP Reduction Progress?