[SURBL-Discuss] RE: FP Reduction Progress?

Jeff Chan jeffc at surbl.org
Wed Oct 20 19:45:19 CEST 2004


On Wednesday, October 20, 2004, 9:22:27 AM, Rob McEwen wrote:
> As we all know, there has been much work done in the past several weeks to
> remove those URIs in the SURBL lists which could cause FPs.

> Is there any "measuring stick" to evaluate our progress? I recall about a
> month or two ago someone commenting on the list a test which showed an FP
> percentage which really seemed to disturbed you (Jeff). Do you recall that
> message? (I can't seem to find it). It would be great if someone could find
> that message and then see what results that same testing would show today.

> Certainly, we are probably far from done... but it seems like this effort
> these past several weeks would have helped much by now?

We've definitely reduced the FPs, but we still need to keep
checking the data to improve it further.  For example, some of
the DMOZ hits we found still need to be checked.  Some of those
are false positives and others would be false negatives if removed.
Manual checking is tedious, but the best and perhaps only way to
make that determination.

The list with the highest FP rate was WS at about 0.4%.  Others
are at least an order of magnitude (ten times) lower, which is
a very significant difference.  Everyone is welcomed and encouraged
to help check the WS hits against DMOZ and thereby improve the
usefulness and performance of SURBLs.  Note that a few have
already been whitelisted (which tools like GetURI will show):

  http://spamcheck.freeapp.net/whitelists/check-ws-dmoz.txt

A good method is to check them against some of the proposed
inclusion policies at:

  http://www.surbl.org/policy.html

Ryan's GetURI and Dallas' SURBL + Checker are both useful
tools that automate some of those checks:

  http://ry.ca/cgi-bin/geturi.cgi
  http://www.rulesemporium.com/cgi-bin/uribl.cgi

Basically we're looking for domains that have legitimate
(non-spam) uses or could reasonably be mentioned in hams.
Those that have legitimate uses should not be listed.

As far as measurements, probably the most meaningful results
come from a really comprehensive ham corpus, such as some of
the ones used to test SpamAssassin.  Another way is by looking
at the rate of false positive reports, but those are probably
too infrequent to be a useful measure.  Another measurement
is the number of hits against what should be mostly legitimate
domains in things like DMOZ, Wikipedia, etc.  Those came out
similar in apparently similar proportions to the SA ham corpus
results, for example the various list hit counts against DMOZ
as of October 6 were:

       4 dmoz-blocklist.ab
      61 dmoz-blocklist.jp
     165 dmoz-blocklist.ob
       2 dmoz-blocklist.ph
       8 dmoz-blocklist.sc
    1141 dmoz-blocklist.ws
    1381 total

As of today it looks like this:

       4 dmoz-blocklist.ab
      44 dmoz-blocklist.jp
      26 dmoz-blocklist.ob
       4 dmoz-blocklist.ph
       3 dmoz-blocklist.sc
     943 dmoz-blocklist.ws
    1024 total

That's only one measure against the 2.3 million unique domains
and IPs in DMOZ, but it's still a possible hint at the  FP
rates in the different lists.

To me the biggest surprise is the drop in OB hits.  Outblaze
folks, thanks much for working on that!

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list