On Wednesday, October 20, 2004, 9:22:27 AM, Rob McEwen wrote:
As we all know, there has been much work done in the past several weeks to remove those URIs in the SURBL lists which could cause FPs.
Is there any "measuring stick" to evaluate our progress? I recall about a month or two ago someone commenting on the list a test which showed an FP percentage which really seemed to disturbed you (Jeff). Do you recall that message? (I can't seem to find it). It would be great if someone could find that message and then see what results that same testing would show today.
Certainly, we are probably far from done... but it seems like this effort these past several weeks would have helped much by now?
We've definitely reduced the FPs, but we still need to keep checking the data to improve it further. For example, some of the DMOZ hits we found still need to be checked. Some of those are false positives and others would be false negatives if removed. Manual checking is tedious, but the best and perhaps only way to make that determination.
The list with the highest FP rate was WS at about 0.4%. Others are at least an order of magnitude (ten times) lower, which is a very significant difference. Everyone is welcomed and encouraged to help check the WS hits against DMOZ and thereby improve the usefulness and performance of SURBLs. Note that a few have already been whitelisted (which tools like GetURI will show):
http://spamcheck.freeapp.net/whitelists/check-ws-dmoz.txt
A good method is to check them against some of the proposed inclusion policies at:
http://www.surbl.org/policy.html
Ryan's GetURI and Dallas' SURBL + Checker are both useful tools that automate some of those checks:
http://ry.ca/cgi-bin/geturi.cgi http://www.rulesemporium.com/cgi-bin/uribl.cgi
Basically we're looking for domains that have legitimate (non-spam) uses or could reasonably be mentioned in hams. Those that have legitimate uses should not be listed.
As far as measurements, probably the most meaningful results come from a really comprehensive ham corpus, such as some of the ones used to test SpamAssassin. Another way is by looking at the rate of false positive reports, but those are probably too infrequent to be a useful measure. Another measurement is the number of hits against what should be mostly legitimate domains in things like DMOZ, Wikipedia, etc. Those came out similar in apparently similar proportions to the SA ham corpus results, for example the various list hit counts against DMOZ as of October 6 were:
4 dmoz-blocklist.ab 61 dmoz-blocklist.jp 165 dmoz-blocklist.ob 2 dmoz-blocklist.ph 8 dmoz-blocklist.sc 1141 dmoz-blocklist.ws 1381 total
As of today it looks like this:
4 dmoz-blocklist.ab 44 dmoz-blocklist.jp 26 dmoz-blocklist.ob 4 dmoz-blocklist.ph 3 dmoz-blocklist.sc 943 dmoz-blocklist.ws 1024 total
That's only one measure against the 2.3 million unique domains and IPs in DMOZ, but it's still a possible hint at the FP rates in the different lists.
To me the biggest surprise is the drop in OB hits. Outblaze folks, thanks much for working on that!
Jeff C. -- "If it appears in hams, then don't list it."