Hi all.
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.
Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.
Done, but the results are disappointing (and somewhat surprising).
I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.
Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]
Conclusions: ============ 1. While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.
2. It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.
I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.
Bye, Mike
[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/