Michael Renzmann schrieb:
Hi all.
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.
Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.
Done, but the results are disappointing (and somewhat surprising).
I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.
Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]
Conclusions:
While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.
It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.
I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.
Bye, Mike
[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
Michael Renzmann schrieb:
Hi all.
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.
Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.
Done, but the results are disappointing (and somewhat surprising).
I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.
Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]
Conclusions:
While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.
It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.
I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.
Bye, Mike
[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
Dear Michael,
i disagree not at all, but partial.
i took your list and asked our own rbl server. results in short 789 out of 854 Domains of your list will be recognized by our service clean-mx surbl 139/854 uribl 66/854
see results on http://support.clean-mx.de/clean-mx/rbltest_results.txt this list is based on your input (and also preserves order) by we way you shall not block virgilio.it .....
Web-Site-spamming either blogs guestbooks etc... has a different approach from the point of view of their originators.
1) it's tricky to tweak pages in the web for abuse 2) this is time consuming so only a few will do that in the "wild" 3) it's much more easier to mail all this stuff over a bot-net 4) the message of all these spammers is always the same... buy .... look at .. obey this finacial tip.... help me... and so on 5) they have to attract their readers to their message so they always must use the same sort of linguistic acrobatic tokens...
at least the same set of keywords stopping mailspam is sufficient to detect and stop web-spam
I totally agree that spamvertized domains in web-spam is a bit diffrent from mailspam but not much.
yours gerhard (feel free to contact me off-list....)