Hi,
Jose Marcio Martins da Cruz wrote:
I took a look some time before, and I noticed some 440 rules, if I remember.
The ruleset is between 300 and 500 rules.
The problem with this is that it may be efficient only for small servers, and you should clean up old unused rules.
URLs that are shut down (404) or redirected to 'error' or 'policy violation' pages are automatically removed after 4 days.
When I started, I thought that would be enough, since after a while the hosts (Geocities / Tripod) would cancel the spammy domains.
It works wonders with Tripod who usually have already shut down the offending sites before I list them, maybe due to the work of Raymond Dijkxhoorn, but the situation with Geocities is much worse than I expected, to say the least.
If you have a look at the current list http://nospam.mailpeers.net/subevil.cf you'll see that among 372 rules only 2 are for Tripod (there might be a sampling problem here, or spammers are fed up with their pages being destroyed before the spam is even completely sent and stopped abusing Tripod !) and the 370 others are from Geocities.
Among them, *** only 6 *** have been closed in the past 4 days ! that's less than 2%.
No wonders spammers have been using Yahoo / Geocities for months and will keep doing so !
I also noticed most of the pages look very similar, so only a few spammers are using it. This similarity makes the automated detection of their encoded relocation scripts trivial in about 85% of cases, but it leads to new questions about the exact relationship between them and Geocities / Yahoo.
The complete list of spammy redirection pages Yahoo / Geocities is still hosting can be found here: http://nospam.mailpeers.net/alive_spammy.txt Anyone knows where it should be sent @ Yahoo ?
------
If the situation does not improve, the list will obviously grow, and I'll have to figure out if old addresses are 'recycled' in new spam runs or not.
I just added a length limited version (only the most recent 200 addresses) to address your server performance concerns while keeping most of it's effectiveness. http://nospam.mailpeers.net/subevil200.cf
If there is enough interest, I can also add the code for rules merging. That would greatly improve both memory usage and speed.
What's nice with URLBL is that you read the message once to extract all URLs and after that, you query the database looking for the domain names you've found.
The problem with your rules is that the message will probably be scanned 441 times looking for complex perl expressions.
But this is a general problem with SpamAssassin with its hundreds rules.
Yes, the explosion of bigevil.cf (over 1M!) and the need for a more efficient way was one of the main reason SURBL was created.
The reasons why those Geocities sites won't be integrated in SURBL were previously discussed. I still think it would be possible, but, at least for now, .cf ruleset is the only way.
Also, contrary to bigevil, it should not expand indefinitely, but a large part of the problem (and the solution) is in the hands of Yahoo / Geocities. If they start cleaning the mess, we'll have less, and their service will become less attractive for spammers. That's where I see the next battle.
In other words, with your rules, you'll highly increase the scanning time without adding too much efficiency to your filter. Good for small servers !
The problem with those Geocities spams is they trend to generate a very low score and go undetected. Sure, the rules won't hit a lot of the total server mail load, but those they'll hit will contribute significantly to the lowering of false negatives (15 to 20% in my server, that's what got me interested in a solution).
In servers such as mine where I have a 4 levels response (3 levels of tagging + quarantine if > 20) it also help the mildly spammy ones to get above 20 so my users won't even be bothered by them.
One feature that might boost the performance for this kind of rules would be conditional rules skipping.
With a single test on (Geocities|tripod), it would be fast & easy to skip all the other tests for the 95% of all mails that reference neither of these sites.
Improving Spamassassin's performance issues is very important, but beyond the scope of my simple ruleset.
Regards,
Eric.