[SURBL-Discuss] Re: One way to handle the Geocities spam

Eric Montréal erv at mailpeers.net
Thu Dec 15 02:14:57 CET 2005


Jose Marcio Martins da Cruz wrote:

>I took a look some time before, and I noticed some 440 rules, if I remember.
The ruleset is between 300 and 500 rules.

>The problem with this is that it may be efficient only for small servers, and
>you should clean up old unused rules.
URLs that are shut down (404) or redirected to 'error' or 'policy 
violation' pages are automatically removed
after 4 days.

When I started, I thought that would be enough, since after a while the 
hosts (Geocities / Tripod) would
cancel the spammy domains.

It works wonders with Tripod who usually have already shut down the 
offending sites before I list them,
maybe due to the work of Raymond Dijkxhoorn, but the situation with 
Geocities is much worse than
I expected, to say the least.

If you have a look at the current list 
http://nospam.mailpeers.net/subevil.cf you'll see that among 372 rules
only 2 are for Tripod (there might be a sampling problem here, or 
spammers are fed up with their pages
being destroyed before the spam is even completely sent and stopped 
abusing Tripod !) and the 370 others
are from Geocities.

Among them, *** only 6 *** have been closed in the past 4 days ! that's 
less than 2%.

No wonders spammers have been using Yahoo / Geocities for months and 
will keep doing so !

I also noticed most of the pages look very similar, so only a few 
spammers are using it. This similarity makes
the automated detection of their encoded relocation scripts trivial in 
about 85% of cases, but it leads to new
questions about the exact relationship between them and Geocities / Yahoo.

The complete list of spammy redirection pages Yahoo / Geocities is still 
hosting can be found here:
Anyone knows where it should be sent @ Yahoo ?


If the situation does not improve, the list will obviously grow, and 
I'll have to figure out if old addresses are
'recycled' in new spam runs or not.

I just added a length limited version (only the most recent 200 
addresses) to address your server performance
concerns while keeping most of it's effectiveness.

If  there is enough interest, I can also add the code for rules merging. 
That would greatly improve
both memory usage and speed.

>What's nice with URLBL is that you read the message once to extract all URLs and
>after that, you query the database looking for the domain names you've found.
>The problem with your rules is that the message will probably be scanned 441
>times looking for complex perl expressions.
>But this is a general problem with SpamAssassin with its hundreds rules.
Yes, the explosion of  bigevil.cf (over 1M!) and the need for a more 
efficient way was one of the main reason
SURBL was created.

The reasons why those Geocities sites won't be integrated in SURBL were 
previously discussed. I still think
it would  be possible, but, at least for now, .cf ruleset is the only way.

Also, contrary to bigevil, it should not expand indefinitely, but a 
large part of the problem (and the solution) is
in the hands of Yahoo / Geocities. If they start cleaning the mess, 
we'll have less, and their service will become
less attractive for spammers. That's where I see the next battle.

>In other words, with your rules, you'll highly increase the scanning time
>without adding too much efficiency to your filter. Good for small servers !
The problem with those Geocities spams is they trend to generate a very 
low score and go undetected.
Sure, the rules won't hit a lot of the total server mail load, but those 
they'll hit will contribute significantly
to the lowering of false negatives (15 to 20% in my server, that's what 
got me interested in a solution).

In servers such as mine where I have a 4 levels response (3 levels of 
tagging + quarantine if > 20) it also
help the mildly spammy ones to get above 20 so my users won't even be 
bothered by them.

One feature that might boost the performance for this kind of rules 
would be conditional rules skipping.

With a single test on (Geocities|tripod), it would be fast & easy to 
skip all the other tests for the 95% of all
mails that reference neither of these sites.

Improving Spamassassin's performance issues is very important, but 
beyond the scope of my simple ruleset.



More information about the Discuss mailing list