Guestbook spam

List overview All Threads
Download

newer

older

Rob Menschel: Please contact me...

Re: Guestbook spam

Veterans Internet Service

7 Nov 2006 7 Nov '06

2:43 a.m.

I would like to use surbl to scan the fields of our guestbook service and then deny entries based on the results. I am currently using perl and MySQL for the service and was wondering if there are any perl modules that can be used for my purposes.

Chris

Show replies by date

Jeff Chan

7 Nov 7 Nov

6:26 a.m.

On Monday, November 6, 2006, 5:43:11 PM, Veterans Service wrote:

...

I would like to use surbl to scan the fields of our guestbook service and then

...

deny entries based on the results. I am currently using perl and MySQL for the

...

service and was wondering if there are any perl modules that can be used

...

for my purposes.

Hi Chris, You could do this, but the set of guestbook spammers and mail spammers may not overlap too much. Therefore we somewhat recommend against it. (Remember that SURBLs are URIs advertised in email spams.)

Jeff C. -- Don't harm innocent bystanders.

Michael Renzmann

7:39 a.m.

Hi.

Jeff Chan wrote:

...

You could do this, but the set of guestbook spammers and mail spammers may not overlap too much. Therefore we somewhat recommend against it. (Remember that SURBLs are URIs advertised in email spams.)

I'm about to write a spam filter plugin for Trac [1] which makes use of SURBLs. While Trac is no guestbook, trac-driven sites are hit by spam posts which have, following to your comment, more in common with guestbook spam than with email spam.

That makes me wonder: a.) Is there a SURBL (outside of surbl.org) available already specially for website spam? b.) If not, would it be worthwile to extend the focus of surbl.org to website spam?

The mentioned plugin will be one of several that users can enable. Each plugin will modify the karma of a post (Wiki edit, new trouble ticket, comment to existing trouble ticket, ...), and only if the karma is above a user-defined threshold it will be accepted.

As the old version of Trac we use currently on madwifi.org does not provide native support for spam filtering, I installed another solution [2] based on mod_security to block spams. In order to adjust the filters if needs be, I logged (and still log) all spam postings that hit our site during the last 4 months (~38000 posts total). Is that interesting for feeding the database of a specialized website spam SURBL?

Bye, Mike

[1] http://trac.edgewall.org [2] http://madwifi.org/wiki/FightingTracSpam

Jeff Chan

1:31 p.m.

On Monday, November 6, 2006, 10:39:06 PM, Michael Renzmann wrote:

...

I'm about to write a spam filter plugin for Trac [1] which makes use of SURBLs. While Trac is no guestbook, trac-driven sites are hit by spam posts which have, following to your comment, more in common with guestbook spam than with email spam.

...

That makes me wonder: a.) Is there a SURBL (outside of surbl.org) available already specially for website spam?

...

b.) If not, would it be worthwile to extend the focus of surbl.org to website spam?

Sure, once there's no more mail spam. ;-)

(BTW spam is about 91% of all mail now: http://www.eweek.com/article2/0,1895,2051949,00.asp .)

...

The mentioned plugin will be one of several that users can enable. Each plugin will modify the karma of a post (Wiki edit, new trouble ticket, comment to existing trouble ticket, ...), and only if the karma is above a user-defined threshold it will be accepted.

...

As the old version of Trac we use currently on madwifi.org does not provide native support for spam filtering, I installed another solution [2] based on mod_security to block spams. In order to adjust the filters if needs be, I logged (and still log) all spam postings that hit our site during the last 4 months (~38000 posts total). Is that interesting for feeding the database of a specialized website spam SURBL?

...

Bye, Mike

...

[1] http://trac.edgewall.org [2] http://madwifi.org/wiki/FightingTracSpam

In principle it may be useful, but we want SURBLs to stay focussed on mail spam.

May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.

For any application using our data, please be sure to follow all of the Implementation Guidelines, especially local whitelisting (exclusion lists) and caching.

http://www.surbl.org/implementation.html

Please be mindful of our DNS, etc., resources, which are all donated.

Cheers,

Jeff C. -- Don't harm innocent bystanders.

Michael Renzmann

2:09 p.m.

Hi Jeff.

...

...
That makes me wonder: a.) Is there a SURBL (outside of surbl.org) available already specially for website spam?

Now an answer to that question is even more interesting to me, by the way :)

...

In principle it may be useful, but we want SURBLs to stay focussed on mail spam.

May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.

Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.

Bye, Mike

Michael Renzmann

10 Nov 10 Nov

6:40 p.m.

Hi all.

...

...
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.

Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.

Done, but the results are disappointing (and somewhat surprising).

I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.

Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]

Conclusions: ============ 1. While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.

2. It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.

I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.

Bye, Mike

[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/

Gerhard W. Recher (rbl)

10:33 p.m.

Michael Renzmann schrieb:

...

Hi all.

...
...
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.

Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.

Done, but the results are disappointing (and somewhat surprising).

I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.

Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]

Conclusions:

While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.

It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.

I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.

Bye, Mike

[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

Michael Renzmann schrieb:

...

Hi all.

...
...
May I suggest that you try checking web spams with SURBLs and see what the hit rate is like. If the hit rate is significantly less than for mail spam, then it may not be worth using our data (and generating the DNS queries) for the website checking application.

Will do so. I'm currently preparing the logged data and will see what rate we get for that. Will report back when I have the results.

Done, but the results are disappointing (and somewhat surprising).

I threw together a list of all recognized/blocked posts sent to madwifi.org during the last 4 months, and added a list of all blocked spam posts sent to trac-hacks.org during the last week. After refining the list as described in the implementation guidelines, removing well-known domains and the "(roughly) top 200 domains not blacklisted by SURBL", 854 domains remained [1]. These 854 domains have been tested against a selection of 14 RHSBLs [2], some of them (such as porn.rhs.mailpolice.com) being very specialized.

Rank 1, with 139 positives, is multi.surbl.org. This is quite surprising, since surbl.org focuses on e-mail spamvertisements. bsb.empty.us, which afaik focuses on website and comment spam, is on rank 7 with just 7(!) positives... the full ranklist is at [3], and the scripts used for testing as well as the "raw" results can be found at [4]

Conclusions:

While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.

It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.

I'd like to discuss an idea I have in mind that could improve the recognition rate for rhsbl's (including surbl.org), but I have to rush back home now. I'll put that in a new mail on monday.

Bye, Mike

[1] http://otaku42.de/static/spam-audit/rbltest/domains.lst.txt [2] http://otaku42.de/static/spam-audit/rbltest/rhsbl.lst.txt [3] http://otaku42.de/static/spam-audit/rbltest/ranklist.txt [4] http://otaku42.de/static/spam-audit/rbltest/

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

Dear Michael,

i disagree not at all, but partial.

i took your list and asked our own rbl server. results in short 789 out of 854 Domains of your list will be recognized by our service clean-mx surbl 139/854 uribl 66/854

see results on http://support.clean-mx.de/clean-mx/rbltest_results.txt this list is based on your input (and also preserves order) by we way you shall not block virgilio.it .....

Web-Site-spamming either blogs guestbooks etc... has a different approach from the point of view of their originators.

1) it's tricky to tweak pages in the web for abuse 2) this is time consuming so only a few will do that in the "wild" 3) it's much more easier to mail all this stuff over a bot-net 4) the message of all these spammers is always the same... buy .... look at .. obey this finacial tip.... help me... and so on 5) they have to attract their readers to their message so they always must use the same sort of linguistic acrobatic tokens...

at least the same set of keywords stopping mailspam is sufficient to detect and stop web-spam

I totally agree that spamvertized domains in web-spam is a bit diffrent from mailspam but not much.

yours gerhard (feel free to contact me off-list....)

Raymond Dijkxhoorn

11:35 p.m.

Hi!

...

see results on http://support.clean-mx.de/clean-mx/rbltest_results.txt this list is based on your input (and also preserves order) by we way you shall not block virgilio.it .....

Web-Site-spamming either blogs guestbooks etc... has a different approach from the point of view of their originators.

You might want to have a look on what you are blocking. Its easy to block, its less easy to get a low FP rating.

With blocking domains like demon.co.uk i would not like you to filter my mails of webforms. :) Same for deejay. it, one of the largest radio stations in Italy. Or what about Indiana University. Oh well.

I agree we could investigate some more to get als web based patterns going, not that hard to do. But the clean-mx service i think isnt a good example of how to do it properly, sorry.

Bye, Raymond.

Jeff Chan

11 Nov 11 Nov

1:19 p.m.

On Friday, November 10, 2006, 2:35:21 PM, Raymond Dijkxhoorn wrote:

...

Its easy to block, its less easy to get a low FP rating.

...

With blocking domains like demon.co.uk i would not like you to filter my mails of webforms. :) Same for deejay. it, one of the largest radio stations in Italy. Or what about Indiana University. Oh well.

Raymond makes a good point. We spend a lot of effort preventing FPs in the SURBL data. It starts with making sure the basic assumptions are correct on what to categorize as spam, then it includes much processing for automatic checking, manual checking and some whitelisting. It's quite a difficult task to automate, and it's quite difficult to get right.

We also set policies about what would be listed or not. Those policies help prevent some of the types of false positives Raymond noticed.

Jeff C. -- Don't harm innocent bystanders.

Joseph Brennan

13 Nov 13 Nov

2:50 p.m.

Raymond Dijkxhoorn raymond@surbl.org wrote:

...

...
by we way you shall not block virgilio.it .....

That one has gone on and off our local blocklist. Why do they host so many spammers?

Joseph Brennan Lead Email Systems Engineer Columbia University Information Technology

Gerhard W. Recher (rbl)

4:11 p.m.

Joseph Brennan schrieb:

...

Raymond Dijkxhoorn raymond@surbl.org wrote:

...
...
by we way you shall not block virgilio.it .....

That one has gone on and off our local blocklist. Why do they host so many spammers?

Joseph Brennan Lead Email Systems Engineer Columbia University Information Technology

Dear Joseph,

this is a big italian provider ... they have a xDSL Brand named "Alice" cheap, and totally abused... but we have productive emails from that range

It's like the big xDSL Providers in the states like swbell etc... the machines behind this xDSL lines are massively controlled by bot's...

yours

Gerhard

Jeff Chan

14 Nov 14 Nov

5:43 a.m.

On Monday, November 13, 2006, 7:11:01 AM, Gerhard (rbl) wrote:

...

Joseph Brennan schrieb:

...
Raymond Dijkxhoorn raymond@surbl.org wrote:

...
...
by we way you shall not block virgilio.it .....

That one has gone on and off our local blocklist. Why do they host so many spammers?

Joseph Brennan Lead Email Systems Engineer Columbia University Information Technology

Dear Joseph,

...

this is a big italian provider ... they have a xDSL Brand named "Alice" cheap, and totally abused... but we have productive emails from that range

...

It's like the big xDSL Providers in the states like swbell etc... the machines behind this xDSL lines are massively controlled by bot's...

...

yours

...

Gerhard

Yes, definitely. Large broadband providers are going to have a lot of infected hosts on them. See for example:

http://www.senderbase.org/

Telefonica, tpnet, rr.com, Verizon, Proxad, Telecom Italia, Comcast, Wanadoo, etc. are all big broadband providers and they have high sender magnitudes probably due to botnet senders.

Jeff C. -- Don't harm innocent bystanders.

Michael Renzmann

7:11 a.m.

Hi.

...

I agree we could investigate some more to get als web based patterns going, not that hard to do.

This is where I jump in with the suggestion I already mentioned on friday :)

From what I've seen in the list of spamvertised sites (the one I used for my tests) it seems that many of them belong to masshosters such as aol.com, alice.it, or blog providers. These services seem to be attractive to spammers, since many of them offer free webspace suited for hosting link farms and what not.

Currently there are two approaches to handle this in a rhsbl: put the whole domain on the block list, or exclude (whitelist) it. Both approaches are far not ideal.

Blocking those domains means that the number of false positives will rise, as this step will also block legitimate websites hosted with that provider. Whitelisting them circumvents that problem, but results in a higher number of false negatives (i.e. it won't catch spamsites). Something in between both extremes would be nice.

As far as I can tell from my little investigation, it seems that these "big hosters" provide one of two schemes for their customers:

1. http://customer.hoster.tld/... 2. http://host.hoster.tld/customer/...

The "customer" part of the URI is what needs to be looked at in order to distinct spammers from non-spammers.

Example: The examined URI is http://spammer.masshoster.tld/cheap-viagra.html. As described in the surbl.org implementation guidelines, the first lookup would be for masshoster.tld. The lookup resolves, the last octet of the result is treated as bitmask (similar to how it is done for multi.surbl.org). Since the domain belongs to a known masshoster, and that masshoster uses hosting-scheme 1, this is signalled by having the corresponding bit set in the response.

The application now does a second lookup, this time for spammer.masshoster.tld (if the hoster used scheme 2, the lookup would be for masshoster.tld.customer). If that lookup resolves, the URI is spam, otherwise it's ham. The first lookup result will not be taken into consideration in either case.

Advantages: 1. The modifications needed for an existing rhsbl (zone file) that implements this enhancement as well as for the applications that make use of the enhancement on the client side are not hard to implement IMO. The enhancement makes use of mechanisms that already are used. No changes are needed to the DNS servers, as far as I can tell.

2. The second lookup becomes necessary only for known masshoster domains. No blind guesses are needed on the lookup application about whether a domain is a known masshoster and which "hosting scheme" it probably uses.

3. The enhancement allows to rise the number of "true" positives without the negative side effect of false positives - at least as long as the rhsbl provider applies the same care as for the rest of his blocklist.

In order to be backward compatible this enhancement should not be applied on a lookup zone that is queried by "non-enhanced" applications, at least if that zone had masshosters whitelisted before. The fact that a masshoster domain now resolves in the first lookup would be misinterpreted by applications that are not aware of the enhancement, resulting in a higher number of false positives. It would be better to "mirror" such zones (for example multi.surbl.org) to a new one (for example emulti.surbl.org, with "e" for "enhanced" ;)) and apply the changes there.

I have to admit that I'm quite new to the concept of rhsbls, and chances are that I miss important points here. I'd be glad for any (fair) comments and suggestions.

Bye, Mike

Jeff Chan

15 Nov 15 Nov

9:17 a.m.

On Monday, November 13, 2006, 10:11:15 PM, Michael Renzmann wrote:

...

As far as I can tell from my little investigation, it seems that these "big hosters" provide one of two schemes for their customers:

...

http://customer.hoster.tld/...

http://host.hoster.tld/customer/...

...

The "customer" part of the URI is what needs to be looked at in order to distinct spammers from non-spammers.

[...]

...

Advantages:

The modifications needed for an existing rhsbl (zone file) that implements this enhancement as well as for the applications that make use of the enhancement on the client side are not hard to implement IMO. The enhancement makes use of mechanisms that already are used. No changes are needed to the DNS servers, as far as I can tell.

[...]

Hi Michael, These subjects have come up before, but we've decided to list registered domains for a number of reasons:

1. Subdomains and paths are often too many to list. There are already many domains and if we added all the possible subdomains and paths, especially for spammers who use many of them, the lists would get too large. 2. Subdomain/path abuse is the responsibility of the domain owner. 3. If there is enough subdomain or path abuse, then we may blacklist the domain. But we generally don't list mostly legitimate domains. 4. Paths or subdomains can be keyed to a specific spam or recipient to allow the spammer to confirm delivery. 5. Paths or subdomains can reveal private information about the recipient like account numbers, etc. 6. If a domain belongs to spam gangs, etc., then we list it. 7. If a domain belongs to legitimate hosts like Yahoo, then they are responsible for the sites and should stop the abuse. It is their responsiblity.

Etc.

More discussion can be found in the list archives:

http://lists.surbl.org/pipermail/discuss/ http://lists.surbl.org/pipermail/discuss/2005-October/005067.html http://lists.surbl.org/pipermail/discuss/2005-November/005133.html http://lists.surbl.org/pipermail/discuss/2006-May/005325.html [...]

and FAQ:

http://www.surbl.org/faq.html#random http://www.surbl.org/faq.html#numbered

Cheers,

Jeff C. -- Don't harm innocent bystanders.

Jeff Chan

11 Nov 11 Nov

1:13 p.m.

On Friday, November 10, 2006, 1:33:12 PM, Gerhard (rbl) wrote:

...

i took your list and asked our own rbl server. results in short 789 out of 854 Domains of your list will be recognized by our service clean-mx surbl 139/854 uribl 66/854

[...]

...

Web-Site-spamming either blogs guestbooks etc... has a different approach from the point of view of their originators.

...

it's tricky to tweak pages in the web for abuse

this is time consuming so only a few will do that in the "wild"

it's much more easier to mail all this stuff over a bot-net

the message of all these spammers is always the same... buy .... look

at .. obey this finacial tip.... help me... and so on 5) they have to attract their readers to their message so they always must use the same sort of linguistic acrobatic tokens...

...

at least the same set of keywords stopping mailspam is sufficient to detect and stop web-spam

In a general sense, yes.

...

I totally agree that spamvertized domains in web-spam is a bit diffrent from mailspam but not much.

I think it has yet to be proven or disproven. Perhaps a test you could try is to see how well your web spam list detects mail spam (checking message body URIs, naturlich). If the intersection is not very much then that may argue that web and mail spam are different.

Jeff C. -- Don't harm innocent bystanders.

Jeff Chan

1:08 p.m.

On Friday, November 10, 2006, 9:40:09 AM, Michael Renzmann wrote:

...

Conclusions:

While I already expected that there is quite some difference between the spamvertisement distributed by e-mail and that distributed on websites, the recognition rate advantage of multi.surbl.org vs. bsb.empty.us is surprising. However, 16% recognition rate is still not good enough to justify adding additional load on surbl.org for website spam recognition.

Agreed.

...

It seems that it could be worth to start yet another (more specialized) rhsbl for the described purpose. A few Trac hackers already started working on that.

Yes, if the web versus email spam spaces turn out to be significantly different.

Jeff C. -- Don't harm innocent bystanders.

Kevin Golding

7 Nov 7 Nov

10:02 p.m.

In article 45502A0A.4010609@otaku42.de, Michael Renzmann mrenzmann@otaku42.de writes

...

That makes me wonder: a.) Is there a SURBL (outside of surbl.org) available already specially for website spam?

bsb.spamlookup.net, although the closest thing to documentation for it is http://bradchoate.com/projects/spamlookup

Else there's always http://akismet.com/ which I believe uses a private blacklist amongst other things.

Kevin

6823

Age (days ago)

6831

Last active (days ago)

discuss@lists.surbl.org

16 comments

7 participants

tags (0)

participants (7)

Gerhard W. Recher (rbl)
Jeff Chan
Joseph Brennan
Kevin Golding
Michael Renzmann
Raymond Dijkxhoorn
Veterans Internet Service