Hi.
I agree we could investigate some more to get als web based patterns going, not that hard to do.
This is where I jump in with the suggestion I already mentioned on friday :)
From what I've seen in the list of spamvertised sites (the one I used for my tests) it seems that many of them belong to masshosters such as aol.com, alice.it, or blog providers. These services seem to be attractive to spammers, since many of them offer free webspace suited for hosting link farms and what not.
Currently there are two approaches to handle this in a rhsbl: put the whole domain on the block list, or exclude (whitelist) it. Both approaches are far not ideal.
Blocking those domains means that the number of false positives will rise, as this step will also block legitimate websites hosted with that provider. Whitelisting them circumvents that problem, but results in a higher number of false negatives (i.e. it won't catch spamsites). Something in between both extremes would be nice.
As far as I can tell from my little investigation, it seems that these "big hosters" provide one of two schemes for their customers:
1. http://customer.hoster.tld/... 2. http://host.hoster.tld/customer/...
The "customer" part of the URI is what needs to be looked at in order to distinct spammers from non-spammers.
Example: The examined URI is http://spammer.masshoster.tld/cheap-viagra.html. As described in the surbl.org implementation guidelines, the first lookup would be for masshoster.tld. The lookup resolves, the last octet of the result is treated as bitmask (similar to how it is done for multi.surbl.org). Since the domain belongs to a known masshoster, and that masshoster uses hosting-scheme 1, this is signalled by having the corresponding bit set in the response.
The application now does a second lookup, this time for spammer.masshoster.tld (if the hoster used scheme 2, the lookup would be for masshoster.tld.customer). If that lookup resolves, the URI is spam, otherwise it's ham. The first lookup result will not be taken into consideration in either case.
Advantages: 1. The modifications needed for an existing rhsbl (zone file) that implements this enhancement as well as for the applications that make use of the enhancement on the client side are not hard to implement IMO. The enhancement makes use of mechanisms that already are used. No changes are needed to the DNS servers, as far as I can tell.
2. The second lookup becomes necessary only for known masshoster domains. No blind guesses are needed on the lookup application about whether a domain is a known masshoster and which "hosting scheme" it probably uses.
3. The enhancement allows to rise the number of "true" positives without the negative side effect of false positives - at least as long as the rhsbl provider applies the same care as for the rest of his blocklist.
In order to be backward compatible this enhancement should not be applied on a lookup zone that is queried by "non-enhanced" applications, at least if that zone had masshosters whitelisted before. The fact that a masshoster domain now resolves in the first lookup would be misinterpreted by applications that are not aware of the enhancement, resulting in a higher number of false positives. It would be better to "mirror" such zones (for example multi.surbl.org) to a new one (for example emulti.surbl.org, with "e" for "enhanced" ;)) and apply the changes there.
I have to admit that I'm quite new to the concept of rhsbls, and chances are that I miss important points here. I'd be glad for any (fair) comments and suggestions.
Bye, Mike