On Tuesday, July 20, 2004, 6:58:15 AM, David Hooton wrote:
On Tue, 20 Jul 2004 15:27:52 +0200, Marc Kool m.kool@vioro.nl wrote:
I did a quick check on a few domains and I do not share your conclusion.
I think we have a slight case of culture clash here. This adult data is meant to be used in a proxy server where the data is apparently matched literally against URI data from web requests, etc.
SURBLs are designed to be used with specific email message body scanning programs that attempt to reduce the domains found in message body URIs to their registrar (base) domain so that subdomains like "models.home.att.net" are reduced to the base domain "att.net" before being included in a SURBL or checked against a SURBL.
The main reason we did this was to defeat the "random subdomain" spammers who generate random subdomains to try to defeat simple URI pattern matching or to key their spams to confirm the recipient addresses. Examples might be "abc1.xyz.spammerdomain.com" and "abc2.xyz.spammerdomain.com". Those we want to reduce to just "spammerdomain.com" since the randomized/keyed versions may occur only once and the sc.surbl.org data engine tries to increase the likelyhood of inclusion in the list with an increasing number of reports.
It may be useful to read about the sc.surbl.org data:
http://www.surbl.org/data.html
and the related Implementation Guidelines:
http://www.surbl.org/implementation.html
to gain a clearer understanding of some of our design decisions.
So both Mark and David's comments make sense in those differing contexts. The two contexts differ mainly in their handling of subdomains:
# grep aol.com domains adultaol.com register.oscar.aol.com sex-aol.com sexonaol.com usaol.com
register.oscar.aol.com is the server used by AOL messenger and ICQ to login - how on earth does this count as an Adult Website, much less a sex site?!!
And more importantly in my first try at processing the data for use as a SURBL, "register.oscar.aol.com" got reduced to "aol.com". :-(
# grep att.net domains adultonly.home.att.net borderjumper.home.att.net
[...]
Ahh the plot thickens... Subdomains..
# grep -w au.com domains aotoys.au.com condoms.au.com
[...]
For au.com and att.net there are only adult subdomains in the blacklist. This is ok.
However SURBL's in general don't use subdomains, I've just run a test on my personal SURBL and SpamCopURI doesn't currently look at subdomains. I suspect because of the requirement for a lookup per domain level which would obviously both make things inefficient and also leave room for a denial of service.
[...]
I assume that something went wrong when you verified the quality of the database.
I think the levels of understanding of what was in the DB and what SURBL was able to do were what went wrong.
Given my very quick testing I think it would probably be worth giving this data a try, we would most likely need to work out how to remove the subdomained entries - the list is huge, and efficiency we can gain by removing excess data would obviously be useful.
Good suggestion, but perhaps slightly tricky to implement, depending on the data.
I can easily use a regex to delete entries with subdomains like "xxxmovies.home.att.net" so that "att.net" does not get on the list. But that would only be effective if the deliberately randomized domains like "abc.xyz.spammerdomain.com" were reduced to "spammerdomain.com" in the source data, otherwise we would lose both.
In other words, if the data is a literal transcription of everything found in spams, including randomized URIs like "abc.xyz.spammerdomain.com," then we will lose the latter if I discard all subdomains.
So Mark, can you tell us if the randomized domains that spammers frequently used are reduced to the base domains in the adult data, i.e. "spammerdomain.com" and not "abc.xyz.spammerdomain.com"?
Jeff C.