Daniel Quinlan, one of the principal SpamAssassin architects had some good suggestions for reducing false positives in the SURBL data. One was using public databases of URIs, particularly hand-built ones like dmoz.org and wikipedia.org or even yahoo.com as sources of mostly legitimate domains. (The wikipedia is not a web directory in a conventional sense; it's more like an open encyclopedia, but it has a relatively large collection of URIs.)
Presumably most of the URIs in these are legitimate and don't belong to spammers, especially in DMOZ since it's hand-built. So the question is: can these be useful as whitelist sources or perhaps as one of the checks on new SURBL additions.
The DMOZ open directory publishes it's data in RDF form at:
So we downloaded the URL data, extracted the domains and compared them against the SURBL block and whitelists:
% join dmoz.srt ../multi.domains.sort | wc 1338 1338 20533 % join dmoz.srt ../whitelist-domains.sort | wc 7375 7375 96720 % join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt % join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt
There were 1338 DMOZ hits against our blocklisted domains and 7375 against our whitelists. You can view those matches at:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt http://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt
Of the 1338 DMOZ hits against our blocklists, which arguably could be false positives, most are in WS. Here is a list with the data from multi.surbl.org showing list membership included:
%join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
And some list counts from those hits:
[ws] hits: 1173 [ob] hits: 165 [jp] hits: 61 [sc] hits: 8 [ab] hits: 4 [ph] hits: 2
These add up to more than 1338 since some records hit multiple lists. The actual hits are in:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ob http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jp http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.sc http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ab http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph
Data source folks, please review these and try to determine which ones are FPs and which would result in false negatives if they came off the lists. For ones that are FPs you may want to eliminate them on your end. For the ones that could cause FNs, we'd like to know about those as a measure of using the DMOZ data for whitelisting. Right now I'm leaning towards whitelisting all of these, so please speak up!
The 7.4k DMOZ whitelist hits represents a majority of the 12.25k whitelist entries that are not reserved .us geographic domains, so there is significant overlap between DMOZ and our existing whitelists, which is probably speaks well for both lists.
% wc dotus_reservedlist_v3.lower.sort 52049 52049 1012735 dotus_reservedlist_v3.lower.sort
% wc ../whitelist-domains.sort 64299 64299 1169155 ../whitelist-domains.sort
% join dmoz.srt dotus_reservedlist_v3.lower.sort | wc 7 7 112
The DMOZ data has about 2.3 million domains. How does anyone feel about adding them to our whitelists? A 1.2 MB gzip of the extracted domains is at:
http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz
I think we can safely say that whitelisting DMOZ domains will reduce FPs. Probably a more important question is: how many FNs would that cause? In other words, how many purely spam domains are in DMOZ, where whitelisting them would wrongly exclude spam domains from SURBLs?
One way to answer that is to note that the lists ab, sc, jp, ph, which have much lower FP rates than ws (measured by the SpamAssassin corpora checks, for example, and also anecdotally by human FP reports) appear relatively infrequently in the DMOZ hits. In other words, SURBL lists that we know are quite spammy like sc, jp, etc. don't match DMOZ often, so the DMOZ data may not have too many spam domains.
Similar tests could be done against other proposed whitelists. (We'll probably try the wikipedia data next.)
Another concern is that since these directories are relatively open, spammers could simply add themselves and effectively get whitelisted. However I intend to take a snapshot of these and probably not try to refresh the data very often in future (it at all), instead using them as relatively static snapshots of established domains. Doing that would miss some new additions, but could also prevent some future abuse by spammers. On the other hand 2 million domains is a pretty good start.... :-)
Extraction scripts are not perfect, particularly in the simplistic chopping to three levels of cctlds, but they're probably adequate:
http://spamcheck.freeapp.net/whitelists/extract-dmoz-domains http://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sed http://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed
Comments please,
Jeff C.
[jp] hits: 61
Hi Jeff,
I grabbed the JP hits and started looking at them. Some of them are clearly spammers, some may be UC candidates and at least one is a FP (tripod.com.ar). I'll go through all of them as soon as I can:
The following DMOZ [JP] entries are not on my local list, so they may have been supplied by Raymond / Prolocation:
bigprizes.com ebigchina.com global2000hosting.net imperialmortgage.com lsbodyjewelry.com placement-uk.com psychicrealm.com quuxuum.org
The remainder are from my list and I'll verify them one by one. Here's some input:
1800patches.com - 210 NANAS sightings - Spamhaus SBL15666
adultlounge.com adultloveline.com allofem.com ancientacu.com bet-at-home.com christineyoung.com coid.biz
coins-and-banknotes.com - spam sent to a Norwegian colaborator of mine
diademtravel.com - see smyrnagroup.net
digienjoy.com ebonyexclusive.com evidence-eliminator.com fantasy-mail.com fattyfarm.com flashcash.com greenguyandjim.com incomebuddy.com jackpot.com:
kaplancollege.edu - 31 NANAS sightings - SBL17199 - persistent spams over extended period - no response to attempts to contact
knorad.com lasseters.com.au lovercash.com manevent.de medchoicelabs.com moneytrend.at movieerotica.com mymailgenie.com online-dictionary.biz pcbugdoctor.com pibcash.com platinumbucks.com: - 123 NANAS sightings - listed on SORBS - SBL7867
pornindustryjobs.com realage.com realtimevideos.com robotreply.com silvercash.com
smyrnagroup.net: - notorious spammer from Turkey (travel agency) - persistent usenet and email spams in .de/.ch - can be blocked by email address, as the addresses are relatively static.
thebingoaffiliates.com tiptopjob.com tomsnewbiebooster.com
tripod.com.ar - Oops, FP!
tvujdum.cz umtscom.org vicp.net virtuagirl2.com visaforyou.com
vistaprint.com: - 130 NANAS sightings - Spamhaus SBL14856
webspace4free.biz webway.at wujidomartialarts.com xboxchips.com
yesmoke.ch: - Mail order tobacco store, advertised in spam sent to a dormant personal mailbox on 2004-05-26.
For the ones that could cause FNs, we'd like to know about those as a measure of using the DMOZ data for whitelisting. Right now I'm leaning towards whitelisting all of these, so please speak up!
I may ask you to remove some of the whitelist entries again when I've had time to check all of my list :-)
Joe
On Wednesday, October 6, 2004, 5:31:15 AM, Joe Wein wrote:
[jp] hits: 61
Hi Jeff,
I grabbed the JP hits and started looking at them. Some of them are clearly spammers, some may be UC candidates and at least one is a FP (tripod.com.ar). I'll go through all of them as soon as I can:
Thanks much Joe!
A couple points:
1. We haven't whitelisted any of these yet.
2. We need to bias against listing. I don't dispute that some of these do send some spams. The question remains, as ever, whether any have legitimate (non-spam) uses. Those that do probably should not be listed.
3. It is possible that some true spammers got into DMOZ, but most probably aren't.
Cheers,
Jeff C. -- "If it appears in hams, then don't list it."
On 10/6/04 7:44 AM, "Jeff Chan" wrote:
- It is possible that some true spammers got into DMOZ, but
most probably aren't.
I would be concerned with the people that buy up expired domains that have that sought after page rank, I know a lot of porn operators do that and they then start spamming, DMOZ is known for not updating their records.
Just my 2$
On Wednesday, October 6, 2004, 7:14:35 AM, David Thurman wrote:
I would be concerned with the people that buy up expired domains that have that sought after page rank, I know a lot of porn operators do that and they then start spamming, DMOZ is known for not updating their records.
Good point.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff wrote:
A couple points:
We haven't whitelisted any of these yet.
We need to bias against listing. I don't dispute that some
of these do send some spams. The question remains, as ever, whether any have legitimate (non-spam) uses. Those that do probably should not be listed.
- It is possible that some true spammers got into DMOZ, but
most probably aren't.
Here's an update on the state of my checking:
[SP] = spammer [UC] = spammy, but not for SURBL [FP] = false positive [TBD] = to be determined
1800patches.com [SP] - created in 1999 - 210 NANAS sightings - Spamhaus SBL15666 - listed in [WS] - received in spamfeed on 2004-09-21
adultlounge.com [FP?] - created in 1997 - no NANAS listings - NS blacklisted, SBL10966 - advertised in mail received in spamfeed on 2004-10-07 (nopostal address, sent from adtmarket.com domain)
adultloveline.com: [FP?] - created in 2002 - 11 NANAS listings, most from 2002 and 2003 - listed on [WS] - spam sent to a spamtrap, advertising someone's entry on the site - sent via http://list.freemailpass.com
allofem.com [FP] - created 2000 - NS blacklisted (conpuppy.com) - listed on [WS] - found in spamfeed on 2004-09-28 but may have been valid subscription by recipient
ancientacu.com [FP?] - created in 2002 - no NANAS listings - NS listed in open relay database - spam received on 2004-05-14 at German mailbox, from China, fake Hotmail sender - also spammed some mailing lists - may have legitimate uses
bet-at-home.com [FP?] - sportsbetting site, created 1999 - 58 NANAS listings, most recent 2003-12 - mail received 2004-07-05, probably afiliate spam
christineyoung.com [FP?] - domain created 2001 - no NANAS reports - NS blacklisted SBL17961 - mail sent by sex4nothing.net to friend's mailbox on 2004-08-07 who forwarded it - mail claimed subscription but used many anti-filter techniques - domain mentioned only as URL within URL
coid.biz [FP] - Indonesian portal and webmail site, created 2003 - no NANAS - NS not blacklisted - listed by [WS] - abused as fake sender in pill spam on 2004-04-04
coins-and-banknotes.com [FP?] - Norwegian coin site, spam sent to a Norwegian mailbox, recipient has no interest in coins whatsoever
diademtravel.com [SP] - see smyrnagroup.net
digienjoy.com [FP, block locally?] - Taiwanese video conferencing product, created in 2002 - 2 NANAS postings - mail received on 2004-09-16, very similar to the NANAS spamtrap posting - looks like a legitimate company that sometimes spams
ebonyexclusive.com [TBD] - adult site, created 2001 - no NANAS sightings - NS blacklisted SBL18947 - listed on [WS] - advertised in mail from spamfeed on 2004-09-23 sent by adtmarket.com - no postal address in mail, but image with feedback code
evidence-eliminator.com [SP] - created 1999, spamming since at least 2000 - 340 NANAS sightings - NS and MX blacklisted SBL10095 - spam received 2003-05-28
fantasy-mail.com [TBD] - adult site, created in 1999 - 228 NANAS sightings - NS and site blacklisted - banner ad in mail in spam feed on 2004-08-30 - the fantasy-mail.com list itself seems confirmed opt-in.
fattyfarm.com [TBD]
flashcash.com [TBD]
greenguyandjim.com: [FP, removed] - appeared as sender domain for a refinance spam for finalsavings.com - went unnoticed because it's hosted by national.net (spammy porn-hoster), therefore NS are listed on Spamhaus, and only 4 months old
incomebuddy.com [TBD]
jackpot.com: [TBD]
kaplancollege.edu: [UC] - 31 NANAS sightings - SBL17199 - persistent spams over extended period - no response to attempts to contact
knorad.com [TBD]
lasseters.com.au [TBD]
lovercash.com [TBD]
manevent.de: [FP] - sex contact mail to spamtrap included link to manevent.de (sex party site) - no NANAS, no SBL, site seems to have legitimate uses
medchoicelabs.com [TBD]
moneytrend.at [TBD]
movieerotica.com [TBD]
mymailgenie.com [TBD]
online-dictionary.biz [TBD]
pcbugdoctor.com [TBD]
pibcash.com [TBD]
platinumbucks.com: [SP] - Spamhaus SBL7867 [marketingx.com/platinumbucks.com] - 123 NANAS sightings - listed on SORBS - spam on 2004-03-10 advertising whitepussyblackcocks.com used image hosted at pb - domain created 1999 but hosted by national.net - claim "zero-tolerance for spamming" by afiliates
pornindustryjobs.com: [SP] - 23 NANAS - the domains appears to have been suspended for spamming on or before 2004-09-12 and is not currently active.
realage.com [TBD]
realtimevideos.com [TBD]
robotreply.com [TBD]
silvercash.com [TBD]
smyrnagroup.net: [SP] - notorious spammer from Turkey (travel agency) - persistent usenet and email spams in .de/.ch - can be blocked by email address, as they only use a few sender email addresses.
thebingoaffiliates.com [TBD]
tiptopjob.com [SP] - job search site created in 2000 - received bulkmail from marketing@tiptopjob.com, 2004-05-12 - many samba.org, debian.org, kde.org mailinglists got same spam in May/June - blacklisted on WS - google finds tons of directory-type hits, but little else (search engine spamming?) - NS has SBL for another domain - no NANAS listings - outgoing mailserver not blacklisted anywhere
tomsnewbiebooster.com [TBD]
tripod.com.ar [FP, removed] - Oops!
tvujdum.cz [UC]: - sent spam on 2004-02-16 advertizing "deinwohnen.de" - same spam received by many German users - no response when contacted - probably no hardcore spammer
umtscom.org [SP] - WAP Advertising Ltd. - registered in 2000 - spam sent 2004-02-18 to addr probably harvested off web
vicp.net [TBD]
virtuagirl2.com [TBD]
visaforyou.com [TBD]
vistaprint.com: [SP] - 130 NANAS sightings - Spamhaus SBL14856
webspace4free.biz [TBD]
webway.at [FP, add to local blacklist] - coin collector magazine - unsolicited subsription of an unused mail account on 2004-09-24 - appears to have legitimate use
wujidomartialarts.com [SP] - created 2003 - no NANAS - NS not blacklisted - listed on [WS] - spam sent directly to Raymond's personal address (from=info@wujido.com) from a SWBell DSL account, advertising this domain
xboxchips.com [SP] - created in 2003 - 2 NANAS sightings (direct to MX from a DSL account in Cyprus) - spam received on 2004-02-21, same spam run as NANAS, same source - domain no longer live
yesmoke.ch: [UC, but blacklisting locally] - Mail order tobacco store, advertised in spam sent to a dormant personal mailbox on 2004-05-26. - they have an MLM afiliate program, it probably was afiliate spam
Joe
On Thursday, October 7, 2004, 1:14:48 AM, Joe Wein wrote:
[SP] = spammer [UC] = spammy, but not for SURBL [FP] = false positive [TBD] = to be determined
1800patches.com [SP]
- created in 1999
- 210 NANAS sightings
- Spamhaus SBL15666
- listed in [WS]
- received in spamfeed on 2004-09-21
adultlounge.com [FP?]
- created in 1997
- no NANAS listings
- NS blacklisted, SBL10966
- advertised in mail received in spamfeed on 2004-10-07 (nopostal address,
sent from adtmarket.com domain)
adultloveline.com: [FP?]
- created in 2002
- 11 NANAS listings, most from 2002 and 2003
- listed on [WS]
- spam sent to a spamtrap, advertising someone's entry on the site
- sent via http://list.freemailpass.com
allofem.com [FP]
- created 2000
- NS blacklisted (conpuppy.com)
- listed on [WS]
- found in spamfeed on 2004-09-28 but may have been valid subscription by
recipient
[...]
Hi Joe, Can you provide a whitelist of these, now or when you finish categorizing them? For example, though you may heva removed it from your outbound feed, webway.at seems to still be on WS and JP, probably via Raymond:
/home/prolocation/black-prolocation-master:webway.at
but this site seems to be a legitimate political and city/state portal. My point is that we should probably globally whitelist the FPs to catch them in all the lists.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff, We also have tripod.com.ar in the WS list, can you see where this is coming from?
On Wednesday, October 6, 2004, 5:47:01 AM, Fred Fred wrote:
Jeff, We also have tripod.com.ar in the WS list, can you see where this is coming from?
Looks like it's coming from Raymond:
black-prolocation-master
See off-list message. ;-)
Jeff C. -- "If it appears in hams, then don't list it."
On Wednesday, October 6, 2004, 6:12:42 AM, Raymond Dijkxhoorn wrote:
Looks like it's coming from Raymond:
black-prolocation-master
See off-list message. ;-)
Its listed in Joe's list. And since i propagate that inside WS its listed there also. 1:1 ;)
If you removes it it will be delisted automaticly.
I've whitelisted tripod.com.ar. It's a web host like geocities, etc. Belongs to Lycos. I don't see them as professional spammers. Maybe a Spanish reader can tell us if their terms of service are any good at prohibiting spam hosting and other abuse:
http://www.tripod.com.ar/adm/redirect/www/membership/signup/tos.html
Jeff C. -- "If it appears in hams, then don't list it."
Jeff,
Good job/good idea regarding checking against these domains for FPs!! This may be the final step towards getting the FP rate to where we've wanted it to be.
However, please don't be in too big a hurry to whitelist all of these!
In particular, keep in mind that DMOZ is rather loosely organized and there may not be all that much careful attention as to what gets into DMOZ.
I'd suggest starting out by whitelisting only those that people point out as needing to be whitelisted and delay your decision to whitelist the others for possibly a week or more.
In the meantime, I'm going to take the list of blacklisted domains found in these directories and custom block these with my filter and, where any of these do "catch" messages, I'll have these automatically copied to a folder for inspection. As these build up, I'll report back (Friday or monday) stats regarding FPs and blocked spams.
Maybe others could do this same kind of test?
Rob McEwen
On Wed, Oct 06, 2004 at 06:32:46AM -0700, Jeff Chan wrote:
On Wednesday, October 6, 2004, 6:12:42 AM, Raymond Dijkxhoorn wrote:
Looks like it's coming from Raymond:
black-prolocation-master
See off-list message. ;-)
Its listed in Joe's list. And since i propagate that inside WS its listed there also. 1:1 ;)
If you removes it it will be delisted automaticly.
I've whitelisted tripod.com.ar. It's a web host like geocities, etc. Belongs to Lycos. I don't see them as professional spammers. Maybe a Spanish reader can tell us if their terms of service are any good at prohibiting spam hosting and other abuse:
http://www.tripod.com.ar/adm/redirect/www/membership/signup/tos.html
I'm not a lawyer, so take all of this with a pair of tweezers, but all the thing says something like you can use the service for anything personal (not commercial) as long as the contents are legal. There is an enum for things they don't allow sites that promote violence, or discrimination, or don't fullfill international treaties, permit access or inclusion of porn, violence..., false or inexact about objects or intentions, things protected by intelectual property laws, empresarial secrets, don't fullfill the comunication secret normatives ...
So if it's not a commercial site and it's legal, they allow lot of things
Jeff C.
"If it appears in hams, then don't list it."
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
On 10/7/04 10:15 AM, "Leonardo Helman" wrote:
I've whitelisted tripod.com.ar. It's a web host like geocities, etc. Belongs to Lycos. I don't see them as professional spammers. Maybe a Spanish reader can tell us if their terms of service are any good at prohibiting spam hosting and other abuse:
http://www.tripod.com.ar/adm/redirect/www/membership/signup/tos.html
I'm not a lawyer, so take all of this with a pair of tweezers, but all the thing says something like you can use the service for anything personal (not commercial) as long as the contents are legal. There is an enum for things they don't allow sites that promote violence, or discrimination, or don't fullfill international treaties, permit access or inclusion of porn, violence..., false or inexact about objects or intentions, things protected by intelectual property laws, empresarial secrets, don't fullfill the comunication secret normatives
We received one of those V1codin ads with a tripod link that redirected to one of the listed med sites, I would have to dig up that email but, I was a little shocked. Maybe a new way for them to bypass the filters yet again?
On Thursday, October 7, 2004, 11:03:58 AM, David Thurman wrote:
I've whitelisted tripod.com.ar. It's a web host like geocities,
We received one of those V1codin ads with a tripod link that redirected to one of the listed med sites, I would have to dig up that email but, I was a little shocked. Maybe a new way for them to bypass the filters yet again?
Can you tell us the URL of the redirection site on tripod?
Both SpamCopURI and urirhsbl and urirhssub have ways to deal with redirection sites, but I think one of them may need a list of those redirection URIs.
Jeff C. -- "If it appears in hams, then don't list it."
On Thursday, October 7, 2004, 8:15:07 AM, Leonardo Helman wrote:
On Wed, Oct 06, 2004 at 06:32:46AM -0700, Jeff Chan wrote:
http://www.tripod.com.ar/adm/redirect/www/membership/signup/tos.html
I'm not a lawyer, so take all of this with a pair of tweezers, but all the thing says something like you can use the service for anything personal (not commercial) as long as the contents are legal. There is an enum for things they don't allow sites that promote violence, or discrimination, or don't fullfill international treaties, permit access or inclusion of porn, violence..., false or inexact about objects or intentions, things protected by intelectual property laws, empresarial secrets, don't fullfill the comunication secret normatives ...
So if it's not a commercial site and it's legal, they allow lot of things
Would you please ask them for their policy on spamvertised sites?
Jeff C. -- "If it appears in hams, then don't list it."
I've sent them the question.
I'm taking next week off.
So, they don't answer today, you'll have to be a little patient.
Saludos -- Leonardo Helman Pert Consultores Argentina
On Thu, Oct 07, 2004 at 04:46:25PM -0700, Jeff Chan wrote:
On Thursday, October 7, 2004, 8:15:07 AM, Leonardo Helman wrote:
On Wed, Oct 06, 2004 at 06:32:46AM -0700, Jeff Chan wrote:
http://www.tripod.com.ar/adm/redirect/www/membership/signup/tos.html
I'm not a lawyer, so take all of this with a pair of tweezers, but all the thing says something like you can use the service for anything personal (not commercial) as long as the contents are legal. There is an enum for things they don't allow sites that promote violence, or discrimination, or don't fullfill international treaties, permit access or inclusion of porn, violence..., false or inexact about objects or intentions, things protected by intelectual property laws, empresarial secrets, don't fullfill the comunication secret normatives ...
So if it's not a commercial site and it's legal, they allow lot of things
Would you please ask them for their policy on spamvertised sites?
Jeff C.
"If it appears in hams, then don't list it."
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
Hi Jeff,
You might want to reconsider your use of the entire DMOZ directory. There may be some subtrees that you can ignore. Of the 1338 DMOZ false positives, how many of them are from the same sections on DMOZ?
Henry
Jeff Chan wrote:
Daniel Quinlan, one of the principal SpamAssassin architects had some good suggestions for reducing false positives in the SURBL data. One was using public databases of URIs, particularly hand-built ones like dmoz.org and wikipedia.org or even yahoo.com as sources of mostly legitimate domains. (The wikipedia is not a web directory in a conventional sense; it's more like an open encyclopedia, but it has a relatively large collection of URIs.)
Presumably most of the URIs in these are legitimate and don't belong to spammers, especially in DMOZ since it's hand-built. So the question is: can these be useful as whitelist sources or perhaps as one of the checks on new SURBL additions.
The DMOZ open directory publishes it's data in RDF form at:
So we downloaded the URL data, extracted the domains and compared them against the SURBL block and whitelists:
% join dmoz.srt ../multi.domains.sort | wc 1338 1338 20533 % join dmoz.srt ../whitelist-domains.sort | wc 7375 7375 96720 % join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt % join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt
There were 1338 DMOZ hits against our blocklisted domains and 7375 against our whitelists. You can view those matches at:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt http://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt
Of the 1338 DMOZ hits against our blocklists, which arguably could be false positives, most are in WS. Here is a list with the data from multi.surbl.org showing list membership included:
%join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
And some list counts from those hits:
[ws] hits: 1173 [ob] hits: 165 [jp] hits: 61 [sc] hits: 8 [ab] hits: 4 [ph] hits: 2
These add up to more than 1338 since some records hit multiple lists. The actual hits are in:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ob http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jp http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.sc http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ab http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph
Data source folks, please review these and try to determine which ones are FPs and which would result in false negatives if they came off the lists. For ones that are FPs you may want to eliminate them on your end. For the ones that could cause FNs, we'd like to know about those as a measure of using the DMOZ data for whitelisting. Right now I'm leaning towards whitelisting all of these, so please speak up!
The 7.4k DMOZ whitelist hits represents a majority of the 12.25k whitelist entries that are not reserved .us geographic domains, so there is significant overlap between DMOZ and our existing whitelists, which is probably speaks well for both lists.
% wc dotus_reservedlist_v3.lower.sort 52049 52049 1012735 dotus_reservedlist_v3.lower.sort
% wc ../whitelist-domains.sort 64299 64299 1169155 ../whitelist-domains.sort
% join dmoz.srt dotus_reservedlist_v3.lower.sort | wc 7 7 112
The DMOZ data has about 2.3 million domains. How does anyone feel about adding them to our whitelists? A 1.2 MB gzip of the extracted domains is at:
http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz
I think we can safely say that whitelisting DMOZ domains will reduce FPs. Probably a more important question is: how many FNs would that cause? In other words, how many purely spam domains are in DMOZ, where whitelisting them would wrongly exclude spam domains from SURBLs?
One way to answer that is to note that the lists ab, sc, jp, ph, which have much lower FP rates than ws (measured by the SpamAssassin corpora checks, for example, and also anecdotally by human FP reports) appear relatively infrequently in the DMOZ hits. In other words, SURBL lists that we know are quite spammy like sc, jp, etc. don't match DMOZ often, so the DMOZ data may not have too many spam domains.
Similar tests could be done against other proposed whitelists. (We'll probably try the wikipedia data next.)
Another concern is that since these directories are relatively open, spammers could simply add themselves and effectively get whitelisted. However I intend to take a snapshot of these and probably not try to refresh the data very often in future (it at all), instead using them as relatively static snapshots of established domains. Doing that would miss some new additions, but could also prevent some future abuse by spammers. On the other hand 2 million domains is a pretty good start.... :-)
Extraction scripts are not perfect, particularly in the simplistic chopping to three levels of cctlds, but they're probably adequate:
http://spamcheck.freeapp.net/whitelists/extract-dmoz-domains http://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sed http://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed
Comments please,
Jeff C.
On Wednesday, October 6, 2004, 6:37:55 AM, Henry Stern wrote:
Hi Jeff,
You might want to reconsider your use of the entire DMOZ directory. There may be some subtrees that you can ignore. Of the 1338 DMOZ false positives, how many of them are from the same sections on DMOZ?
Henry
To be honest, I've not kept track of the categories, but unless they have a "pure spammers" or spamvertised mortgages or pills category, I'm not sure we can disregard entire sections.
Jeff C. -- "If it appears in hams, then don't list it."
Daniel Quinlan, one of the principal SpamAssassin architects had some good suggestions for reducing false positives in the SURBL data. One was using public databases of URIs, particularly hand-built ones like dmoz.org and wikipedia.org or even yahoo.com as sources of mostly legitimate domains. (The wikipedia is not a web directory in a conventional sense; it's more like an open encyclopedia, but it has a relatively large collection of URIs.)
Presumably most of the URIs in these are legitimate and don't belong to spammers, especially in DMOZ since it's hand-built. So the question is: can these be useful as whitelist sources or perhaps as one of the checks on new SURBL additions.
Although directories such as DMOZ are manually edited there is a danger of spammers "grabbing" expired domains and abusing them. I've seen a lot of scripts for sale that track dmoz listed domains.....
M
Mr Michele Neylon Blacknight Internet Solutions Ltd Hosting, co-location & domains http://www.blacknight.ie/ Tel. +353 59 9137101
On Thursday, October 7, 2004, 2:47:46 AM, Michele Solutions wrote:
Although directories such as DMOZ are manually edited there is a danger of spammers "grabbing" expired domains and abusing them. I've seen a lot of scripts for sale that track dmoz listed domains.....
Thanks. That's definitely good to know about.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff,
I did the test as promised. So far, I have collected two hams and 37 spams when manually filtering on this list. (The spams include many which are duplicates where the spammer sent a batch of spams to various clients of mine). I sorted these via a quick manual judgment-call, but I'll research and double-check these more as I get time.
Here are the two FPs:
jjkeller.com (see http://www.pvsys.com/jjkeller.txt ) This was a domain name contained with a newsletter which appears to be legitimate. I showed this to my client who was the intended recipient. He said that he didn't recall subscribing to it, but he felt like it was a legitimate and informative newsletter for his industry and he said he desired to receive it. (I know, a bit weak... but still should be consider for whitelisting??). Does anyone else know anything about jjkeller.com ??
associateprograms.com (see http://www.pvsys.com/associateprograms.txt ) This is a very reputable site for teaching people how to make a living from affiliate advertising. It is very white-hat and does NOT encourage people to spam or to harvest addresses. It encourages use of legitimate opt-in advertising, building web sites, and using pay-per-clicks to advertise affiliate links. It probably got listed due to an open loop signup form (but I'm just speculating). There is a link on this site called "Want to fool spam filters?" ...don't let this link fool you. This page is really about dealing with filters which are out of control, block legitimate mail, and where the mail provider is unwilling to whitelist trusted sender/receiver combos and unwilling to explain why any particular message got blocked.
I'll keep sending this FP stuff from this list as I receive it in my custom filter... and, when I get more time, I'll (also) send a list and a link to the stuff that I deemed as spam.
Rob McEwen
On Thursday, October 7, 2004, 5:55:07 AM, Rob McEwen wrote:
Jeff,
I did the test as promised. So far, I have collected two hams and 37 spams when manually filtering on this list. (The spams include many which are duplicates where the spammer sent a batch of spams to various clients of mine). I sorted these via a quick manual judgment-call, but I'll research and double-check these more as I get time.
Here are the two FPs:
jjkeller.com (see http://www.pvsys.com/jjkeller.txt ) This was a domain name contained with a newsletter which appears to be legitimate. I showed this to my client who was the intended recipient. He said that he didn't recall subscribing to it, but he felt like it was a legitimate and informative newsletter for his industry and he said he desired to receive it. (I know, a bit weak... but still should be consider for whitelisting??). Does anyone else know anything about jjkeller.com ??
All the domains in that newsletter look like legitimate sellers of industrial safety equipment and hardly spam candidates:
msha.gov stevenspublishing.com ohsonline.com lss.com apbuck.com huserinc.com jjkeller.com
The only "controversial" one might be:
processrequest.com
which appears in newsletters a lot and belongs to a marketing company that we already whitelisted.
Therefore I've whitelisted all of the above, including jjkeller.com, the source of which is:
/home/wstearns/black-wstearns-2004-07:jjkeller.com /home/wstearns/black-wstearns-2004-07:jjkellermail.com /home/wstearns/black-wstearns-hand-checked:jjkellermail.com /home/wstearns/black-wstearns-hand-checked-2004-07:jjkellermail.com /home/wstearns/black-wstearns-sa-blacklist.200406281446.domains:jjkeller.com /home/wstearns/black-wstearns-sa-blacklist.200406281446.domains:jjkellermail.com
associateprograms.com (see http://www.pvsys.com/associateprograms.txt ) This is a very reputable site for teaching people how to make a living from affiliate advertising. It is very white-hat and does NOT encourage people to spam or to harvest addresses. It encourages use of legitimate opt-in advertising, building web sites, and using pay-per-clicks to advertise affiliate links. It probably got listed due to an open loop signup form (but I'm just speculating). There is a link on this site called "Want to fool spam filters?" ...don't let this link fool you. This page is really about dealing with filters which are out of control, block legitimate mail, and where the mail provider is unwilling to whitelist trusted sender/receiver combos and unwilling to explain why any particular message got blocked.
This bunch is more uncertain since they're all possible spam candidates:
mp3dollars.com associateprograms.com affiliatesuccess.net liutilities.com webmastersreference.com payperclicksearchengines.com lifetimecustomers.com lifetimecommissions.com
however only associateprograms.com is currently listed. Looks like it came probably into WS from BigEvil:
/web/antispam/bigevil.domains:associateprograms.com /home/wstearns/black-wstearns-sa-blacklist.200406281446.domains:associateprograms.com
1998 domain, no SBL, but 10 NANAS. Given the few NANAS reports I'd agree they could be a legitimate affiliate program that gets a little abuse.
I'm going to go ahead and whitelist them, but would like to get feedback from anyone with additional comments.
I'll keep sending this FP stuff from this list as I receive it in my custom filter... and, when I get more time, I'll (also) send a list and a link to the stuff that I deemed as spam.
Yes, please keep finding any that you find.
Jeff C. -- "If it appears in hams, then don't list it."