Following up on the earlier check of DMOZ domains against SURBL data, I applied some of Quinlan's suggestions and grabbed different revisions of the DMOZ data and joined (intersected) them in order to eliminate changes due to things like editor- removed spammer/abuser domains, etc. This also means that new additions are ignored, but the corpus is so large that the benefits of capturing editor removals is probably more important.
Snapshots are the three most recent available from 9/9/04, 9/25/04 and 10/7/04 (file dates) are intersected, resulting in fewer records (only those that are constant across all three snapshots):
http://rdf.dmoz.org/rdf/archive/
Here are the line, word and character counts:
2300851 2300851 38065969 dmoz.srt
1169 11690 123909 dmoz-blocklist.summed.txt 1141 11410 120860 dmoz-blocklist.ws 1169 1169 17977 dmoz-blocklist.txt 7394 7394 97011 dmoz-whitelist.txt
The above are revised versions of joins on blocklists, with list info, with ws hits, with just domains, and against the whitelist.
These are in the whitelists directory, though we are still *not* applying the dmoz domains as whitelists:
http://spamcheck.freeapp.net/whitelists/
The previous dmoz (9/25 version only IIRC) and hits are archived as:
2326173 2326173 38494184 dmoz.srt1
1338 13380 141946 dmoz-blocklist1.summed.txt 1173 11730 124298 dmoz-blocklist1.ws 1338 1338 20533 dmoz-blocklist1.txt 7375 7375 96720 dmoz-whitelist1.txt __
I was also able to grab 4 snapshots of the wikipedia, all sections (all languages). Only the two most recent snapshots had comparable numbers of sections so I used only those two. Using only two sets may be ok, since these are much smaller corpora and there's probably more hand-editing of them.
http://download.wikimedia.org/
Where dmoz has about 2.3 million domains, wikipedia has about 174k domains:
173828 173828 2633441 wikipedia.srt
188 1880 19631 wikipedia-blocklist.summed.txt 188 188 2713 wikipedia-blocklist.txt 2437 2437 29581 wikipedia-whitelist.txt __
I also took the intersection of the three dmoz snapshots and the two wikipedia snapshots to get a smaller list containing only the ~102k domains found in both wikipedia and dmoz:
101619 101619 1498653 wikipedia-dmoz.srt
116 1160 11928 wikipedia-dmoz-blocklist.summed.txt 116 116 1591 wikipedia-dmoz-blocklist.txt 2223 2223 26854 wikipedia-dmoz-whitelist.txt
This intersection of dmoz and wikipedia domains probably represents the best hope for large whitelist additions so far. This is probably imperfect data, but at least it has some checks by human editors and techniques applied to reduce spammer domains, including comparing the snapshots over time and intersecting the two relatively unrelated sources.
Can anyone think of any other hand-edited databases, directories, encyclopedias, etc of URIs of hopefully legitimate (non-spammer) domains that are publically available? Please think about it a little, and speak up!
While 102k domains isn't nearly as large as the 2.3M in dmoz, it's certainly more than the 12k or so whitelist records we currently have. How does the intersected list look as a potential whitelist?
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
probably not a new idea, but why not run a "wl.surbl.org" with all the whitelisted domains and ppl can choose to use it or not.
Alex
----- Original Message ----- From: "Alex Broens" surbl@alexb.ch
Jeff Chan wrote:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
probably not a new idea, but why not run a "wl.surbl.org" with all the whitelisted domains and ppl can choose to use it or not.
I like this idea! Whitelist the most commonly used 1,000 or so domains, and then create a wl.surbl.org for the rest of the wikipedia-dmoz domains.
Bill
Bill Landry wrote:
----- Original Message ----- From: "Alex Broens" surbl@alexb.ch
Jeff Chan wrote:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
probably not a new idea, but why not run a "wl.surbl.org" with all the whitelisted domains and ppl can choose to use it or not.
I like this idea! Whitelist the most commonly used 1,000 or so domains, and then create a wl.surbl.org for the rest of the wikipedia-dmoz domains.
WOW... Bill didn't bark at me this time.
my point is the following:
take for example "angelfire. com". This domain may have legitimate users but my user base would NEVER have contact with anybody hosting a site or anything there. If they support spam, list them, put pressure on them to stop supporting spammers, bla, bla, bla. I wouldn't appreciate it being whitelisted as then if there's abuse, and it does get blacklisted, there's no pressure on the domain holder to clean up.
As I imagine we're fighting spam here, not just filtering, I have a certain difficulty understanding why the world is crying for whitelisting instead of putting pressure on so called whitehats who support abuse for a lifetime.
as Chris said, you could make whitelisting a lifetime task. I believe the better approach would be to decrease potential FP's by increasing the reporting QUALITY !!!!!!!!
Alex
//Are we fighting Spam or working for Messagelabs & Co. for free? //
----- Original Message ----- From: "Alex Broens" surbl@alexb.ch
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
probably not a new idea, but why not run a "wl.surbl.org" with all the whitelisted domains and ppl can choose to use it or not.
I like this idea! Whitelist the most commonly used 1,000 or so domains,
and
then create a wl.surbl.org for the rest of the wikipedia-dmoz domains.
WOW... Bill didn't bark at me this time.
I have been trying to bark less and purr more. ;-)
Apologies for any previous transgressions - must have been too much coffee and too little sleep (that's my excuse and I sticking by it!).
my point is the following:
take for example "angelfire. com". This domain may have legitimate users but my user base would NEVER have contact with anybody hosting a site or anything there. If they support spam, list them, put pressure on them to stop supporting spammers, bla, bla, bla. I wouldn't appreciate it being whitelisted as then if there's abuse, and it does get blacklisted, there's no pressure on the domain holder to clean up.
That's why creating a WL SURBL (or SURWL) might be a good idea. Then those that have more tolerance for spam can use WL to reduce the weight of potential FPs from those sometimes abused domains that periodically get listed on one of the other SURBL lists.
As I imagine we're fighting spam here, not just filtering, I have a certain difficulty understanding why the world is crying for whitelisting instead of putting pressure on so called whitehats who support abuse for a lifetime.
Agreed, but that is a process that is taking its own sweet time to evolve.
as Chris said, you could make whitelisting a lifetime task. I believe the better approach would be to decrease potential FP's by increasing the reporting QUALITY !!!!!!!!
Indeed, however, I still like your idea of a WL list over outright whitelisting, except for the most common legit domains.
Bill
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Broens writes:
take for example "angelfire. com". This domain may have legitimate users but my user base would NEVER have contact with anybody hosting a site or anything there. If they support spam, list them, put pressure on them to stop supporting spammers, bla, bla, bla. I wouldn't appreciate it being whitelisted as then if there's abuse, and it does get blacklisted, there's no pressure on the domain holder to clean up.
Your user base, maybe. But I can't see how you can justify assuming that everyone who uses SURBL has the same user base. I'd never assume that SpamAssassin should not be usable by someone who may expect to receive mail from their kids regarding an angelfire-hosted school project like http://www.angelfire.com/on2/thrillsandchills/ , for example.
As I imagine we're fighting spam here, not just filtering, I have a certain difficulty understanding why the world is crying for whitelisting instead of putting pressure on so called whitehats who support abuse for a lifetime.
are "we"?
I'm certainly filtering. ;)
Is "fighting spam" and "putting pressure on so called whitehats" a goal of surbl.org?
This attitude is what makes SPEWS useless.
- --j.
On Friday, October 8, 2004, 9:51:23 AM, Alex Broens wrote:
take for example "angelfire. com". This domain may have legitimate users but my user base would NEVER have contact with anybody hosting a site or anything there. If they support spam, list them, put pressure on them to stop supporting spammers, bla, bla, bla. I wouldn't appreciate it being whitelisted as then if there's abuse, and it does get blacklisted, there's no pressure on the domain holder to clean up.
We can't blacklist and entire major hosting provider just because they have some minor abuse issues. That's not what we're doing.
I believe the better approach would be to decrease potential FP's by increasing the reporting QUALITY !!!!!!!!
Agreed.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
On Friday, October 8, 2004, 9:51:23 AM, Alex Broens wrote:
take for example "angelfire. com". This domain may have legitimate users but my user base would NEVER have contact with anybody hosting a site or anything there. If they support spam, list them, put pressure on them to stop supporting spammers, bla, bla, bla. I wouldn't appreciate it being whitelisted as then if there's abuse, and it does get blacklisted, there's no pressure on the domain holder to clean up.
We can't blacklist and entire major hosting provider just because they have some minor abuse issues. That's not what we're doing.
tripod and geocities have given spammers a home for many years... and there's many other freehosters who have become "victims" as well.
if not permitted to blacklist, at least treat as 2nd level tld so we can get rid of the spam coming from them
I believe the better approach would be to decrease potential FP's by increasing the reporting QUALITY !!!!!!!!
Agreed.
:-)
Thanks Alex
On Friday, October 8, 2004, 11:37:03 PM, Alex Broens wrote:
tripod and geocities have given spammers a home for many years... and there's many other freehosters who have become "victims" as well.
if not permitted to blacklist, at least treat as 2nd level tld so we can get rid of the spam coming from them
That's possible, but not something we've done so far. I don't see shared domain hosting as a major spam destination. Maybe they are a minor annoyance, but nothing like the pill spammers hosted in China, Korea, Brazil, etc.
The difference is that geocities, tripod, angelfire, etc. ought to have some incentive to get rid of these minor abusers since they make so little money from them. The pill/mortgage/warez/etc. spam hosters probably make a lot more money from their spamming customers, so they have an incentive too keep them.
In the case where there is some legitimate company like Yahoo or Lycos (parents of geocities, tripod, etc) to police their customers and enforce their AUPs, then spam victims should report them and let them do the enforcing. Since these companies are mostly legitimate and probably do spend some resources dealing with abuse, they have an incentive to reduce abuse since it probably costs them more money than it gains them.
One of the main reasons for doing SURBLs was to be able to do something about the hosting companies who don't have AUPs against spam or don't enforce them.
Bottom line is that tripod, etc. are mostly irrelevant compared to the professional spam gangs who send a lot more spam. SURBLs are for listing the domains of the bigger fish who have found spam-friendly hosts, regardless of where those hosts are.
Jeff C. -- "If it appears in hams, then don't list it."
On Friday, October 8, 2004, 9:26:37 AM, Bill Landry wrote:
----- Original Message ----- From: "Alex Broens" surbl@alexb.ch
Jeff Chan wrote:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
Would there be many FNs (missed spams) if we whitelisted all of these? In other words are these all truly False Positives? If not, which ones do you feel are true spammers and why.
probably not a new idea, but why not run a "wl.surbl.org" with all the whitelisted domains and ppl can choose to use it or not.
I like this idea! Whitelist the most commonly used 1,000 or so domains, and then create a wl.surbl.org for the rest of the wikipedia-dmoz domains.
As Chris mentions, applications using SURBLs are being udpated to not even check the top N whitehat domains like yahoo, w3.org, etc:
http://bugzilla.spamassassin.org/show_bug.cgi?id=3805 http://bugzilla.spamassassin.org/show_bug.cgi?id=3886
That way they don't even incur DNS lookups and save much network time and DNS traffic.
Jeff C. -- "If it appears in hams, then don't list it."