First, great work Jeff.
While 102k domains isn't nearly as large as the 2.3M in dmoz, it's certainly more than the 12k or so whitelist records we currently have. How does the intersected list look as a potential whitelist?
I think this is just plain nuts to whitelist all of these! Why? If we don't try to whitelist the most popular sites, then what the heck it the point? We could whitelist millions of legit domains forever. The popular ones are the most important.
Here is one from the above list. Why would listing this help us? http://oigawa-railway.co.jp/ (looks like a real popular site huh!)
Please also take a look at these blocklist hits (potential FPs) and share what you think:
http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt
I picked of few of these that may give us problems, and none of them met our current criteria to list. (sissy-world.com, good grief that had to be a man at one time!) With the ability to now see whitelisted domains in the crossref page, I don't see a problem with whitelisting all these on the list. Because if they do start spamming again, we can see they are whitelisted and remove them.
so: -1 for adding all those intersected to WL
+1 for whitelisting the blacklist hits.
--Chris
On Friday, October 8, 2004, 8:44:39 AM, Chris Santerre wrote:
I think this is just plain nuts to whitelist all of these! Why? If we don't try to whitelist the most popular sites, then what the heck it the point? We could whitelist millions of legit domains forever. The popular ones are the most important.
Good point. Since these sources have some multiple mentions per domain we do have frequency data per domain. I will look at taking some top percentiles of those and see what the results look like. For example:
% head -25 dmoz/domains.count.20041007 231779 cnn.com 104081 geocities.com 38410 tripod.com 31382 angelfire.com 21415 aol.com 20394 topix.net 15293 yahoo.com 14092 newadvent.org 8847 imdb.com 8319 homestead.com 7224 perso.wanadoo.fr 6779 weather.com 6064 membres.lycos.fr 4999 fortunecity.com 4594 digilander.libero.it 4480 freeserve.co.uk 4318 bbc.co.uk 4265 home.t-online.de 4043 freewebs.com 3948 demon.co.uk 3902 web.tiscali.it 3839 rootsweb.com 3670 nifty.com 3648 8m.com 3496 faqs.org
% head -25 wikipedia/wikipedia.domains.count.20041002 8958 imdb.com 8577 google.com 5921 bbc.co.uk 4548 utexas.edu 3929 geocities.com 3114 wikipedia.com 2789 multimap.com 2766 sourceforge.net 2687 livedepartureboards.co.uk 2077 yahoo.com 1510 allmusic.com 1501 cnn.com 1491 guardian.co.uk 1481 wiktionary.org 1313 dmoz.org 1306 civicheraldry.co.uk 1250 nlm.nih.gov 1222 harvard.edu 1160 itis.usda.gov 1150 bundesrecht.juris.de 1135 amazon.com 1111 wikibooks.org 1107 gutenberg.net 1098 jpl.nasa.gov 1057 gnu.org
(The wikipedia-owned ones would come out.)
Also I'd like to subtract any with SBL hits.
So this is just a starting point or some other sets of data to use.
But we definitely need to find ways to reduce the FPs, most of which are still in WS.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff Chan wrote:
But we definitely need to find ways to reduce the FPs, most of which are still in WS.
Jeff,
You've gone through all the ws data files...
Would it be possible to get those with the FPs you've found be deleted for god? not just the entries but the files.
are there lots of bigevil leftovers?
or are we contributers messing up?
pls point it out clearly so Bill can take action.
Alex
On Friday, October 8, 2004, 9:31:09 AM, Alex Broens wrote:
You've gone through all the ws data files...
Would it be possible to get those with the FPs you've found be deleted for god? not just the entries but the files.
are there lots of bigevil leftovers?
or are we contributers messing up?
We're trying to sort it out. It does seem like some stale old data may be full of FPs. We're continuing to track them down. As you know we have some research about that going on.
It might help for for Bill and the other WS data source folks to look into re-constructing WS from some fresher data that has some stricter standards for inclusion, such as the draft at:
http://www.surbl.org/policy.html
Jeff C. -- "If it appears in hams, then don't list it."
On Friday, October 8, 2004, 9:31:09 AM, Alex Broens wrote:
Jeff Chan wrote:
But we definitely need to find ways to reduce the FPs, most of which are still in WS.
Jeff,
You've gone through all the ws data files...
I should add that I will check the wikipedia and wikipedia-dmoz domains against the individual WS data sources also, like we did privately for the first dmoz data set.
Jeff C. -- "If it appears in hams, then don't list it."