On Friday, October 8, 2004, 8:44:39 AM, Chris Santerre wrote:
I think this is just plain nuts to whitelist all of these! Why? If we don't try to whitelist the most popular sites, then what the heck it the point? We could whitelist millions of legit domains forever. The popular ones are the most important.
Good point. Since these sources have some multiple mentions per domain we do have frequency data per domain. I will look at taking some top percentiles of those and see what the results look like. For example:
% head -25 dmoz/domains.count.20041007 231779 cnn.com 104081 geocities.com 38410 tripod.com 31382 angelfire.com 21415 aol.com 20394 topix.net 15293 yahoo.com 14092 newadvent.org 8847 imdb.com 8319 homestead.com 7224 perso.wanadoo.fr 6779 weather.com 6064 membres.lycos.fr 4999 fortunecity.com 4594 digilander.libero.it 4480 freeserve.co.uk 4318 bbc.co.uk 4265 home.t-online.de 4043 freewebs.com 3948 demon.co.uk 3902 web.tiscali.it 3839 rootsweb.com 3670 nifty.com 3648 8m.com 3496 faqs.org
% head -25 wikipedia/wikipedia.domains.count.20041002 8958 imdb.com 8577 google.com 5921 bbc.co.uk 4548 utexas.edu 3929 geocities.com 3114 wikipedia.com 2789 multimap.com 2766 sourceforge.net 2687 livedepartureboards.co.uk 2077 yahoo.com 1510 allmusic.com 1501 cnn.com 1491 guardian.co.uk 1481 wiktionary.org 1313 dmoz.org 1306 civicheraldry.co.uk 1250 nlm.nih.gov 1222 harvard.edu 1160 itis.usda.gov 1150 bundesrecht.juris.de 1135 amazon.com 1111 wikibooks.org 1107 gutenberg.net 1098 jpl.nasa.gov 1057 gnu.org
(The wikipedia-owned ones would come out.)
Also I'd like to subtract any with SBL hits.
So this is just a starting point or some other sets of data to use.
But we definitely need to find ways to reduce the FPs, most of which are still in WS.
Jeff C. -- "If it appears in hams, then don't list it."