[SURBL-Discuss] Re: Revised DMOZ data, got Wikipedia domains too

Jeff Chan jeffc at surbl.org
Fri Oct 8 18:12:38 CEST 2004


On Friday, October 8, 2004, 8:44:39 AM, Chris Santerre wrote:

> I think this is just plain nuts to whitelist all of these! Why? If we don't
> try to whitelist the most popular sites, then what the heck it the point? We
> could whitelist millions of legit domains forever. The popular ones are the
> most important. 

Good point.  Since these sources have some multiple mentions
per domain we do have frequency data per domain.  I will look at
taking some top percentiles of those and see what the results
look like.  For example:

% head -25 dmoz/domains.count.20041007
231779  cnn.com
104081  geocities.com
38410   tripod.com
31382   angelfire.com
21415   aol.com
20394   topix.net
15293   yahoo.com
14092   newadvent.org
8847    imdb.com
8319    homestead.com
7224    perso.wanadoo.fr
6779    weather.com
6064    membres.lycos.fr
4999    fortunecity.com
4594    digilander.libero.it
4480    freeserve.co.uk
4318    bbc.co.uk
4265    home.t-online.de
4043    freewebs.com
3948    demon.co.uk
3902    web.tiscali.it
3839    rootsweb.com
3670    nifty.com
3648    8m.com
3496    faqs.org

% head -25 wikipedia/wikipedia.domains.count.20041002
8958    imdb.com
8577    google.com
5921    bbc.co.uk
4548    utexas.edu
3929    geocities.com
3114    wikipedia.com
2789    multimap.com
2766    sourceforge.net
2687    livedepartureboards.co.uk
2077    yahoo.com
1510    allmusic.com
1501    cnn.com
1491    guardian.co.uk
1481    wiktionary.org
1313    dmoz.org
1306    civicheraldry.co.uk
1250    nlm.nih.gov
1222    harvard.edu
1160    itis.usda.gov
1150    bundesrecht.juris.de
1135    amazon.com
1111    wikibooks.org
1107    gutenberg.net
1098    jpl.nasa.gov
1057    gnu.org

(The wikipedia-owned ones would come out.)

Also I'd like to subtract any with SBL hits.

So this is just a starting point or some other sets of data to use.

But we definitely need to find ways to reduce the FPs, most of
which are still in WS.

Jeff C.
--
"If it appears in hams, then don't list it."



More information about the Discuss mailing list