On Thursday, September 23, 2004, 9:15:59 PM, Joe Wein wrote:
Alexa by Amazon.com has a top 500 list on its site, which it derives from stats collected via its Alexa toolbar plugin. This may be a good source of whitelist data.
Any site making that high score has the potential to cause a lot of collateral damage if blacklisted, since these appear to be sites that lots of real-life users *do* to visit regularly, as opposed to sites that advertisers suggest they visit, so they are likely to be mentioned in legitimate personal or business e-mail. Probably sites popular enough to be there have far more to lose than to gain from spamming anyway.
I took the HTML from Alexa's five pages which listed 100 sites each, did a bit of text editing and hey presto: here's the list as an attached ASCII file.
A quick check against my local blacklist yielded exactly 0 intersections :-)
[...]
About a third of the top 500 sites (160) were already in my local whitelist. I'll probably add the rest to my whitelist too.
Anybody here who can bulk-check these against SURBL, in case there are listed sites?
Joe
Way ahead of you Joe. I whitelisted the Alexa 500 when we started, so you won't find them on SURBLs. :-) I don't mention it because I don't want to know what Alexa's licensing policies are. Thanks for thinking of it though. :-)
I agree with your reasoning. Popular sites are more likely to be legitimate and get mentioned in hams, and blocklisting them could cause a lot of FPs. So they should stay off.
And yes, it does include some hosting sites and ISPs in Asia that get occasionally mentioned in casual spam. Most of these ISPs have AUPs *on their own domains* so that *their own domains* are probably not a major source of spam hosting. This does not prevent us from listing any of their customers who spam.
Does anyone else have other potential whitelist sources like this?
Jeff C. -- "If it appears in hams, then don't list it."