On Saturday, September 18, 2004, 2:38:29 AM, David Hooton wrote:
On Sat, 18 Sep 2004 00:33:59 -0700, Jeff Chan jeffc@surbl.org wrote:
Most of the data looks pretty regular, but one difference is that the mailpolice data has some records like these:
<snip> > which we would typically try to reduce to their base (registrar) > domains. Reducing would cause some obvious false positives, for > example comcast.net, if we did not happen to whitelist it.
Hmm, this is not great.
One solution would be to not reduce. Another would be to discard these longer domains, but it's not too easy to detect which ones to discard. Neither solution is really great, but they're both better than reducing, because of the FPs that would create.
This is probably the best approach.
Thanks for the feedback! :-)
BTW for anyone who wants to check them out, the slightly processed list, which would go into PH is at:
http://spamcheck.freeapp.net/mailpolice-fraud.srt
The changes are my standard ones:
1. force to lower case 2. discard records that have other than [a-z0-9.-] (original style domain name restrictions)
Unusually in this case don't try to reduce gtlds to two levels.
Jeff C.