On Sat, 18 Sep 2004 00:33:59 -0700, Jeff Chan jeffc@surbl.org wrote:
OK taking a look at the fraud.rhs.mailpolice.com data, there's not too much overlap with the MailSecurity phishing data which we're currently using in PH in muli.surbl.org.
The former has about 260 records, and the latter has about 400 records, and the overlap is around 25 records. So adding in the mailpolice fraud data would grow PH by about 240 new records.
Most of the data looks pretty regular, but one difference is that the mailpolice data has some records like these:
<snip>
which we would typically try to reduce to their base (registrar) domains. Reducing would cause some obvious false positives, for example comcast.net, if we did not happen to whitelist it.
Hmm, this is not great.
One solution would be to not reduce. Another would be to discard these longer domains, but it's not too easy to detect which ones to discard. Neither solution is really great, but they're both better than reducing, because of the FPs that would create.
This is probably the best approach.
Also Jay: example.tld is on the list. That doesn't resolve and probably isn't useful for fraud or phishing so you may want to consider removing it. ;-)
It would be nice to figure out these issues before adding the mailpolice fraud data into PH.
Agreed on all counts.