On Friday, September 17, 2004, 3:10:48 PM, Jay Swackhamer wrote:
> Raymond Dijkxhoorn wrote:
>> Are those zones available for rsync (rbldnsd format) ?
>> Can test it on my own cluster also then right now.
>> Nice to see more lists starting off.

> The fraud list has been operating publically since around March.

OK taking a look at the fraud.rhs.mailpolice.com data,
there's not too much overlap with the MailSecurity phishing
data which we're currently using in PH in muli.surbl.org.

The former has about 260 records, and the latter has
about 400 records, and the overlap is around 25 records.
So adding in the mailpolice fraud data would grow PH
by about 240 new records.

Most of the data looks pretty regular, but one difference
is that the mailpolice data has some records like these:

[not a complete list of these longer domains]

which we would typically try to reduce to their base (registrar)
domains.  Reducing would cause some obvious false positives, for
example comcast.net, if we did not happen to whitelist it.

Some of these also don't make sense.  e-gold.com is legitimate,
and www.e-gold.com and 1380781-usd10.e-gold.com resolve to
the same IP address.  Why would e-gold phish themselves or allow
a phisher to be hosted on their main web server?

One solution would be to not reduce.  Another would be to discard
these longer domains, but it's not too easy to detect which ones
to discard.  Neither solution is really great, but they're both
better than reducing, because of the FPs that would create.

The un-reduced longer domains basically won't be matched by most
code using SURBLs, because the client-side code usually tries to
reduce to base domains.  So if we leave the longer domains in the
data, aside from making the data a little larger, it doesn't have
too much downside.  On the other hand since multi.surbl.org is a
logical "or" of all domains, any extra records in any list going
into multi.surbl.org makes multi unnecessarily longer.  But the
number of these longer domains is probably minor in the larger
picture: a dozen or two.

Also Jay:  example.tld is on the list.  That doesn't resolve and
probably isn't useful for fraud or phishing so you may want to
consider removing it.  ;-)

It would be nice to figure out these issues before adding the
mailpolice fraud data into PH. 


