On Thursday, September 9, 2004, 5:34:05 PM, Jeff Chan wrote:
My first pass at cleaning the resolved IP data would be to take the to 70th percentile of IP addresses and only use those to check domain resolved IPs to. It's not perfect, but it should cut down on the uncertainty.
I should add that this mostly applies to data where we have a constant feed of actual spam reports such as from SpamCop. It does not apply as strongly to data sources where we only have a unitary list of domains, for example where each domain appears once over the whole list. Though even there, it applies weakly, for example a dozen domains that all resolve to the same network probably could be used to bias future domains appearing in the same network towards list inclusion.
But when you have a stream of reports about the *same domain*, then you can get better statistics about that domain or it's resolved IP. There simply more data to work with in more meaningful ways.
Jeff C.