On Monday, September 27, 2004, 5:50:39 PM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discuss:
So there is a point of diminishing returns in going with the older domains. There is also perhaps an increasing chance of FPs with older domains.
(I didn't graph the above, but the numbers look like a nice exponential decay....)
I have graphed similar numbers, but I don't have the results handy. It's more like a normal distribution ("bell curve"), with the mean at 0 days (actually slightly greater than zero, but that's a relatively constant skew due to lag between registration time and spam delivery/processing). GetURI uses a modified version of the normal distribution as part of its heuristic. The other parts of GetURI's heuristic are pretty much all additive, but I found that, statistically, domain age is good enough to be multiplicative, and it'll *reduce* rankings for domains that have been registered for a long time. It's so nice when math actually works. :-)
- Ryan
Heh, when I said "normal", statisticians jumped all over that.
Turns out the distributions may be more like Zipfian. Zipf curves have most of the data concentrated in a small amount of the curve (e.g., young domains) and a small amount of the data in a larger part of the curve (e.g., old domains). I hope I'm explaining that correctly.
That said, if you found some numerical heuristics that fit the data well, that's great!
Jeff C. -- "If it appears in hams, then don't list it."