On Friday, February 11, 2005, 5:29:29 PM, Alain Alain wrote:
>> That said, here are some results Daniel Quinlan posted from the
>> mass-checks on the SpamAssassin corpora around 26 January 2005:
>>
>> > Weekly mass-check results for SURBL:
>>
>> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> > 217996 164295 53701 0.754 0.00 0.00 (all messages)
>> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all messages as %)
>> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL
>> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL
>> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL
>> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL
>> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL
>> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL
>>
> Am I right with the following :
> JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg
> (is less than 1 in 15.000)
That sounds right, but the particular proportions of spam versus
ham may not be meaningful, i.e. they may not be representative
of an actual mail stream. So the percentages are probably more
usefully compared only to spam or ham and not to a combined total
of messages.
Certainly the relative percentages within spam or ham are
meaningful and mostly useful with the caveat that the spam
detection rates are wrong for quickly moving data in SC and AB
since the test corpora cover too much time for them. (This is
more true for spam than ham since spam domains vary quickly with
time, but ham domains are relatively steady.)
>> SC and AB have much better real world results than show above
>> because their time period is much shorter than the test
>> corpora's.
> Yes, but maybe the FP's will grow faster ;-)
That tends not to be the case. The SpamCop data is filtered
multiple times and is human-checked at the front end. The SC FP
rates are consistently among the lowest, and the spam detection
rates are very high for a very small list. In short it's an
effective strategy.
>> Also note that the JP data is now removed from the WS data, and
>> some old data was removed from WS. So the WS spam and ham hit
>> rates have probably both decreased since this check was done.
>> JP should be about the same.
> That will show in the future. Is also a good thing.
Yes, it's fairer to the data sources.
>> > And if possible, has anybody statistics from FP's that where on
>> > several of the sublists -at the same time-?
> [snip]
>> I don't think that is known yet. I had proposed setting up some
>> test lists with combinations like this, but got no response. ;-)
>>
>> If it *is* known I think we'd all like to hear about it. :-)
> I think it could be known to the great people that check the FP
> reports. Normally they check against all sublists (I hope) and fix
> them all.
When we whitelist a domain, it's excluded from all SURBLs. The
original data source is usually notified.
> I know that not all FP's are reported and there are
> probably no exact numbers, but it should give a good idea. Or am I
> wrong?
The FP reports are probably too few overall to be meaningful in
terms of differentiating performance between lists. There just
aren't that many, maybe a few a day on average.
Jeff C.
--
"If it appears in hams, then don't list it."