Hi Jeff
> >> That said, here are some results Daniel Quinlan posted from the
> >> mass-checks on the SpamAssassin corpora around 26 January 2005:
> >>
> >> > Weekly mass-check results for SURBL:
> >>
> >> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> >> > 217996 164295 53701 0.754 0.00 0.00 (all messages)
> >> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all messages as %)
> >> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL
> >> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL
> >> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL
> >> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL
> >> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL
> >> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL
> >>
>
> > Am I right with the following :
>
> > JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg
> > (is less than 1 in 15.000)
>
> That sounds right, but the particular proportions of spam versus
> ham may not be meaningful, i.e. they may not be representative
> of an actual mail stream. So the percentages are probably more
> usefully compared only to spam or ham and not to a combined total
> of messages.
ok
>
> Certainly the relative percentages within spam or ham are
> meaningful and mostly useful with the caveat that the spam
> detection rates are wrong for quickly moving data in SC and AB
> since the test corpora cover too much time for them. (This is
> more true for spam than ham since spam domains vary quickly with
> time, but ham domains are relatively steady.)
>
ok
> >> SC and AB have much better real world results than show above
> >> because their time period is much shorter than the test
> >> corpora's.
>
> > Yes, but maybe the FP's will grow faster ;-)
>
> That tends not to be the case. The SpamCop data is filtered
> multiple times and is human-checked at the front end. The SC FP
> rates are consistently among the lowest, and the spam detection
> rates are very high for a very small list. In short it's an
> effective strategy.
>
ok and I am overall impressed with the low FP rates on all lists.
> >> Also note that the JP data is now removed from the WS data, and
> >> some old data was removed from WS. So the WS spam and ham hit
> >> rates have probably both decreased since this check was done.
> >> JP should be about the same.
>
> > That will show in the future. Is also a good thing.
>
> Yes, it's fairer to the data sources.
>
> >> > And if possible, has anybody statistics from FP's that where on
> >> > several of the sublists -at the same time-?
>
> > [snip]
>
> >> I don't think that is known yet. I had proposed setting up some
> >> test lists with combinations like this, but got no response. ;-)
> >>
> >> If it *is* known I think we'd all like to hear about it. :-)
>
> > I think it could be known to the great people that check the FP
> > reports. Normally they check against all sublists (I hope) and fix
> > them all.
>
> When we whitelist a domain, it's excluded from all SURBLs. The
> original data source is usually notified.
>
> > I know that not all FP's are reported and there are
> > probably no exact numbers, but it should give a good idea. Or am I
> > wrong?
>
> The FP reports are probably too few overall to be meaningful in
> terms of differentiating performance between lists. There just
> aren't that many, maybe a few a day on average.
>
Yes, but I wasn't thinking on differentiating between the lists, there
are other results for. What I was thinking on was the number of FP's
that exists on more than one list. This is very usefull information
when combining lists. If almost no FP's do occur on more than one
list (at the same time) requiring appearance on at least 2 lists
would be a very safe one.
Alain