>-----Original Message-----
>From: Fred [mailto:tech2@i-is.com]
>Sent: Monday, February 14, 2005 1:26 PM
>To: SURBL Discussion list
>Subject: Re: [SURBL-Discuss] FP rate?
>
>
>Chris Santerre wrote:
>> Can we trust the FP rate with the current bug in SA?
>
>Not taking sides but it might be a bug in Net::DNS, the SA
>devs have not
>exactly tied down what was causing this issue. There was talk
>of re-write
>in the way they use Net::DNS to possibly …
[View More]fix this issue but
>I'm pretty sure
>this was not SA specific.
>
>http://bugzilla.spamassassin.org/show_bug.cgi?id=3997
>
Oh I agree. I don't know what is causing it, but I know it must be throwing
off the reported FP rate. Although proably for all the URIRBLs. I'd love to
get a monthly report from DQ on his rates. But I know he is busy.
--Chris
[View Less]
On Saturday, February 12, 2005, 3:36:11 AM, Alain Alain wrote:
> Hi Jeff
>> On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote:
>> >> Generally speaking it may be better to apply this kind of
>> >> filtering at the server level since there are economies of scale,
>> >> especially in terms of things like DNS lookups and caching. If
>> >> we suddenly get 100k more DNS clients, that could tax the name
>> >> servers …
[View More]somewhat. If those same 100k users were using 100
>> >> servers instead, the DNS loading would be quite a bit less. In
>> >> that sense centralization is desirable.
>>
>> > Mmmm isn't the dns server from the ISP caching the dns requests? I
>> > would think it doesn't make a big difference (except when a server is
>> > rsync'ing). The difference could be that end users check their e-mail
>> > not when arriving on the MTA, but later.
>>
>> One difference is that the ISP's mail server may see many of the
>> same spams within a short period of time, and the lookups would
>> probably tend to be cached over that time span. Individual users
>> may POP or IMAP their messages at any random time, so the DNS
>> cache hit rate may be lower for them.
> This will only the case for spam e-mail, not for domains inside ham e-mail.
But most well-written applications, e.g. SpamAssassin, are
already ignoring most ham URIs due to local whitelisting, so it's
spam URI domain caching that's the main issue.
>> I think we're agreeing, but I've never tried to quantify the
>> difference between these. We can propose that there's some
>> difference but how much is unknown. I would suggest a pretty
>> strong cache effect for mail servers however.
> But the good news is : The more users, the more caching. So the
> burden on the nameservers will grow slower.
The SURBL zone files have a minimal 15 minute TTL, so in order
for ISP resolver hits to be cached, the queries will need to
occur within some 15 minutes, which seems less likely at MUA
download time than at MTA processing time. MTAs probably see
similar spam over a short period of time whereas MUA clients
can download at any later time.
In this case, I don't think your argument applies. For something
like caching yahoo domains, or any with "normal" longer TTLs, it
probably applies more strongly.
Jeff C.
--
"If it appears in hams, then don't list it."
[View Less]
Hi Jeff
> >> > I know that not all FP's are reported and there are
> >> > probably no exact numbers, but it should give a good idea. Or am I
> >> > wrong?
> >>
> >> The FP reports are probably too few overall to be meaningful in
> >> terms of differentiating performance between lists. There just
> >> aren't that many, maybe a few a day on average.
> >>
>
> > Yes, but I wasn't thinking on …
[View More]differentiating between the lists, there
> > are other results for. What I was thinking on was the number of FP's
> > that exists on more than one list. This is very usefull information
> > when combining lists. If almost no FP's do occur on more than one
> > list (at the same time) requiring appearance on at least 2 lists
> > would be a very safe one.
>
> Good point. Anecdotally, FPs don't tend to appear on multiple
> lists very often, at least the FPs we've seen reported. This is
> unmeasured, just a subjective opinion. If we had some of the
> list data in combined form as I had proposed then we could test
> it better. I suppose I could just do it. ;-)
>
I f the reported one's are very rare, this would probably even more
the case for the not reported one's. If there's a FP the chance for
being reported will grow if on more than one list.
Mmm the combined lists just have to be available to someone with a big
ham corpus, to test it.
Personaly knowing the results for "at least 2" or "at least 3" , would
be nice. It also would be nice to know how those combination would
result inside :
http://www.surbl.org/permuted-hits.out.txt
Alain
[View Less]
On Saturday, February 12, 2005, 2:41:36 AM, Alain Alain wrote:
>> >> - I've added a local skiplist with about top half of the public
>> >> "whitelist", no need to query those.
>>
>> When you say half, that may be more than optimal (should be about
>> 5000 records). SpamAssassin is using the top 125, which worked
>> out to about the 50%th percentile of all whitelist hits when we
>> first set this up. (Now that result is skewed *because*
…
[View More]>> SpamAssassin isn't checking those 125 any more, but their
>> snapshot of the 125 is still probably useful.
>>
>> I'd say anything between 100 and 1000 would probably be a good
>> compromise between list size and coverage.
> The only disadvantage I see from a bigger local skiplist is some local
> CPU usage for every uri in a email. Most pc's have plenty of CPU
> power ;-) If this could become a problem, I can lower or optimise the
> local checking. Are there any other disadvantages?
One reason SpamAssassin didn't want to hard code too many domains
into their local whitelist was in case we needed to withdraw any,
i.e. because they started spamming. The time between code
releases can be many months, and some people may never update, so
they wanted to be sure to get very hammy domains into that list.
(While Yahoo and Microsoft probably aren't going to start
spamming any time soon, that may be less certain about some of
the less commonly seen domains.)
But I'm glad that you're trying to minimize the DNS queries.
Jeff C.
--
"If it appears in hams, then don't list it."
[View Less]
On Saturday, February 12, 2005, 3:09:46 AM, Alain Alain wrote:
>> > I know that not all FP's are reported and there are
>> > probably no exact numbers, but it should give a good idea. Or am I
>> > wrong?
>>
>> The FP reports are probably too few overall to be meaningful in
>> terms of differentiating performance between lists. There just
>> aren't that many, maybe a few a day on average.
>>
> Yes, but I wasn't thinking on …
[View More]differentiating between the lists, there
> are other results for. What I was thinking on was the number of FP's
> that exists on more than one list. This is very usefull information
> when combining lists. If almost no FP's do occur on more than one
> list (at the same time) requiring appearance on at least 2 lists
> would be a very safe one.
Good point. Anecdotally, FPs don't tend to appear on multiple
lists very often, at least the FPs we've seen reported. This is
unmeasured, just a subjective opinion. If we had some of the
list data in combined form as I had proposed then we could test
it better. I suppose I could just do it. ;-)
Jeff C.
--
"If it appears in hams, then don't list it."
[View Less]
On Friday, February 11, 2005, 5:29:29 PM, Alain Alain wrote:
>> That said, here are some results Daniel Quinlan posted from the
>> mass-checks on the SpamAssassin corpora around 26 January 2005:
>>
>> > Weekly mass-check results for SURBL:
>>
>> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> > 217996 164295 53701 0.754 0.00 0.00 (all messages)
>> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all …
[View More]messages as %)
>> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL
>> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL
>> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL
>> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL
>> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL
>> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL
>>
> Am I right with the following :
> JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg
> (is less than 1 in 15.000)
That sounds right, but the particular proportions of spam versus
ham may not be meaningful, i.e. they may not be representative
of an actual mail stream. So the percentages are probably more
usefully compared only to spam or ham and not to a combined total
of messages.
Certainly the relative percentages within spam or ham are
meaningful and mostly useful with the caveat that the spam
detection rates are wrong for quickly moving data in SC and AB
since the test corpora cover too much time for them. (This is
more true for spam than ham since spam domains vary quickly with
time, but ham domains are relatively steady.)
>> SC and AB have much better real world results than show above
>> because their time period is much shorter than the test
>> corpora's.
> Yes, but maybe the FP's will grow faster ;-)
That tends not to be the case. The SpamCop data is filtered
multiple times and is human-checked at the front end. The SC FP
rates are consistently among the lowest, and the spam detection
rates are very high for a very small list. In short it's an
effective strategy.
>> Also note that the JP data is now removed from the WS data, and
>> some old data was removed from WS. So the WS spam and ham hit
>> rates have probably both decreased since this check was done.
>> JP should be about the same.
> That will show in the future. Is also a good thing.
Yes, it's fairer to the data sources.
>> > And if possible, has anybody statistics from FP's that where on
>> > several of the sublists -at the same time-?
> [snip]
>> I don't think that is known yet. I had proposed setting up some
>> test lists with combinations like this, but got no response. ;-)
>>
>> If it *is* known I think we'd all like to hear about it. :-)
> I think it could be known to the great people that check the FP
> reports. Normally they check against all sublists (I hope) and fix
> them all.
When we whitelist a domain, it's excluded from all SURBLs. The
original data source is usually notified.
> I know that not all FP's are reported and there are
> probably no exact numbers, but it should give a good idea. Or am I
> wrong?
The FP reports are probably too few overall to be meaningful in
terms of differentiating performance between lists. There just
aren't that many, maybe a few a day on average.
Jeff C.
--
"If it appears in hams, then don't list it."
[View Less]
Hi Jeff
> On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote:
> >> Generally speaking it may be better to apply this kind of
> >> filtering at the server level since there are economies of scale,
> >> especially in terms of things like DNS lookups and caching. If
> >> we suddenly get 100k more DNS clients, that could tax the name
> >> servers somewhat. If those same 100k users were using 100
> >> servers instead, the DNS …
[View More]loading would be quite a bit less. In
> >> that sense centralization is desirable.
>
> > Mmmm isn't the dns server from the ISP caching the dns requests? I
> > would think it doesn't make a big difference (except when a server is
> > rsync'ing). The difference could be that end users check their e-mail
> > not when arriving on the MTA, but later.
>
> One difference is that the ISP's mail server may see many of the
> same spams within a short period of time, and the lookups would
> probably tend to be cached over that time span. Individual users
> may POP or IMAP their messages at any random time, so the DNS
> cache hit rate may be lower for them.
This will only the case for spam e-mail, not for domains inside ham e-mail.
>
> I think we're agreeing, but I've never tried to quantify the
> difference between these. We can propose that there's some
> difference but how much is unknown. I would suggest a pretty
> strong cache effect for mail servers however.
But the good news is : The more users, the more caching. So the
burden on the nameservers will grow slower.
Alain
[View Less]
Hi Jeff
> >> That said, here are some results Daniel Quinlan posted from the
> >> mass-checks on the SpamAssassin corpora around 26 January 2005:
> >>
> >> > Weekly mass-check results for SURBL:
> >>
> >> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> >> > 217996 164295 53701 0.754 0.00 0.00 (all messages)
> >> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all messages as %)
…
[View More]> >> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL
> >> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL
> >> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL
> >> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL
> >> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL
> >> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL
> >>
>
> > Am I right with the following :
>
> > JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg
> > (is less than 1 in 15.000)
>
> That sounds right, but the particular proportions of spam versus
> ham may not be meaningful, i.e. they may not be representative
> of an actual mail stream. So the percentages are probably more
> usefully compared only to spam or ham and not to a combined total
> of messages.
ok
>
> Certainly the relative percentages within spam or ham are
> meaningful and mostly useful with the caveat that the spam
> detection rates are wrong for quickly moving data in SC and AB
> since the test corpora cover too much time for them. (This is
> more true for spam than ham since spam domains vary quickly with
> time, but ham domains are relatively steady.)
>
ok
> >> SC and AB have much better real world results than show above
> >> because their time period is much shorter than the test
> >> corpora's.
>
> > Yes, but maybe the FP's will grow faster ;-)
>
> That tends not to be the case. The SpamCop data is filtered
> multiple times and is human-checked at the front end. The SC FP
> rates are consistently among the lowest, and the spam detection
> rates are very high for a very small list. In short it's an
> effective strategy.
>
ok and I am overall impressed with the low FP rates on all lists.
> >> Also note that the JP data is now removed from the WS data, and
> >> some old data was removed from WS. So the WS spam and ham hit
> >> rates have probably both decreased since this check was done.
> >> JP should be about the same.
>
> > That will show in the future. Is also a good thing.
>
> Yes, it's fairer to the data sources.
>
> >> > And if possible, has anybody statistics from FP's that where on
> >> > several of the sublists -at the same time-?
>
> > [snip]
>
> >> I don't think that is known yet. I had proposed setting up some
> >> test lists with combinations like this, but got no response. ;-)
> >>
> >> If it *is* known I think we'd all like to hear about it. :-)
>
> > I think it could be known to the great people that check the FP
> > reports. Normally they check against all sublists (I hope) and fix
> > them all.
>
> When we whitelist a domain, it's excluded from all SURBLs. The
> original data source is usually notified.
>
> > I know that not all FP's are reported and there are
> > probably no exact numbers, but it should give a good idea. Or am I
> > wrong?
>
> The FP reports are probably too few overall to be meaningful in
> terms of differentiating performance between lists. There just
> aren't that many, maybe a few a day on average.
>
Yes, but I wasn't thinking on differentiating between the lists, there
are other results for. What I was thinking on was the number of FP's
that exists on more than one list. This is very usefull information
when combining lists. If almost no FP's do occur on more than one
list (at the same time) requiring appearance on at least 2 lists
would be a very safe one.
Alain
[View Less]
On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote:
>> Generally speaking it may be better to apply this kind of
>> filtering at the server level since there are economies of scale,
>> especially in terms of things like DNS lookups and caching. If
>> we suddenly get 100k more DNS clients, that could tax the name
>> servers somewhat. If those same 100k users were using 100
>> servers instead, the DNS loading would be quite a bit less. In
>> …
[View More]that sense centralization is desirable.
> Mmmm isn't the dns server from the ISP caching the dns requests? I
> would think it doesn't make a big difference (except when a server is
> rsync'ing). The difference could be that end users check their e-mail
> not when arriving on the MTA, but later.
One difference is that the ISP's mail server may see many of the
same spams within a short period of time, and the lookups would
probably tend to be cached over that time span. Individual users
may POP or IMAP their messages at any random time, so the DNS
cache hit rate may be lower for them.
I think we're agreeing, but I've never tried to quantify the
difference between these. We can propose that there's some
difference but how much is unknown. I would propose a pretty
strong cache effect for mail servers however.
Jeff C.
--
"If it appears in hams, then don't list it."
[View Less]