Discuss February 2005

discuss@lists.surbl.org

36 participants
47 discussions

RE: [SURBL-Discuss] FP rate?
by Chris Santerre 14 Feb '05

14 Feb '05

>-----Original Message----- >From: Fred [mailto:tech2@i-is.com] >Sent: Monday, February 14, 2005 1:26 PM >To: SURBL Discussion list >Subject: Re: [SURBL-Discuss] FP rate? > > >Chris Santerre wrote: >> Can we trust the FP rate with the current bug in SA? > >Not taking sides but it might be a bug in Net::DNS, the SA >devs have not >exactly tied down what was causing this issue. There was talk >of re-write >in the way they use Net::DNS to possibly fix this issue but >I'm pretty sure >this was not SA specific. > >http://bugzilla.spamassassin.org/show_bug.cgi?id=3997 > Oh I agree. I don't know what is causing it, but I know it must be throwing off the reported FP rate. Although proably for all the URIRBLs. I'd love to get a monthly report from DQ on his rates. But I know he is busy. --Chris

4 3

RE: [SURBL-Discuss] FP rate?
by Chris Santerre 14 Feb '05

14 Feb '05

Can we trust the FP rate with the current bug in SA? -_Chris

2 1

Re: [SURBL-Discuss] ANN: new surbl client (still beta)
by Jeff Chan 14 Feb '05

14 Feb '05

On Saturday, February 12, 2005, 3:36:11 AM, Alain Alain wrote: > Hi Jeff >> On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote: >> >> Generally speaking it may be better to apply this kind of >> >> filtering at the server level since there are economies of scale, >> >> especially in terms of things like DNS lookups and caching. If >> >> we suddenly get 100k more DNS clients, that could tax the name >> >> servers somewhat. If those same 100k users were using 100 >> >> servers instead, the DNS loading would be quite a bit less. In >> >> that sense centralization is desirable. >> >> > Mmmm isn't the dns server from the ISP caching the dns requests? I >> > would think it doesn't make a big difference (except when a server is >> > rsync'ing). The difference could be that end users check their e-mail >> > not when arriving on the MTA, but later. >> >> One difference is that the ISP's mail server may see many of the >> same spams within a short period of time, and the lookups would >> probably tend to be cached over that time span. Individual users >> may POP or IMAP their messages at any random time, so the DNS >> cache hit rate may be lower for them. > This will only the case for spam e-mail, not for domains inside ham e-mail. But most well-written applications, e.g. SpamAssassin, are already ignoring most ham URIs due to local whitelisting, so it's spam URI domain caching that's the main issue. >> I think we're agreeing, but I've never tried to quantify the >> difference between these. We can propose that there's some >> difference but how much is unknown. I would suggest a pretty >> strong cache effect for mail servers however. > But the good news is : The more users, the more caching. So the > burden on the nameservers will grow slower. The SURBL zone files have a minimal 15 minute TTL, so in order for ISP resolver hits to be cached, the queries will need to occur within some 15 minutes, which seems less likely at MUA download time than at MTA processing time. MTAs probably see similar spam over a short period of time whereas MUA clients can download at any later time. In this case, I don't think your argument applies. For something like caching yahoo domains, or any with "normal" longer TTLs, it probably applies more strongly. Jeff C. -- "If it appears in hams, then don't list it."

2 2

Re: [SURBL-Discuss] FP rate?
by Alain 13 Feb '05

13 Feb '05

Hi Jeff > >> > I know that not all FP's are reported and there are > >> > probably no exact numbers, but it should give a good idea. Or am I > >> > wrong? > >> > >> The FP reports are probably too few overall to be meaningful in > >> terms of differentiating performance between lists. There just > >> aren't that many, maybe a few a day on average. > >> > > > Yes, but I wasn't thinking on differentiating between the lists, there > > are other results for. What I was thinking on was the number of FP's > > that exists on more than one list. This is very usefull information > > when combining lists. If almost no FP's do occur on more than one > > list (at the same time) requiring appearance on at least 2 lists > > would be a very safe one. > > Good point. Anecdotally, FPs don't tend to appear on multiple > lists very often, at least the FPs we've seen reported. This is > unmeasured, just a subjective opinion. If we had some of the > list data in combined form as I had proposed then we could test > it better. I suppose I could just do it. ;-) > I f the reported one's are very rare, this would probably even more the case for the not reported one's. If there's a FP the chance for being reported will grow if on more than one list. Mmm the combined lists just have to be available to someone with a big ham corpus, to test it. Personaly knowing the results for "at least 2" or "at least 3" , would be nice. It also would be nice to know how those combination would result inside : http://www.surbl.org/permuted-hits.out.txt Alain

3 5

Re: [SURBL-Discuss] ANN: new surbl client (still beta)
by Jeff Chan 13 Feb '05

13 Feb '05

On Saturday, February 12, 2005, 2:41:36 AM, Alain Alain wrote: >> >> - I've added a local skiplist with about top half of the public >> >> "whitelist", no need to query those. >> >> When you say half, that may be more than optimal (should be about >> 5000 records). SpamAssassin is using the top 125, which worked >> out to about the 50%th percentile of all whitelist hits when we >> first set this up. (Now that result is skewed *because* >> SpamAssassin isn't checking those 125 any more, but their >> snapshot of the 125 is still probably useful. >> >> I'd say anything between 100 and 1000 would probably be a good >> compromise between list size and coverage. > The only disadvantage I see from a bigger local skiplist is some local > CPU usage for every uri in a email. Most pc's have plenty of CPU > power ;-) If this could become a problem, I can lower or optimise the > local checking. Are there any other disadvantages? One reason SpamAssassin didn't want to hard code too many domains into their local whitelist was in case we needed to withdraw any, i.e. because they started spamming. The time between code releases can be many months, and some people may never update, so they wanted to be sure to get very hammy domains into that list. (While Yahoo and Microsoft probably aren't going to start spamming any time soon, that may be less certain about some of the less commonly seen domains.) But I'm glad that you're trying to minimize the DNS queries. Jeff C. -- "If it appears in hams, then don't list it."

2 2

Re: [SURBL-Discuss] FP rate?
by Jeff Chan 12 Feb '05

12 Feb '05

On Saturday, February 12, 2005, 3:09:46 AM, Alain Alain wrote: >> > I know that not all FP's are reported and there are >> > probably no exact numbers, but it should give a good idea. Or am I >> > wrong? >> >> The FP reports are probably too few overall to be meaningful in >> terms of differentiating performance between lists. There just >> aren't that many, maybe a few a day on average. >> > Yes, but I wasn't thinking on differentiating between the lists, there > are other results for. What I was thinking on was the number of FP's > that exists on more than one list. This is very usefull information > when combining lists. If almost no FP's do occur on more than one > list (at the same time) requiring appearance on at least 2 lists > would be a very safe one. Good point. Anecdotally, FPs don't tend to appear on multiple lists very often, at least the FPs we've seen reported. This is unmeasured, just a subjective opinion. If we had some of the list data in combined form as I had proposed then we could test it better. I suppose I could just do it. ;-) Jeff C. -- "If it appears in hams, then don't list it."

1 0

Re: [SURBL-Discuss] FP rate?
by Jeff Chan 12 Feb '05

12 Feb '05

On Friday, February 11, 2005, 5:29:29 PM, Alain Alain wrote: >> That said, here are some results Daniel Quinlan posted from the >> mass-checks on the SpamAssassin corpora around 26 January 2005: >> >> > Weekly mass-check results for SURBL: >> >> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME >> > 217996 164295 53701 0.754 0.00 0.00 (all messages) >> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all messages as %) >> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL >> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL >> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL >> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL >> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL >> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL >> > Am I right with the following : > JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg > (is less than 1 in 15.000) That sounds right, but the particular proportions of spam versus ham may not be meaningful, i.e. they may not be representative of an actual mail stream. So the percentages are probably more usefully compared only to spam or ham and not to a combined total of messages. Certainly the relative percentages within spam or ham are meaningful and mostly useful with the caveat that the spam detection rates are wrong for quickly moving data in SC and AB since the test corpora cover too much time for them. (This is more true for spam than ham since spam domains vary quickly with time, but ham domains are relatively steady.) >> SC and AB have much better real world results than show above >> because their time period is much shorter than the test >> corpora's. > Yes, but maybe the FP's will grow faster ;-) That tends not to be the case. The SpamCop data is filtered multiple times and is human-checked at the front end. The SC FP rates are consistently among the lowest, and the spam detection rates are very high for a very small list. In short it's an effective strategy. >> Also note that the JP data is now removed from the WS data, and >> some old data was removed from WS. So the WS spam and ham hit >> rates have probably both decreased since this check was done. >> JP should be about the same. > That will show in the future. Is also a good thing. Yes, it's fairer to the data sources. >> > And if possible, has anybody statistics from FP's that where on >> > several of the sublists -at the same time-? > [snip] >> I don't think that is known yet. I had proposed setting up some >> test lists with combinations like this, but got no response. ;-) >> >> If it *is* known I think we'd all like to hear about it. :-) > I think it could be known to the great people that check the FP > reports. Normally they check against all sublists (I hope) and fix > them all. When we whitelist a domain, it's excluded from all SURBLs. The original data source is usually notified. > I know that not all FP's are reported and there are > probably no exact numbers, but it should give a good idea. Or am I > wrong? The FP reports are probably too few overall to be meaningful in terms of differentiating performance between lists. There just aren't that many, maybe a few a day on average. Jeff C. -- "If it appears in hams, then don't list it."

1 0

Re: [SURBL-Discuss] ANN: new surbl client (still beta)
by Alain 12 Feb '05

12 Feb '05

Hi Jeff > On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote: > >> Generally speaking it may be better to apply this kind of > >> filtering at the server level since there are economies of scale, > >> especially in terms of things like DNS lookups and caching. If > >> we suddenly get 100k more DNS clients, that could tax the name > >> servers somewhat. If those same 100k users were using 100 > >> servers instead, the DNS loading would be quite a bit less. In > >> that sense centralization is desirable. > > > Mmmm isn't the dns server from the ISP caching the dns requests? I > > would think it doesn't make a big difference (except when a server is > > rsync'ing). The difference could be that end users check their e-mail > > not when arriving on the MTA, but later. > > One difference is that the ISP's mail server may see many of the > same spams within a short period of time, and the lookups would > probably tend to be cached over that time span. Individual users > may POP or IMAP their messages at any random time, so the DNS > cache hit rate may be lower for them. This will only the case for spam e-mail, not for domains inside ham e-mail. > > I think we're agreeing, but I've never tried to quantify the > difference between these. We can propose that there's some > difference but how much is unknown. I would suggest a pretty > strong cache effect for mail servers however. But the good news is : The more users, the more caching. So the burden on the nameservers will grow slower. Alain

1 0

Re: [SURBL-Discuss] FP rate?
by Alain 12 Feb '05

12 Feb '05

Hi Jeff > >> That said, here are some results Daniel Quinlan posted from the > >> mass-checks on the SpamAssassin corpora around 26 January 2005: > >> > >> > Weekly mass-check results for SURBL: > >> > >> >OVERALL% SPAM% HAM% S/O RANK SCORE NAME > >> > 217996 164295 53701 0.754 0.00 0.00 (all messages) > >> >100.000 75.3661 24.6339 0.754 0.00 0.00 (all messages as %) > >> > 11.644 15.4490 0.0037 1.000 0.98 3.90 URIBL_SC_SURBL > >> > 39.572 52.4976 0.0261 1.000 0.98 3.00 URIBL_JP_SURBL > >> > 51.955 68.9236 0.0391 0.999 0.96 2.00 URIBL_OB_SURBL > >> > 5.690 7.5492 0.0000 1.000 0.95 2.01 URIBL_AB_SURBL > >> > 53.948 71.5238 0.1769 0.998 0.83 0.54 URIBL_WS_SURBL > >> > 0.030 0.0396 0.0000 1.000 0.51 0.84 URIBL_PH_SURBL > >> > > > Am I right with the following : > > > JP has 0.0261% FP on 24.6339% of all msg --> 0.0065% of all msg > > (is less than 1 in 15.000) > > That sounds right, but the particular proportions of spam versus > ham may not be meaningful, i.e. they may not be representative > of an actual mail stream. So the percentages are probably more > usefully compared only to spam or ham and not to a combined total > of messages. ok > > Certainly the relative percentages within spam or ham are > meaningful and mostly useful with the caveat that the spam > detection rates are wrong for quickly moving data in SC and AB > since the test corpora cover too much time for them. (This is > more true for spam than ham since spam domains vary quickly with > time, but ham domains are relatively steady.) > ok > >> SC and AB have much better real world results than show above > >> because their time period is much shorter than the test > >> corpora's. > > > Yes, but maybe the FP's will grow faster ;-) > > That tends not to be the case. The SpamCop data is filtered > multiple times and is human-checked at the front end. The SC FP > rates are consistently among the lowest, and the spam detection > rates are very high for a very small list. In short it's an > effective strategy. > ok and I am overall impressed with the low FP rates on all lists. > >> Also note that the JP data is now removed from the WS data, and > >> some old data was removed from WS. So the WS spam and ham hit > >> rates have probably both decreased since this check was done. > >> JP should be about the same. > > > That will show in the future. Is also a good thing. > > Yes, it's fairer to the data sources. > > >> > And if possible, has anybody statistics from FP's that where on > >> > several of the sublists -at the same time-? > > > [snip] > > >> I don't think that is known yet. I had proposed setting up some > >> test lists with combinations like this, but got no response. ;-) > >> > >> If it *is* known I think we'd all like to hear about it. :-) > > > I think it could be known to the great people that check the FP > > reports. Normally they check against all sublists (I hope) and fix > > them all. > > When we whitelist a domain, it's excluded from all SURBLs. The > original data source is usually notified. > > > I know that not all FP's are reported and there are > > probably no exact numbers, but it should give a good idea. Or am I > > wrong? > > The FP reports are probably too few overall to be meaningful in > terms of differentiating performance between lists. There just > aren't that many, maybe a few a day on average. > Yes, but I wasn't thinking on differentiating between the lists, there are other results for. What I was thinking on was the number of FP's that exists on more than one list. This is very usefull information when combining lists. If almost no FP's do occur on more than one list (at the same time) requiring appearance on at least 2 lists would be a very safe one. Alain

1 0

Re: [SURBL-Discuss] ANN: new surbl client (still beta)
by Jeff Chan 12 Feb '05

12 Feb '05

On Saturday, February 12, 2005, 2:34:20 AM, Alain Alain wrote: >> Generally speaking it may be better to apply this kind of >> filtering at the server level since there are economies of scale, >> especially in terms of things like DNS lookups and caching. If >> we suddenly get 100k more DNS clients, that could tax the name >> servers somewhat. If those same 100k users were using 100 >> servers instead, the DNS loading would be quite a bit less. In >> that sense centralization is desirable. > Mmmm isn't the dns server from the ISP caching the dns requests? I > would think it doesn't make a big difference (except when a server is > rsync'ing). The difference could be that end users check their e-mail > not when arriving on the MTA, but later. One difference is that the ISP's mail server may see many of the same spams within a short period of time, and the lookups would probably tend to be cached over that time span. Individual users may POP or IMAP their messages at any random time, so the DNS cache hit rate may be lower for them. I think we're agreeing, but I've never tried to quantify the difference between these. We can propose that there's some difference but how much is unknown. I would propose a pretty strong cache effect for mail servers however. Jeff C. -- "If it appears in hams, then don't list it."

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Discuss February 2005