-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Thursday, September 16, 2004 7:01 PM To: SURBL Discuss Subject: Re: [SURBL-Discuss] RFC: pj.surbl.org - list from Joe Wein and Prolocation data
On Wednesday, September 15, 2004, 7:06:34 PM, David Hooton wrote:
On Wed, 15 Sep 2004 16:43:32 -0700, Jeff Chan
jeffc@surbl.org wrote:
we thought it might be useful to make the PJ data available as a separate list, at least within multi.surbl.org, the combined SURBL. We'd like to get your comments on this.
I think having a separate list makes sense if the data quality is different to that of the pooled data it was previously connected to.
We're also wondering whether the PJ data should be taken out of WS, or left in, if we do make PJ a distinct list.
No point in lowering the hitrate of the superset, any
additional score
added to a spam is better than none at all.
Please comment,
The greater choice and control we provide SURBL users the better. If we have the ability to sustainably break data out like this and provide ongoing data quality ratings to aid score adjustments I think we should do it.
Thanks for your feedback David. Does anyone else have comments about the possibility of PJ? Making separate lists from the WS data is a little different from the direction we've been going lately, so it would be nice to get comments on it. We're still somewhat undecided about whether to do it or not....
As you can see from the first message about this, the FP rates of PJ look significantly lower than WS as a whole.
AS usual, I'm thinking different from everyone else :)
I do NOT like the idea of more lists.
1) The lists are dynamic, so FP rates will change. 2) Too many lists make it more difficult for the devs to GA and perceptron run all of them. Causing a slow down in scoring for SA and others. 3) Run a diff and find out where we have our FPs. 4) More lookups for mutli 5) Too many list options will drive some potential users away. 6) K.I.S.S.
The only reason I see having more lists is if the data is specifically different throughout the whole list.
ie: phishing, UC, regular spam, blog, ect....
His list data is the same kind as WS. So really....why seperate?
We just keep getting our FP rate lower and it will all be good.
--Chris (The devils advocate.)
Chris Santerre wrote:
AS usual, I'm thinking different from everyone else :)
Usually I like your ideas, so this doesn't add up.
The only reason I see having more lists is if the data is specifically different throughout the whole list.
AOL. Maybe you can adjust the policies for WS and PJ to get one list. IIRC Joe's main trick was "age of domain".
Bye, Frank
On Friday, September 17, 2004, 8:09:06 AM, Frank Ellermann wrote:
Chris Santerre wrote:
The only reason I see having more lists is if the data is specifically different throughout the whole list.
AOL. Maybe you can adjust the policies for WS and PJ to get one list. IIRC Joe's main trick was "age of domain".
Bye, Frank
Age of domain is probably the most powerful tool for both Joe and Outblaze. I think Outblaze is a little more strict about applying it than Joe.
I agree it should be used to potentially improve the quality of all data and spot FPs better.
Jeff C.
Chris Santerre wrote:
AS usual, I'm thinking different from everyone else :) I do NOT like the idea of more lists.
Neither do I. At least not more surbl-type lists that are served by surbl.org.
I'd prefer just one surbl.org-list, serving entries from a few sources all confirming to the strict "we do not want any FPs" philosophy of surbl. One manually checked list that can relatively safely be used to block/drop email, rather than just score email.
Then I'd like to see a lot of surbl-type lists *not* served by surbl.org, that are provided based on different philosophies - more aggressive, accepting a higher degree of collateral damage, etc.
Just like we have with RBLs.
I want an SBL type surlb list, and I think surbl.org is the prime candidate for that.
But I also want a SPEWS type surbl list, and I don't think that it should or could be done/served by surbl.org.
Etc.
Having more and more different surbl.org lists that we try to fit into the same basic philosophy of "no FPs" is just complicating things and confusing existing and potential users.
Patrik
On Friday, September 17, 2004, 11:53:07 AM, Patrik Nilsson wrote:
Chris Santerre wrote:
AS usual, I'm thinking different from everyone else :) I do NOT like the idea of more lists.
Neither do I. At least not more surbl-type lists that are served by surbl.org.
I'd prefer just one surbl.org-list, serving entries from a few sources all confirming to the strict "we do not want any FPs" philosophy of surbl. One manually checked list that can relatively safely be used to block/drop email, rather than just score email.
In principle all the lists except OB are hand-checked. In the case of SC, the checking is done by SpamCop submitters who can be a little inconsistent, which is why we add mechanisms to limit mistakes such as an inclusion threshold dependent on the number of reports.
In a practical sense there is one list which most people will use: multi.
I agree about working towards a list which is useful for dropping spams. Such a list needs to have very low FPs. Zero would be ideal, though that's arguably impossible.
Then I'd like to see a lot of surbl-type lists *not* served by surbl.org, that are provided based on different philosophies - more aggressive, accepting a higher degree of collateral damage, etc.
Just like we have with RBLs.
I agree that having a diversity of data sources is probably useful. Which is why I was glad to hear that the mailpolice lists could be used with SURBL code with some good results.
I want an SBL type surlb list, and I think surbl.org is the prime candidate for that.
But I also want a SPEWS type surbl list, and I don't think that it should or could be done/served by surbl.org.
Etc.
A very aggressive list could be useful for home users, but SURBLs will have the most impact if we get the data clean enough for large providers to use. It would be nice to stop spam before it ever reaches users, i.e. at the ISP level, but FPs get in the way of that. Therefore a list with lower FPs such as PJ is potentially quite useful.
Having more and more different surbl.org lists that we try to fit into the same basic philosophy of "no FPs" is just complicating things and confusing existing and potential users.
Patrik
Most SA users probably just use the default rules, so if we get PJ into the standard config file, there should not be much confusion. And we already have individual SUBRL lists like ws, sc, ob, ab.
Jeff C.
On Friday, September 17, 2004, 6:46:56 AM, Chris Santerre wrote:
I do NOT like the idea of more lists.
- The lists are dynamic, so FP rates will change.
It's true that FP rates vary over time for all lists, but the FPs of PJ look consistently lower than WS.
- Too many lists make it more difficult for the devs to GA and perceptron
run all of them. Causing a slow down in scoring for SA and others.
While it's true that a PJ list would be one more rule for the SpamAssassin mass checks to score, I doubt that one more list would slow it down significantly in the larger picture. Mass checks are already scoring a gazillion other rules....
- Run a diff and find out where we have our FPs.
The diffs between WS and PJ are about 26k records out of 46k records, perhaps too many to check by hand. Or did you mean just the FPs?
- More lookups for mutli
multi doesn't work that way. We can have an infinite number of lists in multi (for the same overall universe of domains and IPs) and it's still just one lookup per wild URI. That's a major advantage of a combined list: one lookup gets you all the lists.
Remember that the PJ records are already in multi, as part of WS, so there would be no new records added by having PJ separate, just some changed return codes and some slightly longer TXT records with "[PJ]" added.
- Too many list options will drive some potential users away.
Most users probably just use the defaults. We would want to add PJ to the default configs for SA3, if we do it.
- K.I.S.S.
The only reason I see having more lists is if the data is specifically different throughout the whole list.
ie: phishing, UC, regular spam, blog, ect....
His list data is the same kind as WS. So really....why seperate?
sc, ws, ob and ab all have email spam URI data, but they're all separate lists because they represent different types of data sources (human reports, manual lists, filtered traps, etc.).
I actually wanted the JW data to be separate in the beginning because it was a distinctly different and new data source with different a inclusion process, different spamtrap feeds, etc.
We just keep getting our FP rate lower and it will all be good.
We definitely need to get the FPs in WS lower, independent of anything else. FPs only hurt WS and make it less useful to people.
Jeff C.
Jeff Chan wrote:
We can have an infinite number of lists in multi
There are many ways to say "7 bits ought to be enough for everybody", but "infinite" is a bit exaggerated. ;-)
Or it's a new octal system, 0, 1, 2, 3, 4, 5, 6, 7, INF
Remember that the PJ records are already in multi, as part of WS
That's cheating. If the WS bit is set I'd expect a WS entry, with the WS policy and whitelisting instructions.
Sure, at the moment there are no different whitelisting instructions for the MULTI sets, but that's not obvious. And sooner or later it will change.
I actually wanted the JW data to be separate in the beginning because it was a distinctly different and new data source with different a inclusion process, different spamtrap feeds, etc.
If it's really very different, then it's also good enough for its own MULTI bit. But a different set of spamtraps is no real difference. A different policy for inclusions, exclusions, or whitelisting is interesting.
FPs only hurt WS and make it less useful to people.
People expecting no FPs at all should try the empty list, works like a charm. Of course it won't identify any spam.
Bye, Frank
On Saturday, September 18, 2004, 11:27:05 PM, Frank Ellermann wrote:
Jeff Chan wrote:
We can have an infinite number of lists in multi
There are many ways to say "7 bits ought to be enough for everybody", but "infinite" is a bit exaggerated. ;-)
Or it's a new octal system, 0, 1, 2, 3, 4, 5, 6, 7, INF
Yes there is some cost: 1 bit and a few bytes of text. It's pretty minor though. And the number of lookups doesn't increase.
Remember that the PJ records are already in multi, as part of WS
That's cheating. If the WS bit is set I'd expect a WS entry, with the WS policy and whitelisting instructions.
Well one of the problems with WS is that it has multiple data sources in it, so it's hard to tell exactly where any given record came from.
Sure, at the moment there are no different whitelisting instructions for the MULTI sets, but that's not obvious. And sooner or later it will change.
Actually ws, ob, sc, ab, and ph all have different whitelisting instructions already, if the updates go back to the original source, which is preferrable. The whitelisting procedure is described on the lists page:
http://www.surbl.org/lists.html
I actually wanted the JW data to be separate in the beginning because it was a distinctly different and new data source with different a inclusion process, different spamtrap feeds, etc.
If it's really very different, then it's also good enough for its own MULTI bit. But a different set of spamtraps is no real difference. A different policy for inclusions, exclusions, or whitelisting is interesting.
Yes, it's different spam traps, and different policies for inclusion, etc.
FPs only hurt WS and make it less useful to people.
People expecting no FPs at all should try the empty list, works like a charm. Of course it won't identify any spam.
Bye, Frank
Well that's not really a reasonable alternative.
We should try to maximize spam detection and minimize FPs. Both functions need to be optimized simultaneously.
Jeff C.
Hi!`
Remember that the PJ records are already in multi, as part of WS
That's cheating. If the WS bit is set I'd expect a WS entry, with the WS policy and whitelisting instructions.
Sure, at the moment there are no different whitelisting instructions for the MULTI sets, but that's not obvious. And sooner or later it will change.
There is generic whitelisting, on *ALL* SURBL lists, and thats done on a central level. That will be the most important mask, since al lists walk by.
I actually wanted the JW data to be separate in the beginning because it was a distinctly different and new data source with different a inclusion process, different spamtrap feeds, etc.
If it's really very different, then it's also good enough for its own MULTI bit. But a different set of spamtraps is no real difference. A different policy for inclusions, exclusions, or whitelisting is interesting.
The dataset is much smaller, still seems to have less FP rates, Theo (SA) and some others, including mysel, did large checks, and found out the same.
Todays stats, but thats only from 11 hours real life data:
SpamAssassin tag hits: (top 100) #1 53053 URIBL_WS_SURBL #2 51711 URIBL_PJ_SURBL #3 51702 URIBL_SBL #4 49008 BAYES_99 #5 48227 URIBL_OB_SURBL #6 45620 RCVD_IN_BL_SPAMCOP_NET #7 45489 HTML_MESSAGE #8 35014 URIBL_SC_SURBL #9 29758 URIBL_AB_SURBL #10 27992 MIME_HTML_ONLY
The WS stats are still the combined lists, i also did tests with a special zonefile, compiled for this test, where PJ data was taken out of WS. There PJ performed better then the whole WS. That was my main reason to propose a seperate list. Its smaller, catches more then the combined list, and has a lower FP rating then the combined list.
Bye, Raymond.
Hi!
#3 51702 URIBL_SBL #4 49008 BAYES_99 #5 48227 URIBL_OB_SURBL #6 45620 RCVD_IN_BL_SPAMCOP_NET #7 45489 HTML_MESSAGE #8 35014 URIBL_SC_SURBL #9 29758 URIBL_AB_SURBL #10 27992 MIME_HTML_ONLY
The WS stats are still the combined lists, i also did tests with a special zonefile, compiled for this test, where PJ data was taken out of WS. There PJ performed better then the whole WS. That was my main reason to propose a seperate list. Its smaller, catches more then the combined list, and has a lower FP rating then the combined list.
I forgot to add that new entry's seem to appear also faster, but thats biased on our local spam only.
Bye, Raymond.
On Sunday, September 19, 2004, 2:11:20 AM, Raymond Dijkxhoorn wrote:
Remember that the PJ records are already in multi, as part of WS
That's cheating. If the WS bit is set I'd expect a WS entry, with the WS policy and whitelisting instructions.
Sure, at the moment there are no different whitelisting instructions for the MULTI sets, but that's not obvious. And sooner or later it will change.
There is generic whitelisting, on *ALL* SURBL lists, and thats done on a central level. That will be the most important mask, since al lists walk by.
To clarify a little, we can whitelist over all SURBLs, and we do that a lot since any FPs found in one list should be excluded from other lists also.
But there is also whitelisting per data source, which means contacting the individual data sources and asking them to exclude right at their source data. Those are the contacts mentioned on the lists page.
[...]
SpamAssassin tag hits: (top 100) #1 53053 URIBL_WS_SURBL #2 51711 URIBL_PJ_SURBL
The WS stats are still the combined lists, i also did tests with a special zonefile, compiled for this test, where PJ data was taken out of WS. There PJ performed better then the whole WS. That was my main reason to propose a seperate list. Its smaller, catches more then the combined list, and has a lower FP rating then the combined list.
To be clear, the combined list that Raymond is referring to in this case is WS, which has several different data sources in it. (Not to be confused with multi.surbl.org which combines several separate lists together.)
Jeff C.
Raymond Dijkxhoorn wrote:
Its smaller, catches more then the combined list, and has a lower FP rating then the combined list.
Sounds very good. Technically only the lower FP rate is a convincing argument for an independent MULTI bit / set, and there are only 7 MULTI bits / sets.
If some users would want to use JP but not WS, then they'd need a separate bit. Somebody said that the lists overlap, therefore enumerations (0 null, 1 WS, 2 PJ, 3 third list) won't work to identify the source, and it has to be 0 null, 1 WS, 2 PJ, 3 WS+PJ (shifted to 2 corresponding MULTI bits).
Bye, Frank
On Sunday, September 19, 2004, 10:29:49 AM, Frank Ellermann wrote:
Raymond Dijkxhoorn wrote:
Its smaller, catches more then the combined list, and has a lower FP rating then the combined list.
Sounds very good. Technically only the lower FP rate is a convincing argument for an independent MULTI bit / set, and there are only 7 MULTI bits / sets.
If some users would want to use JP but not WS, then they'd need a separate bit. Somebody said that the lists overlap, therefore enumerations (0 null, 1 WS, 2 PJ, 3 third list) won't work to identify the source, and it has to be 0 null, 1 WS, 2 PJ, 3 WS+PJ (shifted to 2 corresponding MULTI bits).
Bye, Frank
Well we would not shift the bits around. If we had separate bits for WS, JP, and WS+JP, the original WS+JP would be in the same place and the other two (separate lists) would get new bits.
But me might also lean towards taking JP out of WS if we do this (i.e., no WS+JP).
Jeff C.
Jeff Chan wrote:
But me might also lean towards taking JP out of WS if we do this (i.e., no WS+JP).
That's what I wanted to say, treat both as independent sets:
WS = only WS (multi 127.0.0.4), JP = only JP (multi 127.0.0.64)
The last available multi bit / set / list is then 128 (bit 7).
Bye, Frank
On Sunday, September 19, 2004, 4:03:20 PM, Frank Ellermann wrote:
Jeff Chan wrote:
But me might also lean towards taking JP out of WS if we do this (i.e., no WS+JP).
That's what I wanted to say, treat both as independent sets:
WS = only WS (multi 127.0.0.4), JP = only JP (multi 127.0.0.64)
The last available multi bit / set / list is then 128 (bit 7).
Bye, Frank
Yes, that would probably be the plan if we did it. Taking the JP data out of WS would probably depend on whether we can get the change into SA 3.0 before it is released.
If we can get both WS and JP into the default SA 3 config then it should be ok to have them be separate, and then ask everyone using WS already to add JP also.
Most people using SURBLs will probably be using them in SA 3 when it comes out.
Jeff C.
In message 414E1038.3A78@xyzzy.claranet.de, Frank Ellermann writes:
Jeff Chan wrote:
But me might also lean towards taking JP out of WS if we do this (i.e., no WS+JP).
That's what I wanted to say, treat both as independent sets:
WS = only WS (multi 127.0.0.4), JP = only JP (multi 127.0.0.64)
The last available multi bit / set / list is then 128 (bit 7).
Shouldn't it be possible to use the second and third field as well? For example, multi 127.0.1.0 for an eightth list? Of course, some code would need changes...
But then, my knowledge of IP addressing is less than perfect, so I might be wrong. :-)
//Christer
On Monday, September 20, 2004, 3:49:24 AM, Christer Borang wrote:
In message 414E1038.3A78@xyzzy.claranet.de, Frank Ellermann writes:
Jeff Chan wrote:
But me might also lean towards taking JP out of WS if we do this (i.e., no WS+JP).
That's what I wanted to say, treat both as independent sets:
WS = only WS (multi 127.0.0.4), JP = only JP (multi 127.0.0.64)
The last available multi bit / set / list is then 128 (bit 7).
Shouldn't it be possible to use the second and third field as well? For example, multi 127.0.1.0 for an eightth list? Of course, some code would need changes...
Yes, the additional 16 bits in the other two octets should be available for other lists, though I doubt we'll get that many. Presumably the programs would not need to change, only some additional list configs would be needed.
Jeff C.
Jeff Chan wrote:
Christer Borang wrote:
Shouldn't it be possible to use the second and third field as well? For example, multi 127.0.1.0 for an eightth list? Of course, some code would need changes...
AFAIK nobody does this for sets. The iadb.isipp.org codes use bits in three octets, but not for sets. OPM uses the fixed format 127.0.1.* for seven sets. Maybe the 1 is meant as the version (just an idea).
Yes, the additional 16 bits in the other two octets should be available for other lists, though I doubt we'll get that many.
Christer's idea of _23_ instead of _7_ sets is nice, but as he said, my script couldn't handle it without some modification. And for this modification OPM's strange 127.0.1.* would be (again) a special case. Bye, Frank
P.S.: Another possible explanation for 127.0.1.*, it allows _8_ instead of _7_ sets without conflict with 127.0.0.1.
On Monday, September 20, 2004, 12:51:43 PM, Frank Ellermann wrote:
Jeff Chan wrote:
Christer Borang wrote:
Shouldn't it be possible to use the second and third field as well? For example, multi 127.0.1.0 for an eightth list? Of course, some code would need changes...
AFAIK nobody does this for sets. The iadb.isipp.org codes use bits in three octets, but not for sets. OPM uses the fixed format 127.0.1.* for seven sets. Maybe the 1 is meant as the version (just an idea).
Yes, the additional 16 bits in the other two octets should be available for other lists, though I doubt we'll get that many.
Christer's idea of _23_ instead of _7_ sets is nice, but as he said, my script couldn't handle it without some modification. And for this modification OPM's strange 127.0.1.* would be (again) a special case. Bye, Frank
P.S.: Another possible explanation for 127.0.1.*, it allows _8_ instead of _7_ sets without conflict with 127.0.0.1.
Not sure what you mean by "sets" in this case. Perhaps you mean a combination of lists. I think the bits in other octets (bytes) would be used simply to identify different lists, like those currently used in the last octet.
By the way, does anyone have any more comments about breaking out the JP data as a separate list?
Jeff C.
Jeff Chan wrote:
Not sure what you mean by "sets" in this case.
Bits used as sets, 1 = IN, 0 = OUT, each bit represents one set. In the case of MULTI a set is an independent list, SC, WS, PH, JP, etc. Implementations of PASCAL used this idea for "sets", bitwise operations on bytes are efficient. It starts to get more trouble if you need more than 8 sets ;-)
identify different lists, like those currently used in the last octet.
Yes, up to 23 = 8 + 8 + 7 independent lists in a MULTI result. But then SURBL would be the first BL using more than 7 bits for this purpose. Bye, Frank