Proposal for moving forward with JP list

List overview All Threads
Download

newer

older

Whitelist data: Alexa.com top 500

Web submissions working again.

Jeff Chan

21 Sep 2004 21 Sep '04

12:18 p.m.

OK we heard back from Theo that we probably won't be able to get JP into SpamAssassin 3.0, but we should be able to get it into 3.1. I believe the JP data and policies are different enough that it should be a separate list within multi.surbl.org, so I propose that we:

1. Add JP to multi.surbl.org now.

2. Keep the JP data in WS for now, so that regular 3.0 users get the advantages of JP also (as part of WS).

3. Ask SA to put JP into 3.1 for future use, and most significantly, separate scoring.

4. Remove JP from WS before the final 3.1 mass check and re-scoring is done, to make the two lists more separate for 3.1 . (Note that the separation is removal of the specific subset arrangement suggested in #2. If that is done, there will still be some overlap of the records in WS and JP.)

5. Inform people about removing JP from WS before we do it, so existing WS users can add JP, etc.

How does this sound?

Jeff C.

-- Jeff Chan mailto:jeffc@surbl.org http://www.surbl.org/

Show replies by date

Christiaan den Besten

21 Sep 21 Sep

2:06 p.m.

...

How does this sound?

Sounds like a decent plan :)

bye, Chris

Jeff Chan

22 Sep 22 Sep

7:32 a.m.

On Tuesday, September 21, 2004, 3:18:43 AM, Jeff Chan wrote:

...

OK we heard back from Theo that we probably won't be able to get JP into SpamAssassin 3.0, but we should be able to get it into 3.1. I believe the JP data and policies are different enough that it should be a separate list within multi.surbl.org, so I propose that we:

...

Add JP to multi.surbl.org now.

...

Keep the JP data in WS for now, so that regular 3.0 users

get the advantages of JP also (as part of WS).

...

Ask SA to put JP into 3.1 for future use, and most

significantly, separate scoring.

OK We would like to proceed with the first part of this. We propose adding JP to multi on Monday September 27, but keeping the JP data in WS for now, and asking SA to add JP to SA 3.1 after that change next Monday. We would want to announce the new list so that other programs using multi.surbl.org would know that the return values had changed. That would give them some time if they need to make adjustments to their code. JP would get the 64 bitmask, as in:

2 = comes from sc.surbl.org 4 = comes from ws.surbl.org 8 = comes from phishing list (labelled as [ph] in multi) 16 = comes from ob.surbl.org 32 = comes from ab.surbl.org 64 = comes from jp list

Does anyone have comments?

Jeff C.

Frank Ellermann

9:55 p.m.

Jeff Chan wrote:

...

...
How does this sound?

Sound.

...

Does anyone have comments?

Thanks. Bye, Frank

Jeff Chan

23 Sep 23 Sep

1:45 a.m.

On Tuesday, September 21, 2004, 10:32:16 PM, Jeff Chan wrote:

...

On Tuesday, September 21, 2004, 3:18:43 AM, Jeff Chan wrote:

...
OK we heard back from Theo that we probably won't be able to get JP into SpamAssassin 3.0, but we should be able to get it into 3.1. I believe the JP data and policies are different enough that it should be a separate list within multi.surbl.org, so I propose that we:

...

...

Add JP to multi.surbl.org now.

...

...

Keep the JP data in WS for now, so that regular 3.0 users

get the advantages of JP also (as part of WS).

...

...

Ask SA to put JP into 3.1 for future use, and most

significantly, separate scoring.

...

OK We would like to proceed with the first part of this. We propose adding JP to multi on Monday September 27, but keeping the JP data in WS for now, and asking SA to add JP to SA 3.1 after that change next Monday. We would want to announce the new list so that other programs using multi.surbl.org would know that the return values had changed. That would give them some time if they need to make adjustments to their code. JP would get the 64 bitmask, as in:

...

2 = comes from sc.surbl.org 4 = comes from ws.surbl.org 8 = comes from phishing list (labelled as [ph] in multi) 16 = comes from ob.surbl.org 32 = comes from ab.surbl.org 64 = comes from jp list

...

Does anyone have comments?

...

Jeff C.

I'm going to assume a lack of comments means everyone agrees....

Jeff C.

John Lundin

3:21 p.m.

On Wed, Sep 22, 2004 at 04:45:34PM -0700, Jeff Chan wrote:

...

...
OK We would like to proceed with the first part of this. We propose adding JP to multi on Monday September 27, but keeping the JP data in WS for now, and asking SA to add JP to SA 3.1 after that change next Monday. We would want to announce the new list so that other programs using multi.surbl.org would know that the return values had changed. That would give them some time if they need to make adjustments to their code. JP would get the 64 bitmask [...]

I'm going to assume a lack of comments means everyone agrees....

I don't disagree, but do have a couple of comments.

First, when JP drops out of WS there will be a content change. One of the reasons for adding JP is to get it a higher SpamAssassin score. But since it was part of WS before that, there will be a "decrease" there. And the folk doing scoring won't have a way to anticipate the effect. Would it be worthwhile to phase JP out of WS slowly and/or put up a temporary WSONLY list that could be used for scoring trials?

The other is more about how people use scores. As we do a better job of spotting and reduce FPs the SpamAssassin scores will go up. This is good, right? Well, maybe. There are six URIRL's in SpamAssassin 3.0 already. And as scored, a -single- feature in the text of the message can trigger a spam score of 9.9 (without bayes) or 12.4 (with). Now. This scares me, since some systems discard spam above a certain score.

If we assume that JP gets the same confidence that SC has, that inflates the score to 13.8 or 16.6. That's a lot of certainty to invest in one lone URI. Especially given that evil URIs do wind up in legitimate mail, however rarely.

Which isn't directly SURBL's problem, of course.

-- lundin@fini.net "Would you tell me, please, which way I ought to go from here?" "That depends a good deal on where you want to get to," said the Cat.

Jeff Chan

3:56 p.m.

On Thursday, September 23, 2004, 6:21:13 AM, John Lundin wrote:

...

On Wed, Sep 22, 2004 at 04:45:34PM -0700, Jeff Chan wrote:

...
...
OK We would like to proceed with the first part of this. We propose adding JP to multi on Monday September 27, but keeping the JP data in WS for now, and asking SA to add JP to SA 3.1 after that change next Monday. We would want to announce the new list so that other programs using multi.surbl.org would know that the return values had changed. That would give them some time if they need to make adjustments to their code. JP would get the 64 bitmask [...]

(Thanks for your feedback... :-)

...

First, when JP drops out of WS there will be a content change. One of the reasons for adding JP is to get it a higher SpamAssassin score. But since it was part of WS before that, there will be a "decrease" there. And the folk doing scoring won't have a way to anticipate the effect. Would it be worthwhile to phase JP out of WS slowly and/or put up a temporary WSONLY list that could be used for scoring trials?

Good point. Raymond has already been testing a version of WS with only WS and no JP. Perhaps we should make one generally available for testing and scoring before the JP out of WS date in some months. I'm already dreading the support questions. LOL!

...

The other is more about how people use scores. As we do a better job of spotting and reduce FPs the SpamAssassin scores will go up. This is good, right? Well, maybe. There are six URIRL's in SpamAssassin 3.0 already. And as scored, a -single- feature in the text of the message can trigger a spam score of 9.9 (without bayes) or 12.4 (with). Now. This scares me, since some systems discard spam above a certain score.

Are the scores cumulative like that? I thought I heard they are either/or, perhaps in the context of multi and urirhssub.

...

If we assume that JP gets the same confidence that SC has, that inflates the score to 13.8 or 16.6. That's a lot of certainty to invest in one lone URI. Especially given that evil URIs do wind up in legitimate mail, however rarely.

JP should score about the same as OB since they have similar spam detection and FP rates. SC has a lower FP rate (good) and somewhat lower hit rates (less good) than JP or OB. The lower FP rate rightly counts more, so SC scores higher.

Jeff C.

John Lundin

7:52 p.m.

On Thu, Sep 23, 2004 at 06:56:04AM -0700, Jeff Chan wrote:

...

On Thursday, September 23, 2004, 6:21:13 AM, John Lundin wrote:

...
The other is more about how people use scores. As we do a better job of spotting and reduce FPs the SpamAssassin scores will go up. This is good, right? Well, maybe. There are six URIRL's in SpamAssassin 3.0 already. And as scored, a -single- feature in the text of the message can trigger a spam score of 9.9 (without bayes) or 12.4 (with). Now. This scares me, since some systems discard spam above a certain score.

Are the scores cumulative like that? I thought I heard they are either/or, perhaps in the context of multi and urirhssub.

Oooh, yeah. And they usually do go off in multiples.

Some percentages from a small ISP, last two months inbound mail:

Detected 4.616% as not spam (including FFP's): 99.144% (no URI_RBL found) 0.488% WS_URI_RBL 0.226% OB_URI_RBL 0.051% OB_URI_RBL WS_URI_RBL 0.037% SPAMCOP_URI_RBL 0.017% OB_URI_RBL SPAMCOP_URI_RBL 0.012% SPAMCOP_URI_RBL WS_URI_RBL 0.012% OB_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 0.005% AB_URI_RBL OB_URI_RBL SPAMCOP_URI_RBL 0.005% AB_URI_RBL OB_URI_RBL 0.002% AB_URI_RBL

Detected 95.384% as spam: 34.538% AB_URI_RBL OB_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 14.623% (no URI_RBL found) 14.359% OB_URI_RBL WS_URI_RBL 10.442% OB_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 7.551% WS_URI_RBL 3.153% AB_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 3.031% AB_URI_RBL OB_URI_RBL SPAMCOP_URI_RBL 3.006% OB_URI_RBL 2.681% AB_URI_RBL OB_URI_RBL WS_URI_RBL 1.936% SPAMCOP_URI_RBL WS_URI_RBL 1.648% OB_URI_RBL SPAMCOP_URI_RBL 1.105% AB_URI_RBL WS_URI_RBL 1.055% AB_URI_RBL OB_URI_RBL 0.340% SPAMCOP_URI_RBL 0.340% AB_URI_RBL SPAMCOP_URI_RBL 0.172% AB_URI_RBL 0.010% AB_URI_RBL PH_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 0.005% PH_URI_RBL WS_URI_RBL 0.004% OB_URI_RBL PH_URI_RBL WS_URI_RBL 0.001% AB_URI_RBL PH_URI_RBL WS_URI_RBL 0.001% PH_URI_RBL SPAMCOP_URI_RBL WS_URI_RBL 0.000% PH_URI_RBL

Over a third of all spam inbound hit all four URIRLs. Less that half of that number hit no URIRLs. But even less, only 11.069%, hit just one URIRL.

Under SA2.6, I compensated by adding in second-order meta rules with negative scores, but as the number of urirls goes up that becomes unwieldy fast.

...

...
If we assume that JP gets the same confidence that SC has, that inflates the score to 13.8 or 16.6. That's a lot of certainty to invest in one lone URI. Especially given that evil URIs do [...]

JP should score about the same as OB since they have similar spam detection and FP rates. SC has a lower FP rate (good) and somewhat lower hit rates (less good) than JP or OB. The lower FP rate rightly counts more, so SC scores higher.

That would drop it to 11.9 or 15.6. :-)

I worry most about quoting and notification scenarios.

-- lundin@cavtel.net "ASCII stupid question, get a stupid ANSI."

Jeff Chan

24 Sep 24 Sep

2:41 a.m.

On Thursday, September 23, 2004, 10:52:12 AM, John Lundin wrote:

...

I worry most about quoting and notification scenarios.

Spam discussion messages should not be filtered.

Nor should abuse desk mailboxes.

Jeff C. -- "If it appears in hams, don't list it."

7587

Age (days ago)

7590

Last active (days ago)

discuss@lists.surbl.org

8 comments

4 participants

tags (0)

participants (4)

Christiaan den Besten
Frank Ellermann
Jeff Chan
John Lundin