Eric Kolve and I were looking at how to best set the default SpamCopURI scores for the various SURBL lists and at first we tried looking at the SpamAssassin 3.0 perceptron-generated scores as a possible guide:
http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
# The following block of scores were generated using the mass-checking # scripts, and a perceptron to determine the optimum scores which # resulted in minimum false positives or negatives. The scores are # weighted to produce roughly 1 false positive in 2500 non-spam messages # using the default threshold of 5.0.
score URIBL_AB_SURBL 0 2.007 0 0.417 score URIBL_OB_SURBL 0 1.996 0 3.213 score URIBL_PH_SURBL 0 0.839 0 2.000 score URIBL_SC_SURBL 0 3.897 0 4.263 score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant, to which Theo Van Dinter cited:
$ perldoc Mail::SpamAssassin::Conf [...] If four valid scores are listed, then the score that is used depends on how SpamAssassin is being used. The first score is used when both Bayes and network tests are disabled (score set 0). The second score is used when Bayes is disabled, but network tests are enabled (score set 1). The third score is used when Bayes is enabled and network tests are disabled (score set 2). The fourth score is used when Bayes is enabled and network tests are enabled (score set 3).
We wondered if we could somehow use those scores with SpamCopURI and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to set scores, to which I mentioned:
We have these test results from Justin from 25 June:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 121405 22516 98889 0.185 0.00 0.00 (all messages) 100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %) 13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS 3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC 2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB 0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH 12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
which shows a pretty high FP rate for WS, less for the others. Do you happen to have access to any more recent corpus check data like this? Could be useful to have another snapshot for a more complete picture.
Which was followed up with more data and discussion:
On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:
high spam + low ham is good from an FP standpoint, but having a "significant" (for your definition thereof) ham hitrate means the score shouldn't be too high. My handwaving scores would be something like:
[Theo's wild guess scores for Justin's June data: -- Jeff C.]
WS 1.2 SC 2.5 AB 3.5 OB 1.8
Theo then gave some of his own stats on a couple different corpora:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 416072 365031 51041 0.877 0.00 0.00 (all messages) 100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 119215 67094 52121 0.563 0.00 0.00 (all messages) 100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
so for these results, I'd probably do something like:
WS 1.3 SC 4.0 AB 3.0 OB 2.2
since the hit rates and S/O are a bit higher for me, related to the fact I ran more spam through than Justin did.
To which I added:
Those final scores look like an excellent fit to the data to me.
and:
Also while the PH spam hit rate [from Justin's stats] is low, the data is of hand checked phishing scams, which deserve to be blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like 3 to 5.
So we'll probably adjust the default scores on SpamCopURI to something like:
WS 1.3 SC 4.0 AB 3.0 OB 2.2 PH 4.5
and we recommend SpamCopURI users do likewise. Please be sure to use the latest version of SpamCopURI with multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/ http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for ws.surbl.org is way too high at about 0.45 to 0.5% across multiple corpora. That FP rate needs to be reduced for WS to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to reduce FPs in WS further. If so, ***please*** try to apply it. We need to get the FPs to be much less than 0.5%. The other lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular FP rates?
Jeff C.
Hi!
http://sourceforge.net/projects/spamcopuri/ http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for ws.surbl.org is way too high at about 0.45 to 0.5% across multiple corpora. That FP rate needs to be reduced for WS to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to reduce FPs in WS further. If so, ***please*** try to apply it. We need to get the FPs to be much less than 0.5%. The other lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Seeing those data it would be very interesting if we could test a seperate list. Is that possible? I would like to test the Prolo and Joe's list combined, without the rest of the WS list. I can generate the data for a test like that. I have seen allmost zero FP's in the data i compose, so perhaps its better to seperate the lists. I think people would benefit from a less FP stuffed list. The current WS list is just compiled out of too many datasources i think.
Suggestions ?
Shall i send you a list for testing so we can see if this would bump down the FP rates ?
Bye, Raymond.
On Sunday, September 5, 2004, 3:30:49 AM, Raymond Dijkxhoorn wrote:
Seeing those data it would be very interesting if we could test a seperate list. Is that possible? I would like to test the Prolo and Joe's list combined, without the rest of the WS list. I can generate the data for a test like that. I have seen allmost zero FP's in the data i compose, so perhaps its better to seperate the lists. I think people would benefit from a less FP stuffed list. The current WS list is just compiled out of too many datasources i think.
If you can make the different lists available to me by rsync, I can easily set up some temporary local SURBLs for testing them. Thank you rbldnsd! :-)
Unfortunately I don't have my own test corpora, so I need to rely on the generosity of others who do. So I'd probably need to ask Theo, Daniel, Justin or others with corpora to test against them.
Jeff C.
Hi!
If you can make the different lists available to me by rsync, I can easily set up some temporary local SURBLs for testing them. Thank you rbldnsd! :-)
Ok, will make a test set available, if you generate zonefiles for the rsync box i can put them on 2 servers to test with...
Unfortunately I don't have my own test corpora, so I need to rely on the generosity of others who do. So I'd probably need to ask Theo, Daniel, Justin or others with corpora to test against them.
Yes, that would be great. I'll mail you details offlist.
Bye, Raymond.
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
Basically the higher the FP rate, the less useful a list is.
... or, rather, the lower it ought to be scored.
Does anyone have other corpus stats to share, in particular FP rates?
Sure. All of these messages were received in the past 10 days. A lot has happened since June. :-)
WS: 44004/54185s, 61/19150s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54185 19150 0.739 0.00 0.00 (all messages) 100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %) 60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit. I found several, but, for the most part, they've *already* been cleaned up and are no longer listed in WS. (30 out of the 61 were in a massive mailing list thread for a single domain that has since been whitelisted).
And, in that 19K ham corpus, I found the following FPs still listed in WS:
buckeye-express.com -- Used in a personal email address, looks legit; 7 examples nm.ru -- Used in a personal email address, looks legit advanstar.com -- Legit uses; found in a well-known dental newsletter; also personal email address of one of the editors; 3 messages 00fun.com -- Confirmed, more than one user on our system sent or received eCards from them northstarconferences.com Legit conference host site subscribed to by two users; 9 messages in this corpus mardox.com -- Search engine; registered 1875 days ago, and *looks* like the user did actually submit their site to them. postsnet.com -- Registered exactly one year ago, 51 NANAS, blank home page, ehh... but I have 4 different legit newsletters with links to them. webspawner.com -- Created in 1996; free host/email npdor.com -- Surveys; been around since 1999. 103 NANAS, but they've been advertised by some reputable "word of the day" mailers (dictionary.com) Maybe a good candidate for UC. :-) 2 examples imninc.com -- Domain is 507 days old; they do newsletters. At least one of them is legit. :-) worldhealth.net -- It's 3468 days old today (1995). One of our users attended a conference of theirs, and signed up for a newsletter. hoteldiscounts.com -- 2459 days old (1997), found in actual room booking confirmations for Comfort Inn.
(I'll re-post these in another thread, just so everybody sees them).
AND, I found 2 spams that were incorrectly hand-classified as ham.
So, if I take those out, the numbers look more like:
WS: 44006/54187s, 0/19148s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54187 19148 0.739 0.00 0.00 (all messages) 100.007 73.8897 26.1103 0.739 0.00 0.00 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up. :-)
Anyone with ham corpora, just search for WS_SURBL hits and give 'em a hand-check.
- Ryan
On Sunday, September 5, 2004, 10:32:57 AM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
Basically the higher the FP rate, the less useful a list is.
... or, rather, the lower it ought to be scored.
Yes, but please remember that not everyone has the ability to "score" their SURBL hits. Not everyone using SURBLs is using SpamAssassin.
Does anyone have other corpus stats to share, in particular FP rates?
Thanks for sharing your data. I know this can be a somewhat painful subject for people, but it's very important to clean up the false positives and make the lists better and more useful.
Sure. All of these messages were received in the past 10 days. A lot has happened since June. :-)
WS: 44004/54185s, 61/19150s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54185 19150 0.739 0.00 0.00 (all messages) 100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %) 60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit.
That kind of checking should become a policy. For people who can do that kind of checking, they should do it every time. Every tool we have for reducing FPs should be used.
Letting FPs in just hurts the usefulness of the lists.
I found several, but, for the most part, they've *already* been cleaned up and are no longer listed in WS. (30 out of the 61 were in a massive mailing list thread for a single domain that has since been whitelisted).
And, in that 19K ham corpus, I found the following FPs still listed in WS:
buckeye-express.com -- Used in a personal email address, looks legit; 7 examples nm.ru -- Used in a personal email address, looks legit advanstar.com -- Legit uses; found in a well-known dental newsletter; also personal email address of one of the editors; 3 messages 00fun.com -- Confirmed, more than one user on our system sent or received eCards from them northstarconferences.com Legit conference host site subscribed to by two users; 9 messages in this corpus mardox.com -- Search engine; registered 1875 days ago, and *looks* like the user did actually submit their site to them. postsnet.com -- Registered exactly one year ago, 51 NANAS, blank home page, ehh... but I have 4 different legit newsletters with links to them. webspawner.com -- Created in 1996; free host/email npdor.com -- Surveys; been around since 1999. 103 NANAS, but they've been advertised by some reputable "word of the day" mailers (dictionary.com) Maybe a good candidate for UC. :-) 2 examples imninc.com -- Domain is 507 days old; they do newsletters. At least one of them is legit. :-) worldhealth.net -- It's 3468 days old today (1995). One of our users attended a conference of theirs, and signed up for a newsletter. hoteldiscounts.com -- 2459 days old (1997), found in actual room booking confirmations for Comfort Inn.
Thanks. I agree those look like false positives and have whitelisted all of them across SURBLs. Signing up for a newsletter then forgetting about does not make a message spam.
Instead of having these go into SURBLs, they should be checked **before** they get added. Hopefully they would be detected then and not get added to begin with. Wouldn't that be better?
Should hand-checking catch these as mostly legitimate?
Are we hand-checking? If not we should!
(I'll re-post these in another thread, just so everybody sees them).
AND, I found 2 spams that were incorrectly hand-classified as ham.
So, if I take those out, the numbers look more like:
WS: 44006/54187s, 0/19148s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54187 19148 0.739 0.00 0.00 (all messages) 100.007 73.8897 26.1103 0.739 0.00 0.00 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up. :-)
Looks good, but this corpus is perhaps too small to make representative measurements for emails in general. That said, any reduction in FPs is important and welcome.
Anyone with ham corpora, just search for WS_SURBL hits and give 'em a hand-check.
- Ryan
Thanks for your stats and checking, and yes please anyone else with ham corpora, please check for FPs.
Jeff C.
Jeff Chan wrote to SURBL Discussion list and users@spamassassin.apache.org:
Does anyone have other corpus stats to share, in particular FP rates?
Thanks for sharing your data.
You're welcome.
HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit.
That kind of checking should become a policy. For people who can do that kind of checking, they should do it every time. Every tool we have for reducing FPs should be used.
Letting FPs in just hurts the usefulness of the lists.
Agreed... some of these are really easy to catch.
Thanks. I agree those look like false positives and have whitelisted all of them across SURBLs.
Good. Thanks!
Signing up for a newsletter then forgetting about does not make a message spam.
;-) Worse yet, even *with* a carefully and correctly classified corpus of *messages*, we all know that doesn't come anywhere *near* to guaranteeing a correctly classified list of URIs. That's where spamtraps fall short, and that's why we *need* hand-checking on every domain.
Instead of having these go into SURBLs, they should be checked **before** they get added. Hopefully they would be detected then and not get added to begin with. Wouldn't that be better?
Should hand-checking catch these as mostly legitimate?
Are we hand-checking? If not we should!
Speaking for myself, I hand check absolutely everything I submit. I've spent at least half an hour digging up dirt on some single domains to correctly classify them (though, in many cases, that time is now greatly reduced thanks to GetURI), and, despite my best efforts, it's still likely that I've misclassified a few that haven't been reported as FPs yet.
But, yes, we really need to continue to look hard at sources and their methods to make sure *every* submitter is doing the right thing. It doesn't take many domains to seriously skew the FP rate, when we're talking about hundredths of percentage points.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54187 19148 0.739 0.00 0.00 (all messages) 100.007 73.8897 26.1103 0.739 0.00 0.00 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up. :-)
Looks good, but this corpus is perhaps too small to make representative measurements for emails in general.
Agreed. If any other SA users would like to send me their mass-check spam.log and ham.log with SURBL tests, I'll gladly combine, analyze, and post the hit frequencies.
Here's my latest, without those whitelisted ones:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73333 54186 19147 0.739 0.00 0.00 (all messages) 100.000 73.8903 26.1097 0.739 0.00 0.00 (all messages as %) 62.906 85.1308 0.0104 1.000 1.00 1.00 URIBL_PJ_SURBL 23.738 32.1245 0.0052 1.000 0.89 4.00 URIBL_SC_SURBL 66.122 89.4327 0.1515 0.998 0.82 3.00 URIBL_WS_SURBL 21.525 29.1293 0.0052 1.000 0.76 5.00 URIBL_AB_SURBL 56.618 76.6194 0.0157 1.000 0.71 4.00 URIBL_OB_SURBL 0.001 0.0018 0.0000 1.000 0.64 2.00 URIBL_PH_SURBL
BUT... If I exclude the messages with domains from today's whitelist:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73310 54186 19124 0.739 0.00 0.00 (all messages) 100.000 73.9135 26.0865 0.739 0.00 0.00 (all messages as %) 66.104 89.4327 0.0052 1.000 1.00 3.00 URIBL_WS_SURBL 62.926 85.1308 0.0105 1.000 0.74 1.00 URIBL_PJ_SURBL 23.746 32.1245 0.0052 1.000 0.67 4.00 URIBL_SC_SURBL 21.532 29.1293 0.0052 1.000 0.57 5.00 URIBL_AB_SURBL 0.001 0.0018 0.0000 1.000 0.50 2.00 URIBL_PH_SURBL 56.636 76.6194 0.0157 1.000 0.48 4.00 URIBL_OB_SURBL
I also found more to whitelist, but I'm working on a larger ham corpus for those. Details to follow...
That said, any reduction in FPs is important and welcome.
So why don't we hold our first 12-hour SURBL FP-a-thon?
- Ryan
Hi!
;-) Worse yet, even *with* a carefully and correctly classified corpus of *messages*, we all know that doesn't come anywhere *near* to guaranteeing a correctly classified list of URIs. That's where spamtraps fall short, and that's why we *need* hand-checking on every domain.
So far, the PJ list is hand checked, uhm well, there is one exception on that. The pillgang guys, with their 2 'famous' nameservers. When we see those comming in they get auto-added ;)
NS2.AUDI56SEW.BIZ NS3.AIRMARAMBA.BIZ
Really a gazillion spam domains on those two nameservers.
Bye, Raymond.