Jeff Chan wrote to SURBL Discussion list and users@spamassassin.apache.org:
Does anyone have other corpus stats to share, in particular FP rates?
Thanks for sharing your data.
You're welcome.
HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit.
That kind of checking should become a policy. For people who can do that kind of checking, they should do it every time. Every tool we have for reducing FPs should be used.
Letting FPs in just hurts the usefulness of the lists.
Agreed... some of these are really easy to catch.
Thanks. I agree those look like false positives and have whitelisted all of them across SURBLs.
Good. Thanks!
Signing up for a newsletter then forgetting about does not make a message spam.
;-) Worse yet, even *with* a carefully and correctly classified corpus of *messages*, we all know that doesn't come anywhere *near* to guaranteeing a correctly classified list of URIs. That's where spamtraps fall short, and that's why we *need* hand-checking on every domain.
Instead of having these go into SURBLs, they should be checked **before** they get added. Hopefully they would be detected then and not get added to begin with. Wouldn't that be better?
Should hand-checking catch these as mostly legitimate?
Are we hand-checking? If not we should!
Speaking for myself, I hand check absolutely everything I submit. I've spent at least half an hour digging up dirt on some single domains to correctly classify them (though, in many cases, that time is now greatly reduced thanks to GetURI), and, despite my best efforts, it's still likely that I've misclassified a few that haven't been reported as FPs yet.
But, yes, we really need to continue to look hard at sources and their methods to make sure *every* submitter is doing the right thing. It doesn't take many domains to seriously skew the FP rate, when we're talking about hundredths of percentage points.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54187 19148 0.739 0.00 0.00 (all messages) 100.007 73.8897 26.1103 0.739 0.00 0.00 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up. :-)
Looks good, but this corpus is perhaps too small to make representative measurements for emails in general.
Agreed. If any other SA users would like to send me their mass-check spam.log and ham.log with SURBL tests, I'll gladly combine, analyze, and post the hit frequencies.
Here's my latest, without those whitelisted ones:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73333 54186 19147 0.739 0.00 0.00 (all messages) 100.000 73.8903 26.1097 0.739 0.00 0.00 (all messages as %) 62.906 85.1308 0.0104 1.000 1.00 1.00 URIBL_PJ_SURBL 23.738 32.1245 0.0052 1.000 0.89 4.00 URIBL_SC_SURBL 66.122 89.4327 0.1515 0.998 0.82 3.00 URIBL_WS_SURBL 21.525 29.1293 0.0052 1.000 0.76 5.00 URIBL_AB_SURBL 56.618 76.6194 0.0157 1.000 0.71 4.00 URIBL_OB_SURBL 0.001 0.0018 0.0000 1.000 0.64 2.00 URIBL_PH_SURBL
BUT... If I exclude the messages with domains from today's whitelist:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73310 54186 19124 0.739 0.00 0.00 (all messages) 100.000 73.9135 26.0865 0.739 0.00 0.00 (all messages as %) 66.104 89.4327 0.0052 1.000 1.00 3.00 URIBL_WS_SURBL 62.926 85.1308 0.0105 1.000 0.74 1.00 URIBL_PJ_SURBL 23.746 32.1245 0.0052 1.000 0.67 4.00 URIBL_SC_SURBL 21.532 29.1293 0.0052 1.000 0.57 5.00 URIBL_AB_SURBL 0.001 0.0018 0.0000 1.000 0.50 2.00 URIBL_PH_SURBL 56.636 76.6194 0.0157 1.000 0.48 4.00 URIBL_OB_SURBL
I also found more to whitelist, but I'm working on a larger ham corpus for those. Details to follow...
That said, any reduction in FPs is important and welcome.
So why don't we hold our first 12-hour SURBL FP-a-thon?
- Ryan