On Sunday, September 5, 2004, 10:32:57 AM, Ryan Thompson wrote:
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:
Basically the higher the FP rate, the less useful a list is.
... or, rather, the lower it ought to be scored.
Yes, but please remember that not everyone has the ability to "score" their SURBL hits. Not everyone using SURBLs is using SpamAssassin.
Does anyone have other corpus stats to share, in particular FP rates?
Thanks for sharing your data. I know this can be a somewhat painful subject for people, but it's very important to clean up the false positives and make the lists better and more useful.
Sure. All of these messages were received in the past 10 days. A lot has happened since June. :-)
WS: 44004/54185s, 61/19150s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54185 19150 0.739 0.00 0.00 (all messages) 100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %) 60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
HOWEVER... I decided to go through the ham hits (61 of them), and look for false positive domains to submit.
That kind of checking should become a policy. For people who can do that kind of checking, they should do it every time. Every tool we have for reducing FPs should be used.
Letting FPs in just hurts the usefulness of the lists.
I found several, but, for the most part, they've *already* been cleaned up and are no longer listed in WS. (30 out of the 61 were in a massive mailing list thread for a single domain that has since been whitelisted).
And, in that 19K ham corpus, I found the following FPs still listed in WS:
buckeye-express.com -- Used in a personal email address, looks legit; 7 examples nm.ru -- Used in a personal email address, looks legit advanstar.com -- Legit uses; found in a well-known dental newsletter; also personal email address of one of the editors; 3 messages 00fun.com -- Confirmed, more than one user on our system sent or received eCards from them northstarconferences.com Legit conference host site subscribed to by two users; 9 messages in this corpus mardox.com -- Search engine; registered 1875 days ago, and *looks* like the user did actually submit their site to them. postsnet.com -- Registered exactly one year ago, 51 NANAS, blank home page, ehh... but I have 4 different legit newsletters with links to them. webspawner.com -- Created in 1996; free host/email npdor.com -- Surveys; been around since 1999. 103 NANAS, but they've been advertised by some reputable "word of the day" mailers (dictionary.com) Maybe a good candidate for UC. :-) 2 examples imninc.com -- Domain is 507 days old; they do newsletters. At least one of them is legit. :-) worldhealth.net -- It's 3468 days old today (1995). One of our users attended a conference of theirs, and signed up for a newsletter. hoteldiscounts.com -- 2459 days old (1997), found in actual room booking confirmations for Comfort Inn.
Thanks. I agree those look like false positives and have whitelisted all of them across SURBLs. Signing up for a newsletter then forgetting about does not make a message spam.
Instead of having these go into SURBLs, they should be checked **before** they get added. Hopefully they would be detected then and not get added to begin with. Wouldn't that be better?
Should hand-checking catch these as mostly legitimate?
Are we hand-checking? If not we should!
(I'll re-post these in another thread, just so everybody sees them).
AND, I found 2 spams that were incorrectly hand-classified as ham.
So, if I take those out, the numbers look more like:
WS: 44006/54187s, 0/19148s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 73335 54187 19148 0.739 0.00 0.00 (all messages) 100.007 73.8897 26.1103 0.739 0.00 0.00 60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up. :-)
Looks good, but this corpus is perhaps too small to make representative measurements for emails in general. That said, any reduction in FPs is important and welcome.
Anyone with ham corpora, just search for WS_SURBL hits and give 'em a hand-check.
- Ryan
Thanks for your stats and checking, and yes please anyone else with ham corpora, please check for FPs.
Jeff C.