Here's a test on my last 4 days of ham and spam. The T_DNS_FROM* rules are using the envelope sender (that is, the MAIL FROM added by my MTA to the message headers). They have a reasonably good hit rate (6.43% of spam hit one of tested SURBL zones) and a 0% FP rate in this test. Only 7 out of 103 of those did not hit one of the URIBL rules, but they did do it with zero FPs (I get FPs mostly because people discuss spam domains in some of the ham tested here. That's another issue, though.)
Maybe it would be worthwhile factoring this into future development. That is, also list known spammer envelope senders domains, maybe get SpamCop to provide lists for that too, I suspect there's some overlap in the other direction as well.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 2598 1602 996 0.617 0.00 0.00 (all messages) 100.000 61.6628 38.3372 0.617 0.00 0.00 (all messages as %) 3.580 5.8052 0.0000 1.000 1.00 0.01 T_DNS_FROM_SURBL_WS 0.885 1.4357 0.0000 1.000 0.67 0.01 T_DNS_FROM_SURBL_SC 9.161 14.5443 0.5020 0.967 0.56 1.00 URIBL_BE_SURBL 36.066 57.5531 1.5060 0.974 0.44 1.00 URIBL_WS_SURBL 0.423 0.6866 0.0000 1.000 0.33 0.01 T_DNS_FROM_SURBL_BE 29.908 47.5655 1.5060 0.969 0.11 1.00 URIBL_SC_SURBL
How's the multi roll-out going? It would definitely be handy for this test (the code to support it already exists).
Daniel
On Monday, June 7, 2004, 7:22:22 PM, Daniel Quinlan wrote:
Here's a test on my last 4 days of ham and spam. The T_DNS_FROM* rules are using the envelope sender (that is, the MAIL FROM added by my MTA to the message headers). They have a reasonably good hit rate (6.43% of spam hit one of tested SURBL zones) and a 0% FP rate in this test. Only 7 out of 103 of those did not hit one of the URIBL rules, but they did do it with zero FPs (I get FPs mostly because people discuss spam domains in some of the ham tested here. That's another issue, though.)
Maybe it would be worthwhile factoring this into future development. That is, also list known spammer envelope senders domains, maybe get SpamCop to provide lists for that too, I suspect there's some overlap in the other direction as well.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 2598 1602 996 0.617 0.00 0.00 (all messages) 100.000 61.6628 38.3372 0.617 0.00 0.00 (all messages as %) 3.580 5.8052 0.0000 1.000 1.00 0.01 T_DNS_FROM_SURBL_WS 0.885 1.4357 0.0000 1.000 0.67 0.01 T_DNS_FROM_SURBL_SC 9.161 14.5443 0.5020 0.967 0.56 1.00 URIBL_BE_SURBL 36.066 57.5531 1.5060 0.974 0.44 1.00 URIBL_WS_SURBL 0.423 0.6866 0.0000 1.000 0.33 0.01 T_DNS_FROM_SURBL_BE 29.908 47.5655 1.5060 0.969 0.11 1.00 URIBL_SC_SURBL
Thanks for the results. While SURBLs are not intended to be used with domains in envelopes, I can see how they could apply to a small percentage of spams, as you found. I'm glad there are zero false positives, though I'm somewhat concerned about the 1.5% FPs using SURBLs with urirhsbl. Hopefully those are from spam under discussion in the corpora, as you propose.
Regarding adding sender domains to SURBLs, I'm 99% sure we're not going to specifically go out and do that. Our focus will remain on message bodies (spamvertised sites).
On the other hand I can see the value in being able to handle sender domains without needing to DNS resolve them into IP addresses. That advantage would be similar to the one SURBLs have over the numeric URIBLS.
However another reason to not do it is that sender domains are routinely forged, so eventually they would probably become an unreliable predictor of spam. On the other hand a forging message body URI would do spammers little good since it would take away traffic from their spam web site. Accordingly, I think we'll stick with URI domains for SURBLs. :D
How's the multi roll-out going? It would definitely be handy for this test (the code to support it already exists).
Thanks for the prompting; we got around to doing it at last, as seen in the other messages.
Jeff C.