(was: Want your ham lists for creating whitelists (Was: Re: [SURBL-Discuss] This ROCKS!))
Jeff Chan jeffc@surbl.org writes:
Is there any way we can get the message body URI domains (or even raw messages) from some of the ham lists?
This is a really really really really really really bad idea.
The intent is not to "cheat" the S/O scores, but to compile a good, hand checked whitelist of legitimate message body domains. *I don't really care where they come from*, but we have this nice, untapped source of them just sitting there, and a glaring need for some legitimate domains to keep off the blocklists....
I'm very concerned that if people who plan on submitting their corpus results into the SpamAssassin mass-check process for 3.0 will end up with no ham hits whereas your average user will have some. Cheating is not the word I would use, but it would completely completely throw off our GA process by making it look like SURBL cannot ever issue a false positive, screwing over non-developers.
Even if you only get results from people who are never going to submitting corpus results, you're going to end up with a very heavy whitelist bias towards technical users, penalizing non-technical users who will have the rule scored too high compared to what it should be.
I'm results oriented. If there are some good hand-checked ham lists, I see no practical reason why we should not use them to generate message body domain whitelists.
See above.
It would be much better if you came up with other methods for generating a whitelist besides using our benchmark data which is comparatively tiny compared to the world of ham.
One other note: just because a URL comes out of ham does *not* mean it should be listed in SURBL. I expect spam URLs to be present in some ham, discussion of spam mostly, so there should always be a small false positive rate (and the rules should be appropriately scored).
Daniel