On Tuesday, November 23, 2004, 12:14:59 PM, Rob McEwen wrote:
A. Definite hand-typed HAM
B. Closed Loop Opt-In NEWSLETTER (topically applicable to the recipient)
C. NEWSLETTER (topically applicable to the recipient) from reputable organization (no harvesting, few/none NANAS, no SpamHaus) where the person didn't actually subscribe, but likes to read it... maybe it came because they previously bought something or left checked a "receive other offers/info" checkbox
D. More "spammy" NEWSLETTER (but topically applicable to the recipient) where the mailer is fairly "clean" (some NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else.
E. More "spammy" ADVERTISEMENT (but topically applicable to the recipient) where the mailer is very "clean" (no harvesting, few NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else
All of the above should probably be considered ham for SURBL purposes. What matters more than the *sending style* is what other *uses the domain name* or IP in the URI might have.
Remember that we're not blocking sending methods. We're blocking URI mentions like domains. Therefore what matters is not how the message is sent (newsletter, hand-send, etc.) but ***what the domain might be used for***. We don't want to block on legitimate domains. All of your examples above are for legitimate or mostly legitimate domains.
F. Definite spam (to varying degrees).
Of course, it is not always possible to know if an e-mail is "topically applicable to the recipient". But assuming that you do, it is hard for Mail Administrators to distinguish between B, C, and D. It is also sometimes hard to distinguish between E & F.
A better question might be whether the mail is "topically applicable to ANY recipient." Since we are a global blocklist, we need to think globally and act on behalf of ALL users, not just one particular recipient.
Therefore we want to list domains that are pretty much universally regarded as spammy like cheappillz4u. biz, 0emsoftwarez. info, etc., and almost certainly not some plumbing fixture manufacturer's open subscription newsletter.
The overwhelming percentage of Spam IS very distinguishable from A-E because of things like obfuscation techniques, SpamTrap recipients, location of sender's server, past history of sender, etc.
I agree. We want to list only that extremely obvious spam. Usually it's for pills, mortgage, warez, gambling, porn, etc.
Still, this whole issue makes me question, "how good are Ham Corpuses".
Moreover, when a particular SURBL gets an FP rating of .002%, I think, "that's great"... but then I wonder, "is this .002% actual human written correspondence, or is it a newsletter, etc?"
Rob McEwen
As has been noted, getting down to 1 part in 50,000 (0.02%) it's very easy for a minor misclassification to have a huge impact on the FP numbers.
Ham corpora do have errors, both FP and FN. Usually FPs can only be detected by hand-checking them again. Even highly-experienced spam-fighters make errors when classifying their ham and spam initially. To err is human.
There are also problems with the representativeness of messages in corpora. It's not always easy to put together large and broad enough collections of ham to meaningfully reflect the larger corpus of all messages in general.
Measurements like these are quite hard to do well. Corpus checks are probably best for relative differences between algorithms, etc. I.e. is performance increasing or decreasing with a given change in coding, inclusion policies, etc.
Jeff C. -- "If it appears in hams, then don't list it."