[SURBL-Discuss] general questions.....
jeffc at surbl.org
Tue Nov 23 22:55:58 CET 2004
On Tuesday, November 23, 2004, 12:14:59 PM, Rob McEwen wrote:
> A. Definite hand-typed HAM
> B. Closed Loop Opt-In NEWSLETTER (topically applicable to the recipient)
> C. NEWSLETTER (topically applicable to the recipient) from reputable
> organization (no harvesting, few/none NANAS, no SpamHaus) where the person
> didn't actually subscribe, but likes to read it... maybe it came because
> they previously bought something or left checked a "receive other
> offers/info" checkbox
> D. More "spammy" NEWSLETTER (but topically applicable to the recipient)
> where the mailer is fairly "clean" (some NANAS, no SpamHaus), but the user
> didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox
> checked in the past when filling out something else or ordering something
> E. More "spammy" ADVERTISEMENT (but topically applicable to the recipient)
> where the mailer is very "clean" (no harvesting, few NANAS, no SpamHaus),
> but the user didn't explicitly Opt-in. Maybe they left a "receive other
> offers" checkbox checked in the past when filling out something else or
> ordering something else
All of the above should probably be considered ham for SURBL
purposes. What matters more than the *sending style* is what
other *uses the domain name* or IP in the URI might have.
Remember that we're not blocking sending methods. We're
blocking URI mentions like domains. Therefore what matters
is not how the message is sent (newsletter, hand-send, etc.)
but ***what the domain might be used for***. We don't want to
block on legitimate domains. All of your examples above
are for legitimate or mostly legitimate domains.
> F. Definite spam (to varying degrees).
> Of course, it is not always possible to know if an e-mail is "topically
> applicable to the recipient". But assuming that you do, it is hard for Mail
> Administrators to distinguish between B, C, and D. It is also sometimes hard
> to distinguish between E & F.
A better question might be whether the mail is "topically
applicable to ANY recipient." Since we are a global blocklist,
we need to think globally and act on behalf of ALL users,
not just one particular recipient.
Therefore we want to list domains that are pretty much
universally regarded as spammy like cheappillz4u. biz,
0emsoftwarez. info, etc., and almost certainly not some
plumbing fixture manufacturer's open subscription newsletter.
> The overwhelming percentage of Spam IS very distinguishable from A-E because
> of things like obfuscation techniques, SpamTrap recipients, location of
> sender's server, past history of sender, etc.
I agree. We want to list only that extremely obvious spam.
Usually it's for pills, mortgage, warez, gambling, porn, etc.
> Still, this whole issue makes me question, "how good are Ham Corpuses".
> Moreover, when a particular SURBL gets an FP rating of .002%, I think,
> "that's great"... but then I wonder, "is this .002% actual human written
> correspondence, or is it a newsletter, etc?"
> Rob McEwen
As has been noted, getting down to 1 part in 50,000
(0.02%) it's very easy for a minor misclassification
to have a huge impact on the FP numbers.
Ham corpora do have errors, both FP and FN. Usually
FPs can only be detected by hand-checking them again.
Even highly-experienced spam-fighters make errors when
classifying their ham and spam initially. To err is human.
There are also problems with the representativeness
of messages in corpora. It's not always easy to put
together large and broad enough collections of ham
to meaningfully reflect the larger corpus of all messages
Measurements like these are quite hard to do well.
Corpus checks are probably best for relative differences
between algorithms, etc. I.e. is performance increasing
or decreasing with a given change in coding, inclusion
"If it appears in hams, then don't list it."
More information about the Discuss