-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Monday, November 22, 2004 8:41 PM To: SURBL Discussion list Subject: Re: [SURBL-Discuss] general questions.....
On Monday, November 22, 2004, 5:25:14 PM, Justin Mason wrote:
Important to note that SURBL *can* increase its efficiency,
by changing
its methods -- ie. adding more data sources, modifying the moderation model, etc. can increase efficiency.
I like to think so too, but one of Terry's hypotheses is that detecting spam in the remaining variance (the ~15% currently undetected) may require some "third dimension of spam" and that about half of that variance may be truly "noise" and therefore inherently undetectable (paraphrasing him from off-list discussions). But he doesn't have data to support that claim yet, just empirical observations across different classification systems.
It's good to hear that Henry Stern is getting a PhD for his work in this area, since it can be worthy of that honor. It's not a particularly easy problem.
Jeff C.
Wow that was a good email. It makes me think about things from a higher level then the trenches. The whole thing has to be thought of in sections. If we are thinking of JUST SURBL, then I agree that to get this 15% remaining requires more manpower thrown at the overall project. I say overall, because there are other antispam projects that support SURBL that would also be MUCH better with more help.
Looking at it from another view, the 15% IS caught! THe bigger picture is antispam. You throw DNSRBL, SURBL, BAYES, SARE, and SA at the problem, and classification jumps an order of magnitude that you wanted. Which for most end users can be 99.99%. Differences being tastes in the definition of the classification. Which is a human trait that can't be removed.
But I believe there is still a huge leap SURBL can make in classification. With an increase in data mining, research, and a little more help from major ISPs and registrars.
Thanks for that informative email Jeff!! You saved me a google ;)
--Chris
Differences being tastes in the definition of the classification
...which reminds me... I keep meaning to ask about what constitutes a FP when discussed on this list. Basically, this isn't always so black & white:
Consider the following classifications:
A. Definite hand-typed HAM
B. Closed Loop Opt-In NEWSLETTER (topically applicable to the recipient)
C. NEWSLETTER (topically applicable to the recipient) from reputable organization (no harvesting, few/none NANAS, no SpamHaus) where the person didn't actually subscribe, but likes to read it... maybe it came because they previously bought something or left checked a "receive other offers/info" checkbox
D. More "spammy" NEWSLETTER (but topically applicable to the recipient) where the mailer is fairly "clean" (some NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else.
E. More "spammy" ADVERTISEMENT (but topically applicable to the recipient) where the mailer is very "clean" (no harvesting, few NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else
F. Definite spam (to varying degrees).
(I'm sure someone else could have done a better job of listed hard-to-differentiate categories)
Of course, it is not always possible to know if an e-mail is "topically applicable to the recipient". But assuming that you do, it is hard for Mail Administrators to distinguish between B, C, and D. It is also sometimes hard to distinguish between E & F.
The overwhelming percentage of Spam IS very distinguishable from A-E because of things like obfuscation techniques, SpamTrap recipients, location of sender's server, past history of sender, etc.
Still, this whole issue makes me question, "how good are Ham Corpuses".
Moreover, when a particular SURBL gets an FP rating of .002%, I think, "that's great"... but then I wonder, "is this .002% actual human written correspondence, or is it a newsletter, etc?"
Rob McEwen
on Tue, Nov 23, 2004 at 03:14:59PM -0500, Rob McEwen wrote:
Differences being tastes in the definition of the classification
...which reminds me... I keep meaning to ask about what constitutes a FP when discussed on this list. Basically, this isn't always so black & white:
Consider the following classifications:
<snip categories>
(I'm sure someone else could have done a better job of listed hard-to-differentiate categories)
For me, I'm coming to the point of simply distinguishing between mail delivery attempts that occur in the context of abusive behavior (e.g., as part of the same session that tries to deliver to a spamtrap) or has so many things wrong with either the remote host (no rDNS, mismatch rDNS and HELO, known forged HELO, HELO as blacklisted domain, etc.) or with the message itself (missing Message-ID, tracking device header, misleading MIME content-type - ie, multipart/mixed with only one part, which though legal (!) is a very strong indicator of spam, etc.)
I see a future in which legit mail servers are simply expected to be configured within a reasonable bound, and act in reasonably nonabusive ways, or else their mail will be rejected. Here, anyway. Unfortunately, the spammers will likely simply beat us to it, so even these checks become less useful.
On Tuesday, November 23, 2004, 12:25:16 PM, Steven Champeon wrote:
For me, I'm coming to the point of simply distinguishing between mail delivery attempts that occur in the context of abusive behavior (e.g., as part of the same session that tries to deliver to a spamtrap) or has so many things wrong with either the remote host (no rDNS, mismatch rDNS and HELO, known forged HELO, HELO as blacklisted domain, etc.) or with the message itself (missing Message-ID, tracking device header, misleading MIME content-type - ie, multipart/mixed with only one part, which though legal (!) is a very strong indicator of spam, etc.)
Which is ok for a breadth first approach that you guys take.
But for SURBLs we need that narrowed down to 100% pure spammers only. That's probably an impossible task, but that should be our goal.
I see a future in which legit mail servers are simply expected to be configured within a reasonable bound, and act in reasonably nonabusive ways, or else their mail will be rejected. Here, anyway. Unfortunately, the spammers will likely simply beat us to it, so even these checks become less useful.
Yeah, it just means the spammers will need to fake or steal services better. That's why sender checks are probably less useful than content checks.
Jeff C. -- "If it appears in hams, then don't list it."
on Tue, Nov 23, 2004 at 01:58:55PM -0800, Jeff Chan wrote:
On Tuesday, November 23, 2004, 12:25:16 PM, Steven Champeon wrote:
For me, I'm coming to the point of simply distinguishing between mail delivery attempts that occur in the context of abusive behavior (e.g., as part of the same session that tries to deliver to a spamtrap) or has so many things wrong with either the remote host (no rDNS, mismatch rDNS and HELO, known forged HELO, HELO as blacklisted domain, etc.) or with the message itself (missing Message-ID, tracking device header, misleading MIME content-type - ie, multipart/mixed with only one part, which though legal (!) is a very strong indicator of spam, etc.)
Which is ok for a breadth first approach that you guys take.
But for SURBLs we need that narrowed down to 100% pure spammers only. That's probably an impossible task, but that should be our goal.
Yeah, I understand that completely.
I see a future in which legit mail servers are simply expected to be configured within a reasonable bound, and act in reasonably nonabusive ways, or else their mail will be rejected. Here, anyway. Unfortunately, the spammers will likely simply beat us to it, so even these checks become less useful.
Yeah, it just means the spammers will need to fake or steal services better. That's why sender checks are probably less useful than content checks.
I dunno - with 50 million (or more) zombies out there? Sender checks are going to be useful for a good long time. As long as we can keep the fixed-netblock spammers in check with DNSBLs like SBL we'll do well.
On Wednesday, November 24, 2004, 6:47:56 AM, Steven Champeon wrote:
on Tue, Nov 23, 2004 at 01:58:55PM -0800, Jeff Chan wrote:
On Tuesday, November 23, 2004, 12:25:16 PM, Steven Champeon wrote:
I see a future in which legit mail servers are simply expected to be configured within a reasonable bound, and act in reasonably nonabusive ways, or else their mail will be rejected. Here, anyway. Unfortunately, the spammers will likely simply beat us to it, so even these checks become less useful.
Yeah, it just means the spammers will need to fake or steal services better. That's why sender checks are probably less useful than content checks.
I dunno - with 50 million (or more) zombies out there? Sender checks are going to be useful for a good long time. As long as we can keep the fixed-netblock spammers in check with DNSBLs like SBL we'll do well.
I use regular SBL too, but spammers have found ways around RBLs, such as zombies that only send a few messages, etc.
Jeff C. -- "If it appears in hams, then don't list it."
Rob McEwen wrote:
Moreover, when a particular SURBL gets an FP rating of .002%, I think, "that's great"... but then I wonder, "is this .002% actual human written correspondence, or is it a newsletter, etc?"
Here's a case which I've seen happen, more than once:
I cater for company A, users get no spam.
"User"'s buddy in Company B with no antispam forwards a great offer for a US mortgage (for a Swiss citizen? DUMBO!) and guess what happens, SURBL catches this message ....
what do you do? whitelist the SURBL entry, whitelist DUMBO , ignore FP....
yes, I've chose to ignore FP, and I'm sure "User"'s boss would approve.....
but can an ISP do this easily? nope.....
what do others do?
Alex
PS: Nice to have WS back to normal - hands off that kernel Bill!
On Tuesday, November 23, 2004, 12:14:59 PM, Rob McEwen wrote:
A. Definite hand-typed HAM
B. Closed Loop Opt-In NEWSLETTER (topically applicable to the recipient)
C. NEWSLETTER (topically applicable to the recipient) from reputable organization (no harvesting, few/none NANAS, no SpamHaus) where the person didn't actually subscribe, but likes to read it... maybe it came because they previously bought something or left checked a "receive other offers/info" checkbox
D. More "spammy" NEWSLETTER (but topically applicable to the recipient) where the mailer is fairly "clean" (some NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else.
E. More "spammy" ADVERTISEMENT (but topically applicable to the recipient) where the mailer is very "clean" (no harvesting, few NANAS, no SpamHaus), but the user didn't explicitly Opt-in. Maybe they left a "receive other offers" checkbox checked in the past when filling out something else or ordering something else
All of the above should probably be considered ham for SURBL purposes. What matters more than the *sending style* is what other *uses the domain name* or IP in the URI might have.
Remember that we're not blocking sending methods. We're blocking URI mentions like domains. Therefore what matters is not how the message is sent (newsletter, hand-send, etc.) but ***what the domain might be used for***. We don't want to block on legitimate domains. All of your examples above are for legitimate or mostly legitimate domains.
F. Definite spam (to varying degrees).
Of course, it is not always possible to know if an e-mail is "topically applicable to the recipient". But assuming that you do, it is hard for Mail Administrators to distinguish between B, C, and D. It is also sometimes hard to distinguish between E & F.
A better question might be whether the mail is "topically applicable to ANY recipient." Since we are a global blocklist, we need to think globally and act on behalf of ALL users, not just one particular recipient.
Therefore we want to list domains that are pretty much universally regarded as spammy like cheappillz4u. biz, 0emsoftwarez. info, etc., and almost certainly not some plumbing fixture manufacturer's open subscription newsletter.
The overwhelming percentage of Spam IS very distinguishable from A-E because of things like obfuscation techniques, SpamTrap recipients, location of sender's server, past history of sender, etc.
I agree. We want to list only that extremely obvious spam. Usually it's for pills, mortgage, warez, gambling, porn, etc.
Still, this whole issue makes me question, "how good are Ham Corpuses".
Moreover, when a particular SURBL gets an FP rating of .002%, I think, "that's great"... but then I wonder, "is this .002% actual human written correspondence, or is it a newsletter, etc?"
Rob McEwen
As has been noted, getting down to 1 part in 50,000 (0.02%) it's very easy for a minor misclassification to have a huge impact on the FP numbers.
Ham corpora do have errors, both FP and FN. Usually FPs can only be detected by hand-checking them again. Even highly-experienced spam-fighters make errors when classifying their ham and spam initially. To err is human.
There are also problems with the representativeness of messages in corpora. It's not always easy to put together large and broad enough collections of ham to meaningfully reflect the larger corpus of all messages in general.
Measurements like these are quite hard to do well. Corpus checks are probably best for relative differences between algorithms, etc. I.e. is performance increasing or decreasing with a given change in coding, inclusion policies, etc.
Jeff C. -- "If it appears in hams, then don't list it."
Jeff, I didn't mean to make you have to rehash the standards for SURBL. I totally understood these already and I didn't mean to imply differently in my original post. (But I suppose you have to always be on your guard to prevent misunderstandings. You can never be too careful...)
But your answers regarding the corpuses were exactly what I was questioning. Basically, 1 FP in 50,000 is not bad. But if most of these FPs are "white-hat marketer" advertisements (an oxymoron?) or newsletters ...and few of them are actual human-typed correspondence, then this percentage is even better. If the opposite is true, then this might not be quite as good as it sounds.
Interestingly, I've read some phenomenal and very specific stats from Mail Filtering companies who don't get specific about these kinds of issues mentioned here and I wonder "who are they kidding".
Rob McEwen
On Tuesday, November 23, 2004, 2:18:09 PM, Rob McEwen wrote:
Jeff, I didn't mean to make you have to rehash the standards for SURBL. I totally understood these already and I didn't mean to imply differently in my original post. (But I suppose you have to always be on your guard to prevent misunderstandings. You can never be too careful...)
I appreciate seeing your examples and getting to discuss some of them. It's probably good to discuss some of the things we're all trying to do.
But your answers regarding the corpuses were exactly what I was questioning. Basically, 1 FP in 50,000 is not bad. But if most of these FPs are "white-hat marketer" advertisements (an oxymoron?) or newsletters ...and few of them are actual human-typed correspondence, then this percentage is even better. If the opposite is true, then this might not be quite as good as it sounds.
Yes, getting down in the small fractions of percents is a little like looking for subatomic particles. You never know exactly what you might find when you look there....
Interestingly, I've read some phenomenal and very specific stats from Mail Filtering companies who don't get specific about these kinds of issues mentioned here and I wonder "who are they kidding".
Anyone who will believe them? ;-)
Jeff C. -- "If it appears in hams, then don't list it."