-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Friday, November 19, 2004 5:38 AM To: SURBL Discuss Subject: Re: [SURBL-Discuss] general questions.....
On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
About 15% of the spams I get are not in SURBL, but are by
the time I try to
add :)
Ask Terry Sullivan sometime what the theoretical maximum detection rate of a collective spam classification system might be. He had some research showing it maxes out at around 85%. So we're probably already pretty close to the theoretical limits of this type of system.
Me thinks I need to google for more data on this :)
I have not done any study of domains that continue to try to
spam despite
being in SURBL. Any numbers on these? Possibly the
most/longest hit domain
in SURBL lookups??
SHould we post the top 25 lookups to SURBL?
You mean like:
Perfect! This is what I mean, block port 80 (or all ports for that matter) for
Hits Domain 1875 imgehost.com
Hosted by Electric Lightwave, eli.net.
Domain List matching dns_a of 67.50.118.130 48 total matches
* 1: 123onlinecash.com * 2: 500fastcash.com * 3: absoluteroi.com * 4: americash-online.com * 5: azooimages.com * 6: camasterd.com * 7: cashadvancenow.com * 8: cashbackvalues.com * 9: cashbuzz.com * 10: cbvmasterd.com * 11: costamasterd.com * 12: cvcmasterd.com * 13: d1masterd.com * 14: dabogus.com * 15: directdepositcash.com * 16: efastcashloans.com * 17: egcmasterd.com * 18: epointmasterd.com * 19: equity1auto.com * 20: equityoneauto.com * 21: ezcash-online.com * 22: fast-funds-online.com * 23: fastcashandgas.com * 24: fastcashusa.com * 25: financialhosting.com * 26: hostimages.net * 27: imagedataserver.com * 28: imagesbyaz.com * 29: imgehost.com * 30: imgserver.net * 31: inamasterd.com * 32: lighteningcash.com * 33: mbcashmasterd.com * 34: mycash-online.com * 35: myonlinepayday.com * 36: oledirect.com * 37: oledirect2.com * 38: oneclickcash.com * 39: paydaycity.com * 40: pclmasterd.com * 41: ptymasterd.com * 42: sellingsource.com * 43: smartshopperonline.com * 44: steaksofstlouis.com * 45: tpmasterd.com * 46: webfastcash.com * 47: xenlog.com * 48: yourfastcash.com
By blocking port 80 (or all) at the firewall for this IP address, you don't have to worry about them getting new domain names. Only the worst cases should be blocked. If you have 48 spam domains on one host, you suck as an ISP :) I seriously would like to hear the ISP's argument for being unblocked on this one.
--Chris
On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
I have not done any study of domains that continue to try to
spam despite
being in SURBL. Any numbers on these? Possibly the
most/longest hit domain
in SURBL lookups??
SHould we post the top 25 lookups to SURBL?
You mean like:
Perfect! This is what I mean, block port 80 (or all ports for that matter) for
Hits Domain 1875 imgehost.com
Hosted by Electric Lightwave, eli.net.
Domain List matching dns_a of 67.50.118.130 48 total matches
* 1: 123onlinecash.com * 2: 500fastcash.com
[...]
Uh, but that won't block the spam....
Jeff C. -- "If it appears in hams, then don't list it."
On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
About 15% of the spams I get are not in SURBL, but are by
the time I try to
add :)
Ask Terry Sullivan sometime what the theoretical maximum detection rate of a collective spam classification system might be. He had some research showing it maxes out at around 85%. So we're probably already pretty close to the theoretical limits of this type of system.
Me thinks I need to google for more data on this :)
Here is Terry's reference and some commentary. I think it fits in line with what we've seen. Interestingly it also sounds like he supports a greylist to capture spam more broadly, then filter some of those down to truly black for regular SURBL listing.
On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
I mentioned on the SURBL discussion list that we may be approaching theoretical limits and there was some interest expressed in a reference. Could I trouble you to dig one up for us? :-)
Sure. Here's the cite:
Buckland, M. and Gey, F. (1994). The trade-off between recall and precision. Journal of the American Society for Information Science, 45, 12-19.
For those who are able to track down JASIS at a local university library, it's important to keep several things in mind while harvesting insight from this article:
The article is steeped in the vocabulary of IR (topical search), not spam classification. However, spam classification and IR are both just special-cases of binary document classification. (That is, ham/spam, or relevant/nonrelevant, are both simply special cases of good/bad; it's all the same.)
It's crucial to remember that spam classification targets the "bad" documents, while IR targets the "good" documents. In each case, though, we have a category of things-we-want-to-find, and another category of things-we-want-to-ignore. The process is the same, but the metrics are "backwards."
The terms used in the article (Precision and Recall) to describe classification performance correspond to accuracy and coverage (respectively). "Precision" can be thought of as 1-FPR, while "Recall" is 1-FNR.
Because I suspect that many of your list readers won't be able to lay hands on the full text of the article, I've taken the article abstract and done some global search-and-replace operations, to replace the IR-specific vocabulary with a more general classification vocabulary. Selected portions of the _munged_ abstract follow:
Summary: The traditional measures of classification performance are coverage (completeness) and accuracy (purity). Empirical studies of classifier performance have shown a tendency for accuracy to decline as coverage increases. ...A trade-off between accuracy and coverage is entailed unless, as the total number of documents classified increases, classifier performance is equal to or better than overall classification performance thus far. **
...If coverage is modeled by a polynomial function of proportion of documents found, then accuracy is modeled by a lower order polynomial function of the same variable.
...Two-stage, or, more generally, multistage classification procedures, whereby a (larger document set) is used for subsequent, more detailed analysis, is likely to achieve the goal of improving both accuracy and coverage simultaneously, even though the trade-off between them cannot be avoided.
** My note: this is the key "take-away" point. The math doesn't lie: accuracy and coverage are *necessarily* inversely related unless classifier performance somehow manages to magically improve as a function of the number of items examined. And yet, peformance for any single classifer/classification method is best conceived as a "constant." By implication, the only possible way to simultaneously achieve both accuracy and coverage is to adopt a "breadth-first" approach, where a larger (and inevitably "grayer") pool of candidate documents are subject to a multistage classification regimen.
and:
I was looking at the email I sent you earlier, and it occurs to me that something in it is "obvious" to me, but may not necessarily be obvious to others not as heavily steeped in automatic classification research...
The reason that the whole accuracy/coverage tradeoff is relevant to SURBL goes back to the notion that you're right up against the upper limit of coverage (1-FNR) for a given/predefined level of accuracy (1-FPR). The Buckland & Gey article is pertinent because it demonstrates that the only way to increase coverage for any given (single) classifier is _at the expense of accuracy_. Since accuracy (or its inverse, FPR) is something you want to hold constant, coverage necessarily suffers.
Jeff C. -- "If it appears in hams, then don't list it."
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
good mail. but I don't have the time to think much about it! If only Henry Stern had finished his PhD we could get his thoughts ;)
Important to note that SURBL *can* increase its efficiency, by changing its methods -- ie. adding more data sources, modifying the moderation model, etc. can increase efficiency.
- --j.
Jeff Chan writes:
On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
About 15% of the spams I get are not in SURBL, but are by
the time I try to
add :)
Ask Terry Sullivan sometime what the theoretical maximum detection rate of a collective spam classification system might be. He had some research showing it maxes out at around 85%. So we're probably already pretty close to the theoretical limits of this type of system.
Me thinks I need to google for more data on this :)
Here is Terry's reference and some commentary. I think it fits in line with what we've seen. Interestingly it also sounds like he supports a greylist to capture spam more broadly, then filter some of those down to truly black for regular SURBL listing.
On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
I mentioned on the SURBL discussion list that we may be approaching theoretical limits and there was some interest expressed in a reference. Could I trouble you to dig one up for us? :-)
Sure. Here's the cite:
Buckland, M. and Gey, F. (1994). The trade-off between recall and precision. Journal of the American Society for Information Science, 45, 12-19.
For those who are able to track down JASIS at a local university library, it's important to keep several things in mind while harvesting insight from this article:
The article is steeped in the vocabulary of IR (topical search), not spam classification. However, spam classification and IR are both just special-cases of binary document classification. (That is, ham/spam, or relevant/nonrelevant, are both simply special cases of good/bad; it's all the same.)
It's crucial to remember that spam classification targets the "bad" documents, while IR targets the "good" documents. In each case, though, we have a category of things-we-want-to-find, and another category of things-we-want-to-ignore. The process is the same, but the metrics are "backwards."
The terms used in the article (Precision and Recall) to describe classification performance correspond to accuracy and coverage (respectively). "Precision" can be thought of as 1-FPR, while "Recall" is 1-FNR.
Because I suspect that many of your list readers won't be able to lay hands on the full text of the article, I've taken the article abstract and done some global search-and-replace operations, to replace the IR-specific vocabulary with a more general classification vocabulary. Selected portions of the _munged_ abstract follow:
Summary: The traditional measures of classification performance are coverage (completeness) and accuracy (purity). Empirical studies of classifier performance have shown a tendency for accuracy to decline as coverage increases. ...A trade-off between accuracy and coverage is entailed unless, as the total number of documents classified increases, classifier performance is equal to or better than overall classification performance thus far. **
...If coverage is modeled by a polynomial function of proportion of documents found, then accuracy is modeled by a lower order polynomial function of the same variable.
...Two-stage, or, more generally, multistage classification procedures, whereby a (larger document set) is used for subsequent, more detailed analysis, is likely to achieve the goal of improving both accuracy and coverage simultaneously, even though the trade-off between them cannot be avoided.
** My note: this is the key "take-away" point. The math doesn't lie: accuracy and coverage are *necessarily* inversely related unless classifier performance somehow manages to magically improve as a function of the number of items examined. And yet, peformance for any single classifer/classification method is best conceived as a "constant." By implication, the only possible way to simultaneously achieve both accuracy and coverage is to adopt a "breadth-first" approach, where a larger (and inevitably "grayer") pool of candidate documents are subject to a multistage classification regimen.
and:
I was looking at the email I sent you earlier, and it occurs to me that something in it is "obvious" to me, but may not necessarily be obvious to others not as heavily steeped in automatic classification research...
The reason that the whole accuracy/coverage tradeoff is relevant to SURBL goes back to the notion that you're right up against the upper limit of coverage (1-FNR) for a given/predefined level of accuracy (1-FPR). The Buckland & Gey article is pertinent because it demonstrates that the only way to increase coverage for any given (single) classifier is _at the expense of accuracy_. Since accuracy (or its inverse, FPR) is something you want to hold constant, coverage necessarily suffers.
Jeff C.
"If it appears in hams, then don't list it."
Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
On Monday, November 22, 2004, 5:25:14 PM, Justin Mason wrote:
Important to note that SURBL *can* increase its efficiency, by changing its methods -- ie. adding more data sources, modifying the moderation model, etc. can increase efficiency.
I like to think so too, but one of Terry's hypotheses is that detecting spam in the remaining variance (the ~15% currently undetected) may require some "third dimension of spam" and that about half of that variance may be truly "noise" and therefore inherently undetectable (paraphrasing him from off-list discussions). But he doesn't have data to support that claim yet, just empirical observations across different classification systems.
It's good to hear that Henry Stern is getting a PhD for his work in this area, since it can be worthy of that honor. It's not a particularly easy problem.
Jeff C. -- "If it appears in hams, then don't list it."