On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
From: Jeff Chan [mailto:jeffc@surbl.org]
On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
About 15% of the spams I get are not in SURBL, but are by
the time I try to
add :)
Ask Terry Sullivan sometime what the theoretical maximum detection rate of a collective spam classification system might be. He had some research showing it maxes out at around 85%. So we're probably already pretty close to the theoretical limits of this type of system.
Me thinks I need to google for more data on this :)
Here is Terry's reference and some commentary. I think it fits in line with what we've seen. Interestingly it also sounds like he supports a greylist to capture spam more broadly, then filter some of those down to truly black for regular SURBL listing.
On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
I mentioned on the SURBL discussion list that we may be approaching theoretical limits and there was some interest expressed in a reference. Could I trouble you to dig one up for us? :-)
Sure. Here's the cite:
Buckland, M. and Gey, F. (1994). The trade-off between recall and precision. Journal of the American Society for Information Science, 45, 12-19.
For those who are able to track down JASIS at a local university library, it's important to keep several things in mind while harvesting insight from this article:
The article is steeped in the vocabulary of IR (topical search), not spam classification. However, spam classification and IR are both just special-cases of binary document classification. (That is, ham/spam, or relevant/nonrelevant, are both simply special cases of good/bad; it's all the same.)
It's crucial to remember that spam classification targets the "bad" documents, while IR targets the "good" documents. In each case, though, we have a category of things-we-want-to-find, and another category of things-we-want-to-ignore. The process is the same, but the metrics are "backwards."
The terms used in the article (Precision and Recall) to describe classification performance correspond to accuracy and coverage (respectively). "Precision" can be thought of as 1-FPR, while "Recall" is 1-FNR.
Because I suspect that many of your list readers won't be able to lay hands on the full text of the article, I've taken the article abstract and done some global search-and-replace operations, to replace the IR-specific vocabulary with a more general classification vocabulary. Selected portions of the _munged_ abstract follow:
Summary: The traditional measures of classification performance are coverage (completeness) and accuracy (purity). Empirical studies of classifier performance have shown a tendency for accuracy to decline as coverage increases. ...A trade-off between accuracy and coverage is entailed unless, as the total number of documents classified increases, classifier performance is equal to or better than overall classification performance thus far. **
...If coverage is modeled by a polynomial function of proportion of documents found, then accuracy is modeled by a lower order polynomial function of the same variable.
...Two-stage, or, more generally, multistage classification procedures, whereby a (larger document set) is used for subsequent, more detailed analysis, is likely to achieve the goal of improving both accuracy and coverage simultaneously, even though the trade-off between them cannot be avoided.
** My note: this is the key "take-away" point. The math doesn't lie: accuracy and coverage are *necessarily* inversely related unless classifier performance somehow manages to magically improve as a function of the number of items examined. And yet, peformance for any single classifer/classification method is best conceived as a "constant." By implication, the only possible way to simultaneously achieve both accuracy and coverage is to adopt a "breadth-first" approach, where a larger (and inevitably "grayer") pool of candidate documents are subject to a multistage classification regimen.
and:
I was looking at the email I sent you earlier, and it occurs to me that something in it is "obvious" to me, but may not necessarily be obvious to others not as heavily steeped in automatic classification research...
The reason that the whole accuracy/coverage tradeoff is relevant to SURBL goes back to the notion that you're right up against the upper limit of coverage (1-FNR) for a given/predefined level of accuracy (1-FPR). The Buckland & Gey article is pertinent because it demonstrates that the only way to increase coverage for any given (single) classifier is _at the expense of accuracy_. Since accuracy (or its inverse, FPR) is something you want to hold constant, coverage necessarily suffers.
Jeff C. -- "If it appears in hams, then don't list it."