[SURBL-Discuss] general questions.....
jm at jmason.org
Tue Nov 23 02:25:14 CET 2004
-----BEGIN PGP SIGNED MESSAGE-----
good mail. but I don't have the time to think much about it! If only
Henry Stern had finished his PhD we could get his thoughts ;)
Important to note that SURBL *can* increase its efficiency, by changing
its methods -- ie. adding more data sources, modifying the moderation
model, etc. can increase efficiency.
Jeff Chan writes:
> On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
> >>From: Jeff Chan [mailto:jeffc at surbl.org]
> >>On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
> >>> About 15% of the spams I get are not in SURBL, but are by
> >>the time I try to
> >>> add :)
> >>Ask Terry Sullivan sometime what the theoretical maximum
> >>detection rate of a collective spam classification system might
> >>be. He had some research showing it maxes out at around 85%.
> >>So we're probably already pretty close to the theoretical
> >>limits of this type of system.
> > Me thinks I need to google for more data on this :)
> Here is Terry's reference and some commentary. I think
> it fits in line with what we've seen. Interestingly
> it also sounds like he supports a greylist to capture
> spam more broadly, then filter some of those down to
> truly black for regular SURBL listing.
> > On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
> >>I mentioned on the SURBL discussion list that we may be
> >>approaching theoretical limits and there was some interest
> >>expressed in a reference. Could I trouble you to dig one
> >>up for us? :-)
> > Sure. Here's the cite:
> > Buckland, M. and Gey, F. (1994). The trade-off
> > between recall and precision. Journal of the
> > American Society for Information Science, 45, 12-19.
> > For those who are able to track down JASIS at a local university
> > library, it's important to keep several things in mind while harvesting
> > insight from this article:
> > 1) The article is steeped in the vocabulary of IR (topical search),
> > not spam classification. However, spam classification and IR
> > are both just special-cases of binary document classification.
> > (That is, ham/spam, or relevant/nonrelevant, are both simply
> > special cases of good/bad; it's all the same.)
> > 2) It's crucial to remember that spam classification targets the
> > "bad" documents, while IR targets the "good" documents. In each
> > case, though, we have a category of things-we-want-to-find, and
> > another category of things-we-want-to-ignore. The process is
> > the same, but the metrics are "backwards."
> > 3) The terms used in the article (Precision and Recall) to describe
> > classification performance correspond to accuracy and coverage
> > (respectively). "Precision" can be thought of as 1-FPR, while
> > "Recall" is 1-FNR.
> > Because I suspect that many of your list readers won't be able to lay
> > hands on the full text of the article, I've taken the article abstract
> > and done some global search-and-replace operations, to replace the
> > IR-specific vocabulary with a more general classification vocabulary.
> > Selected portions of the _munged_ abstract follow:
> > Summary: The traditional measures of classification
> > performance are coverage (completeness) and accuracy
> > (purity). Empirical studies of classifier performance
> > have shown a tendency for accuracy to decline as
> > coverage increases. ...A trade-off between accuracy
> > and coverage is entailed unless, as the total number
> > of documents classified increases, classifier
> > performance is equal to or better than overall
> > classification performance thus far. **
> > ...If coverage is modeled by a polynomial function of
> > proportion of documents found, then accuracy is modeled
> > by a lower order polynomial function of the same variable.
> > ...Two-stage, or, more generally, multistage
> > classification procedures, whereby a (larger document set)
> > is used for subsequent, more detailed analysis, is likely
> > to achieve the goal of improving both accuracy and coverage
> > simultaneously, even though the trade-off between them
> > cannot be avoided.
> > ** My note: this is the key "take-away" point. The math doesn't lie:
> > accuracy and coverage are *necessarily* inversely related unless
> > classifier performance somehow manages to magically improve as a
> > function of the number of items examined. And yet, peformance for any
> > single classifer/classification method is best conceived as a
> > "constant." By implication, the only possible way to simultaneously
> > achieve both accuracy and coverage is to adopt a "breadth-first"
> > approach, where a larger (and inevitably "grayer") pool of candidate
> > documents are subject to a multistage classification regimen.
> > I was looking at the email I sent you earlier, and it occurs to me that
> > something in it is "obvious" to me, but may not necessarily be obvious
> > to others not as heavily steeped in automatic classification research...
> > The reason that the whole accuracy/coverage tradeoff is relevant to
> > SURBL goes back to the notion that you're right up against the upper
> > limit of coverage (1-FNR) for a given/predefined level of accuracy
> > (1-FPR). The Buckland & Gey article is pertinent because it
> > demonstrates that the only way to increase coverage for any given
> > (single) classifier is _at the expense of accuracy_. Since accuracy (or
> > its inverse, FPR) is something you want to hold constant, coverage
> > necessarily suffers.
> Jeff C.
> "If it appears in hams, then don't list it."
> Discuss mailing list
> Discuss at lists.surbl.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
-----END PGP SIGNATURE-----
More information about the Discuss