Re: [SURBL-Discuss] general questions.....

23 Nov 2004

      On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
...
...
From: Jeff Chan [mailto:jeffc@surbl.org]
...
...
On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
...
About 15% of the spams I get are not in SURBL, but are by
the time I try to
...
add :)
Ask Terry Sullivan sometime what the theoretical maximum
detection rate of a collective spam classification system might
be.  He had some research showing it maxes out at around 85%.
So we're probably already pretty close to the theoretical
limits of this type of system.
...
Me thinks I need to google for more data on this :)
Here is Terry's reference and some commentary.  I think
it fits in line with what we've seen.  Interestingly
it also sounds like he supports a greylist to capture
spam more broadly, then filter some of those down to
truly black for regular SURBL listing.
...
On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
...
I mentioned on the SURBL discussion list that we may be
approaching theoretical limits and there was some interest
expressed in a reference.  Could I trouble you to dig one
up for us?  :-)
Sure.  Here's the cite:
 Buckland, M. and Gey, F. (1994). The trade-off 
 between recall and precision. Journal of the 
 American Society for Information Science, 45, 12-19.

For those who are able to track down JASIS at a local university 
library, it's important to keep several things in mind while harvesting 
insight from this article:

The article is steeped in the vocabulary of IR (topical search),
not spam classification.  However, spam classification and IR 
are both just special-cases of binary document classification. 
(That is, ham/spam, or relevant/nonrelevant, are both simply 
special cases of good/bad; it's all the same.)

It's crucial to remember that spam classification targets the 
"bad" documents, while IR targets the "good" documents.  In each
case, though, we have a category of things-we-want-to-find, and
another category of things-we-want-to-ignore.  The process is
the same, but the metrics are "backwards."

The terms used in the article (Precision and Recall) to describe 
classification performance correspond to accuracy and coverage 
(respectively).  "Precision" can be thought of as 1-FPR, while 
"Recall" is 1-FNR.

Because I suspect that many of your list readers won't be able to lay 
hands on the full text of the article, I've taken the article abstract 
and done some global search-and-replace operations, to replace the 
IR-specific vocabulary with a more general classification vocabulary.  
Selected portions of the _munged_ abstract follow:
Summary: The traditional measures of classification 
   performance are coverage (completeness) and accuracy
   (purity). Empirical studies of classifier performance 
   have shown a tendency for accuracy to decline as 
   coverage increases.  ...A trade-off between accuracy 
   and coverage is entailed unless, as the total number 
   of documents classified increases, classifier 
   performance is equal to or better than overall 
   classification performance thus far. **
...If coverage is modeled by a polynomial function of 
   proportion of documents found, then accuracy is modeled 
   by a lower order polynomial function of the same variable.
...Two-stage, or, more generally, multistage 
   classification procedures, whereby a (larger document set) 
   is used for subsequent, more detailed analysis, is likely 
   to achieve the goal of improving both accuracy and coverage 
   simultaneously, even though the trade-off between them 
   cannot be avoided.
** My note: this is the key "take-away" point.  The math doesn't lie: 
accuracy and coverage are *necessarily* inversely related unless 
classifier performance somehow manages to magically improve as a 
function of the number of items examined.  And yet, peformance for any 
single classifer/classification method is best conceived as a 
"constant."  By implication, the only possible way to simultaneously 
achieve both accuracy and coverage is to adopt a "breadth-first" 
approach, where a larger (and inevitably "grayer") pool of candidate 
documents are subject to a multistage classification regimen.
and:
...
I was looking at the email I sent you earlier, and it occurs to me that
something in it is "obvious" to me, but may not necessarily be obvious 
to others not as heavily steeped in automatic classification research...
The reason that the whole accuracy/coverage tradeoff is relevant to 
SURBL goes back to the notion that you're right up against the upper 
limit of coverage (1-FNR) for a given/predefined level of accuracy 
(1-FPR).  The Buckland & Gey article is pertinent because it 
demonstrates that the only way to increase coverage for any given 
(single) classifier is _at the expense of accuracy_.  Since accuracy (or 
its inverse, FPR) is something you want to hold constant, coverage 
necessarily suffers.
Jeff C.
--
"If it appears in hams, then don't list it."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [SURBL-Discuss] general questions.....