[SURBL-Discuss] general questions.....

Jeff Chan jeffc at surbl.org
Tue Nov 23 02:06:54 CET 2004

On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
>>From: Jeff Chan [mailto:jeffc at surbl.org]

>>On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
>>> About 15% of the spams I get are not in SURBL, but are by 
>>the time I try to
>>> add :) 
>>Ask Terry Sullivan sometime what the theoretical maximum
>>detection rate of a collective spam classification system might
>>be.  He had some research showing it maxes out at around 85%.
>>So we're probably already pretty close to the theoretical
>>limits of this type of system.

> Me thinks I need to google for more data on this :)

Here is Terry's reference and some commentary.  I think
it fits in line with what we've seen.  Interestingly
it also sounds like he supports a greylist to capture
spam more broadly, then filter some of those down to
truly black for regular SURBL listing.

> On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
>>I mentioned on the SURBL discussion list that we may be
>>approaching theoretical limits and there was some interest
>>expressed in a reference.  Could I trouble you to dig one
>>up for us?  :-)
> Sure.  Here's the cite:
>      Buckland, M. and Gey, F. (1994). The trade-off 
>      between recall and precision. Journal of the 
>      American Society for Information Science, 45, 12-19.
> For those who are able to track down JASIS at a local university 
> library, it's important to keep several things in mind while harvesting 
> insight from this article:
> 1) The article is steeped in the vocabulary of IR (topical search),
>    not spam classification.  However, spam classification and IR 
>    are both just special-cases of binary document classification. 
>    (That is, ham/spam, or relevant/nonrelevant, are both simply 
>    special cases of good/bad; it's all the same.)
> 2) It's crucial to remember that spam classification targets the 
>    "bad" documents, while IR targets the "good" documents.  In each
>    case, though, we have a category of things-we-want-to-find, and
>    another category of things-we-want-to-ignore.  The process is
>    the same, but the metrics are "backwards."
> 3) The terms used in the article (Precision and Recall) to describe 
>    classification performance correspond to accuracy and coverage 
>    (respectively).  "Precision" can be thought of as 1-FPR, while 
>    "Recall" is 1-FNR.
> Because I suspect that many of your list readers won't be able to lay 
> hands on the full text of the article, I've taken the article abstract 
> and done some global search-and-replace operations, to replace the 
> IR-specific vocabulary with a more general classification vocabulary.  
> Selected portions of the _munged_ abstract follow:
>    Summary: The traditional measures of classification 
>    performance are coverage (completeness) and accuracy
>    (purity). Empirical studies of classifier performance 
>    have shown a tendency for accuracy to decline as 
>    coverage increases.  ...A trade-off between accuracy 
>    and coverage is entailed unless, as the total number 
>    of documents classified increases, classifier 
>    performance is equal to or better than overall 
>    classification performance thus far. **
>    ...If coverage is modeled by a polynomial function of 
>    proportion of documents found, then accuracy is modeled 
>    by a lower order polynomial function of the same variable.
>    ...Two-stage, or, more generally, multistage 
>    classification procedures, whereby a (larger document set) 
>    is used for subsequent, more detailed analysis, is likely 
>    to achieve the goal of improving both accuracy and coverage 
>    simultaneously, even though the trade-off between them 
>    cannot be avoided.
> ** My note: this is the key "take-away" point.  The math doesn't lie: 
> accuracy and coverage are *necessarily* inversely related unless 
> classifier performance somehow manages to magically improve as a 
> function of the number of items examined.  And yet, peformance for any 
> single classifer/classification method is best conceived as a 
> "constant."  By implication, the only possible way to simultaneously 
> achieve both accuracy and coverage is to adopt a "breadth-first" 
> approach, where a larger (and inevitably "grayer") pool of candidate 
> documents are subject to a multistage classification regimen.  


> I was looking at the email I sent you earlier, and it occurs to me that
> something in it is "obvious" to me, but may not necessarily be obvious 
> to others not as heavily steeped in automatic classification research...
> The reason that the whole accuracy/coverage tradeoff is relevant to 
> SURBL goes back to the notion that you're right up against the upper 
> limit of coverage (1-FNR) for a given/predefined level of accuracy 
> (1-FPR).  The Buckland & Gey article is pertinent because it 
> demonstrates that the only way to increase coverage for any given 
> (single) classifier is _at the expense of accuracy_.  Since accuracy (or 
> its inverse, FPR) is something you want to hold constant, coverage 
> necessarily suffers.

Jeff C.
"If it appears in hams, then don't list it."

More information about the Discuss mailing list