[SURBL-Discuss] general questions.....

Justin Mason jm at jmason.org
Tue Nov 23 02:25:14 CET 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


good mail.   but I don't have the time to think much about it! If only
Henry Stern had finished his PhD we could get his thoughts ;)

Important to note that SURBL *can* increase its efficiency, by changing
its methods -- ie. adding more data sources, modifying the moderation
model, etc. can increase efficiency.

- --j.

Jeff Chan writes:
> On Friday, November 19, 2004, 6:39:31 AM, Chris Santerre wrote:
> >>From: Jeff Chan [mailto:jeffc at surbl.org]
> 
> >>On Thursday, November 18, 2004, 12:13:26 PM, Chris Santerre wrote:
> >>> About 15% of the spams I get are not in SURBL, but are by 
> >>the time I try to
> >>> add :) 
> >>
> >>Ask Terry Sullivan sometime what the theoretical maximum
> >>detection rate of a collective spam classification system might
> >>be.  He had some research showing it maxes out at around 85%.
> >>So we're probably already pretty close to the theoretical
> >>limits of this type of system.
> 
> > Me thinks I need to google for more data on this :)
> 
> Here is Terry's reference and some commentary.  I think
> it fits in line with what we've seen.  Interestingly
> it also sounds like he supports a greylist to capture
> spam more broadly, then filter some of those down to
> truly black for regular SURBL listing.
> 
> > On Sat, 20 Nov 2004 04:30:11 -0800, Jeff Chan wrote:
> > 
> >>I mentioned on the SURBL discussion list that we may be
> >>approaching theoretical limits and there was some interest
> >>expressed in a reference.  Could I trouble you to dig one
> >>up for us?  :-)
> > 
> > Sure.  Here's the cite:
> > 
> >      Buckland, M. and Gey, F. (1994). The trade-off 
> >      between recall and precision. Journal of the 
> >      American Society for Information Science, 45, 12-19.
> > 
> > For those who are able to track down JASIS at a local university 
> > library, it's important to keep several things in mind while harvesting 
> > insight from this article:
> > 
> > 1) The article is steeped in the vocabulary of IR (topical search),
> >    not spam classification.  However, spam classification and IR 
> >    are both just special-cases of binary document classification. 
> >    (That is, ham/spam, or relevant/nonrelevant, are both simply 
> >    special cases of good/bad; it's all the same.)
> > 
> > 2) It's crucial to remember that spam classification targets the 
> >    "bad" documents, while IR targets the "good" documents.  In each
> >    case, though, we have a category of things-we-want-to-find, and
> >    another category of things-we-want-to-ignore.  The process is
> >    the same, but the metrics are "backwards."
> > 
> > 3) The terms used in the article (Precision and Recall) to describe 
> >    classification performance correspond to accuracy and coverage 
> >    (respectively).  "Precision" can be thought of as 1-FPR, while 
> >    "Recall" is 1-FNR.
> > 
> > Because I suspect that many of your list readers won't be able to lay 
> > hands on the full text of the article, I've taken the article abstract 
> > and done some global search-and-replace operations, to replace the 
> > IR-specific vocabulary with a more general classification vocabulary.  
> > Selected portions of the _munged_ abstract follow:
> > 
> >    Summary: The traditional measures of classification 
> >    performance are coverage (completeness) and accuracy
> >    (purity). Empirical studies of classifier performance 
> >    have shown a tendency for accuracy to decline as 
> >    coverage increases.  ...A trade-off between accuracy 
> >    and coverage is entailed unless, as the total number 
> >    of documents classified increases, classifier 
> >    performance is equal to or better than overall 
> >    classification performance thus far. **
> > 
> >    ...If coverage is modeled by a polynomial function of 
> >    proportion of documents found, then accuracy is modeled 
> >    by a lower order polynomial function of the same variable.
> > 
> >    ...Two-stage, or, more generally, multistage 
> >    classification procedures, whereby a (larger document set) 
> >    is used for subsequent, more detailed analysis, is likely 
> >    to achieve the goal of improving both accuracy and coverage 
> >    simultaneously, even though the trade-off between them 
> >    cannot be avoided.
> > 
> > ** My note: this is the key "take-away" point.  The math doesn't lie: 
> > accuracy and coverage are *necessarily* inversely related unless 
> > classifier performance somehow manages to magically improve as a 
> > function of the number of items examined.  And yet, peformance for any 
> > single classifer/classification method is best conceived as a 
> > "constant."  By implication, the only possible way to simultaneously 
> > achieve both accuracy and coverage is to adopt a "breadth-first" 
> > approach, where a larger (and inevitably "grayer") pool of candidate 
> > documents are subject to a multistage classification regimen.  
> 
> and:
> 
> > I was looking at the email I sent you earlier, and it occurs to me that
> > something in it is "obvious" to me, but may not necessarily be obvious 
> > to others not as heavily steeped in automatic classification research...
> > 
> > The reason that the whole accuracy/coverage tradeoff is relevant to 
> > SURBL goes back to the notion that you're right up against the upper 
> > limit of coverage (1-FNR) for a given/predefined level of accuracy 
> > (1-FPR).  The Buckland & Gey article is pertinent because it 
> > demonstrates that the only way to increase coverage for any given 
> > (single) classifier is _at the expense of accuracy_.  Since accuracy (or 
> > its inverse, FPR) is something you want to hold constant, coverage 
> > necessarily suffers.
> 
> Jeff C.
> --
> "If it appears in hams, then don't list it."
> 
> _______________________________________________
> Discuss mailing list
> Discuss at lists.surbl.org
> http://lists.surbl.org/mailman/listinfo/discuss
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBopF6MJF5cimLx9ARAuTqAJsHAG8VsHg+qzahqMU1SuWVkLmN/wCfWGYE
iL2w76D0U5hpBQB9Y2W+qB4=
=jSoY
-----END PGP SIGNATURE-----



More information about the Discuss mailing list