[SURBL-Discuss] Setting SpamAssassin scores for SURBL lists

Ryan Thompson ryan at sasknow.com
Sun Sep 5 23:41:48 CEST 2004


Jeff Chan wrote to SURBL Discussion list and users at spamassassin.apache.org:

>>> Does anyone have other corpus stats to share, in particular
>>> FP rates?
>
> Thanks for sharing your data.

You're welcome.

>> HOWEVER... I decided to go through the ham hits (61 of them), and look
>> for false positive domains to submit.
>
> That kind of checking should become a policy.  For people who can
> do that kind of checking, they should do it every time.  Every
> tool we have for reducing FPs should be used.
>
> Letting FPs in just hurts the usefulness of the lists.

Agreed... some of these are really easy to catch.

> Thanks.  I agree those look like false positives and have
> whitelisted all of them across SURBLs.

Good. Thanks!

> Signing up for a newsletter then forgetting about does not make a
> message spam.

;-) Worse yet, even *with* a carefully and correctly classified corpus
of *messages*, we all know that doesn't come anywhere *near* to
guaranteeing a correctly classified list of URIs. That's where spamtraps
fall short, and that's why we *need* hand-checking on every domain.

> Instead of having these go into SURBLs, they should be checked
> **before** they get added.  Hopefully they would be detected
> then and not get added to begin with.  Wouldn't that be better?
>
> Should hand-checking catch these as mostly legitimate?
>
> Are we hand-checking?  If not we should!

Speaking for myself, I hand check absolutely everything I submit. I've
spent at least half an hour digging up dirt on some single domains to
correctly classify them (though, in many cases, that time is now greatly
reduced thanks to GetURI), and, despite my best efforts, it's still
likely that I've misclassified a few that haven't been reported as FPs
yet.

But, yes, we really need to continue to look hard at sources and their
methods to make sure *every* submitter is doing the right thing. It
doesn't take many domains to seriously skew the FP rate, when we're
talking about hundredths of percentage points.

>>   OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>>     73335    54187    19148    0.739   0.00    0.00  (all messages)
>>   100.007  73.8897  26.1103    0.739   0.00    0.00
>>    60.087  81.2111   0.0000    1.000   0.00    0.00  WS_SURBL
>
>> Is that more like what you had in mind..? No, I'm not making that up.
>> :-)
>
> Looks good, but this corpus is perhaps too small to make
> representative measurements for emails in general.

Agreed. If any other SA users would like to send me their mass-check
spam.log and ham.log with SURBL tests, I'll gladly combine, analyze, and
post the hit frequencies.

Here's my latest, without those whitelisted ones:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   73333    54186    19147    0.739   0.00    0.00  (all messages)
100.000  73.8903  26.1097    0.739   0.00    0.00  (all messages as %)
  62.906  85.1308   0.0104    1.000   1.00    1.00  URIBL_PJ_SURBL
  23.738  32.1245   0.0052    1.000   0.89    4.00  URIBL_SC_SURBL
  66.122  89.4327   0.1515    0.998   0.82    3.00  URIBL_WS_SURBL
  21.525  29.1293   0.0052    1.000   0.76    5.00  URIBL_AB_SURBL
  56.618  76.6194   0.0157    1.000   0.71    4.00  URIBL_OB_SURBL
   0.001   0.0018   0.0000    1.000   0.64    2.00  URIBL_PH_SURBL


BUT... If I exclude the messages with domains from today's whitelist:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   73310    54186    19124    0.739   0.00    0.00  (all messages)
100.000  73.9135  26.0865    0.739   0.00    0.00  (all messages as %)
  66.104  89.4327   0.0052    1.000   1.00    3.00  URIBL_WS_SURBL
  62.926  85.1308   0.0105    1.000   0.74    1.00  URIBL_PJ_SURBL
  23.746  32.1245   0.0052    1.000   0.67    4.00  URIBL_SC_SURBL
  21.532  29.1293   0.0052    1.000   0.57    5.00  URIBL_AB_SURBL
   0.001   0.0018   0.0000    1.000   0.50    2.00  URIBL_PH_SURBL
  56.636  76.6194   0.0157    1.000   0.48    4.00  URIBL_OB_SURBL

I also found more to whitelist, but I'm working on a larger ham corpus
for those. Details to follow...

> That said, any reduction in FPs is important and welcome.

So why don't we hold our first 12-hour SURBL FP-a-thon?

- Ryan

-- 
   Ryan Thompson <ryan at sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America


More information about the Discuss mailing list