[SURBL-Discuss] Setting SpamAssassin scores for SURBL lists

Jeff Chan jeffc at surbl.org
Sun Sep 5 22:50:19 CEST 2004


On Sunday, September 5, 2004, 10:32:57 AM, Ryan Thompson wrote:
> Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

>> Basically the higher the FP rate, the less useful a list is.

> ... or, rather, the lower it ought to be scored.

Yes, but please remember that not everyone has the ability to
"score" their SURBL hits.  Not everyone using SURBLs is using
SpamAssassin.

>> Does anyone have other corpus stats to share, in particular
>> FP rates?

Thanks for sharing your data.  I know this can be a somewhat
painful subject for people, but it's very important to clean
up the false positives and make the lists better and more useful.

> Sure. All of these messages were received in the past 10 days. A lot has
> happened since June. :-)

> WS: 44004/54185s, 61/19150s

>   OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>     73335    54185    19150    0.739   0.00    0.00  (all messages)
>   100.000  73.8870  26.1130    0.739   0.00    0.00  (all messages as %)
>    60.087  81.2107   0.0836    0.999   0.00    0.00  WS_SURBL

> HOWEVER... I decided to go through the ham hits (61 of them), and look
> for false positive domains to submit.

That kind of checking should become a policy.  For people who can
do that kind of checking, they should do it every time.  Every
tool we have for reducing FPs should be used.

Letting FPs in just hurts the usefulness of the lists.

> I found several, but, for the most
> part, they've *already* been cleaned up and are no longer listed in WS.
> (30 out of the 61 were in a massive mailing list thread for a single
> domain that has since been whitelisted).

> And, in that 19K ham corpus, I found the following FPs still listed
> in WS:

> buckeye-express.com   -- Used in a personal email address, looks legit;
>                          7 examples
> nm.ru                 -- Used in a personal email address, looks legit
> advanstar.com         -- Legit uses; found in a well-known dental
>                          newsletter; also personal email address of
>                          one of the editors; 3 messages
> 00fun.com             -- Confirmed, more than one user on our system
>                           sent or received eCards from them
> northstarconferences.com Legit conference host site subscribed to
>                          by two users; 9 messages in this corpus
> mardox.com            -- Search engine; registered 1875 days ago, and
>                           *looks* like the user did actually submit
>                          their site to them.
> postsnet.com          -- Registered exactly one year ago, 51 NANAS,
>                          blank home page, ehh... but I have 4
>                          different legit newsletters with links to
>                          them.
> webspawner.com        -- Created in 1996; free host/email
> npdor.com             -- Surveys; been around since 1999. 103 NANAS,
>                          but they've been advertised by some reputable
>                          "word of the day" mailers (dictionary.com)
>                          Maybe a good candidate for UC. :-) 2
>                          examples
> imninc.com            -- Domain is 507 days old; they do newsletters.
>                          At least one of them is legit. :-)
> worldhealth.net       -- It's 3468 days old today (1995). One of our
>                          users attended a conference of theirs, and
>                          signed up for a newsletter.
> hoteldiscounts.com    -- 2459 days old (1997), found in actual room
>                           booking confirmations for Comfort Inn.

Thanks.  I agree those look like false positives and have
whitelisted all of them across SURBLs.  Signing up for a
newsletter then forgetting about does not make a message
spam.

Instead of having these go into SURBLs, they should be checked
**before** they get added.  Hopefully they would be detected
then and not get added to begin with.  Wouldn't that be better?

Should hand-checking catch these as mostly legitimate?

Are we hand-checking?  If not we should!

> (I'll re-post these in another thread, just so everybody sees them).

> AND, I found 2 spams that were incorrectly hand-classified as ham.

> So, if I take those out, the numbers look more like:

> WS: 44006/54187s, 0/19148s

>   OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>     73335    54187    19148    0.739   0.00    0.00  (all messages)
>   100.007  73.8897  26.1103    0.739   0.00    0.00
>    60.087  81.2111   0.0000    1.000   0.00    0.00  WS_SURBL

> Is that more like what you had in mind..? No, I'm not making that up.
> :-)

Looks good, but this corpus is perhaps too small to make
representative measurements for emails in general.  That
said, any reduction in FPs is important and welcome.

> Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
> hand-check.

> - Ryan

Thanks for your stats and checking, and yes please anyone else
with ham corpora, please check for FPs.

Jeff C.



More information about the Discuss mailing list