Hi all,
I added some experimental code to GetURI to automatically determine the
age of a domain, which works for about 99.2% of the domains I've seen in
SURBL, and, with a little bit of gaussian math, the results are fricken'
*amazing* for classification!
Unfortunately, I need a way to be able to do this without violating
registry whois terms of service, because they don't allow automated
queries, "except as reasonably required to register and update domains"
or somesuch...
Any ideas?
- Ryan
--
Ryan Thompson <ryan(a)sasknow.com>
SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America
In order to reduce false positives in the SURBL data, we would
like to have access to ham corpora. Does anyone know of any
public ham copora, including just the URI domain names from the
hams? Or is there anyone who would be willing to run our URI
domain lists against their ham?
Does anyone know if messages from the Enron corpus have been
categorized for ham and spam?
http://www-2.cs.cmu.edu/~enron/
Thanks in advance for any suggestions, comments, thoughts....
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
I've whitelisted x10.com. They are a frequent FP since they
"advertise" so much to their own customers.
I'm surprised it was not whitelisted earlier. x10.com is
probably significant contributor to the ham scores.
Jeff C.
Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
SpamAssassin 3.0 perceptron-generated scores as a possible guide:
> http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
>
> # The following block of scores were generated using the mass-checking
> # scripts, and a perceptron to determine the optimum scores which
> # resulted in minimum false positives or negatives. The scores are
> # weighted to produce roughly 1 false positive in 2500 non-spam messages
> # using the default threshold of 5.0.
> score URIBL_AB_SURBL 0 2.007 0 0.417
> score URIBL_OB_SURBL 0 1.996 0 3.213
> score URIBL_PH_SURBL 0 0.839 0 2.000
> score URIBL_SC_SURBL 0 3.897 0 4.263
> score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant,
to which Theo Van Dinter cited:
> $ perldoc Mail::SpamAssassin::Conf
> [...]
> If four valid scores are listed, then the score that is used
> depends on how SpamAssassin is being used. The first score is used
> when both Bayes and network tests are disabled (score set 0). The
> second score is used when Bayes is disabled, but network tests are
> enabled (score set 1). The third score is used when Bayes is
> enabled and network tests are disabled (score set 2). The fourth
> score is used when Bayes is enabled and network tests are enabled
> (score set 3).
We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to
set scores, to which I mentioned:
> We have these test results from Justin from 25 June:
>
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 121405 22516 98889 0.185 0.00 0.00 (all messages)
> 100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %)
> 13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS
> 3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC
> 2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB
> 0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH
> 12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
>
> which shows a pretty high FP rate for WS, less for the others.
> Do you happen to have access to any more recent corpus check data
> like this? Could be useful to have another snapshot for a more
> complete picture.
Which was followed up with more data and discussion:
> On Saturday, September 4, 2004, 10:13:11 PM, Theo Dinter wrote:
>> high spam + low ham is good from an FP standpoint, but having a "significant"
>> (for your definition thereof) ham hitrate means the score shouldn't be too
>> high. My handwaving scores would be something like:
[Theo's wild guess scores for Justin's June data: -- Jeff C.]
>> WS 1.2
>> SC 2.5
>> AB 3.5
>> OB 1.8
Theo then gave some of his own stats on a couple different corpora:
>> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> 416072 365031 51041 0.877 0.00 0.00 (all messages)
>> 100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
>> set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL
>> set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL
>> set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL
>> set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL
>> set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
>
>> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
>> 119215 67094 52121 0.563 0.00 0.00 (all messages)
>> 100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
>> set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL
>> set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL
>> set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL
>> set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL
>> set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
>
>> so for these results, I'd probably do something like:
>
>> WS 1.3
>> SC 4.0
>> AB 3.0
>> OB 2.2
>
>> since the hit rates and S/O are a bit higher for me, related to the fact I ran
>> more spam through than Justin did.
To which I added:
> Those final scores look like an excellent fit to the data to me.
and:
> Also while the PH spam hit rate [from Justin's stats] is low,
> the data is of hand checked phishing scams, which deserve to be
> blocked due to their potential danger and damage.
>
> Therefore I would tend to give PH a medium-high score like
> 3 to 5.
So we'll probably adjust the default scores on SpamCopURI
to something like:
WS 1.3
SC 4.0
AB 3.0
OB 2.2
PH 4.5
and we recommend SpamCopURI users do likewise. Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora. That FP rate needs to be reduced for WS
to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further. If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%. The other
lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular
FP rates?
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
This is a forwarded message
From: Theo Van Dinter <felicity(a)kluge.net>
To: SURBL Discussion list <discuss(a)lists.surbl.org>, SpamAssassin Developers <spamassassin-dev(a)incubator.apache.org>
Date: Saturday, September 4, 2004, 10:36:53 AM
Subject: [SURBL-Discuss] checking plain domains in message bodies against SURBLs reportedly effective
===8<==============Original message text===============
On Sat, Sep 04, 2004 at 10:45:44AM -0600, Ryan Thompson wrote:
> Yep. Good idea, overall. There are a few gotchas:
>
> TLD extensions sometimes map file extensions. We might have to whitelist
> command.com, and the entire country of Poland. :-)
>
> Since the domain is in plain text and doesn't contain a protocol or
> subdomain (i.e., 'www'), I haven't yet seen a mail client that will
> display it as a clickable URL.
This is generally the tact we're taking in SpamAssassin -- if a general
MUA doesn't display it as a link, then we don't consider it an URL.
Another issue for the generic domains thing is performance -- lots of
messages have lots of things like could potentially look like a domain,
and querying for them all adds a bit of a load on the client and the
server.
For instance: /\b([a-zA-Z0-9_.-]{1,256}\.[a-zA-Z]{2,6})\b/
in theory (I haven't tested it), will grab anything that looks like a
generic domain name in text. If you check that list against a list of
valid TLDs, you'd probably end up with a decent list, but you'd hit the top
issue quoted above where "Go take a look at command.com" isn't clear if it's
an URL or a filename.
--
Randomly Generated Tagline:
"Brevity is the soul of lingerie." - Dorothy Parker
===8<===========End of original message text===========
Randy Brukardt of rrsoftware.com mentioned that checking
plain domains occurring in message bodies against SURBLs
was pretty productive. (E.g., look for domain.com in
addition to www.domain.com or http://www.domain.com).
Perhaps this could be something interesting to at least try
experimentally or to think about.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
Today I processed a pile of 11,000 spams donated by Raymond, as I'm
gradually gearing up for large-scale spamtrap processing.
I ended up with many most obvious spam domains, but an even bigger pile of
domains that aren't quite as suspicious. Many are from last year and the
name server is not listed by SBL. I don't want to add any grey domains to a
black pile.
While investigating many of them to determine if they deserved listing, I
repeatedly came across the term "safelist". They seem to be some kind of
opt-in lists. What is the term supposed to mean exactly?
Joe
--
http://www.joewein.de/sw/jwSpamSpy/
On Friday, September 3, 2004, 3:10:29 PM, Raymond Dijkxhoorn wrote:
> Currently we are running with a somehow frozen ws.surbl.org list. We are
> experiencing hardware trouble with one of the SURBL machines. New updates
> will be processed, but most likely activated after we restore full
> functionality.
> The main SURBL site is not afffected, its only related to the WS updates.
> We are working hard to get the processing box back online.
We made a temporary workaround to get updates from Raymond and
others directly into ws.surbl.org until the other server comes
back. Once the other server is working again we will undo that
workaround. So ws is now getting updated with at least some of
the new data.
Jeff C.