Ham corpora needed

List overview All Threads
Download

newer

older

Domain age

whitelisted x10.com

Jeff Chan

5 Sep 2004 5 Sep '04

11:22 p.m.

In order to reduce false positives in the SURBL data, we would like to have access to ham corpora. Does anyone know of any public ham copora, including just the URI domain names from the hams? Or is there anyone who would be willing to run our URI domain lists against their ham?

Does anyone know if messages from the Enron corpus have been categorized for ham and spam?

http://www-2.cs.cmu.edu/~enron/

Thanks in advance for any suggestions, comments, thoughts....

Jeff C.

-- Jeff Chan mailto:jeffc@surbl.org http://www.surbl.org/

Show replies by date

Ryan Thompson

5 Sep 5 Sep

11:56 p.m.

Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

...

In order to reduce false positives in the SURBL data, we would like to have access to ham corpora. Does anyone know of any public ham copora, including just the URI domain names from the hams? Or is there anyone who would be willing to run our URI domain lists against their ham?

Does anyone know if messages from the Enron corpus have been categorized for ham and spam?

http://www-2.cs.cmu.edu/~enron/

Thanks in advance for any suggestions, comments, thoughts....

FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

Now, I know not everybody runs SpamAssassin, but it *does* have a really easy log format and hit-frequencies program. It's possible to concatenate ham and spam logs from different sources to effectively get statistics on a larger corpus... and only the test hits are stored in the log, so the results are effectively anonymous.

There's ham.log for ham, and spam.log for spam, and the entries look like this, one line per message:

Y 7 /spamdir/11710. URIBL_OB_SURBL,URIBL_WS_SURBL time=1089946124

Rather than re-invent the wheel, you can have your checkers output simplified mass-check logs. The only column that matters is the tests column. Something like this should work well enough for hit-frequencies:

N 0 <any_string> URIBL_TESTS_HIT,COMMA_DELIMITED time=<any_integer>

Then, grab hit-frequencies from the SA distribution and you can reproduce the output that others have been posting.

If you *do* have SA installed (even if you don't filter your mail with it), it's even easier. Just set up a simple .cf file with the URIBL rules (I'll provide one on request), and invoke mass-check in the tools directory like so:

./mass-check -p=../rules -c=../rules --net -j=20 --progress \ spam:dir:${SPAMDIR} ham:dir:${HAMDIR}

Then run:

./hit-frequencies -s 3 -p

It's almost worth extracting Mail-SpamAssassin from CPAN just to gain that functionality. You don't even have to *use* SA. :-)

- Ryan

-- Ryan Thompson ryan@sasknow.com SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America

Jeff Chan

6 Sep 6 Sep

12:34 a.m.

On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:

...

FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.

Yes it doesn't get the full stats, and yes, it could miscategorize a few, but the hits are so few that it could be useable. On the other hand, because the hits *are* few, missing a few may be a bigger deal.

Might be interesting to try it both ways and see if the results differ much.

Jeff C.

Ryan Thompson

12:45 a.m.

Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

...

On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:

...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.

Yeah. The *best* solution would be to have our own mass-checker. My GetURI (http://ry.ca/geturi/) could probably be extended for the task without much work, since it already extracts URIs and is capable of producing statistics.

Maybe if I wired it to *also* accept a ham directory, it could cross-check domains in both corpora and list possible FPs.

Yeah. Next version.

;-)

- Ryan

Jeff Chan

1:16 a.m.

On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:

...

Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

...

...
On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:

...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.

...

Yeah. The *best* solution would be to have our own mass-checker. My GetURI (http://ry.ca/geturi/) could probably be extended for the task without much work, since it already extracts URIs and is capable of producing statistics.

...

Maybe if I wired it to *also* accept a ham directory, it could cross-check domains in both corpora and list possible FPs.

But we should be able to use it singly against a directory of ham messages, right? The only difference is that the output would be ham domains and not spam domains.... We'd then compare that ham list to a SURBL and find the FPs....

I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)

http://spamassassin.apache.org/publiccorpus/

Jeff C.

Raymond Dijkxhoorn

1:31 a.m.

Hi!

...

I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)

http://spamassassin.apache.org/publiccorpus/

The only 'problem' for this testset is that its all 2003-02-28 spams. And most likely most of the domains used recently are newer ;) But perhaps its nice if it can help in some way to determine more FP's.

Bye, Raymond.

Jeff Chan

1:40 a.m.

On Sunday, September 5, 2004, 4:31:17 PM, Raymond Dijkxhoorn wrote:

...

...
I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)

http://spamassassin.apache.org/publiccorpus/

...

The only 'problem' for this testset is that its all 2003-02-28 spams. And most likely most of the domains used recently are newer ;) But perhaps its nice if it can help in some way to determine more FP's.

...

Bye, Raymond.

Aha, but I'm not too interested in their spams. I'm interested in their hams for use in FP detection. Hams probably don't change as rapidly as spams......

Jeff C.

Ryan Thompson

2:31 a.m.

Jeff Chan wrote to SURBL Discuss:

...

Aha, but I'm not too interested in their spams. I'm interested in their hams for use in FP detection. Hams probably don't change as rapidly as spams......

I'm running a mass-check against the entire 2003 public corpus. (about 6000 messages total). I'll post the results once it's done and I've collected easy groupings.

- Ryan

Jeff Chan

2:42 a.m.

On Sunday, September 5, 2004, 5:31:54 PM, Ryan Thompson wrote:

...

Jeff Chan wrote to SURBL Discuss:

...

...
Aha, but I'm not too interested in their spams. I'm interested in their hams for use in FP detection. Hams probably don't change as rapidly as spams......

...

I'm running a mass-check against the entire 2003 public corpus. (about 6000 messages total). I'll post the results once it's done and I've collected easy groupings.

...

Ryan

Thanks!! :-)

Jeff C.

Ryan Thompson

2:55 a.m.

Ryan Thompson wrote to Raymond Dijkxhoorn and SURBL Discussion list:

...

Jeff Chan wrote to SURBL Discuss:

...
Aha, but I'm not too interested in their spams. I'm interested in their hams for use in FP detection. Hams probably don't change as rapidly as spams......

I'm running a mass-check against the entire 2003 public corpus. (about 6000 messages total). I'll post the results once it's done and I've collected easy groupings.

OK. Here it is:

OVERALL% SPAM% HAM% S/O RANK SCORE NAME 6048 1898 4150 0.314 0.00 0.00 (all messages) 100.000 31.3823 68.6177 0.314 0.00 0.00 (all messages as %) 1.091 3.4773 0.0000 1.000 1.00 4.00 URIBL_OB_SURBL 5.258 9.0622 3.5181 0.720 0.29 3.00 URIBL_WS_SURBL 0.000 0.0000 0.0000 0.500 0.14 5.00 URIBL_AB_SURBL 0.000 0.0000 0.0000 0.500 0.14 2.00 URIBL_PH_SURBL 0.000 0.0000 0.0000 0.500 0.14 4.00 URIBL_SC_SURBL 0.265 0.3688 0.2169 0.630 0.00 1.00 URIBL_PJ_SURBL

I don't have time to go through the results right now, but feel free:

Ham that hit any URIBL rule: http://ry.ca/geturi/pc-ham-uribl.log (14K) Full ham log: http://ry.ca/geturi/pc-ham.log (340K) Full spam log: http://ry.ca/geturi/pc-spam.log (159K)

Really, the first one is the interesting one, but the full logs might be interesting if you want to do your own frequency comparisons.

What you want to do is go through pc-ham-uribl.log, and check each message mentioned in the log in the SA public corpus to see if you have any FP candidates or not.

- Ryan

Raymond Dijkxhoorn

3:14 a.m.

Hi Ryan,

...

1.091 3.4773 0.0000 1.000 1.00 4.00 URIBL_OB_SURBL 5.258 9.0622 3.5181 0.720 0.29 3.00 URIBL_WS_SURBL 0.000 0.0000 0.0000 0.500 0.14 5.00 URIBL_AB_SURBL 0.000 0.0000 0.0000 0.500 0.14 2.00 URIBL_PH_SURBL 0.000 0.0000 0.0000 0.500 0.14 4.00 URIBL_SC_SURBL 0.265 0.3688 0.2169 0.630 0.00 1.00 URIBL_PJ_SURBL

I don't have time to go through the results right now, but feel free:

Ham that hit any URIBL rule: http://ry.ca/geturi/pc-ham-uribl.log (14K) Full ham log: http://ry.ca/geturi/pc-ham.log (340K) Full spam log: http://ry.ca/geturi/pc-spam.log (159K)

There were 9 'hits' on the PJ list, and all 9 were from the exact same domain. (partner2profit.com). I have whitelisted that one now, it was in WS. 'Besides' that one not a single FP in that set, for PJ, next! :)

...

What you want to do is go through pc-ham-uribl.log, and check each message mentioned in the log in the SA public corpus to see if you have any FP candidates or not.

If someone has a couple of minutes, please lookup the ones that should be removed from WS.

Thanks! Raymond.

Ryan Thompson

2:17 a.m.

Jeff Chan wrote to SURBL Discussion list:

...

On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:

...
Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

...
...
On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:

...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.

...
Yeah. The *best* solution would be to have our own mass-checker. My GetURI (http://ry.ca/geturi/) could probably be extended for the task without much work, since it already extracts URIs and is capable of producing statistics.

...
Maybe if I wired it to *also* accept a ham directory, it could cross-check domains in both corpora and list possible FPs.

But we should be able to use it singly against a directory of ham messages, right?

Well, yes, but GetURI excludes any domains that are listed in multi, so I'd almost need to throw in a switch to flip the logic. It's a good idea, but I have a friend's graduation party to attend, so it isn't going to happen tonight. :-)

That, and the sorting would be all back-asswards.

I have many good plans (and code) for GetURI 1.5 that'll make all of this easy. I really want to figure out a better way to get the domain age, though!

...

The only difference is that the output would be ham domains and not spam domains.... We'd then compare that ham list to a SURBL and find the FPs....

I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)

http://spamassassin.apache.org/publiccorpus/

Just for kicks, I ran it on the easy_ham corpus (about 2500h), and got the following:

http://ry.ca/geturi/easy_ham.html (1MB)

The output would be *much* cleaner with just a few optimizations/flags I have in mind for GetURI. Still, the above might be of some value.

- Ryan

Jeff Chan

2:39 a.m.

On Sunday, September 5, 2004, 5:17:39 PM, Ryan Thompson wrote:

...

Jeff Chan wrote to SURBL Discussion list:

...

...
On Sunday, September 5, 2004, 3:45:20 PM, Ryan Thompson wrote:

...
Jeff Chan wrote to SURBL Discussion list and SpamAssassin Users:

...
...
On Sunday, September 5, 2004, 2:56:04 PM, Ryan Thompson wrote:

...
FWIW, the mass-check I did on that 75K corpus took about 1.75h, on a beefy machine with rbldnsd running on localhost, with 20 concurrent jobs. (mass-check is slower than molasses for anything that blocks if you don't let it run concurrent jobs :-)

One shortcut, which may be adequate for purposes of cleaning up the SURBL data, might be to simply extract the URI domains from the ham corpus, sort and unique that list, then compare that ham URI domain list against the SURBL under test. Hits could be matched up against the source message. Since the hits are relatively few that could save much processing over using full SA on every message.

...
Yeah. The *best* solution would be to have our own mass-checker. My GetURI (http://ry.ca/geturi/) could probably be extended for the task without much work, since it already extracts URIs and is capable of producing statistics.

...
Maybe if I wired it to *also* accept a ham directory, it could cross-check domains in both corpora and list possible FPs.

But we should be able to use it singly against a directory of ham messages, right?

...

Well, yes, but GetURI excludes any domains that are listed in multi, so I'd almost need to throw in a switch to flip the logic. It's a good idea, but I have a friend's graduation party to attend, so it isn't going to happen tonight. :-)

Hmm, could you make a leaner, separate program, say "ExtractURIDomain" that produced a list of URI hosts on standard output when fed mail text on standard input? That could be useful as a general filter program.

...

...
I may try that against the SpamAssassin public corpora that Justin replied about, unless you beat me to it. :-)

http://spamassassin.apache.org/publiccorpus/

...