New subject: (1) Another Possible FP, and (2) header p arsing issues

16 Aug 2004


      ...
-----Original Message-----
From: Jeff Chan [mailto:jeffc@surbl.org]
Sent: Saturday, August 14, 2004 2:24 AM
To: SURBL Discussion list
Subject: Re: [SURBL-Discuss] RE: (1) Another Possible FP, and 
(2) header
p arsing issues
On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
...
From: Rik van Riel
...
...
Once I have a working script to extract URLs from a spamtrap
feed, I'll make it available as free software.  Possibly even
bundled with Spamikaze ;)
...
Here is a script I run against my spamtrap mailboxes to
output a list of
...
domain names:
...
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- |
cut -d "/" -f1 |
...
sed "s/=2E/./g" | grep "..*." | egrep -v "
|>|=|@|..*..*." | cut -d
...
"." -f2-3 | tr -d "<" | usort | uniq
...
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and
others have code to "render" messages from MIME, multipart
messages, etc. that are not plaintext, in addition to a bunch of
other deobfuscation code.  Since spammers sometimes try to make
their messages harder for programs to read, the programs tend to
become more complex and capable.
Jeff C.
While I'm all for automation, but please be extremely careful. Take it from
someone who knows, harvesting domains from scripts will lead to headaches.
Like Jeff said, there is other encoding to deal with. URL poison. links that
don't have http or www in them. ("Paste this into your browser......", btw
SARE is working on that now.)
I'm thinking the best way would be to take the actual SURBL code, and use it
to rip out domains. But the SA code to unencode the email would be needed as
well. Putting this in the email pipe would be best.
If spam
    if not found in SURBL
    	run SURBL extract + append to file
:)
--Chris

RE: [SURBL-Discuss] RE: (1) Another Possible FP, and (2) header p arsing issues