-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Saturday, August 14, 2004 2:24 AM To: SURBL Discussion list Subject: Re: [SURBL-Discuss] RE: (1) Another Possible FP, and (2) header p arsing issues
On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to
output a list of
domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- |
cut -d "/" -f1 |
sed "s/=2E/./g" | grep "..*." | egrep -v "
|>|=|@|..*..*." | cut -d
"." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and others have code to "render" messages from MIME, multipart messages, etc. that are not plaintext, in addition to a bunch of other deobfuscation code. Since spammers sometimes try to make their messages harder for programs to read, the programs tend to become more complex and capable.
Jeff C.
While I'm all for automation, but please be extremely careful. Take it from someone who knows, harvesting domains from scripts will lead to headaches. Like Jeff said, there is other encoding to deal with. URL poison. links that don't have http or www in them. ("Paste this into your browser......", btw SARE is working on that now.)
I'm thinking the best way would be to take the actual SURBL code, and use it to rip out domains. But the SA code to unencode the email would be needed as well. Putting this in the email pipe would be best.
If spam if not found in SURBL run SURBL extract + append to file
:)
--Chris