On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to output a list of domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- | cut -d "/" -f1 | sed "s/=2E/./g" | grep "..*." | egrep -v " |>|=|@|..*..*." | cut -d "." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and others have code to "render" messages from MIME, multipart messages, etc. that are not plaintext, in addition to a bunch of other deobfuscation code. Since spammers sometimes try to make their messages harder for programs to read, the programs tend to become more complex and capable.
Jeff C.