>-----Original Message-----
>From: Jeff Chan [mailto:jeffc@surbl.org]
>Sent: Saturday, August 14, 2004 2:24 AM
>To: SURBL Discussion list
>Subject: Re: [SURBL-Discuss] RE: (1) Another Possible FP, and
>(2) header
>p arsing issues
>
>
>On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
>> From: Rik van Riel
>
>>> Once I have a working script to extract URLs from a spamtrap
>>> feed, I'll make it available as free software. Possibly even
>>> bundled with Spamikaze ;)
>
>> Here is a script I run against my spamtrap mailboxes to
>output a list of
>> domain names:
>
>> egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- |
>cut -d "/" -f1 |
>> sed "s/=2E/\./g" | grep "\..*\." | egrep -v "
>|>|=|@|\..*\..*\." | cut -d
>> "." -f2-3 | tr -d "<" | usort | uniq
>
>> Depending on your mailbox format, it may work for you as well.
>
>Which will work on some plaintext URIs, but SpamAssassin and
>others have code to "render" messages from MIME, multipart
>messages, etc. that are not plaintext, in addition to a bunch of
>other deobfuscation code. Since spammers sometimes try to make
>their messages harder for programs to read, the programs tend to
>become more complex and capable.
>
>Jeff C.
While I'm all for automation, but please be extremely careful. Take it from
someone who knows, harvesting domains from scripts will lead to headaches.
Like Jeff said, there is other encoding to deal with. URL poison. links that
don't have http or www in them. ("Paste this into your browser......", btw
SARE is working on that now.)
I'm thinking the best way would be to take the actual SURBL code, and use it
to rip out domains. But the SA code to unencode the email would be needed as
well. Putting this in the email pipe would be best.
If spam
if not found in SURBL
run SURBL extract + append to file
:)
--Chris