-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Saturday, August 14, 2004 2:24 AM To: SURBL Discussion list Subject: Re: [SURBL-Discuss] RE: (1) Another Possible FP, and (2) header p arsing issues
On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to
output a list of
domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- |
cut -d "/" -f1 |
sed "s/=2E/./g" | grep "..*." | egrep -v "
|>|=|@|..*..*." | cut -d
"." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and others have code to "render" messages from MIME, multipart messages, etc. that are not plaintext, in addition to a bunch of other deobfuscation code. Since spammers sometimes try to make their messages harder for programs to read, the programs tend to become more complex and capable.
Jeff C.
While I'm all for automation, but please be extremely careful. Take it from someone who knows, harvesting domains from scripts will lead to headaches. Like Jeff said, there is other encoding to deal with. URL poison. links that don't have http or www in them. ("Paste this into your browser......", btw SARE is working on that now.)
I'm thinking the best way would be to take the actual SURBL code, and use it to rip out domains. But the SA code to unencode the email would be needed as well. Putting this in the email pipe would be best.
If spam if not found in SURBL run SURBL extract + append to file
:)
--Chris
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Chris Santerre writes:
-----Original Message----- From: Jeff Chan [mailto:jeffc@surbl.org] Sent: Saturday, August 14, 2004 2:24 AM To: SURBL Discussion list Subject: Re: [SURBL-Discuss] RE: (1) Another Possible FP, and (2) header p arsing issues
On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to
output a list of
domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- |
cut -d "/" -f1 |
sed "s/=2E/./g" | grep "..*." | egrep -v "
|>|=|@|..*..*." | cut -d
"." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and others have code to "render" messages from MIME, multipart messages, etc. that are not plaintext, in addition to a bunch of other deobfuscation code. Since spammers sometimes try to make their messages harder for programs to read, the programs tend to become more complex and capable.
Jeff C.
While I'm all for automation, but please be extremely careful. Take it from someone who knows, harvesting domains from scripts will lead to headaches. Like Jeff said, there is other encoding to deal with. URL poison. links that don't have http or www in them. ("Paste this into your browser......", btw SARE is working on that now.)
I'm thinking the best way would be to take the actual SURBL code, and use it to rip out domains. But the SA code to unencode the email would be needed as well. Putting this in the email pipe would be best.
BTW, what some people are doing is:
1. write a small plugin that spits out the decoded URLs or the decoded message body to a particular file
2a. mass-check using that plugin, or
2b. install it on your system-wide spamd installation
3. check the file every now and again for the URLs
4. profit!
- --j.
Chris Santerre wrote to 'SURBL Discussion list':
I'm thinking the best way would be to take the actual SURBL code, and use it to rip out domains. But the SA code to unencode the email would be needed as well. Putting this in the email pipe would be best.
If spam if not found in SURBL run SURBL extract + append to file
Here's some sample output from the script I made a couple of months ago:
http://ry.ca/spam/results.html
It already does exactly what you describe, using the SA3 plugin, ignoring any domains on the whitelists and bl[ao]cklists. It attempts to display the results in a very readable way to assist with hand-checking. The "Score" assigned is just a rough heuristic designed to separate likely spammer domains from poisoning attempts.
It needs a little work... like detecting valid IP URIs (http://24.0.0.1/) and checking them in reverse (1.0.0.24.*.surbl.org), and some infrequent TLD chopping issues.
Even in its current form, it saves me *hours* of hand-checking. It also supports local whitelists and blacklists, so I generally feed it about 500-1000 spams at a time, depending on how ambitious I feel, and go from there. I make a few passes, first picking out the poisoning attempts (most of those are easy to spot in the second list) and malformed URIs, throwing those into the local whitelist. That usually weeds out over half of the remaining URIs. I keep making passes in increasing order of uncertainty until I'm left with about a dozen really icky ones that are tough to classify. It works well, because by then I'm usually sick of looking at URIs anyway, and usually those tough ones are best left alone, to avoid FPs. :-)
I'm really glad I added the NANAS links. I also do this from a "safe" browser so I can open up the message text (in the second list on the page, but my site won't give you guys access to that :-)), or click on the spammer URL to check out their site.
And, the way my mind works, I'll use this a few times and gradually add more automation as I become simultaneously annoyed with the repetition, and comfortable with my (previously human) algorithm for automation.
- Ryan
:)
--Chris _______________________________________________ Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss
Yikes. I really need to follow up that last post.. :-)
Ryan Thompson wrote to SURBL Discussion list:
Here's some sample output from the script I made a couple of months ago:
http://ry.ca/spam/results.html
It already does exactly what you describe, using the SA3 plugin,
And there's a 2.63 version, too... if there was enough demand for both, consolidating the code with a version check and appropriate logic branching wouldn't be hard. Since I made the switch to SA3, I haven't looked back.
Even in its current form, it saves me *hours* of hand-checking.
So, that was a poor choice of words. :-) The point of this program is to *help* with hand-checking, not eliminate it. It saves hours of repetitive parsing and looking and going cross-eyed, so that you can spend the more time just hand-checking.
- Ryan