-----Original Message----- From: Rob McEwen [mailto:webmail@powerviewsystems.com] Sent: Thursday, August 12, 2004 11:18 AM To: discuss@lists.surbl.org Subject: [SURBL-Discuss] RE: (1) Another Possible FP, and (2) header parsing issues Importance: High
I have a two-part question:
(1) header parsing issues...
I was reading a web site discussing an implementation of SURBL on the IceWarp web server (using a third party add-on). One person complained that there are too many false positives when submitting IPs and domains found in the header of the e-mail. They felt like ONLY the body of the message should be examined. I see good arguments both ways. For example, parsing the header can catch spam which was originally sent to one place, but then forwarded to another. On the other hand, actual affiliate URLs would only normally occur in the body of the message. Any thoughts or suggestions?
(2) Another Possible FP...
This person was asked to give an example of a message which shouldn't have been blocked and which would have gone through if the header wasn't parsed. They provided an example which had the following line in the header:
Message-ID: 000b01c47f1a$e02f73e0$0200a8c0@MUNGED-callatg.com
The offending domain was MUNGED-callatg.com
Therefore, I must ask, could MUNGED-callatg.com be a FP? The reason I suspect so is because they mentioned that this company is a division of GE. Please check on this.
I'm confused. (Theres a first!) SURBL only check the body for URLs. How did the message-ID get hit?
--Chris
Chris said:
I'm confused. (Theres a first!) SURBL only check the body for URLs. How did the message-ID get hit?
Its simply an issue where someone's implementation of SURBL provided the option of extracting domains out of either (1) the header, and/or (2) the client's IP address, and/or (3) the body of the e-mail. Any combination was possible/configurable. The "default" setting was to use all three.
The following is the software package that I am using for SURBL filtering:
http://www.2150.com/regexfilter/
I chose this because it works well with my Merak IceWarp webmail software I have running on Windows 2000 server.
The guy who wrote this is very smart. Because he uses the filter for himself and didn't have to worry about "clients", he was very aggressive with his default settings both for SURBL and for other linguistic aspects of his filter. Just about everyone using it has had to contend with having to "loosen" it in a number of ways to prevent false positives for their clients... but this was a small price to pay for a well designed and FREE software package.
Rob McEwen
On Thursday, August 12, 2004, 9:17:29 AM, Rob McEwen wrote:
Chris said:
I'm confused. (Theres a first!) SURBL only check the body for URLs. How did the message-ID get hit?
Its simply an issue where someone's implementation of SURBL provided the option of extracting domains out of either (1) the header, and/or (2) the client's IP address, and/or (3) the body of the e-mail. Any combination was possible/configurable. The "default" setting was to use all three.
The following is the software package that I am using for SURBL filtering:
I chose this because it works well with my Merak IceWarp webmail software I have running on Windows 2000 server.
The guy who wrote this is very smart. Because he uses the filter for himself and didn't have to worry about "clients", he was very aggressive with his default settings both for SURBL and for other linguistic aspects of his filter. Just about everyone using it has had to contend with having to "loosen" it in a number of ways to prevent false positives for their clients... but this was a small price to pay for a well designed and FREE software package.
FWIW I am in contact with the author and he's somewhat redesigning his use of SURBLs. Hopefully the results will be more in-line with how we expect them to be used, especially in an ISP context.
Jeff C.
Hi!
I was reading a web site discussing an implementation of SURBL on the IceWarp web server (using a third party add-on). One person
I'm confused. (Theres a first!) SURBL only check the body for URLs. How did the message-ID get hit?
It should not, it seems they did a own implementation of something and were surprised it didnt work, If you use a list on things its not ment to be tested on (msg headers) dont be surprised it breaks.
Obviously the original poster, and/or the people who implemented this missed the point of what SURBL does actually. Please, read the dokumentation on the website.
Bye, Raymond.
Raymond,
Maybe I'm reading into things, but your tone seems a bit harsh. To be sure, let me say that nobody here or on that other forum has bitched or complained about this. SURBL got rave reviews on that other site. I did forward this information to that site as Jeff requested to help educate them.
As for my own server, rather than turning parsing of headers "off", I'm currently doing a test where I SURBL-check headers ONLY if the SURBL-checking of the body doesn't get marked as spam. Next, if the body of the message is "clean" and the header gets checked, I'm saving ALL of these into a special folder where I can investigate for myself the benefits/drawback using real-world data. I understand that this is not the prescribed way to use SURBL... but, even if I don't like the results, this may be beneficial as a "flagging" or "scoring" system where I could then allow these particular messages through, but have then handy to see what is not getting blocked by other filtering.
Rob McEwen
Hi!
Maybe I'm reading into things, but your tone seems a bit harsh. To be sure, let me say that nobody here or on that other forum has bitched or complained about this. SURBL got rave reviews on that other site. I did forward this information to that site as Jeff requested to help educate them.
Yes and no, i would be very disappointed if people mis-use SURBL and then say it gives FPs. So if you helped out, thanks. The 'tone' was only to make clear its not smart to use SURBL on headers.
We put a lot of time in the project, handchecking the items and so on... only to avoid and bring down FP rates. This is really timeconsuming as you can understand.
benefits/drawback using real-world data. I understand that this is not the prescribed way to use SURBL... but, even if I don't like the results, this may be beneficial as a "flagging" or "scoring" system where I could then allow these particular messages through, but have then handy to see what is not getting blocked by other filtering.
Ok, clear.
Thanks for your explanation.
Bye, Raymond.
On Thursday, August 12, 2004, 12:53:26 PM, Rob McEwen wrote:
As for my own server, rather than turning parsing of headers "off", I'm currently doing a test where I SURBL-check headers ONLY if the SURBL-checking of the body doesn't get marked as spam. Next, if the body of the message is "clean" and the header gets checked, I'm saving ALL of these into a special folder where I can investigate for myself the benefits/drawback using real-world data. I understand that this is not the prescribed way to use SURBL... but, even if I don't like the results, this may be beneficial as a "flagging" or "scoring" system where I could then allow these particular messages through, but have then handy to see what is not getting blocked by other filtering.
Seems like a reasonable thing to try, at least for curiosity's sake, but I think for the long-term or for a server with many users, checking only message bodies is definitely the preferred way.
Headers are too easily and too often forged. While there's a disincentive to forge URLs or add legitimate ones into a message body since those would distract the human reader, in contrast there's an incentive to put legitimate domains in headers to try to fool automated or human header checking. So the potential for FPs is much greater and the incentives are wrong for checking on headers.
Cheers,
Jeff C.
On Thursday, August 12, 2004, 4:26:09 PM, Jeff Chan wrote:
Headers are too easily and too often forged. .... there's an incentive to put legitimate domains in headers to try to fool automated or human header checking.
Correction, none of the domains in SURBLs should be legitimate so they should not have value in being added to headers. Still SURBLs are best used on URIs only.
Jeff C.
Jeff Chan wrote to 'SURBL Discussion list':
On Thursday, August 12, 2004, 4:26:09 PM, Jeff Chan wrote:
Headers are too easily and too often forged. .... there's an incentive to put legitimate domains in headers to try to fool automated or human header checking.
Correction, none of the domains in SURBLs should be legitimate so they should not have value in being added to headers. Still SURBLs are best used on URIs only.
Yes, and your earlier point is still valid, as headers would be an easy way to poison SURBL checks and make life a living hell for those who hand-check URIs.
- Ryan
On Thu, 12 Aug 2004, Raymond Dijkxhoorn wrote:
It should not, it seems they did a own implementation of something and
[snip]
Speaking of which, I'm curious whether the server side software SURBL has is available. I'm getting about 70k spams/day into my spamtraps here and it would be nice to turn it into a SURBL somehow...
If the SURBL whitelist is DNS queryable, it would even be possible for third parties to run low FP SURBL lists.
kind regards,
Rik
On Friday, August 13, 2004, 4:52:59 PM, Rik Riel wrote:
Speaking of which, I'm curious whether the server side software SURBL has is available. I'm getting about 70k spams/day into my spamtraps here and it would be nice to turn it into a SURBL somehow...
Hi Rik, There isn't much code on the SURBL server side of things, just some scripts and small programs for processing name and IP lists. It sounds like you're looking for mail and URI parsing code, which can be non-trivial. Certainly some of the code in SpamAssassin or SpamCopURI could be of interest.
(SURBLs are built from data provided by third parties such as SpamCop, Outblaze, SARE, Bill Stearns, Raymond, Joe Wein, etc. As such we don' do any processing of actual spam messages, just the extracted URI contents.)
If the SURBL whitelist is DNS queryable, it would even be possible for third parties to run low FP SURBL lists.
We don't have really large whitelists, just a few obvious or joe jobbed ones mostly. Key is to keep legitimate sites out of the data in the first place, and that is done mainly though good procedures and practices.
Jeff C.
On Fri, 13 Aug 2004, Jeff Chan wrote:
and URI parsing code, which can be non-trivial. Certainly some of the code in SpamAssassin or SpamCopURI could be of interest.
Good point, I'll have to look at those.
(SURBLs are built from data provided by third parties such as SpamCop, Outblaze, SARE, Bill Stearns, Raymond, Joe Wein, etc. As such we don' do any processing of actual spam messages, just the extracted URI contents.)
After expansion of the recipient lists, I get about 150k spamtrap mails per day. A bit much to check the URLs by hand, so I'm looking for a way to automatically extract them and make them available.
I'd be happy to put up a feed for SURBL and other interested parties; I'll take a look into the spamassassin code, if anybody has pointers to other software that may make my task easy, please let me know ;)
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
cheers,
Rik
----- Original Message ----- From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to output a list of domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- | cut -d "/" -f1 | sed "s/=2E/./g" | grep "..*." | egrep -v " |>|=|@|..*..*." | cut -d "." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Bill
On Friday, August 13, 2004, 10:33:22 PM, Bill Landry wrote:
From: Rik van Riel
Once I have a working script to extract URLs from a spamtrap feed, I'll make it available as free software. Possibly even bundled with Spamikaze ;)
Here is a script I run against my spamtrap mailboxes to output a list of domain names:
egrep -i "http|www" main.mbx | cut -d ":" -f2 | cut -b 3- | cut -d "/" -f1 | sed "s/=2E/./g" | grep "..*." | egrep -v " |>|=|@|..*..*." | cut -d "." -f2-3 | tr -d "<" | usort | uniq
Depending on your mailbox format, it may work for you as well.
Which will work on some plaintext URIs, but SpamAssassin and others have code to "render" messages from MIME, multipart messages, etc. that are not plaintext, in addition to a bunch of other deobfuscation code. Since spammers sometimes try to make their messages harder for programs to read, the programs tend to become more complex and capable.
Jeff C.
"Rik van Riel" riel@surriel.com
After expansion of the recipient lists, I get about 150k spamtrap mails per day. A bit much to check the URLs by hand, so I'm looking for a way to automatically extract them and make them available.
I'd be happy to put up a feed for SURBL and other interested parties;
Hi Rik,
as Jeff mentioned, I am one of the contributors to WS. I run my own client filter on about a dozen mailboxes right now, yielding about 70 domains per day.
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here. That way you wouldn't spend any time checking by hand, I can apply my established technology and procedures to a larger data set and everyone can use the output via SURBL without any further effort on your part. POP3 download or SMTP forwarding would use some bandwidth, though.
This doesn't preclude you from finding other approaches to harvest data from those spamtraps yourself, of course.
Joe Wein
Rik:
I like Joe Wein's suggestion.
There are a lot of things that go into this kind of checking... (factoring domain registration dates, factoring in a domain's nameserver, etc.) ...plus other checks to make sure that legitimate mail doesn't get into the spamtrap. On top of all of this, you also have to watch out for innocent parties being backlisted due to a spammer's "Joe Jobs" attack (where the spammer tries to make it look like the spam came from another innocent party). Finally, Rik, even if you have to skills to do all this, Joe Wein may be able to do it faster and with less meticulous hand testing due to the extensive tools and knowledge he has developed. It if obvious from his web site that he had successfully dealt with all of these harder issues before.
I'd say that your spamtrap and Joe Wein's extraction/checking process would make a great combination. It would likely be a very wonderful contribution to SURBL if you took him up on this offer.
Rob McEwen
On Sat, 14 Aug 2004, Rob McEwen wrote:
Finally, Rik, even if you have to skills to do all this, Joe Wein may be able to do it faster and with less meticulous hand testing due to the extensive tools and knowledge he has developed. It if obvious from his web site that he had successfully dealt with all of these harder issues before.
One question though, how many GB/day of spamtrap mail is Joe Wein able to handle ? ;)
I may only be getting one GB/day now, but in the long run the only scalable solution will be to have the software that analyses the spam available to others.
Eg. it would be interesting to run it on the CBL spamtraps, which receive over an order of magnitude more spam than my spamtraps here ...
I'd be more than happy to send my spamtrap mail to Joe Wein though, either by nntpsend or by having him pull it from my nntp server here. The only condition would be that the mail in question isn't made available to others, since that would expose the spamtrap addresses, breaking a promise I made to the guy who pointed one of the spamtrap domains at me...
kind regards,
Rik
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rik van Riel writes:
On Sat, 14 Aug 2004, Rob McEwen wrote:
Finally, Rik, even if you have to skills to do all this, Joe Wein may be able to do it faster and with less meticulous hand testing due to the extensive tools and knowledge he has developed. It if obvious from his web site that he had successfully dealt with all of these harder issues before.
One question though, how many GB/day of spamtrap mail is Joe Wein able to handle ? ;)
I may only be getting one GB/day now, but in the long run the only scalable solution will be to have the software that analyses the spam available to others.
Eg. it would be interesting to run it on the CBL spamtraps, which receive over an order of magnitude more spam than my spamtraps here ...
I'd be more than happy to send my spamtrap mail to Joe Wein though, either by nntpsend or by having him pull it from my nntp server here. The only condition would be that the mail in question isn't made available to others, since that would expose the spamtrap addresses, breaking a promise I made to the guy who pointed one of the spamtrap domains at me...
interesting...
It would be nice to define an efficient spamtrap-delivery system -- possibly not even SMTP or NNTP, just a protocol where a client (ie the one doing the pull-down) connects to a port, authenticates, and gets a continual mbox-formatted stream of messages until it chooses to disconnect.
(After all, in the spamtrap case, you just want the mails ASAP, not necessarily *all* the mails, just the freshest ones. also, "from"/"to" is immaterial because they're all going to the same destination -- the spamtrap.)
Maybe flooding over NNTP, treating a spamtrap as a USENET feed, is the way to do it?
- --j.
Hi!
After expansion of the recipient lists, I get about 150k spamtrap mails per day. A bit much to check the URLs by hand, so I'm looking for a way to automatically extract them and make them available.
I'd be happy to put up a feed for SURBL and other interested parties;
as Jeff mentioned, I am one of the contributors to WS. I run my own client filter on about a dozen mailboxes right now, yielding about 70 domains per day.
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here. That way you wouldn't spend any time checking by hand, I can apply my established technology and procedures to a larger data set and everyone can use the output via SURBL without any further effort on your part. POP3 download or SMTP forwarding would use some bandwidth, though.
Would be a easy solution. Is that an option, to pop the spamtrap ?
Bye, Raymond.
On Sun, 15 Aug 2004, Joe Wein wrote:
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here.
That would work, as long as you're willing to suck down about 1GB of spam per day. I already have my spam in a news spool, so it can just be sucked down...
No POP3 of course, since I'm not aware of any mail software that scales to mailboxes with over 100k pieces of mail a day.
Rik
On Saturday, August 14, 2004, 11:36:36 AM, Rik Riel wrote:
On Sun, 15 Aug 2004, Joe Wein wrote:
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here.
That would work, as long as you're willing to suck down about 1GB of spam per day. I already have my spam in a news spool, so it can just be sucked down...
No POP3 of course, since I'm not aware of any mail software that scales to mailboxes with over 100k pieces of mail a day.
I agree with Raymond, Joe and Rob. Using Joe's process would leverage his good work well.
I also agree with Rik that NNTP in some ways makes more sense for this kind of large feed (as would RSS). (POP of course is fine for "normal" individual users.)
Some sort of a general RSS or NNTP capability into Joe's suite could make a wonderful spamtrap receptacle for Joe and all of us.... ;-) If we had that we could solicit other spamtrap feeds from people we can trust.
Jeff C.
On Saturday, August 14, 2004, 1:40:46 PM, Jeff Chan wrote:
Some sort of a general RSS or NNTP capability into Joe's suite could make a wonderful spamtrap receptacle for Joe and all of us.... ;-) If we had that we could solicit other spamtrap feeds from people we can trust.
Or how about an authenticated spam recipient address at Joe's. In other words a place to mail spam into Joe's system.
Jeff C.
"Jeff Chan" jeffc@surbl.org wrote:
Or how about an authenticated spam recipient address at Joe's. In other words a place to mail spam into Joe's system.
Currently my filter can check data from two sources: 1) individual mail image files on disk and 2) messages on a remote POP3 mailbox.
(Extending it to the mailbox archive format wouldn't be too difficult)
Using 1) I currently accept submissions from third parties forwarded as mail attachments (Content-Type: message/rfc822). I first drag them to a folder and then run the filter on it.
For large numbers in real time, I would have to automate that part and verify that the senders are trusted.
If there was demand for submission by attachment, I could write the necessary code to extract attachments from mails received in special mailboxes and run the scanner on them.
"Rik van Riel" riel@surriel.com wrote:
On Sun, 15 Aug 2004, Joe Wein wrote:
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here.
That would work, as long as you're willing to suck down about 1GB of spam per day. I already have my spam in a news spool, so it can just be sucked down...
No POP3 of course, since I'm not aware of any mail software that scales to mailboxes with over 100k pieces of mail a day.
Hi Rik,
thanks for your offer :-)
I won't be able to do much before September, as I'm about to go on vacations later this week. Initially I was interested in a small enough subset that I can handle as is, but I can also see a lot of potential in processing the whole data set in real time. I think I can extend my filter to cope with the kind of volume you describe. I'll have to rethink how I log and archive the data, as I probably don't want to archive all spam (as I currently do), but just the ones that caused new listings.
I will take a look at NNTP, to see how much new code it would take to retrieve spams that way. Probably I could reuse quite a few bits from my existing POP3 code.
As for performance, I can currently handle about 60K messages per day, but I expect I could significantly speed that up. I currently check mails against SBL+XBL, and the necessary DNS lookups take up most of the elapsed time, but that wouldn't be absolutely necessary for "known bad" feed data.
One question though, how many GB/day of spamtrap mail is Joe Wein able to handle ? ;)
I may only be getting one GB/day now, but in the long run the only scalable solution will be to have the software that analyses the spam available to others.
I agree, it will have to be running on multiple hosts in the long term, otherwise it won't scale.
A million spams a day sounds interesting :-)
The only condition would be that the mail in question isn't made available to others, since that would expose the spamtrap addresses, breaking a promise I made to the guy who pointed one of the spamtrap domains at me...
No problem with that. The content of mails only becomes an issue when a listing is challenged and I currently also do not reveal the recipient address in those cases.
Cheers!
Joe
Hi!
I will take a look at NNTP, to see how much new code it would take to retrieve spams that way. Probably I could reuse quite a few bits from my existing POP3 code.
As for performance, I can currently handle about 60K messages per day, but I expect I could significantly speed that up. I currently check mails against SBL+XBL, and the necessary DNS lookups take up most of the elapsed time, but that wouldn't be absolutely necessary for "known bad" feed data.
Do youi allready have a local copy of those zonefiles? Would speedup processing also i guess...
I agree, it will have to be running on multiple hosts in the long term, otherwise it won't scale.
A million spams a day sounds interesting :-)
I could put a agent on one of my boxes, to help share the load, if needed.
Bye, Raymond.
"Raymond Dijkxhoorn" raymond@prolocation.net wrote:
Do youi allready have a local copy of those zonefiles? Would speedup processing also i guess...
Actually, I don't even run a local DNS server at this point. All queries are routed via the DNS server of my broadband provider, which was an adequate solution so far (600 spams/day + forwards).
After I get back from vacations I'll probably set up a DNS server to speed things up a bit (I might need some handholding as I'm not a Linux admin by training...)
Joe
Hi!
Do youi allready have a local copy of those zonefiles? Would speedup processing also i guess...
Actually, I don't even run a local DNS server at this point. All queries are routed via the DNS server of my broadband provider, which was an adequate solution so far (600 spams/day + forwards).
After I get back from vacations I'll probably set up a DNS server to speed things up a bit (I might need some handholding as I'm not a Linux admin by training...)
If you need some help there let me know. Have a nice vacation meanwhile.
Bye, Raymond.