"Jeff Chan" jeffc@surbl.org wrote:
Or how about an authenticated spam recipient address at Joe's. In other words a place to mail spam into Joe's system.
Currently my filter can check data from two sources: 1) individual mail image files on disk and 2) messages on a remote POP3 mailbox.
(Extending it to the mailbox archive format wouldn't be too difficult)
Using 1) I currently accept submissions from third parties forwarded as mail attachments (Content-Type: message/rfc822). I first drag them to a folder and then run the filter on it.
For large numbers in real time, I would have to automate that part and verify that the senders are trusted.
If there was demand for submission by attachment, I could write the necessary code to extract attachments from mails received in special mailboxes and run the scanner on them.
"Rik van Riel" riel@surriel.com wrote:
On Sun, 15 Aug 2004, Joe Wein wrote:
If you make your spamtrap mailboxes accessible to me, I could automatically parse any number of them as long as I can get POP3 access from here.
That would work, as long as you're willing to suck down about 1GB of spam per day. I already have my spam in a news spool, so it can just be sucked down...
No POP3 of course, since I'm not aware of any mail software that scales to mailboxes with over 100k pieces of mail a day.
Hi Rik,
thanks for your offer :-)
I won't be able to do much before September, as I'm about to go on vacations later this week. Initially I was interested in a small enough subset that I can handle as is, but I can also see a lot of potential in processing the whole data set in real time. I think I can extend my filter to cope with the kind of volume you describe. I'll have to rethink how I log and archive the data, as I probably don't want to archive all spam (as I currently do), but just the ones that caused new listings.
I will take a look at NNTP, to see how much new code it would take to retrieve spams that way. Probably I could reuse quite a few bits from my existing POP3 code.
As for performance, I can currently handle about 60K messages per day, but I expect I could significantly speed that up. I currently check mails against SBL+XBL, and the necessary DNS lookups take up most of the elapsed time, but that wouldn't be absolutely necessary for "known bad" feed data.
One question though, how many GB/day of spamtrap mail is Joe Wein able to handle ? ;)
I may only be getting one GB/day now, but in the long run the only scalable solution will be to have the software that analyses the spam available to others.
I agree, it will have to be running on multiple hosts in the long term, otherwise it won't scale.
A million spams a day sounds interesting :-)
The only condition would be that the mail in question isn't made available to others, since that would expose the spamtrap addresses, breaking a promise I made to the guy who pointed one of the spamtrap domains at me...
No problem with that. The content of mails only becomes an issue when a listing is challenged and I currently also do not reveal the recipient address in those cases.
Cheers!
Joe