On Thursday, February 10, 2005, 7:46:23 AM, John Delisle wrote:
Couldn't this be automated using other spam detection techniques? IE spamassassin detects 100% spam, URL not in SURBL. Spamassassin sends the email to a central repository and any URLs are parsed and added to SURBL.
It's very difficult to fully automate spam detection because not everyone agrees on what constitutes a spam. Certainly in some borderline cases, one person's spam may be another person's ham.
As a global list, we need to be very conservative about adding records so as not to create false positives (FPs). For that reason, we seek to add only hosts that are pretty much universally agreed to be spam.
Hopefully it's somewhat clear from the website that we have different sources of data:
http://www.surbl.org/lists.html
JP and OB are based on large spam traps. They're mostly automated but with some specific techniques for keeping out false positives. Outblaze for example only adds domains that are registered within the last 6 months and which are spewing a lot of spam recently. JP has an elaborate system for weeding out FPs before they get onto their list.
SC and AB are based on SpamCop reports. Both have inclusion thresholds so that only the most commonly reported spams get added. SpamCop reports have already been hand checked, though the quality of the checking and reporting varies, so in a sense they're multiply filtered before they get published as SC and AB. (I'm redoing the way the SC data is handled in a way that should be even better if I ever get around to doing it.)
WS is a manual list, meaning most of the entries are added by hand and human checked.
All the lists have FPs, some more than others. FPs are what prevent SURBLs from being used say at the MTA level in a telco, and it would be nice to eliminate FPs entirely. It's bad to have someone's ham marked as spam.
So yes, some parts of data collection can be automated, but quite a bit more engineering and thought needs to go
Jeff C. -- "If it appears in hams, then don't list it."