On Friday, September 10, 2004, 10:40:39 AM, Pete McNeil wrote:
On Friday, September 10, 2004, 1:13:38 PM, Jeff wrote:
JC>> Thanks for your comments. By "recursive domain additions" to you JC>> mean to initiate a proactive search of domains within a given JC>> network? What I'm proposing is not to actively try to search, JC>> but simply to bias the inclusion of domains that are *actually JC>> reported to us as being in spams*.
What I mean by "recursive domain additions" (an internal name I use for this process) is something like this:
- Spamtrap sources the addition of a domain (URI) to the blacklist.
- A subset of domains in the blacklist are resolved to IPs and those
IPs are added to an internal reference list.
- Subsquent clean spamtrap sources are scanned for domain URI that
resolve to IPs on the reference list and if found these new domains are added to the blacklist (or at least recommended as candidates).
Aha, the space I was referring to was SpamCop reports, which AFIAK are human. SpamCop does get trap data, but I'm not exactly sure what they do with it.
That said some of the same techniques might apply to our use of spamtrap data, provided hand-checking is also done.
Otherwise your description matches ours.
So, this is not a proactive search really - rather the capture of one domain predisposes the candidate generator to capture additional domains that resolve to the same IP(s).
Got it. That is similar to the principle I was proposing. ;-)
(Candidate generator = AI monitoring spamtrap data to extract URI and recommend them as candidates for the black list).
--- Sorry for the complexity here, I'm used to thinking in terms of our system and it is sometimes difficult to describe the concepts outside of that context.
We all get accustomed to thinking in terms of our own systems, which sometimes is why explanations like this are needed to clear things up. I find it sometimes helps to try to step back and describe an outsider's view of things. I don't always succeed or remember to do that. ;-)
JC>> Hopefully my description of the difference makes some sense JC>> and it can be seen why the potential for false inclusions JC>> might be lower when the space is *actual spam reports*, and JC>> not the space of all domains hosted in nearby networks.
Clearly. *actual spam reports* is analogous to clean spamtrap data - though I presume it may also include some non-spamtrap data submitted by users. You are definitely on the right track - that is, I think we're on the same page generally.
The SpamCop data I assume to be *human-sourced* reports. That's what I meant by "actual spam reports". "Human spam reports" would have been more descriptive.
The caution is - even with very strong spamtraps there are errors in this process often enough to require some extra research before gating the new "candidates" into the blacklist, IME.
Our use of spamtraps (mostly into the WS and OB lists) are carefully tested. The WS entries are supposed to all be hand checked, since we all agree that purely automatic methods let in too many FPs. Human checkers make mistakes too, though we're trying to cut down on those errors, for example by suggesting some requirements such as:
1. Domain age. Older domains should only be added with a lot of evidence. Most spammer domains are no more than a week or two old, often less than a few days old.
2. Only add domains that only appear in spams. Don't add domains that appear in hams.
The second seems the hardest to get across, even though it should seem pretty obvious. The problem seems to be that people say "yep, I've seen a spam with this domain so I'm adding it". If so, that's not the right criterion.
Thanks for comparing notes! :-)
Jeff C.