On Wednesday, September 8, 2004, 7:13:36 AM, Frank Ellermann wrote:
Jeff Chan wrote:
there must be some form of feedback or error correction, or other strategies to deal with misclassifications.
Whitelisting is one strategy.
ACK, but where and as far as possible I'd prefer a technical definition like the "BI" (Breidbarth Index) in Usenet.
Here's a definition (note there is no H in the name):
http://www.stopspam.org/usenet/mmf/breidbart.html
"The BI is a measure of how spammy a spammed news article is. It is the sum of the square root of the number of groups each copy of a spam article is posted to. So if you post 10 copies of an article, each cross-posted to 4 groups, the BI is 20. Other ways of reaching the BI=20 mark (a threshhold used by some cancellers) is to post 20 copies, each to just one group, 4 copies to 25 groups each, or 8 articles to 6 groups each and one more to just one group. (for BI=20.6)"
It's interesting, but probably does not apply in the mail spam area directly. I suppose we could say how often does a domain appear on multiple SURBLs, but some of the SURBL data feeds are unitary, i.e. we can't see how many reports went into the listing, only whether a domain is listed or not.
This sort of idea could perhaps be useful for categorizing spamtrap data however, especially across multiple spamtraps.
But I think your complaint is that there's no objective criteria for whitelisting. That's fair, but there always must be some subjective judgement applied, especially when we can't see the entire universe of mail spam in the same way that the entire universe of Usenet spam *is easily visible*.
It's also definitely not the case that we can see the entire mail ham universe, so there really can't be a generally knowable measure of the spam/ham ratio of a given domain.
This is somewhat a question of philosophy and science: to know what is knowable and what is not, i.e. epistemology.
Since spammyness versus legitimacy is not easily measured purely objectively, we must reserve the right to make judgements.
If you have a BI or something similar for *mail* spam, then please share it.
E.g. whitelisting TLD .edu is almost the same bad idea as blacklisting TLD .biz.
Not to worry, neither is going to happen. Such things would be too powerful and probably unnecessary. Our focus is more on the spammyness of individual domains.
Another is trying to get enough spam reports or even trapped spam to be able to get some meaningful statistical impression about spammyness. If 1000 people report a domain as spammy, it probably is. If only 1 person says it's spammy it may be less likely.
You could combine these strategies using the SC input: If the SC data matches whitelisted domains, then something is wrong:
Either the domain shouldn't be whitelisted w.r.t. the SC zone. or it should be reported as "IB link" (innocent bystander) to deputies@admin.spamcop
That's fine, but reporting IB to SpamCop does not take them out of sc.surbl.org. That still must be done on our side.
In fact we should probably also be reporting whitelist hits back to SpamCop as innocent bystanders. The actual number of meaningful whitelist hits is much smaller than you may be assuming.
The feed from SpamCop into sc.surbl.org is one way from them to us.
Both cases require some manual intervention, unfortuately, but at least you would catch erroneous WL entries.
Take a look at the whitelist hit log for sc.surbl.org and tell me how many you think are erroneous:
http://spamcheck.freeapp.net/whitelist-hits.new.log
I see approximately zero. :-)
Does anyone have any ideas, research, etc. into this?
You're already using good ideas like "age of registration", and if this data isn't available (see *.whois.rfc-ignorant.org) it is their problem, treat it as "registered yesterday".
Yes, domain age is a good one. Most professional spammers register many fresh domains every day, use them for a few days at most, then change to another.
In grey cases, we must sometime apply some judgement in order to prevent false positives.
Sure, but that judgement should consider the source resp. zone of the data. SC and SC.SURBL.ORG are not exactly the same as OB or WS. Minus obvious errors, abuses, and bugs SC.SURBL.ORG is designed to run on auto-pilot.
Whitelisting is still needed across all SURBLs. Otherwise things like yahoo.com or ebay.com could get added.
we should be free to list any part of an organization that is mostly spammy, however, even if other parts are not.
Indeed, and TLD .edu, or hosted by Schlund, or a NYSE ticker symbol have nothing to do with spam vs. ham. Anybody can be hit by an idiot spammer in his own domain, so what ? As soon as the problem is solved the listings expire.
Yes, we do not whitelist every domain the Schlund ever registers. Any individual domain hosted at Schlund or any .edu domain can get listed if they spam.
The point about Schlund is that we should not consider them a blackhat registrar because they have a few abusers. That fact does not stop us from listing their customer's domains. We can still list their customers.
The registrar information was meant to be a little extra hint about whether a domain is spammy. There are some registrars that seem to register a lot of spam domains. Sometimes that can be a signal that a domain is spammy. Sometimes it is not.
Perhaps my obvious errors are not the same as your obvious errors. ;-)
Not sure. My definition of "obvious error" for the SC zone would be "I'd report it as innocent bystander to deputies@".
That's fine, but reporting innocent bystanders does not take them off any SURBL lists. Only whitelisting or taking them out of the source data can do that.
If your definition is very different, and if the reason for this difference is related to other SURBL zones, then maybe one general whitelist covering all zones is not good enough.
I disagree. If a domain is legit, we whitelist. Otherwise we allow them to get listed. It doesn't matter what the list is.
[rogue nations]
I assume most people are aware that many of the professional spammer sites seem to be hosted in China, Brazil and Korea, and that they continue to do so. Therefore we can assume any anti-spam laws or abuse policies are not being enforced there.
TTBOMK that's not more true for Korea. They have some kind of "anti-spam" law, it predates CAN-SPAM, and is not really worse.
Some ISPs and registrars are "rogue" (e.g. SpamCast, ChinaNet, DirectI), many are clueless or ignorant, but it's not related to "nations". Unless you're prepared to identify the U.S. as the top spammer nation of the known universe. ;-)
Most legitimate US or European ISPs will shut down spam sites or spam senders. One important point of SURBLs is to be able to catch spam sites that *don't* get shut down. It really doesn't matter where they are. All that matters is that there *are* hosts that allow them to stay up. Those we need to catch. And we do.
Most of the SpamCop reports get into sc.surbl.org.
That's good. Use the rest which doesn't make it to check your whitelists and automatical procedures.
I believe the question of whitelist hits is fully answered by looking at the actual ones:
http://spamcheck.freeapp.net/whitelist-hits.new.log'
But I also agree that we should review whitelist hits to make sure they're legitimate. We're quite careful about what goes onto our whitelists in the first place, so it should not be a major problem.
When I re-write my data engine, it will handle all the lists in a consistent manner and we should be able to get better reporting across all lists about new additions, new whitelist hits, etc.
Jeff C.