[SURBL-Discuss] Whitelist Please

Thu Sep 9 04:18:12 CEST 2004

On Wednesday, September 8, 2004, 7:13:36 AM, Frank Ellermann wrote:
> Jeff Chan wrote:

>> there must be some form of feedback or error correction,
>> or other strategies to deal with misclassifications.

>> Whitelisting is one strategy.

> ACK, but where and as far as possible I'd prefer a technical
> definition like the "BI" (Breidbarth Index) in Usenet.

Here's a definition (note there is no H in the name):

  http://www.stopspam.org/usenet/mmf/breidbart.html

"The BI is a measure of how spammy a spammed news article is. It
is the sum of the square root of the number of groups each copy
of a spam article is posted to. So if you post 10 copies of an
article, each cross-posted to 4 groups, the BI is 20. Other ways
of reaching the BI=20 mark (a threshhold used by some cancellers)
is to post 20 copies, each to just one group, 4 copies to 25
groups each, or 8 articles to 6 groups each and one more to just
one group. (for BI=20.6)"

It's interesting, but probably does not apply in the mail
spam area directly.  I suppose we could say how often does
a domain appear on multiple SURBLs, but some of the SURBL
data feeds are unitary, i.e. we can't see how many reports
went into the listing, only whether a domain is listed or not.

This sort of idea could perhaps be useful for categorizing
spamtrap data however, especially across multiple spamtraps.

But I think your complaint is that there's no objective
criteria for whitelisting.   That's fair, but there always
must be some subjective judgement applied, especially when
we can't see the entire universe of mail spam in the same
way that the entire universe of Usenet spam *is easily
visible*.

It's also definitely not the case that we can see the entire
mail ham universe, so there really can't be a generally knowable
measure of the spam/ham ratio of a given domain.

This is somewhat a question of philosophy and science: to
know what is knowable and what is not, i.e. epistemology.

Since spammyness versus legitimacy is not easily measured
purely objectively, we must reserve the right to make
judgements.

If you have a BI or something similar for *mail* spam, then
please share it.

> E.g. whitelisting TLD .edu is almost the same bad idea as
> blacklisting TLD .biz.

Not to worry, neither is going to happen.  Such things would be
too powerful and probably unnecessary.  Our focus is more on
the spammyness of individual domains.

>> Another is trying to get enough spam reports or even trapped
>> spam to be able to get some meaningful statistical impression
>> about spammyness.  If 1000 people report a domain as spammy,
>> it probably is.  If only 1 person says it's spammy it may be
>> less likely.

> You could combine these strategies using the SC input:  If the
> SC data matches whitelisted domains, then something is wrong:

> Either the domain shouldn't be whitelisted w.r.t. the SC zone.
> or it should be reported as "IB link" (innocent bystander) to
> deputies at admin.spamcop

That's fine, but reporting IB to SpamCop does not take them
out of sc.surbl.org.  That still must be done on our side.

In fact we should probably also be reporting whitelist hits back
to SpamCop as innocent bystanders.   The actual number of
meaningful whitelist hits is much smaller than you may be
assuming.

The feed from SpamCop into sc.surbl.org is one way from
them to us.

> Both cases require some manual intervention, unfortuately, but
> at least you would catch erroneous WL entries.

Take a look at the whitelist hit log for sc.surbl.org and
tell me how many you think are erroneous:

  http://spamcheck.freeapp.net/whitelist-hits.new.log

I see approximately zero.  :-)

>> Does anyone have any ideas, research, etc. into this?

> You're already using good ideas like "age of registration", and
> if this data isn't available (see *.whois.rfc-ignorant.org) it
> is their problem, treat it as "registered yesterday".

Yes, domain age is a good one.  Most professional spammers
register many fresh domains every day, use them for a few
days at most, then change to another.

>> In grey cases, we must sometime apply some judgement in order
>> to prevent false positives.

> Sure, but that judgement should consider the source resp. zone
> of the data.  SC and SC.SURBL.ORG are not exactly the same as
> OB or WS.  Minus obvious errors, abuses, and bugs SC.SURBL.ORG
> is designed to run on auto-pilot.

Whitelisting is still needed across all SURBLs.  Otherwise
things like yahoo.com or ebay.com could get added.

>> we should be free to list any part of an organization that
>> is mostly spammy, however, even if other parts are not.

> Indeed, and TLD .edu, or hosted by Schlund, or a NYSE ticker
> symbol have nothing to do with spam vs. ham.  Anybody can be
> hit by an idiot spammer in his own domain, so what ?  As soon
> as the problem is solved the listings expire.

Yes, we do not whitelist every domain the Schlund ever
registers.  Any individual domain hosted at Schlund or
any .edu domain can get listed if they spam.

The point about Schlund is that we should not consider
them a blackhat registrar because they have a few abusers.
That fact does not stop us from listing their customer's
domains.  We can still list their customers.

The registrar information was meant to be a little extra
hint about whether a domain is spammy.  There are some
registrars that seem to register a lot of spam domains.
Sometimes that can be a signal that a domain is spammy.
Sometimes it is not.

>> Perhaps my obvious errors are not the same as your obvious
>> errors.  ;-)

> Not sure.  My definition of "obvious error" for the SC zone
> would be "I'd report it as innocent bystander to deputies@".

That's fine, but reporting innocent bystanders does not take
them off any SURBL lists.  Only whitelisting or taking them
out of the source data can do that.

> If your definition is very different, and if the reason for
> this difference is related to other SURBL zones, then maybe
> one general whitelist covering all zones is not good enough.

I disagree.  If a domain is legit, we whitelist.  Otherwise
we allow them to get listed.  It doesn't matter what the list is.

>  [rogue nations] 
>> I assume most people are aware that many of the professional
>> spammer sites seem to be hosted in China, Brazil and Korea,
>> and that they continue to do so.  Therefore we can assume
>> any anti-spam laws or abuse policies are not being enforced
>> there.

> TTBOMK that's not more true for Korea.  They have some kind of
> "anti-spam" law, it predates CAN-SPAM, and is not really worse.

> Some ISPs and registrars are "rogue" (e.g. SpamCast, ChinaNet,
> DirectI), many are clueless or ignorant, but it's not related
> to "nations".  Unless you're prepared to identify the U.S. as
> the top spammer nation of the known universe. ;-)

Most legitimate US or European ISPs will shut down spam sites
or spam senders.  One important point of SURBLs is to be able
to catch spam sites that *don't* get shut down.  It really
doesn't matter where they are.  All that matters is that
there *are* hosts that allow them to stay up.  Those we
need to catch.  And we do.

>> Most of the SpamCop reports get into sc.surbl.org.

> That's good.  Use the rest which doesn't make it to check your
> whitelists and automatical procedures.

I believe the question of whitelist hits is fully answered
by looking at the actual ones:

  http://spamcheck.freeapp.net/whitelist-hits.new.log'

But I also agree that we should review whitelist hits to
make sure they're legitimate.  We're quite careful about
what goes onto our whitelists in the first place, so it
should not be a major problem.

When I re-write my data engine, it will handle all the
lists in a consistent manner and we should be able to get
better reporting across all lists about new additions,
new whitelist hits, etc.

Jeff C.