-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeff Chan writes:
On Wednesday, September 8, 2004, 7:13:36 AM, Frank Ellermann wrote:
Jeff Chan wrote:
there must be some form of feedback or error correction, or other strategies to deal with misclassifications.
Whitelisting is one strategy.
ACK, but where and as far as possible I'd prefer a technical definition like the "BI" (Breidbarth Index) in Usenet.
Here's a definition (note there is no H in the name):
http://www.stopspam.org/usenet/mmf/breidbart.html
"The BI is a measure of how spammy a spammed news article is. It is the sum of the square root of the number of groups each copy of a spam article is posted to. So if you post 10 copies of an article, each cross-posted to 4 groups, the BI is 20. Other ways of reaching the BI=20 mark (a threshhold used by some cancellers) is to post 20 copies, each to just one group, 4 copies to 25 groups each, or 8 articles to 6 groups each and one more to just one group. (for BI=20.6)"
It's interesting, but probably does not apply in the mail spam area directly. I suppose we could say how often does a domain appear on multiple SURBLs, but some of the SURBL data feeds are unitary, i.e. we can't see how many reports went into the listing, only whether a domain is listed or not.
This sort of idea could perhaps be useful for categorizing spamtrap data however, especially across multiple spamtraps.
But I think your complaint is that there's no objective criteria for whitelisting. That's fair, but there always must be some subjective judgement applied, especially when we can't see the entire universe of mail spam in the same way that the entire universe of Usenet spam *is easily visible*.
It's also definitely not the case that we can see the entire mail ham universe, so there really can't be a generally knowable measure of the spam/ham ratio of a given domain.
This is somewhat a question of philosophy and science: to know what is knowable and what is not, i.e. epistemology.
Since spammyness versus legitimacy is not easily measured purely objectively, we must reserve the right to make judgements.
If you have a BI or something similar for *mail* spam, then please share it.
As a matter of interest -- and I should just ask Seth Breidbart ;) -- does this deal with hashbusters? ie. if a message is 80% hashbuster strings, and 20% payload, it's not so easy to automate BI calculation. (cf. dcc, Pyzor, Razor, AOL's paper at CEAS, et al.)
- --j.