RFC: SURBL inclusion policy

List overview All Threads
Download

newer

older

RE: [SPAM-TAG] [SURBL-Discuss]...

SA 3.0/URIDNSBL Install Problems

Jeff Chan

24 Sep 2004 24 Sep '04

1:40 p.m.

In order to assist people hand-classifying spam URI domains and IPs for inclusion or non-inclusion in SURBLs, I've made a draft policy document:

http://www.surbl.org/policy.html

Please read it and post your comments.

Jeff C. -- "If it appears in hams, then don't list it."

Show replies by date

Frank Ellermann

24 Sep 24 Sep

10:32 p.m.

Jeff Chan wrote:

...

Please read it and post your comments.

| Don't add domains or IPs that have legitimate, non-spam uses.

NAK (known issue, JFTR).

| For IP addresses look them up in reverse octet order against | iddb.isipp.com .

s/iddb/iadb2/

| check them against iadb.isipp.com or wadb.isipp.com

s/wadb/iadb2/. WA is "withdrawn accreditation" (= bulk mailer decided to break the IADB rules), it's a kind of blacklist. You could mention WA elsewhere. e.g. together with SpamHaus.

| Visit the site or at least

Better remove this, it's too dangerous for the kids, and it can be misleading without JavaScript. If you need more interesting sources, you could add whois.sc (and maybe A9.com (?))

| 13.Apply common sense

ACK, much better than 2.

| but which other people might consider legitimate. This can | include sites like topica, yahoogroups, joke-of-the-day, and | similar things that people actually subscribe to. Do not list | them, even if they get abused for spam.

NAK. Nobody knows what "other people might consider". Let alone to agree with it blindly. That clause makes no sense, and it devaluates the important first part before the "but".

Anything else is fine, but a bit long. "When in doubt don't list" could be added to the <title> and / or <h1> header.

Bye, Frank

Jeff Chan

25 Sep 25 Sep

3:24 a.m.

On Friday, September 24, 2004, 1:32:09 PM, Frank Ellermann wrote:

...

| For IP addresses look them up in reverse octet order against | iddb.isipp.com .

...

s/iddb/iadb2/

NAK, iddb is a domain list. Domains are resolved against this list to turn them into IP addresses, which can then be checked against the main lists (iadb, iadb2, wadb).

...

| check them against iadb.isipp.com or wadb.isipp.com

...

s/wadb/iadb2/. WA is "withdrawn accreditation" (= bulk mailer decided to break the IADB rules), it's a kind of blacklist. You could mention WA elsewhere. e.g. together with SpamHaus.

I've added iadb2 as an alternative to iadb. wadb is still useful, with the caveat you mentioned, so I copied the description of WADB.

...

| Visit the site or at least

...

Better remove this, it's too dangerous for the kids, and it can be misleading without JavaScript. If you need more interesting sources, you could add whois.sc (and maybe A9.com (?))

True, visiting sites can sometimes be dangeous, I added:

(I usually use google's cache of the site, or a text browser like lynx. This is somewhat safer than using a full browser to go to a site, which could contain malicious code. Viewing google summaries is often good enough.)

...

| 13.Apply common sense

...

ACK, much better than 2.

...

| but which other people might consider legitimate. This can | include sites like topica, yahoogroups, joke-of-the-day, and | similar things that people actually subscribe to. Do not list | them, even if they get abused for spam.

...

NAK. Nobody knows what "other people might consider". Let alone to agree with it blindly. That clause makes no sense, and it devaluates the important first part before the "but".

In this case we need to try to consider what other people may use. It can be difficult but not impossible. Anyone who works at an ISP, works in an IT department, visits chatrooms, knows novice Internet users, friends, relatives, etc. probably is aware of at least some of these kinds of sites.

Strictly speaking these may not always be personally knowable, but it's more of an external social or cultural awareness.

...

Anything else is fine, but a bit long. "When in doubt don't list" could be added to the <title> and / or <h1> header.

...

                    Bye, Frank

Thanks as always Frank,

Jeff C. -- "If it appears in hams, then don't list it."

Jeff Chan

4:41 a.m.

On Friday, September 24, 2004, 6:24:54 PM, Jeff Chan wrote:

...

On Friday, September 24, 2004, 1:32:09 PM, Frank Ellermann wrote:

...
| For IP addresses look them up in reverse octet order against | iddb.isipp.com .

...

...
s/iddb/iadb2/

...

NAK, iddb is a domain list. Domains are resolved against this list to turn them into IP addresses, which can then be checked against the main lists (iadb, iadb2, wadb).

Oops, I see you were referring to the first mention of iddb which indeed is on IP addresses and should be checked against iadb.

You're right. :-)

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

2:49 a.m.

Jeff Chan wrote to SURBL Discuss:

...

In order to assist people hand-classifying spam URI domains and IPs for inclusion or non-inclusion in SURBLs, I've made a draft policy document:

Good, although I think there are a few redundant points, and it would read better in a top-down priority format, "firing your biggest guns first".

To me, the main points are (using roman numerals so as not to confuse your numbering system):

i) Add domains that appear *only* in spam. Do not add any domains that appear in ham.

ii) Beware of poisoning/joe job attempts; not every domain that appears in spam belongs to a spammer!

iii) Use these important sources of information as additional input: (List the IADB2, whois, etc., in decreasing order of usefulness)

The "not your personal blocklist" point (14), and "common sense" (13) are good points that I think are deserving of discussion in paragraph form beneath the "main points". They're not "criteria", per se, but should definitely be mentioned.

After the list of main points are first, clearly defined, and, second, *lightly* expanded upon (remember, we want to make sure people get the main points!), you can include the more general discussion from some of your points further down the page in paragraph format. Seeing a numbered list of more than 5-6 items raises some questions for me, indicating that perhaps the big picture could be lost on some people (especially those newcomers just learning of the SURBL policies).

So, in brief, what I'm suggesting is just a bit of restructuring to make the main points clearer, while still providing the detailed information you already have in the document. IMO, you've done a fine job with the information.

Hope this helps,

- Ryan

...

http://www.surbl.org/policy.html

Please read it and post your comments.

Jeff C.

"If it appears in hams, then don't list it."

Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss

-- Ryan Thompson ryan@sasknow.com SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4 Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America

Jeff Chan

3:27 a.m.

On Friday, September 24, 2004, 5:49:08 PM, Ryan Thompson wrote:

...

To me, the main points are (using roman numerals so as not to confuse your numbering system):

...

i) Add domains that appear *only* in spam. Do not add any domains that appear in ham.

...

ii) Beware of poisoning/joe job attempts; not every domain that appears in spam belongs to a spammer!

...

iii) Use these important sources of information as additional input: (List the IADB2, whois, etc., in decreasing order of usefulness)

...

The "not your personal blocklist" point (14), and "common sense" (13) are good points that I think are deserving of discussion in paragraph form beneath the "main points". They're not "criteria", per se, but should definitely be mentioned.

...

After the list of main points are first, clearly defined, and, second, *lightly* expanded upon (remember, we want to make sure people get the main points!), you can include the more general discussion from some of your points further down the page in paragraph format. Seeing a numbered list of more than 5-6 items raises some questions for me, indicating that perhaps the big picture could be lost on some people (especially those newcomers just learning of the SURBL policies).

...

So, in brief, what I'm suggesting is just a bit of restructuring to make the main points clearer, while still providing the detailed information you already have in the document. IMO, you've done a fine job with the information.

All good points. Let me re-organize....

Jeff C. -- "If it appears in hams, then don't list it."

Jeff Chan

6:35 a.m.

OK I Updated the policy page, taking Ryan's top rules and general organizational comments:

http://www.surbl.org/policy.html

Please let me/us know what you think of it now.

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

8:24 a.m.

Jeff Chan wrote to Jeff Chan:

...

OK I Updated the policy page, taking Ryan's top rules and general organizational comments:

http://www.surbl.org/policy.html

Please let me/us know what you think of it now.

Hi Jeff,

Aha! I like it very much. I suspect it will still evolve a bit--as most good things do--but it gets the point across, and also provides a lot of good, useful information that will assist human classifiers in listing (only) the spammiest domains.

On a related note, do we want to say anything in this document (or possibly another document) about whitelisting criteria? There are really three main categories:

1. Blacklist material (that's what your policy addresses very well)

1.5. "Almost" blacklist material (the grey ones); ala the "UC" list, are the domains that are almost totally spammers, but may have a few borderline uses

2. Domains that should not be listed, but are not necessarily of "whitelist" merit. These are mostly the domains where insufficient data (or effort) exists to make a determination, which, for good or for ill, is where the bulk of our human efforts are currently focused.

3. Domains that are white; i.e., have definite legitimate uses

OK, that's four. If we really want to reduce FPs, we need to carefully consider *all* of these categories when analysing potential domains. I spend just as much time pulling domains out of ham as I do pulling domains out of spam.

The distinction between 2 and 3 is almost as difficult as the distinction between 1 and 2 sometimes.

- Ryan

Jeff Chan

10:45 a.m.

On Friday, September 24, 2004, 11:24:57 PM, Ryan Thompson wrote:

...

do we want to say anything in this document (or possibly another document) about whitelisting criteria? There are really three main categories:

...

Blacklist material (that's what your policy addresses very well)

...

1.5. "Almost" blacklist material (the grey ones); ala the "UC" list, are the domains that are almost totally spammers, but may have a few borderline uses

...

Domains that should not be listed, but are not necessarily of "whitelist" merit. These are mostly the domains where insufficient data (or effort) exists to make a determination, which, for good or for ill, is where the bulk of our human efforts are currently focused.

...

Domains that are white; i.e., have definite legitimate uses

...

OK, that's four. If we really want to reduce FPs, we need to carefully consider *all* of these categories when analysing potential domains. I spend just as much time pulling domains out of ham as I do pulling domains out of spam.

...

The distinction between 2 and 3 is almost as difficult as the distinction between 1 and 2 sometimes.

...

Ryan

I agree with 1 and 3, but another way to look at the undecided middle ground might be to say that if a domain or IP has not proven to be blacklist material and has not been falsely listed and therefore in need of whitelisting, then it perhaps can be ignored until it gets into category 1 or 3.

I know that goes against the feelings of people who want to catch every spam, and I understand that feeling myself, but in *practical terms* it may be a *useful* solution.

Yes, that misses some marginal and probable spammers, but it lets us focus on the first category which are probably the most important to find in terms of the volume of spam they produce. The others can consume a lot of time and effort without producing the level of performance that catching the *major* spammers in the first category can.

I realize you guys are trying to sort out some of the stuff in the middle and I understand some of the reasons for wanting to do it, but I think working on the more clear cases gets us the most results for our efforts.

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

6:42 p.m.

Jeff Chan wrote to SURBL Discussion list:

...

On Friday, September 24, 2004, 11:24:57 PM, Ryan Thompson wrote:

...
do we want to say anything in this document (or possibly another document) about whitelisting criteria? There are really three main categories:

...

Blacklist material (that's what your policy addresses very well)

1.5. "Almost" blacklist material (the grey ones); ala the "UC" list, are 2. Domains that should not be listed, but are not necessarily of 3. Domains that are white; i.e., have definite legitimate uses

...
OK, that's four. If we really want to reduce FPs, we need to carefully consider *all* of these categories when analysing potential domains. I spend just as much time pulling domains out of ham as I do pulling domains out of spam.

Hi Jeff,

...

I agree with 1 and 3, but another way to look at the undecided middle ground might be to say that if a domain or IP has not proven to be blacklist material and has not been falsely listed and therefore in need of whitelisting, then it perhaps can be ignored until it gets into category 1 or 3.

I know that goes against the feelings of people who want to catch every spam, and I understand that feeling myself, but in *practical terms* it may be a *useful* solution.

Yes, that misses some marginal and probable spammers, but it lets us focus on the first category which are probably the most important to find in terms of the volume of spam they produce. The others can consume a lot of time and effort without producing the level of performance that catching the *major* spammers in the first category can.

I realize you guys are trying to sort out some of the stuff in the middle and I understand some of the reasons for wanting to do it, but I think working on the more clear cases gets us the most results for our efforts.

Well, suffice to say, I don't want to open up the "grey" can of worms again! I just wanted to identify the major categories which, in real life, we submitters are actually dealing with on a daily basis. :-)

I wrote:

...

...
The distinction between 2 and 3 is almost as difficult as the distinction between 1 and 2 sometimes.

Meaning, whitelisting is usually just about as difficult as blacklisting.

- Ryan

Jeff Chan

26 Sep 26 Sep

1:46 a.m.

On Saturday, September 25, 2004, 9:42:35 AM, Ryan Thompson wrote:

...

Meaning, whitelisting is usually just about as difficult as blacklisting.

Whitelisting is sometimes harder than blocklisting. Most pure spams are extremely obvious. We've all seen the many nearly identical pill, mortgage, and warez spams, right? Those ones are clearly spams and easy to blocklist.

There are some spammy-mentioned legitimate sites that are harder to identify as legitimate, like those that appear in stock newsletters, joke-of-the-day type, mailing lists, newsletters, etc.

Those require more research to find out if the reporter forgot they were subscribed, whether the domain belongs to spam gangs, whether there is a Joe Job going on, or any number of other factors. But the decision needs to be made if we are to prevent or fix false positives.

The decision to whitelist is often difficult and usually requires at least some research. Fortunately some of our research tools like GetURI and others help quite a bit, but classification still requires human judgement and effort.

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

5:58 a.m.

Jeff Chan wrote to SURBL Discussion list:

...

The decision to whitelist is often difficult and usually requires at least some research. Fortunately some of our research tools like GetURI and others help quite a bit,

Speaking of which, I've worked hard to convince GetURI (new version pending release!) to follow the SURBL inclusion criteria pretty closely; I've added SBL lookups on the forward IP(s) and nameservers, as well as IADB2 and WADB checks on the IP(s), although the IADB2/WADB checks rarely hit. The SBL lookups are extremely useful. And, of course, GetURI has had the --age option for a while now.

Here's what the output looks like now (this took 57s for ~900 messages, even with the great number of DNS queries needed to process the 116 domains not found in SURBL):

http://ry.ca/geturi/public/criteriatest.html (62K)

Feedback welcome!

These features are in the current development version (not available to the public, yet, sorry), but, once testing is complete, there'll be a new release.

...

but classification still requires human judgement and effort.

Agreed! Hopefully GetURI can reduce the human effort, so humans have more energy for judgement. :-)

- Ryan

Jeff Chan

6:38 a.m.

On Saturday, September 25, 2004, 8:58:24 PM, Ryan Thompson wrote:

...

Jeff Chan wrote to SURBL Discussion list:

...

...
The decision to whitelist is often difficult and usually requires at least some research. Fortunately some of our research tools like GetURI and others help quite a bit,

...

Speaking of which, I've worked hard to convince GetURI (new version pending release!) to follow the SURBL inclusion criteria pretty closely; I've added SBL lookups on the forward IP(s) and nameservers, as well as IADB2 and WADB checks on the IP(s), although the IADB2/WADB checks rarely hit. The SBL lookups are extremely useful. And, of course, GetURI has had the --age option for a while now.

...

Here's what the output looks like now (this took 57s for ~900 messages, even with the great number of DNS queries needed to process the 116 domains not found in SURBL):

...

 http://ry.ca/geturi/public/criteriatest.html (62K)

...

Feedback welcome!

OK It might help to have a legend, especially for people not familiar with the output. I assume the domains in white are the grey (uncertain) ones, and the ones in grey are the whitelisted ones. (A little ironic, eh?)

Jeff C. -- "If it appears in hams, then don't list it."

Jeff Chan

6:52 a.m.

On Saturday, September 25, 2004, 9:38:00 PM, Jeff Chan wrote:

...

OK It might help to have a legend, especially for people not familiar with the output. I assume the domains in white are the grey (uncertain) ones, and the ones in grey are the whitelisted ones. (A little ironic, eh?)

Or for that matter, why not make the whitelisted ones in white and they uncertain ones in grey..... Hmmm....

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

7:01 a.m.

Jeff Chan wrote to SURBL Discuss:

...

...
 http://ry.ca/geturi/public/criteriatest.html (62K)
...
Feedback welcome!

OK It might help to have a legend, especially for people not familiar with the output. I assume the domains in white are the grey (uncertain) ones, and the ones in grey are the whitelisted ones. (A little ironic, eh?)

Heh. Yeah, good point. In my mind, I never really associated the colours with the "white/grey/black" states of domains. Maybe I should give some thought to flipping those colours around, eh? :-) And, yes, before the next official release, there will be a "legend" of sorts; more on-line documentation, in other words, and some improvements to the output format itself to make it more readable.

Thanks for the feedback, Jeff!

- Ryan

Ryan Thompson

10:31 a.m.

Ryan Thompson wrote to Jeff Chan and SURBL Discussion list:

...

Thanks for the feedback, Jeff!

Hey everybody,

Does this look better? http://ry.ca/geturi/results.html

There are many improvements to the output. Even *I'm* impressed.

It's getting close to feature freeze/release time again, methinks, to put these improvements into a stable release. So, if anyone has anything they'd like to see right away, please speak now. :-)

- Ryan

Alex Broens

10:46 a.m.

Ryan Thompson wrote:

...

Ryan Thompson wrote to Jeff Chan and SURBL Discussion list:

...
Thanks for the feedback, Jeff!

Hey everybody,

Does this look better? http://ry.ca/geturi/results.html

There are many improvements to the output. Even *I'm* impressed.

It's getting close to feature freeze/release time again, methinks, to put these improvements into a stable release. So, if anyone has anything they'd like to see right away, please speak now. :-)

Great!

Why not add Spamhaus' XBL zone lookups?

Alex

Ryan Thompson

11:12 a.m.

Alex Broens wrote to SURBL Discussion list:

...

Ryan Thompson wrote:

...
Ryan Thompson wrote to Jeff Chan and SURBL Discussion list:

...
Thanks for the feedback, Jeff!

Hey everybody,

Does this look better? http://ry.ca/geturi/results.html

There are many improvements to the output. Even *I'm* impressed.

It's getting close to feature freeze/release time again, methinks, to put these improvements into a stable release. So, if anyone has anything they'd like to see right away, please speak now. :-)

Great!

Thanks!

...

Why not add Spamhaus' XBL zone lookups?

The thought had only briefly crossed my mind. Is XBL really a good resource for SURBL classification? I thought XBL just listed exploited systems and open HTTP proxies. Hmm. I suppose I could just code it up and run it on a bunch of mail to see what happens... Or I could just use the combined sbl-xbl.spamhaus.org list, I suppose.

- Ryan

Alex Broens

11:16 a.m.

Ryan Thompson wrote:

...

Alex Broens wrote to SURBL Discussion list:

...
Ryan Thompson wrote:

...
Ryan Thompson wrote to Jeff Chan and SURBL Discussion list:

...
Thanks for the feedback, Jeff!

Hey everybody,

Does this look better? http://ry.ca/geturi/results.html

There are many improvements to the output. Even *I'm* impressed.

It's getting close to feature freeze/release time again, methinks, to put these improvements into a stable release. So, if anyone has anything they'd like to see right away, please speak now. :-)

Great!

Thanks!

...
Why not add Spamhaus' XBL zone lookups?

The thought had only briefly crossed my mind. Is XBL really a good resource for SURBL classification? I thought XBL just listed exploited systems and open HTTP proxies. Hmm. I suppose I could just code it up and run it on a bunch of mail to see what happens... Or I could just use the combined sbl-xbl.spamhaus.org list, I suppose.

If a spammy looking msg comes thru an exploited system IMO it would qualify even more to be a SURBL inclusion as a genuine "marketer" would not be expected to use exploited machines, right? (silently waiting for Jeff to bark at me :-)

Keeping the lookups separate would give us a bit more detail to evaluate.

Alex

Jeff Chan

11:35 a.m.

On Sunday, September 26, 2004, 2:16:45 AM, Alex Broens wrote:

...

If a spammy looking msg comes thru an exploited system IMO it would qualify even more to be a SURBL inclusion as a genuine "marketer" would not be expected to use exploited machines, right?

That's definitely true, and one of the things I usually look for in SURBL listing candidates. (I thought you were referring to checking URI domains against XBL, which probably would not catch much.)

XBL is an excellent list of spam senders, by far the biggest catcher of spam senders in my regular RBLs, so it probably would be good as a header check for GetURI also. Ryan can we make this a feature request?

As we mentioned earlier, zombies are a major reason for SURBLs to exist. If someone uses fixed mail senders, those are easily blocked using regular RBLs. SURBLs are a largely a response to zombies, since without consistent mail senders to look for, content, specifically spam advertised web sites was the next logical thing, IMO.

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

7:36 p.m.

Jeff Chan wrote to SURBL Discuss:

...

On Sunday, September 26, 2004, 2:16:45 AM, Alex Broens wrote:

...
If a spammy looking msg comes thru an exploited system IMO it would qualify even more to be a SURBL inclusion as a genuine "marketer" would not be expected to use exploited machines, right?

That's definitely true, and one of the things I usually look for in SURBL listing candidates. (I thought you were referring to checking URI domains against XBL, which probably would not catch much.)

XBL is an excellent list of spam senders, by far the biggest catcher of spam senders in my regular RBLs, so it probably would be good as a header check for GetURI also. Ryan can we make this a feature request?

Sure. Now it's making sense. :-) Fortunately, adding header checks will be easy, because I'm already using the SpamAssassin engine.

- Ryan

Ryan Thompson

11:22 p.m.

Ryan Thompson wrote to SURBL Discussion list:

...

...
XBL is an excellent list of spam senders, by far the biggest catcher of spam senders in my regular RBLs, so it probably would be good as a header check for GetURI also. Ryan can we make this a feature request?

Sure. Now it's making sense. :-) Fortunately, adding header checks will be easy, because I'm already using the SpamAssassin engine.

OK, I've tried this, but it slows down the runs considerably, and my 2K test corpus had 54 RCVD_IN_XBL hits, but for some reason, *none* of those messages contained domains that were not already listed in SURBL. The run took 26 minutes, instead of the usual 2-3m for the 2K corpus.

Then, I used the new --surbl=hostname option to only check against WS only (instead of the default multi), and found only 2/381 (0.5%) domains spamvertised by an XBL listed host.

Hmm. Then I fed the --surbl option a local "dummy" SURBL list containing only test entries, effectively disabling the SURBL filter in GetURI, and have 52/3130 (1.6%) domains whose message was RCVD_IN_XBL.

So, I think, given the low hit rate (especially in the usual case of only looking for new SURBL domains), and the tremendous amount of extra time required to do the XBL header/net test (the last run took 48 minutes, compared to ~16 minutes without the header tests), so I'm going to make GetURI default to *not* doing the header checks, and let people enable them with the new --header option.

With all of these new DNS tests, network delays are now definitely the bottleneck in GetURI. Soon (not for 1.6, maybe 1.7), I think I'm going to have to go to a forked or threaded model.

- Ryan

Jeff Chan

11:55 p.m.

On Sunday, September 26, 2004, 2:22:02 PM, Ryan Thompson wrote:

...

So, I think, given the low hit rate (especially in the usual case of only looking for new SURBL domains), and the tremendous amount of extra time required to do the XBL header/net test (the last run took 48 minutes, compared to ~16 minutes without the header tests), so I'm going to make GetURI default to *not* doing the header checks, and let people enable them with the new --header option.

Sounds reasonable to me. :-)

Jeff C. -- "If it appears in hams, then don't list it."

Raymond Dijkxhoorn

12:39 p.m.

Hi!

...

...
Why not add Spamhaus' XBL zone lookups?

...

The thought had only briefly crossed my mind. Is XBL really a good resource for SURBL classification? I thought XBL just listed exploited systems and open HTTP proxies. Hmm. I suppose I could just code it up and run it on a bunch of mail to see what happens... Or I could just use the combined sbl-xbl.spamhaus.org list, I suppose.

I would really only use SBL, if you check agains XBL you could also test on DSBL, but we want to get the hardcore non-proxy spammers. The zombies are stopped with DSBL/XBL and alike anyway. Any thoughts?

Bye, Raymond.

Jeff Chan

1:50 p.m.

On Sunday, September 26, 2004, 3:39:29 AM, Raymond Dijkxhoorn wrote: (Alex wrote:)

...

...
...
Why not add Spamhaus' XBL zone lookups?

(Ryan replied:)

...

...
The thought had only briefly crossed my mind. Is XBL really a good resource for SURBL classification? I thought XBL just listed exploited systems and open HTTP proxies. Hmm. I suppose I could just code it up and run it on a bunch of mail to see what happens... Or I could just use the combined sbl-xbl.spamhaus.org list, I suppose.

...

I would really only use SBL, if you check agains XBL you could also test on DSBL, but we want to get the hardcore non-proxy spammers. The zombies are stopped with DSBL/XBL and alike anyway. Any thoughts?

Yes, there was perhaps some confusion about what Alex meant in suggesting XBL. If he meant use it to check headers then I agree it's a useful way to spot zombie and open server usage. If he meant to try XBL against spam URIs, then I agree it probably won't do much.

Jeff C. -- "If it appears in hams, then don't list it."

Alex Broens

2 p.m.

Jeff Chan wrote:

...

On Sunday, September 26, 2004, 3:39:29 AM, Raymond Dijkxhoorn wrote: (Alex wrote:)

...
...
...
Why not add Spamhaus' XBL zone lookups?

(Ryan replied:)

...
...
The thought had only briefly crossed my mind. Is XBL really a good resource for SURBL classification? I thought XBL just listed exploited systems and open HTTP proxies. Hmm. I suppose I could just code it up and run it on a bunch of mail to see what happens... Or I could just use the combined sbl-xbl.spamhaus.org list, I suppose.

...
I would really only use SBL, if you check agains XBL you could also test on DSBL, but we want to get the hardcore non-proxy spammers. The zombies are stopped with DSBL/XBL and alike anyway. Any thoughts?

Yes, there was perhaps some confusion about what Alex meant in suggesting XBL. If he meant use it to check headers then I agree it's a useful way to spot zombie and open server usage.

yep that was the idea.....

...

If he meant to try XBL against spam URIs, then I agree it probably won't do much.

naaaaaaa... did I say that? :)

Alex

Jeff Chan

11:14 a.m.

On Sunday, September 26, 2004, 1:46:03 AM, Alex Broens wrote:

...

Ryan Thompson wrote:

...

...
Hey everybody,

Does this look better? http://ry.ca/geturi/results.html

There are many improvements to the output. Even *I'm* impressed.

It's getting close to feature freeze/release time again, methinks, to put these improvements into a stable release. So, if anyone has anything they'd like to see right away, please speak now. :-)

...

Great!

...

Why not add Spamhaus' XBL zone lookups?

XBL is about mail senders, open relays, open proxies, etc. While it may be interesting to check header addresses for a given message against XBL, strictly speaking it's the URI domain servers and name servers that are most relevant to SURBLs. Those are in SBL.

Jeff C. -- "If it appears in hams, then don't list it."

Frank Ellermann

25 Sep 25 Sep

1:26 p.m.

Jeff Chan wrote:

...

Please let me/us know what you think of it now.

13: ... </em</strong>

That should be </em></strong>, I got the whole page as <em> ;-)

69 - 75: | <a href="http://www.isipp.com/iadbcodes.php">iadb.isipp.com | or iadb2.isipp.com and wadb.isipp.com</a>. [...]

<a href="http://www.isipp.com/iadbcodes.php">iadb.isipp.com</a> or <a href="http://www.isipp.com/iadb2codes.php">iadb2.isipp.com</a>.

Don't mention WADB here, remove the explanation, it's only confusing (the linked ISIPP page does it, you don't need it). Or do you know any interesting WADB entries at the moment ?

155: wierd news

Google has more hits for "weird news", and LEO's dictionary is down, please ignore me if "wierd" is correct or a joke ;-)

Bye, Frank

Jeff Chan

26 Sep 26 Sep

12:46 a.m.

On Saturday, September 25, 2004, 4:26:54 AM, Frank Ellermann wrote:

...

Jeff Chan wrote:

...

...
Please let me/us know what you think of it now.

...

13: ... </em</strong>

...

That should be </em></strong>, I got the whole page as <em> ;-)

...

69 - 75: | <a href="http://www.isipp.com/iadbcodes.php">iadb.isipp.com | or iadb2.isipp.com and wadb.isipp.com</a>. [...]

...

<a href="http://www.isipp.com/iadbcodes.php">iadb.isipp.com</a> or <a href="http://www.isipp.com/iadb2codes.php">iadb2.isipp.com</a>.

...

Don't mention WADB here, remove the explanation, it's only confusing (the linked ISIPP page does it, you don't need it). Or do you know any interesting WADB entries at the moment ?

...

155: wierd news

Fixed. Thanks!

Jeff C. -- "If it appears in hams, then don't list it."

Joe Wein

25 Sep 25 Sep

4:52 p.m.

...

http://www.surbl.org/policy.html

"The older a domain is the less likely it should be listed. Most spam domains are used for 3 days then abandoned. Domains older than 90 days probably should not be added. A domain more than a few years old usually should not be added."

I would say, domains older than 90 days probably should not be added *unless* they use a blacklisted nameserver.

You really have to look at both the name servers and the date, in that order.

I want to give you some data on domain age for my recent blacklistings (last two weeks):

year count 2004 4165 2003 582 2002 30 2001 6 2000 3 <=1999 12 total: 4830

There is a significant percentage of domains registered in 2003, but most of these still fall within one year of the listing. There are extremely few blacklistings for domains registered before 2003, about 1% of the total. Most of the 1999 ones are porn sites using a NS by wildrhino.com, plus one each by vendaregroup.com, webfinity.net, allproactive.com, rackhosters.com, all notorious spamhouses with SBL listings. These domains are exceptions to the rule that old domains usually don't merit listing.

About 11% of blacklisted domains were registered within 3 days of detection, 18% within 7 days, 34% within 2 weeks.

Then it gets interesting: I have no records in the set for 13-24 days, then a whole bunch of pill spam domains registered at least 25 days ago. These guys seem to wait a little before they strike.

50% of all blacklisted domains are registered no more than 35 days before listing, 60% within two months, 66% within three months, 70% with four months. As you see, the incremental gain per extra month gets smaller and smaller. Six months cover 80%, 12 months 90%, 24 months 97%.

A few comments in addition to those numbers:

1) There's a very small set of hardcore spammer NSs for which I list *all* domains that use them, regardless of age.

2) For other domains with SBL-listed NS, I routinely list them *if* they are recently registered.

3) For domains with SBL-listed NS older than a few months, I list them if they fit a pattern. Most of these will be porn and gambling sites from usual suspects, i.e. I'll see lots and lots of domains all sharing the same NS, advertised in similar spam mails. These guys stick around, so it doesn't matter much if you don't list them immediately, before you see a pattern. You can still get them later.

4) I also list sites without SBL records on the NS if they are very recently registered (usually < 6 weeks) and they fit a pattern with regard to naming or what kind of spam subject lines / sender names are used. That takes care of discardable spam domains registered with Joker.com such as these:

californiapassword.info coloradopassword.info coloradovodka.info dc-user.info dcpassword.info floridaadmin.info georgiapass.info georgiauser.info hawaii-vodka.info idahouser.info iowavodka.info kentucky-password.info

5) Recently registered domains with a name server from the same domain are more suspicious than those using a different server, because it means the name server has no track record to check.

Joe

-- http://www.joewein.de/sw/jwSpamSpy/

Jeff Chan

26 Sep 26 Sep

5 a.m.

On Saturday, September 25, 2004, 7:52:13 AM, Joe Wein wrote:

...

...
http://www.surbl.org/policy.html

...

I would say, domains older than 90 days probably should not be added *unless* they use a blacklisted nameserver.

...

You really have to look at both the name servers and the date, in that order.

...

I want to give you some data on domain age for my recent blacklistings (last two weeks):

...

year count 2004 4165 2003 582 2002 30 2001 6 2000 3 <=1999 12 total: 4830

...

There is a significant percentage of domains registered in 2003, but most of these still fall within one year of the listing. There are extremely few blacklistings for domains registered before 2003, about 1% of the total.

[...]

...

About 11% of blacklisted domains were registered within 3 days of detection, 18% within 7 days, 34% within 2 weeks.

...

Then it gets interesting: I have no records in the set for 13-24 days, then a whole bunch of pill spam domains registered at least 25 days ago. These guys seem to wait a little before they strike.

...

50% of all blacklisted domains are registered no more than 35 days before listing, 60% within two months, 66% within three months, 70% with four months. As you see, the incremental gain per extra month gets smaller and smaller. Six months cover 80%, 12 months 90%, 24 months 97%.

...

A few comments in addition to those numbers:

...

There's a very small set of hardcore spammer NSs for which I list *all*

domains that use them, regardless of age.

...

For other domains with SBL-listed NS, I routinely list them *if* they are

recently registered.

...

For domains with SBL-listed NS older than a few months, I list them if

they fit a pattern. Most of these will be porn and gambling sites from usual suspects, i.e. I'll see lots and lots of domains all sharing the same NS, advertised in similar spam mails.

[...]

...

I also list sites without SBL records on the NS if they are very recently

registered (usually < 6 weeks) and they fit a pattern with regard to naming or what kind of spam subject lines / sender names are used. That takes care of discardable spam domains registered with Joker.com such as these:

Hi Joe, All your observations and policies seem quite reasonable to me. :-)

There can be some lag in SBL detecting new domains and new spam gang name servers, so it's definitely true that non-inclusion in SBL should not give new domains a "free pass". New domains not matching SBL can be real spammers.

Thanks also for sharing your research into the age of spam domains! It's very useful data, though it might also be interesting to know how long a domain is used after it appears in the first spams we detect. Many are only used for a few days according to a well-placed spam statistician I spoke with before. It's also interesting that some domains don't get used immediately after registration. (Note that I said many spam domains only get used for a few days, not that they only get used for a few days after registration.)

I've updated the domain age guidelines, taking into account your research:

"The older a domain is the less likely it should be listed. Most spam domains are used for 3 days then abandoned. Domains older than 90 days probably should not be added. Domains more than 1 year old usually should not be added. However, domains that use name servers listed in SBL as belonging to known spam operators can be included, regardless of age. (See below.)"

How does that sound?

Jeff C. -- "If it appears in hams, then don't list it."

Frank Ellermann

28 Sep 28 Sep

1:51 a.m.

Jeff Chan wrote:

...

How does that sound?

Good. A link to UC somewhere (not necessarily on this page) would be also nice. About the "age" problem: Sometimes spamvertized domains are used for weeks, more than 2 months in the case of aktion2004.net.multi.surbl.org = 127.0.0.118

Some days ago I got a feedback mail from ICANN's WDPRS, it was about a complaint in May. So attacking domains on the "whois data problem" track takes some time, certainly more than 90 days. At the moment Joe's idea "don't add anything older than 90 days" probably works, but the spammers will as always try to bypass any strict rules.

Therefore your wording ("should") is IMHO fine, bye, Frank

Jeff Chan

2:15 a.m.

On Monday, September 27, 2004, 4:51:27 PM, Frank Ellermann wrote:

...

Some days ago I got a feedback mail from ICANN's WDPRS, it was about a complaint in May. So attacking domains on the "whois data problem" track takes some time, certainly more than 90 days. At the moment Joe's idea "don't add anything older than 90 days" probably works, but the spammers will as always try to bypass any strict rules.

Actually it's Outblaze that tries to cut off domains at 90 days. Joe is more flexible, suggesting that domains older than 90 days can be included if, for example, they use name servers or hosting addresses in SBL.

Joe's statistics did show a large drop off in spam domain registrations older 1 year however:

...

50% of all blacklisted domains are registered no more than 35 days before listing, 60% within two months, 66% within three months, 70% with four months. As you see, the incremental gain per extra month gets smaller and smaller. Six months cover 80%, 12 months 90%, 24 months 97%.

So there is a point of diminishing returns in going with the older domains. There is also perhaps an increasing chance of FPs with older domains.

(I didn't graph the above, but the numbers look like a nice exponential decay....)

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

2:50 a.m.

Jeff Chan wrote to SURBL Discuss:

...

So there is a point of diminishing returns in going with the older domains. There is also perhaps an increasing chance of FPs with older domains.

(I didn't graph the above, but the numbers look like a nice exponential decay....)

I have graphed similar numbers, but I don't have the results handy. It's more like a normal distribution ("bell curve"), with the mean at 0 days (actually slightly greater than zero, but that's a relatively constant skew due to lag between registration time and spam delivery/processing). GetURI uses a modified version of the normal distribution as part of its heuristic. The other parts of GetURI's heuristic are pretty much all additive, but I found that, statistically, domain age is good enough to be multiplicative, and it'll *reduce* rankings for domains that have been registered for a long time. It's so nice when math actually works. :-)

- Ryan

Jeff Chan

3:03 a.m.

On Monday, September 27, 2004, 5:50:39 PM, Ryan Thompson wrote:

...

Jeff Chan wrote to SURBL Discuss:

...

...
So there is a point of diminishing returns in going with the older domains. There is also perhaps an increasing chance of FPs with older domains.

(I didn't graph the above, but the numbers look like a nice exponential decay....)

...

I have graphed similar numbers, but I don't have the results handy. It's more like a normal distribution ("bell curve"), with the mean at 0 days (actually slightly greater than zero, but that's a relatively constant skew due to lag between registration time and spam delivery/processing). GetURI uses a modified version of the normal distribution as part of its heuristic. The other parts of GetURI's heuristic are pretty much all additive, but I found that, statistically, domain age is good enough to be multiplicative, and it'll *reduce* rankings for domains that have been registered for a long time. It's so nice when math actually works. :-)

...

Ryan

Heh, when I said "normal", statisticians jumped all over that.

Turns out the distributions may be more like Zipfian. Zipf curves have most of the data concentrated in a small amount of the curve (e.g., young domains) and a small amount of the data in a larger part of the curve (e.g., old domains). I hope I'm explaining that correctly.

That said, if you found some numerical heuristics that fit the data well, that's great!

Jeff C. -- "If it appears in hams, then don't list it."

Ryan Thompson

6:18 a.m.

Jeff Chan wrote to SURBL Discuss:

...

Heh, when I said "normal", statisticians jumped all over that.

:-)

...

Turns out the distributions may be more like Zipfian. Zipf curves have most of the data concentrated in a small amount of the curve (e.g., young domains) and a small amount of the data in a larger part of the curve (e.g., old domains). I hope I'm explaining that correctly.

That said, if you found some numerical heuristics that fit the data well, that's great!

Yup, my function seems to fit quite nicely to the data I had at the time. However, I do plan to work on the scoring in more detail. GetURI is currently in a huge growth spurt with the advent of different relevant tests, and finally getting up to speed with what people are already doing to classify domains. Once that settles down a bit, I'll probably look more closely at scoring. Right now, though, it is definitely quite a useful metric at the extremes (top/bottom of output). It's weak in the middle ground, but, then again, we all know the middle ground is damned hard enough for humans. :-)

- Ryan

Jeff Chan

6:24 a.m.

On Monday, September 27, 2004, 9:18:42 PM, Ryan Thompson wrote:

...

Once that settles down a bit, I'll probably look more closely at scoring. Right now, though, it is definitely quite a useful metric at the extremes (top/bottom of output). It's weak in the middle ground, but, then again, we all know the middle ground is damned hard enough for humans. :-)

Indeed. It's probably the extremes, and the FPs in the middle that are the most important, and GetURI is a nice tool for spotting those.

Jeff C. -- "If it appears in hams, then don't list it."

Frank Ellermann

3:25 a.m.

Jeff Chan wrote:

...

Joe's statistics did show a large drop off in spam domain registrations older 1 year however:

Makes sense, if one year is the shortest period offered by registrars. Some spammers could try to use their registered domains as long as possible without renewing the registration. But I'm notoriously bad in guessing what spammers "think". Bye, Frank

Jeff Chan

27 Sep 27 Sep

10:36 a.m.

On Friday, September 24, 2004, 9:35:14 PM, Jeff Chan wrote:

...

OK I Updated the policy page, taking Ryan's top rules and general organizational comments:

...

http://www.surbl.org/policy.html

...

Please let me/us know what you think of it now.

Does anyone else have any comments on the updated policy page for adding new records to manual SURBL lists? It includes changes thanks to comments from Frank, Ryan, Joe and others.

Please reply if you have anything to add or change.

Jeff C. -- "If it appears in hams, then don't list it."

7612

Age (days ago)

7616

Last active (days ago)

discuss@lists.surbl.org

38 comments

6 participants

tags (0)

participants (6)

Alex Broens
Frank Ellermann
Jeff Chan
Joe Wein
Raymond Dijkxhoorn
Ryan Thompson