2004-09-30: GetURI 1.6 Released
I'm very pleased to announce the release of GetURI 1.6. Many new features have
been put into to this quickly growing program, as have a few important bug
fixes. Everyone already using GetURI is strongly encouraged to upgrade as soon
as possible. If you haven't yet tried GetURI, now is a great time to start!
What is GetURI?
GetURI is a program designed to extract URIs from ham and spam messages, mbox
files, or lists of domains, and present them in a format designed to help
classify domains for anti-spam efforts such as SURBL, although it has other
uses, too. The included 'uricat' utility provides a simple way to extract URIs
from virtually any text file, regardless of how they are encoded. With the
help of the SpamAssassin libraries, GetURI attempts to ignore unclickable
domains (i.e., poisoning attempts), follow redirects, and otherwise simulate
the action of mail user agents (MUAs) as closely as possible.
Sample output: http://ry.ca/geturi/results.html
What's new?
Here are just a few of the most notable additions to GetURI 1.6:
- Support for SpamAssassin 2.6x has been re-introduced. Now 3.0 and 2.6x are
officially supported
- By popular demand, support for processing mbox files has been added
- GetURI now does several forward lookup checks on domains, including SBL/XBL,
IADB2/WADB, as well as checks on nameservers, to aid classification.
- More documentation is now included in the output, and the output format has
been improved visually, to hopefully be somewhat more intuitive.
- It is now possible to specify a specific SURBL host to query, instead of the
previous default of multi.surbl.org
- A potentially large memory leak was discovered in the handling of SA3.0
objects. Consequently, SA3.0 users should upgrade immediately to enjoy
drastically reduced memory consumption.
Many more changes have been implemented; please see
http://ry.ca/geturi/CHANGELOG for details
To fetch the new version of GetURI, please visit http://ry.ca/geturi/
As always, your feedback will help improve GetURI!
Additional testers are always welcome.
--
Ryan Thompson <ryan(a)sasknow.com>
SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America
[Please post follow ups to the SURBL discuss list or to me.]
One of the distinct data sources currently feeding into
ws.surbl.org includes data from Joe Wein and Raymond Dijkxhoorn
with his colleagues at Prolocation. Raymond and Prolocation
are currently processing more than 300,000 potential spams per
day using Joe's jwSpamSpy server software and combining those
with Joe's own results. In addition to the data processing
software, Joe has an elaborate, thorough, and well-thought-out
set of inclusion criteria which includes age of domain
registration, manual checks, and other factors. The resulting
data are an extensive list of spam URI domains with a very
low false positive rate (hits on legitimate messages). We
are calling this resulting data JP for Joe Wein + Prolocation.
The bottom line is that JP (called PJ in the table below) has a
significantly lower false positive rate than WS while having
similar spam detection rates, for example as measured against a
large corpora set belonging to Theo Van Dinter of SpamAssassin:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
2424443 2357143 67300 0.972 0.00 0.00 (all messages)
100.000 97.2241 2.7759 0.972 0.00 0.00 (all messages as %)
7.595 7.8122 0.0045 0.999 1.00 0.00 URIBL_SC_SURBL
76.754 78.9448 0.0178 1.000 0.80 0.00 URIBL_OB_SURBL
77.230 79.4340 0.0208 1.000 0.60 1.00 URIBL_PJ_SURBL
0.985 1.0126 0.0045 0.996 0.50 0.00 URIBL_AB_SURBL
82.119 84.4600 0.1367 0.998 0.40 0.00 URIBL_WS_SURBL
0.021 0.0216 0.0045 0.829 0.00 0.00 URIBL_PH_SURBL
So we feel the data could usefully be broken out into a
separate list which could safely be scored higher than
WS. We also continue to work on improving the False Positive
rate of WS of course. We propose making JP a separate list
within multi.surbl.org, but *not* a standalone list like
jp.surbl.org, since it's a major effort to set up entirely
new lists and most people should be using multi now.
The main reason for announcing this change ahead of time
is to allow developers of the many programs (in addition to
SpamAssassin) now using SURBL data to update their code or
configurations to take into account that the result codes in
multi will be changing as a result of adding JP. JP would get
the 64 bitmask, as in:
2 = comes from sc.surbl.org
4 = comes from ws.surbl.org
8 = comes from phishing list (labelled as [ph] in multi)
16 = comes from ob.surbl.org
32 = comes from ab.surbl.org
64 = comes from jp list
So a record in SC, WS, and JP would give a value 127.0.0.70.
One with WS, OB, and JP would resolve to 127.0.0.84, etc.
Programs using multi.surbl.org should be updated accordingly.
Since JP is currently included in WS, there will be 100%
overlap of JP entries in WS so that any record in JP will
also be in WS. In other words about half of the WS records
in multi will increase by 64 due to overlap with JP. But
WS will continue to use the 4 bit, as before. If your
programs are decoding the multi results using the bit
positions, they should need no adjustments to continue to
handle the WS data.
We hope that 5 days is not too short notice for this kind of
change.... I will try to contact the developers of the various
(non-SA) programs separately to make sure they're aware of the
coming change. Hopefully most of them are on this announcement
list however.
We were not able to get JP as a separate list in yesterday's
SpamAssassin 3.0.0 full release, but we have gotten it into
SA 3.1 development.
For now the JP data will continue to be included in WS,
but just before Spam Assassin 3.1 gets released (probably in
6 months to a year from now), we will remove JP data from WS
to make them separate lists within multi. This means that
SpamAssassin 3.0 and other current users of WS will continue
to to get the benefits of JP under their default shipping
configurations, and that JP can also be used separately by
those who modify their configurations to take advantage of it.
In summary, we will:
1. Add JP to multi.surbl.org on Monday September 27th.
(Note that like PH, JP would not be available as a separate
list, only as part of multi.)
2. Keep the JP data in WS for now, so that regular 3.0 users
get the advantages of JP also (as part of WS).
3. Ask the SpamAssassin developers to score JP separately in
SA 3.1.
4. Remove JP from WS before the final SA 3.1 mass check and
re-scoring is done, to make the two lists more separate
for 3.1 . (Note that the separation is removal of the
specific subset arrangement suggested in #2. If that is
done, there will still be some minor overlap of the records
in WS and JP.)
5. Inform people about removing JP from WS before we do it,
so existing WS users can add JP, etc.
Please post follow up questions or comments to the SURBL discuss
list or to me personally.
Thanks,
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
On Saturday, September 18, 2004, 11:58:56 PM, Frank Ellermann wrote:
> [Konqueror list]
>>> | name,ai,au,bd,bh,ck,eg,et,fk,il,in,kh,kr,mk,mt,na,
>>> | np,nz,pg,pk,qa,sa,sb,sg,sv,ua,ug,uk,uy,vn,za,zw
I have updated the ccTLDs for the ones Frank mentioned,
removed some duplicates, and added some data for a few
other ccTLDs. The results are in:
http://spamcheck.freeapp.net/two-level-tlds
Really this is just for completeness since geographic
domains other than .us aren't used in spams too often.
Jeff C.
--
"If it appears in hams, then don't list it."
Released today, SpamAssassin 3.0.0 has support for SURBLs
built-in and enabled by default. It checks multi.surbl.org
using the command urirhssub in the URIDNSBL plugin.
http://spamassassin.apache.org/full/3.0.x/dist/doc/Mail_SpamAssassin_Plugin…
If you administer a high volume mail server processing 100k
messages per day or more, please set up a local DNS mirror
for your own use of the SURBL zone files using rsync and rbldnsd
as described in some of the links on our rsync signup form:
http://www.surbl.org/rsync-signup.html
If you are able to host public DNS for the SURBL zones,
please let us know. Pre-release traffic was about 70-80k bits
per second, and we expect that to go up at least 5-fold, but
that's just an estimate.
Cheers,
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
We'd like to welcome and thank the addition of two new public
SURBL name servers j3 and k3.surbl.org administered by:
Arjen Wolfs of Wanadoo Nederland BV
Without all of our public nameservers and the help of their
administrators, SURBLs would not be possible.
Our thanks to all of them!
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
We'd like to welcome and thank the addition of a new public SURBL
name server d3.surbl.org administered by:
Kevin A. McGrail of Peregrine Computer Consultants Corporation
Without all of our public nameservers and the help of their
administrators, SURBLs would not be possible.
Our thanks to all of them!
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
After trying DNS TTLs at 1 hour, 25, 20, 15 and 10 minutes,
it appears that 15 minute TTLs optimizes both name server
traffic and the quickness of records being added or deleted
from the lists. Therefore, we are standardizing on 15 minute
TTLs for all SURBLs.
Note that this result applies to SURBL data and uses of it.
It may or may not apply to other types of RBLs.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
We'd like to welcome and thank the addition of a new public SURBL
name server administered by:
Chris Stone of AxisInternet, Inc.
Without all of our public nameservers and the help of their
administrators, SURBLs would not be possible.
Our thanks to all of them!
Some additional name server news: all of the name servers which
we had temporarily commented out while they were being repaired
have been fully returned to service. A name server status page
can be found at:
http://www.surbl.org/nameservers-output.html
Also in case we haven't already mentioned it, the name servers
are now organized into letter names like a.surbl.org,
b.surbl.org, ... through n.surbl.org, and each name round robins
into two or three servers. As more are added, we will try to
balance them all at three servers per name, so that traffic is
balanced overall. Having more than one server per name lets us
comment one out without needing to change NS records in the
subdomain zone files and delegations, when maintenance on a
particular name server needs to be done, etc. It also keeps the
SURBL DNS packets smaller to have fewer name servers listed in
the NS records.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
To follow up on an earlier announcement, in addition to a list of
the top SURBL DNS queries that hit whitelists:
http://www.surbl.org/dns-queries.whitelist.counts.txt
we've added a list of the top blocklist hits:
http://www.surbl.org/dns-queries.blocklist.counts.txt
Though the sample size is somewhat small at 32k queries over
the trailing 48 hours, the data may perhaps be useful for
expiring records from blocklists or checking whitelists.
We can increase the sample size if anyone thinks it worthwhile.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/