Daniel Quinlan, one of the principal SpamAssassin architects had
some good suggestions for reducing false positives in the SURBL
data. One was using public databases of URIs, particularly
hand-built ones like dmoz.org and wikipedia.org or even yahoo.com
as sources of mostly legitimate domains. (The wikipedia is not a
web directory in a conventional sense; it's more like an open
encyclopedia, but it has a relatively large collection of URIs.)
Presumably most of the URIs in these are legitimate and don't
belong to spammers, especially in DMOZ since it's hand-built.
So the question is: can these be useful as whitelist sources or
perhaps as one of the checks on new SURBL additions.
The DMOZ open directory publishes it's data in RDF form at:
http://rdf.dmoz.org/
So we downloaded the URL data, extracted the domains and
compared them against the SURBL block and whitelists:
% join dmoz.srt ../multi.domains.sort | wc
1338 1338 20533
% join dmoz.srt ../whitelist-domains.sort | wc
7375 7375 96720
% join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt
% join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt
There were 1338 DMOZ hits against our blocklisted domains and
7375 against our whitelists. You can view those matches at:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txthttp://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt
Of the 1338 DMOZ hits against our blocklists, which arguably
could be false positives, most are in WS. Here is a list with
the data from multi.surbl.org showing list membership included:
%join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt
And some list counts from those hits:
[ws] hits: 1173
[ob] hits: 165
[jp] hits: 61
[sc] hits: 8
[ab] hits: 4
[ph] hits: 2
These add up to more than 1338 since some records hit multiple
lists. The actual hits are in:
http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.wshttp://spamcheck.freeapp.net/whitelists/dmoz-blocklist.obhttp://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jphttp://spamcheck.freeapp.net/whitelists/dmoz-blocklist.schttp://spamcheck.freeapp.net/whitelists/dmoz-blocklist.abhttp://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph
Data source folks, please review these and try to determine
which ones are FPs and which would result in false negatives
if they came off the lists. For ones that are FPs you may
want to eliminate them on your end. For the ones that could
cause FNs, we'd like to know about those as a measure of
using the DMOZ data for whitelisting. Right now I'm
leaning towards whitelisting all of these, so please speak
up!
The 7.4k DMOZ whitelist hits represents a majority of the 12.25k
whitelist entries that are not reserved .us geographic domains,
so there is significant overlap between DMOZ and our existing
whitelists, which is probably speaks well for both lists.
% wc dotus_reservedlist_v3.lower.sort
52049 52049 1012735 dotus_reservedlist_v3.lower.sort
% wc ../whitelist-domains.sort
64299 64299 1169155 ../whitelist-domains.sort
% join dmoz.srt dotus_reservedlist_v3.lower.sort | wc
7 7 112
The DMOZ data has about 2.3 million domains. How does anyone
feel about adding them to our whitelists? A 1.2 MB gzip of
the extracted domains is at:
http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz
I think we can safely say that whitelisting DMOZ domains
will reduce FPs. Probably a more important question is: how many
FNs would that cause? In other words, how many purely spam
domains are in DMOZ, where whitelisting them would wrongly
exclude spam domains from SURBLs?
One way to answer that is to note that the lists ab, sc, jp, ph,
which have much lower FP rates than ws (measured by the
SpamAssassin corpora checks, for example, and also anecdotally by
human FP reports) appear relatively infrequently in the DMOZ
hits. In other words, SURBL lists that we know are quite spammy
like sc, jp, etc. don't match DMOZ often, so the DMOZ data may
not have too many spam domains.
Similar tests could be done against other proposed whitelists.
(We'll probably try the wikipedia data next.)
Another concern is that since these directories are relatively
open, spammers could simply add themselves and effectively get
whitelisted. However I intend to take a snapshot of these and
probably not try to refresh the data very often in future (it at
all), instead using them as relatively static snapshots of
established domains. Doing that would miss some new additions,
but could also prevent some future abuse by spammers. On the
other hand 2 million domains is a pretty good start.... :-)
Extraction scripts are not perfect, particularly in the
simplistic chopping to three levels of cctlds, but they're
probably adequate:
http://spamcheck.freeapp.net/whitelists/extract-dmoz-domainshttp://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sedhttp://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed
Comments please,
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
Hi,
Got two different spams not getting caught by SURBL. The domains for
both of the spams looks like this:
http://ss.net.yourstuffspotMUNGED.com?qd="and a tracking id"
AND
http://mq.net.yourstuffspotMUNGED.com?r="and a tracking id"
I checked the domain at rulesemporium and got a hit on
yourstuffspotMUNGED.com but not at the subdomains. Somehow SURBL don't
catch it even though it's listed on WS.
Are they trying to fool SURBL or what is going on?
Thanks in advance
/ Martin
>-----Original Message-----
>From: Martin [mailto:martin.lyberg@idkommunikation.com]
>Sent: Thursday, October 07, 2004 10:25 AM
>To: discuss(a)lists.surbl.org
>Subject: [SURBL-Discuss] submission to WS
>
>
>Hi,
>
>How long will it take for a submission made to WS at
>http://www.rulesemporium.com/cgi-bin/uribl.cgi to become active?
>
>I wonder because i got a new fresh spam that i submitted, but it's not
>yet listed to WS.
>
>Edit: i just noticed that it got listed on the OB-list.
>
These submission used to go just to me, not they get split between a few
others. THANK GOODNESS!!! So it depends on the person. We have to hand check
them. I myself am behind a little on these. But I checked and don't have
yours. SO I guess I can only quote game developers when I say "soon." ;)
--Chris
Hi,
How long will it take for a submission made to WS at
http://www.rulesemporium.com/cgi-bin/uribl.cgi to become active?
I wonder because i got a new fresh spam that i submitted, but it's not
yet listed to WS.
Edit: i just noticed that it got listed on the OB-list.
Thank you
Martin
Hi,
In some cases, it would be interesting to provide an alternative zone,
with a "spam signature info", for domains that could also be used for
legitimate purposes. This zone would feature a special TXT part with a
regexp or some encoded string that will be used by checking clients to
test the message.
A fake example:
buyziagra.com
TXT: listed in re.surbl.org (etc...) #click here.*buy [zvj]iagra#
The text between #'s will be used as an regexp that, if matched
against the text in slurp mode (whole buffer checked instead of
line-by-line), will make the tool return that that e-mail is Spam.
I can adapt my suriproxy to do that very easily. (Btw, there is a new
test version of suriproxy avaliable with domain whitelisting and a
better uri matching algorithm at
http://sourceforge.net/projects/pf-aux. Any new feedback would be
appreciated)
The format I used is just an illustration. It would be ideal to
develop or find a simpler "text matching" format then regexp, and yet
more powerful, to accept different character coding.
The idea of this "URIBL with spam sigs" is to avoid FP's and,
specially, to let us list domains in a less restrictive policy. Even
if a domain could used for legimate purposes, it could be added to
this special zone. I do agree with the current policy used here, but I
have several spam arriving everyday, specially from Brazilian domains,
that, if the policy is respected, could not get into the list. Yet we
need to find a solution for that, and this is my suggestion.
This new feature would, then, take two different collaborative
anti-spam solutions types - URIBL and on line content checks (Razor,
DCC, etc) in a very efficient way and using existing infra-structure,
that is, DNS servers.
The odds are it would be a bit more difficult to maintain and spam
gangs can change the text all the time. Even then, I believe this
could be interesting. Do you think this is worth trying?
sorry for my bad english,
Yves
--
Yves Junqueira
http://www.lynx.com.br
Michele Solutions wrote:
>> Although directories such as DMOZ are manually edited there is a danger of
>> spammers "grabbing" expired domains and abusing them. I've seen a lot of
>> scripts for sale that track dmoz listed domains.....
Jeff Chan <jeffc(a)surbl.org> writes:
> Thanks. That's definitely good to know about.
Ah, you could also look at how long a domain has been registered. ;-)
Daniel
--
Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/http://www.apachecon.com/ sessions & more)
Jeff Chan <jeffc(a)surbl.org> writes:
> Those are all good ideas. Do you know if spammer links do get
> deleted? How do the folks who maintain the sites find abusers or
> bots?
Wikipedia tends to find them eventually. Sometimes, spam links can live
on a page or two, so higher count links are going to be safer. Also
pages that get updated a lot (and the link stays for each revision).
Also, while DMOZ might have some spammer links, I suspect most of the
spammer links are very stable, well-listed in SBL and your blacklists,
etc. It might be easiest to prioritize links by their S/O ratio:
(number of blacklist source hits)
---------------------------------
(number of whitelist source hits + number of blacklist source hits)
Daniel
--
Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/http://www.apachecon.com/ sessions & more)
Chris Santerre <csanterre(a)MerchantsOverseas.com> writes:
> Wow, it looks like some of the DMOZ data can't be trusted. Some of those
> domains in this WS blocklist are pure spammers.
DMOZ (and as far as I know, Wikipedia) don't filter URLs based on email
policies of those sites. However, the links *should* generally be
categorized correctly in the case of DMOZ and useful in the case of
Wikipedia.
I would not suggest using either to whitelist automatically, but if you
get several of these sources and count the number of hits for each
domain, then you should be able to prioritize and possibly automatically
whitelist the ones that hit in a large number of databases.
I would also take snapshots, but for a different reason than the one
Jeff suggested. I would take snapshots and take the intersection of two
snapshots for each source (two separate days of DMOZ, etc.) as the
authoritative list since some spammer links (especially if added by some
bot) will drop off once they are found.
Clearly, given that most of the hits are in .ws etc. you're in the tail
region of false positives. It'll be hard to find a lot. More sources
and looking at source counts seems like the best way.
Daniel
--
Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin
http://www.pathname.com/~quinlan/http://www.apachecon.com/ sessions & more)
>-----Original Message-----
>From: Jeff Chan [mailto:jeffc@surbl.org]
>Sent: Wednesday, October 06, 2004 5:58 AM
>To: SURBL Discuss
>Cc: SpamAssassin Developers
>Subject: Possible large whitelist from DMOZ data
>
*snip*
>
>And some list counts from those hits:
>
> [ws] hits: 1173
> [ob] hits: 165
> [jp] hits: 61
> [sc] hits: 8
> [ab] hits: 4
> [ph] hits: 2
>
>These add up to more than 1338 since some records hit multiple
>lists. The actual hits are in:
>
> http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws
Wow, it looks like some of the DMOZ data can't be trusted. Some of those
domains in this WS blocklist are pure spammers.
adultmovienetwork.com has 135+ NANAS hits, is listed in spamhaus, ect.....
These need to ALL be checked carefully. Do not use DMOZ to autowhitelist. I
will check A-C in this list. Any takers to check the rest?
1173 FPs......I doubt it.
--Chris
Ok, As I was reviewing a spam domain to submit, I ran across a somewhat
disgusting policy I felt the need to share with everyone here.
Here is a snipplet:
"
Additionally, when you open, preview or click on the advertising portion of
our e-mails and/or those of our marketing partners and/or affiliates of
GroovyUSA, you have agreed to the terms set forth in our Privacy Policy and
agree that as a function of opening, previewing or clicking on the
advertising portion of our e-mails, that you will receive new or additional
marketing communications from us, our marketing partners and/or affiliates
of GroovyUSA.
"
Right before this, they talk about using clear 1 pixel gifs to track what
you view, so they are saying, if you view this message, we will send you
more and you accept for us to do whatever we want with your information.
http://www.groovyusa-MUNGED.net/
Click on the privacy button at the top of the page.