Discuss October 2004

discuss@lists.surbl.org

44 participants
134 discussions

Possible large whitelist from DMOZ data
by Jeff Chan 08 Oct '04

08 Oct '04

Daniel Quinlan, one of the principal SpamAssassin architects had some good suggestions for reducing false positives in the SURBL data. One was using public databases of URIs, particularly hand-built ones like dmoz.org and wikipedia.org or even yahoo.com as sources of mostly legitimate domains. (The wikipedia is not a web directory in a conventional sense; it's more like an open encyclopedia, but it has a relatively large collection of URIs.) Presumably most of the URIs in these are legitimate and don't belong to spammers, especially in DMOZ since it's hand-built. So the question is: can these be useful as whitelist sources or perhaps as one of the checks on new SURBL additions. The DMOZ open directory publishes it's data in RDF form at: http://rdf.dmoz.org/ So we downloaded the URL data, extracted the domains and compared them against the SURBL block and whitelists: % join dmoz.srt ../multi.domains.sort | wc 1338 1338 20533 % join dmoz.srt ../whitelist-domains.sort | wc 7375 7375 96720 % join dmoz.srt ../multi.domains.sort > dmoz-blocklist.txt % join dmoz.srt ../whitelist-domains.sort > dmoz-whitelist.txt There were 1338 DMOZ hits against our blocklisted domains and 7375 against our whitelists. You can view those matches at: http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.txt http://spamcheck.freeapp.net/whitelists/dmoz-whitelist.txt Of the 1338 DMOZ hits against our blocklists, which arguably could be false positives, most are in WS. Here is a list with the data from multi.surbl.org showing list membership included: %join dmoz.srt ../multi.domains.summed > dmoz-blocklist.summed.txt http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.summed.txt And some list counts from those hits: [ws] hits: 1173 [ob] hits: 165 [jp] hits: 61 [sc] hits: 8 [ab] hits: 4 [ph] hits: 2 These add up to more than 1338 since some records hit multiple lists. The actual hits are in: http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ob http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.jp http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.sc http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ab http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ph Data source folks, please review these and try to determine which ones are FPs and which would result in false negatives if they came off the lists. For ones that are FPs you may want to eliminate them on your end. For the ones that could cause FNs, we'd like to know about those as a measure of using the DMOZ data for whitelisting. Right now I'm leaning towards whitelisting all of these, so please speak up! The 7.4k DMOZ whitelist hits represents a majority of the 12.25k whitelist entries that are not reserved .us geographic domains, so there is significant overlap between DMOZ and our existing whitelists, which is probably speaks well for both lists. % wc dotus_reservedlist_v3.lower.sort 52049 52049 1012735 dotus_reservedlist_v3.lower.sort % wc ../whitelist-domains.sort 64299 64299 1169155 ../whitelist-domains.sort % join dmoz.srt dotus_reservedlist_v3.lower.sort | wc 7 7 112 The DMOZ data has about 2.3 million domains. How does anyone feel about adding them to our whitelists? A 1.2 MB gzip of the extracted domains is at: http://spamcheck.freeapp.net/whitelists/dmoz.srt.gz I think we can safely say that whitelisting DMOZ domains will reduce FPs. Probably a more important question is: how many FNs would that cause? In other words, how many purely spam domains are in DMOZ, where whitelisting them would wrongly exclude spam domains from SURBLs? One way to answer that is to note that the lists ab, sc, jp, ph, which have much lower FP rates than ws (measured by the SpamAssassin corpora checks, for example, and also anecdotally by human FP reports) appear relatively infrequently in the DMOZ hits. In other words, SURBL lists that we know are quite spammy like sc, jp, etc. don't match DMOZ often, so the DMOZ data may not have too many spam domains. Similar tests could be done against other proposed whitelists. (We'll probably try the wikipedia data next.) Another concern is that since these directories are relatively open, spammers could simply add themselves and effectively get whitelisted. However I intend to take a snapshot of these and probably not try to refresh the data very often in future (it at all), instead using them as relatively static snapshots of established domains. Doing that would miss some new additions, but could also prevent some future abuse by spammers. On the other hand 2 million domains is a pretty good start.... :-) Extraction scripts are not perfect, particularly in the simplistic chopping to three levels of cctlds, but they're probably adequate: http://spamcheck.freeapp.net/whitelists/extract-dmoz-domains http://spamcheck.freeapp.net/whitelists/chop-two-level-domains.sed http://spamcheck.freeapp.net/whitelists/reduce-to-third-level.sed Comments please, Jeff C. -- Jeff Chan mailto:jeffc@surbl.org http://www.surbl.org/

9 23

Strange domains, Get no hits. Are they trying to fool SURBL?
by Martin 08 Oct '04

08 Oct '04

Hi, Got two different spams not getting caught by SURBL. The domains for both of the spams looks like this: http://ss.net.yourstuffspotMUNGED.com?qd="and a tracking id" AND http://mq.net.yourstuffspotMUNGED.com?r="and a tracking id" I checked the domain at rulesemporium and got a hit on yourstuffspotMUNGED.com but not at the subdomains. Somehow SURBL don't catch it even though it's listed on WS. Are they trying to fool SURBL or what is going on? Thanks in advance / Martin

2 3

RE: [SURBL-Discuss] submission to WS
by Chris Santerre 08 Oct '04

08 Oct '04

>-----Original Message----- >From: Martin [mailto:martin.lyberg@idkommunikation.com] >Sent: Thursday, October 07, 2004 10:25 AM >To: discuss(a)lists.surbl.org >Subject: [SURBL-Discuss] submission to WS > > >Hi, > >How long will it take for a submission made to WS at >http://www.rulesemporium.com/cgi-bin/uribl.cgi to become active? > >I wonder because i got a new fresh spam that i submitted, but it's not >yet listed to WS. > >Edit: i just noticed that it got listed on the OB-list. > These submission used to go just to me, not they get split between a few others. THANK GOODNESS!!! So it depends on the person. We have to hand check them. I myself am behind a little on these. But I checked and don't have yours. SO I guess I can only quote game developers when I say "soon." ;) --Chris

3 2

submission to WS
by Martin 08 Oct '04

08 Oct '04

Hi, How long will it take for a submission made to WS at http://www.rulesemporium.com/cgi-bin/uribl.cgi to become active? I wonder because i got a new fresh spam that i submitted, but it's not yet listed to WS. Edit: i just noticed that it got listed on the OB-list. Thank you Martin

5 5

URIBL with "spam sigs" proposal
by Yves Junqueira 07 Oct '04

07 Oct '04

Hi, In some cases, it would be interesting to provide an alternative zone, with a "spam signature info", for domains that could also be used for legitimate purposes. This zone would feature a special TXT part with a regexp or some encoded string that will be used by checking clients to test the message. A fake example: buyziagra.com TXT: listed in re.surbl.org (etc...) #click here.*buy [zvj]iagra# The text between #'s will be used as an regexp that, if matched against the text in slurp mode (whole buffer checked instead of line-by-line), will make the tool return that that e-mail is Spam. I can adapt my suriproxy to do that very easily. (Btw, there is a new test version of suriproxy avaliable with domain whitelisting and a better uri matching algorithm at http://sourceforge.net/projects/pf-aux. Any new feedback would be appreciated) The format I used is just an illustration. It would be ideal to develop or find a simpler "text matching" format then regexp, and yet more powerful, to accept different character coding. The idea of this "URIBL with spam sigs" is to avoid FP's and, specially, to let us list domains in a less restrictive policy. Even if a domain could used for legimate purposes, it could be added to this special zone. I do agree with the current policy used here, but I have several spam arriving everyday, specially from Brazilian domains, that, if the policy is respected, could not get into the list. Yet we need to find a solution for that, and this is my suggestion. This new feature would, then, take two different collaborative anti-spam solutions types - URIBL and on line content checks (Razor, DCC, etc) in a very efficient way and using existing infra-structure, that is, DNS servers. The odds are it would be a bit more difficult to maintain and spam gangs can change the text all the time. Even then, I believe this could be interesting. Do you think this is worth trying? sorry for my bad english, Yves -- Yves Junqueira http://www.lynx.com.br

1 0

Re: [SURBL-Discuss] Possible large whitelist from DMOZ data
by Daniel Quinlan 07 Oct '04

07 Oct '04

Michele Solutions wrote: >> Although directories such as DMOZ are manually edited there is a danger of >> spammers "grabbing" expired domains and abusing them. I've seen a lot of >> scripts for sale that track dmoz listed domains..... Jeff Chan <jeffc(a)surbl.org> writes: > Thanks. That's definitely good to know about. Ah, you could also look at how long a domain has been registered. ;-) Daniel -- Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin http://www.pathname.com/~quinlan/ http://www.apachecon.com/ sessions & more)

1 0

Re: Possible large whitelist from DMOZ data
by Daniel Quinlan 07 Oct '04

07 Oct '04

Jeff Chan <jeffc(a)surbl.org> writes: > Those are all good ideas. Do you know if spammer links do get > deleted? How do the folks who maintain the sites find abusers or > bots? Wikipedia tends to find them eventually. Sometimes, spam links can live on a page or two, so higher count links are going to be safer. Also pages that get updated a lot (and the link stays for each revision). Also, while DMOZ might have some spammer links, I suspect most of the spammer links are very stable, well-listed in SBL and your blacklists, etc. It might be easiest to prioritize links by their S/O ratio: (number of blacklist source hits) --------------------------------- (number of whitelist source hits + number of blacklist source hits) Daniel -- Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin http://www.pathname.com/~quinlan/ http://www.apachecon.com/ sessions & more)

1 0

Re: Possible large whitelist from DMOZ data
by Daniel Quinlan 07 Oct '04

07 Oct '04

Chris Santerre <csanterre(a)MerchantsOverseas.com> writes: > Wow, it looks like some of the DMOZ data can't be trusted. Some of those > domains in this WS blocklist are pure spammers. DMOZ (and as far as I know, Wikipedia) don't filter URLs based on email policies of those sites. However, the links *should* generally be categorized correctly in the case of DMOZ and useful in the case of Wikipedia. I would not suggest using either to whitelist automatically, but if you get several of these sources and count the number of hits for each domain, then you should be able to prioritize and possibly automatically whitelist the ones that hit in a large number of databases. I would also take snapshots, but for a different reason than the one Jeff suggested. I would take snapshots and take the intersection of two snapshots for each source (two separate days of DMOZ, etc.) as the authoritative list since some spammer links (especially if added by some bot) will drop off once they are found. Clearly, given that most of the hits are in .ws etc. you're in the tail region of false positives. It'll be hard to find a lot. More sources and looking at source counts seems like the best way. Daniel -- Daniel Quinlan ApacheCon! 13-17 November (3 SpamAssassin http://www.pathname.com/~quinlan/ http://www.apachecon.com/ sessions & more)

2 1

RE: Possible large whitelist from DMOZ data
by Chris Santerre 06 Oct '04

06 Oct '04

>-----Original Message----- >From: Jeff Chan [mailto:jeffc@surbl.org] >Sent: Wednesday, October 06, 2004 5:58 AM >To: SURBL Discuss >Cc: SpamAssassin Developers >Subject: Possible large whitelist from DMOZ data > *snip* > >And some list counts from those hits: > > [ws] hits: 1173 > [ob] hits: 165 > [jp] hits: 61 > [sc] hits: 8 > [ab] hits: 4 > [ph] hits: 2 > >These add up to more than 1338 since some records hit multiple >lists. The actual hits are in: > > http://spamcheck.freeapp.net/whitelists/dmoz-blocklist.ws Wow, it looks like some of the DMOZ data can't be trusted. Some of those domains in this WS blocklist are pure spammers. adultmovienetwork.com has 135+ NANAS hits, is listed in spamhaus, ect..... These need to ALL be checked carefully. Do not use DMOZ to autowhitelist. I will check A-C in this list. Any takers to check the rest? 1173 FPs......I doubt it. --Chris

2 1

Read this policy please
by Fred 06 Oct '04

06 Oct '04

Ok, As I was reviewing a spam domain to submit, I ran across a somewhat disgusting policy I felt the need to share with everyone here. Here is a snipplet: " Additionally, when you open, preview or click on the advertising portion of our e-mails and/or those of our marketing partners and/or affiliates of GroovyUSA, you have agreed to the terms set forth in our Privacy Policy and agree that as a function of opening, previewing or clicking on the advertising portion of our e-mails, that you will receive new or additional marketing communications from us, our marketing partners and/or affiliates of GroovyUSA. " Right before this, they talk about using clear 1 pixel gifs to track what you view, so they are saying, if you view this message, we will send you more and you accept for us to do whatever we want with your information. http://www.groovyusa-MUNGED.net/ Click on the privacy button at the top of the page.

3 3

← Newer
1
...
8
9
10
11
12
13
14
Older →

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Discuss October 2004