Hi, Currently URIDNSBL.pm uses SA's get_uri_list to get a list of URI's from a message, the current regex seems to also get uri's of the form cid:random_characters in the list
cid:.* seems to refer to content-ids,attachments in the same message when these uris are run through uri_to_domain, they return back the same result cid:.*
My feeling is that a message can contain some artificial cid:.* url's which may skew the set of random domains used for SURBL lookup's
I am not sure if cid:.* url's should be returned from get_uri_list() or they should be stripped correctly in uri_to_domain. Quite a few of the values after cid: seem to refer to host names/domain names
Regards, Yusuf
On Friday, April 23, 2004, 1:15:49 AM, Yusuf Goolamabbas wrote:
Hi, Currently URIDNSBL.pm uses SA's get_uri_list to get a list of URI's from a message, the current regex seems to also get uri's of the form cid:random_characters in the list
cid:.* seems to refer to content-ids,attachments in the same message when these uris are run through uri_to_domain, they return back the same result cid:.*
My feeling is that a message can contain some artificial cid:.* url's which may skew the set of random domains used for SURBL lookup's
I am not sure if cid:.* url's should be returned from get_uri_list() or they should be stripped correctly in uri_to_domain. Quite a few of the values after cid: seem to refer to host names/domain names
I'll leave a detailed response to those more familiar with URIDNSBL internals, but the goal is to remove all but the base domain before comparing it to an SURBL. So I'm hoping any deliberately randomized characters and any other extra stuff is discarded before RBL comparison. Only the basic domain should be checked against the SURBL.
Jeff C.
I'll leave a detailed response to those more familiar with URIDNSBL internals, but the goal is to remove all but the base domain before comparing it to an SURBL. So I'm hoping any deliberately randomized characters and any other extra stuff is discarded before RBL comparison. Only the basic domain should be checked against the SURBL.
Currently, SURBL relies on get_uri_list the grab the list of domains, some uri's may not be appropiate as the basis for which to grab domains for. If that list could be cut down, then the pool from which the random selection is made could be more interesting
e.g, I could write a message with maybe 25-30 cid:.* url's and one real-spamvertised url'. The probability of URIDNSBL.pm to get the spamvertised url will be higher if the noise from the cid:.* url or other non-interesting url's could be removed
PS, Does this list need to have the listname prefixed to the subject line, it wastes a lot of space. I am sure there are other headers one can filter by
Regards, Yusuf
On Fri, Apr 23, 2004 at 04:15:49PM +0800, Yusuf Goolamabbas wrote:
Hi, Currently URIDNSBL.pm uses SA's get_uri_list to get a list of URI's from a message, the current regex seems to also get uri's of the form cid:random_characters in the list
cid:.* seems to refer to content-ids,attachments in the same message when these uris are run through uri_to_domain, they return back the same result cid:.*
My feeling is that a message can contain some artificial cid:.* url's which may skew the set of random domains used for SURBL lookup's
I am not sure if cid:.* url's should be returned from get_uri_list() or they should be stripped correctly in uri_to_domain. Quite a few of the values after cid: seem to refer to host names/domain names
I did a quick test and cid:.* urls are not checked against SURBL in SpamCopURI.
I use URI to do all the URI parsing and then check to see if it has a host method, which only schemes such as http, ftp, gopher, etc. actually implement. The cid scheme translates to an internal _foreign URI type, which has no host implementation.
--eric
Regards, Yusuf _______________________________________________ Discuss mailing list Discuss@lists.surbl.org http://lists.surbl.org/mailman/listinfo/discuss