Hand-checking could make it feasible.
definitely, that's the key. Even checking the URLs (dump the text with "lynx -dump" for example) would probably help.
maybe if the resulting html could be scored, perhaps SA could be adapted to such purpose, you could for example score based on the webserver ip appearing in dns blacklists etc. You would have to be careful about your user-agent to avoid spammers presenting clean content to the checker.