Snake Soup — SitePoint

One of the (few) defining points of Web 2.0 is consuming remote data and services. Which is great if your service provider is Amazon, Yahoo or Google but not so great if it’s your regional elected representatives, who may only have just arrived at Web 1.0. Being able to mine such sites for data is becoming more and more a part of everyday web development.

Anyway, while pondering what forummatrix or wikimatrix is lacking, figured this was a good excuse to take BeautifulSoup for a spin; “a Python HTML/XML parser designed for quick turnaround projects like screen-scraping”, one of the better (if not the best, according to opinion) tools of this kind (note there’s also RubyfulSoup by the same author).

Beautiful Soup is capable of handling pretty much the worst HTML you can throw at it, and still give you a usable data structure. For example given some HTML like;

<i><b>Aargh!</i></b>

…and running through Beautiful Soup like;


from BeautifulSoup import BeautifulSoup
print BeautifulSoup('<i><b>Aargh!</i></b>').prettify()

…I get;

<i>
 <b>
  Aargh!
 </b>
</i>

…notice how it’s changed the order of the tags. This clean up allows me to access the inner text like;


from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<i><b>Aargh!</i></b>')
print soup.i.b.string

This isn’t intended as a full tutorial – the documentation is extensive and excellent. Another link you should be aware of though is urllib2 – The Missing Manual, which describes pythons urllib2 library (among other things, provides an HTTP client).

Anyway, the mission was to mine MARC for the secunia advisories mailing list, to speed evaluating security records.

MARC provides a search interface which displays results in pages of up to 30 at a time. Aside from the fact it’s all easily fetch able via HTTP GET requests, MARC doesn’t seem to undergo regular HTML changes (still looks the same as I remember and those <font/> tags are a give away), which hopefully means anything mining it’s HTML won’t be “broken” in the near future.

The result in advisories.py;


#!/usr/bin/python
"""
Pulls out secunia security advisories from
http://marc.theaimsgroup.com/?l=secunia-sec-adv
DO NOT overuse!

Make sure you read the following:
    http://marc.theaimsgroup.com/?q=about#Robots

Also be aware that secunia _may_ feel you may be making inappropriate
use of their advisories. For example they have strict rules regarding
content _on_ their site (http://secunia.com/terms_and_conditions/) but
this may not applying to the mailing list announcements


License on the script is GPL: http://www.gnu.org/copyleft/gpl.html
"""
import urllib2, re, time
from urllib import urlencode
from BeautifulSoup import BeautifulSoup

def fetchPage(application, page = 1):
    """
    Fetches a page of advisories, using the marc search interface
    """
    url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&%s' 
        % (urlencode({'s':application}), urlencode({'r':page}))
    return urllib2.urlopen(url)

def fetchMessage(mid):
    """
    Fetches a single advisory, given it's marc message id
    """
    url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&q=raw'
        % (urlencode({'m':mid}))
    return urllib2.urlopen(url).read()

class LastPage(Exception):
    """
    Used to flag that there are no pages of advisories to process
    """
    pass

class FlyInMySoup(Exception):
    """
    Used to indicate the HTML being passed varies wildly from what
    was expected.
    """
    pass

class NotModified(Exception):
    """
    Used to indicate there are no new advisories
    """
    pass

class Advisories:
    """
    Controls BeautifulSoup, pulling out relevant information from a page of advisories
    and 'crawling' for additional pages as needed
    """
    
    maxpages = 10 # If there are more than this num pages, give up
    requestdelay = 1 # Delay between successive requests - be kind to marc!
    
    __nohits = re.compile('^No hits found.*')
    __addate = re.compile('.*[0-9]+. ([0-9]{4}-[0-9]{2}-[0-9]{2}).*', re.DOTALL)
    __messageid = re.compile('.*m=([0-9]+).*')
    
    def __init__(self, application, lastMsgId = None):
        self.__application = application
        self.__lastMsgId = lastMsgId
        self.__advisories = []
        self.__pages = []
        self.__loaded = 0
    
    def __loadPage(self, page = 0):
        """
        Load a page and store it in mem as BeautifulSoup instance
        """
        self.__pages.append(BeautifulSoup(fetchPage(self.__application, page+1)))
        time.sleep(self.requestdelay)
    
    def __hasAdvisories(self, page = 0):
        """
        Test whether page has advisors. To be regarded as not having advisories,
        it must contain a font tag with the words "No hits found". Other input
        raises FlyInMySoup and will typically mean something is badly broken
        """
        font = self.__pages[page].body.find(name='font', size='+1')
        
        if not font:
            if self.__pages[page].body.pre is None:
                raise FlyInMySoup, "body > pre tag ? advisories?n%s"
                    % self.__pages[page].prettify
            return True
        
        if self.__nohits.match(font.string) == None:
            raise FlyInMySoup, "Nosir - dont like that font tag?n%s"
                % font.prettify
        
        return False
    
    def __hasAnotherPage(self, page = 0):
        """
        Hunts for a img src = 'images/arrright.gif' (Next) in
        the advisories page and if found returns a page number
        to make another request with. Other raises a LastPage
        exception
        """
        if page >= self.maxpages: raise LastPage;
        
        pre = self.__pages[page].body.pre
        imgs = pre.findAll(name='img', src='images/arrright.gif', limit=5)
        
        if len(imgs) > 0:
            return page + 1
        
        raise LastPage
    
    def __fetchAdvisories(self, page = 0):
        """
        Fetches a page of advisories, recursing if more pages of advisories
        were found
        """
        self.__loadPage(page)
        
        if self.__hasAdvisories(page):
            advisory = {}
            in_advisory = 0
            pre = self.__pages[page].body.pre
            for child in pre:
                if not in_advisory:
                    m = self.__addate.match(str(child))
                    if m is not None:
                        in_advisory = 1
                        advisory['date'] = m.group(1)
                else:
                    try:
                        advisory['mid'] = self.__messageid.match(child['href']).group(1)
                        advisory['desc'] = child.string.strip()
                        self.__advisories.append(advisory)
                        advisory = {}
                        in_advisory = 0
                    except:
                        pass
            
            # Some sanity checks...
            if len(self.__advisories) == 0:
                raise FlyInMySoup, "No advisories in body > pre!n%s" % pre
            
            if in_advisory:
                raise FlyInMySoup, "Still looking for the last advisory"
            
            # More protection for marc
            if self.__lastMsgId and self.__advisories[0]['mid'] == str(self.__lastMsgId):
                raise NotModified, "Not modified - last message id: %s"
                    % self.__lastMsgId
            
            try:
                nextpage = self.__hasAnotherPage(page)
            except:
                return
            self.__fetchAdvisories(nextpage)
    
    def __lazyFetch(self):
        """
        Fetch advisories but only when needed
        """
        if not self.__loaded:
            self.__fetchAdvisories()
            self.__loaded = 1
    
    def __iter__(self):
        self.__lazyFetch()
        return self.__advisories.__iter__()
    
    def __len__(self):
        self.__lazyFetch()
        return len(self.__advisories)
    
    

if __name__ == '__main__':
    import getopt, sys, csv
    from os import getcwd
    from os.path import isdir, isfile, realpath, join
    
    def usage():
        """
    advisories.py [-p=proxy_url] [-f] [-d=target_dir] <application>
        
        Pulls a list of security advisories for a given <application>
        
        Puts a summary list in <application>.csv and raw text in
        <application>_<msgid>.txt
        
        options:
            -d, --directory= (directory to write csv and raw msgs to)
            -f, --fetchmsgs (fetch raw messages announcements as well)
            -h, --help (display this message)
            -p, --proxy=http://user:pass@proxy.isp.com:8080
        """
        print usage.__doc__
    
    def lastMsgId(csvfile):
        """
        Pull out the last message id from the csvfile. Used to test for
        changes if the advisories page
        """
        if not isfile(csvfile): return None
        try:
            fh = open(csvfile, 'rb')
            csvreader = csv.reader(fh, dialect='excel')
            csvreader.next()
            id = csvreader.next()[1]
            fh.close()
            return id
        except:
            return None
    
    app = None
    proxy = None
    fetchMsgs = 0
    dir = getcwd()
    
    try:
        
        opts, args = getopt.getopt(sys.argv[1:], 
            "fhp:d:", ["help", "fetchmsgs", "proxy=", "directory="])
        for o, v in opts:
            if o in ("-h", "--help"):
                usage()
                sys.exit(0)
            if o in ("-f", "--fetchmsgs"):
                fetchMsgs = 1
            elif o in ("-p", "--proxy"):
                proxy = v
            elif o in ("-d", "--directory"):
                if isdir(realpath(v)):
                    dir = realpath(v)
                else:
                    raise "Invalid dir %s" % v
        
        if len(args) == 1:
            app = args[0]
        else:
            raise getopt.error("Supply an app name to fetch advisories for!")
        
    except getopt.error, msg:
        print msg
        print "for help use --help"
        sys.exit(2)
    
    if proxy:
        # Use the explicit proxy passed as a CLI option
        proxy_support = urllib2.ProxyHandler({"http" : proxy})
    else:
        # Prevent urllib2 from attempting to auto detect a proxy
        proxy_support = urllib2.ProxyHandler({})
    opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    
    csvfile = join(dir,app+'.csv')
    advs = Advisories(app, lastMsgId(csvfile))
    
    if len(advs) > 0:
        
        fh=open(csvfile, 'wb')
        csvwriter=csv.writer(fh, dialect='excel')
        csvwriter.writerow(('date','mid','desc'))
        
        for a in advs:
            csvwriter.writerow((a['date'], a['mid'], a['desc']))
            if fetchMsgs:
                mfh=open(join(dir, "%s_%s.txt" % (app, a['mid'])), 'wb')
                mfh.write(fetchMessage(a['mid']))
                mfh.close()
        
        fh.close()
        
        print "%s advisories found for %s" % (len(advs), app)
    
    else:
        print "No advisories found for %s" % app

Assuming you have a recent version of python and Beautiful Soup 3.x+ installed (download the tarball, extract somewhere and run $ setup.py install to install into your Python library), you can run this script from the command line (it’s intended for cron) like;

$ advisories.py phpbb

… and it will create a file phpbb.csv containing all advisories it found. There’s a few other features, like proxy support and the ability to download the raw advisories which you can read about by running $ advisories.py --help. Make sure you read the warnings at the start of the script though!

So mission basically complete. The interesting part is figuring out where to put checks in the code. While Beautiful Soup allows you to read pretty much anything SGML-like, a change in the HTML tag structure of MARC would break this script (it’s not an official API after all), so hopefully it’s primed to raise exceptions in the right places should manual intervention be required.

Otherwise another project to investigate, if you’re getting into HTML mining, is webstemmer (Python again), which in some cases (e.g. a news site) may be smart enough to get you what you want with very little effort.