One of the (few) defining points of Web 2.0 is consuming remote data and services. Which is great if your service provider is Amazon, Yahoo or Google but not so great if it’s your regional elected representatives, who may only have just arrived at Web 1.0. Being able to mine such sites for data is becoming more and more a part of everyday web development.
Anyway, while pondering what forummatrix or wikimatrix is lacking, figured this was a good excuse to take BeautifulSoup for a spin; “a Python HTML/XML parser designed for quick turnaround projects like screen-scraping”, one of the better (if not the best, according to opinion) tools of this kind (note there’s also RubyfulSoup by the same author).
Beautiful Soup is capable of handling pretty much the worst HTML you can throw at it, and still give you a usable data structure. For example given some HTML like;
<i><b>Aargh!</i></b>
…and running through Beautiful Soup like;
from BeautifulSoup import BeautifulSoup
print BeautifulSoup('<i><b>Aargh!</i></b>').prettify()
…I get;
<i> <b> Aargh! </b> </i>
…notice how it’s changed the order of the tags. This clean up allows me to access the inner text like;
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<i><b>Aargh!</i></b>')
print soup.i.b.string
This isn’t intended as a full tutorial – the documentation is extensive and excellent. Another link you should be aware of though is urllib2 – The Missing Manual, which describes pythons urllib2 library (among other things, provides an HTTP client).
Anyway, the mission was to mine MARC for the secunia advisories mailing list, to speed evaluating security records.
MARC provides a search interface which displays results in pages of up to 30 at a time. Aside from the fact it’s all easily fetch able via HTTP GET requests, MARC doesn’t seem to undergo regular HTML changes (still looks the same as I remember and those <font/> tags are a give away), which hopefully means anything mining it’s HTML won’t be “broken” in the near future.
The result in advisories.py
;
#!/usr/bin/python
"""
Pulls out secunia security advisories from
http://marc.theaimsgroup.com/?l=secunia-sec-adv
DO NOT overuse!
Make sure you read the following:
http://marc.theaimsgroup.com/?q=about#Robots
Also be aware that secunia _may_ feel you may be making inappropriate
use of their advisories. For example they have strict rules regarding
content _on_ their site (http://secunia.com/terms_and_conditions/) but
this may not applying to the mailing list announcements
License on the script is GPL: http://www.gnu.org/copyleft/gpl.html
"""
import urllib2, re, time
from urllib import urlencode
from BeautifulSoup import BeautifulSoup
def fetchPage(application, page = 1):
"""
Fetches a page of advisories, using the marc search interface
"""
url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&%s'
% (urlencode({'s':application}), urlencode({'r':page}))
return urllib2.urlopen(url)
def fetchMessage(mid):
"""
Fetches a single advisory, given it's marc message id
"""
url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&q=raw'
% (urlencode({'m':mid}))
return urllib2.urlopen(url).read()
class LastPage(Exception):
"""
Used to flag that there are no pages of advisories to process
"""
pass
class FlyInMySoup(Exception):
"""
Used to indicate the HTML being passed varies wildly from what
was expected.
"""
pass
class NotModified(Exception):
"""
Used to indicate there are no new advisories
"""
pass
class Advisories:
"""
Controls BeautifulSoup, pulling out relevant information from a page of advisories
and 'crawling' for additional pages as needed
"""
maxpages = 10 # If there are more than this num pages, give up
requestdelay = 1 # Delay between successive requests - be kind to marc!
__nohits = re.compile('^No hits found.*')
__addate = re.compile('.*[0-9]+. ([0-9]{4}-[0-9]{2}-[0-9]{2}).*', re.DOTALL)
__messageid = re.compile('.*m=([0-9]+).*')
def __init__(self, application, lastMsgId = None):
self.__application = application
self.__lastMsgId = lastMsgId
self.__advisories = []
self.__pages = []
self.__loaded = 0
def __loadPage(self, page = 0):
"""
Load a page and store it in mem as BeautifulSoup instance
"""
self.__pages.append(BeautifulSoup(fetchPage(self.__application, page+1)))
time.sleep(self.requestdelay)
def __hasAdvisories(self, page = 0):
"""
Test whether page has advisors. To be regarded as not having advisories,
it must contain a font tag with the words "No hits found". Other input
raises FlyInMySoup and will typically mean something is badly broken
"""
font = self.__pages[page].body.find(name='font', size='+1')
if not font:
if self.__pages[page].body.pre is None:
raise FlyInMySoup, "body > pre tag ? advisories?n%s"
% self.__pages[page].prettify
return True
if self.__nohits.match(font.string) == None:
raise FlyInMySoup, "Nosir - dont like that font tag?n%s"
% font.prettify
return False
def __hasAnotherPage(self, page = 0):
"""
Hunts for a img src = 'images/arrright.gif' (Next) in
the advisories page and if found returns a page number
to make another request with. Other raises a LastPage
exception
"""
if page >= self.maxpages: raise LastPage;
pre = self.__pages[page].body.pre
imgs = pre.findAll(name='img', src='images/arrright.gif', limit=5)
if len(imgs) > 0:
return page + 1
raise LastPage
def __fetchAdvisories(self, page = 0):
"""
Fetches a page of advisories, recursing if more pages of advisories
were found
"""
self.__loadPage(page)
if self.__hasAdvisories(page):
advisory = {}
in_advisory = 0
pre = self.__pages[page].body.pre
for child in pre:
if not in_advisory:
m = self.__addate.match(str(child))
if m is not None:
in_advisory = 1
advisory['date'] = m.group(1)
else:
try:
advisory['mid'] = self.__messageid.match(child['href']).group(1)
advisory['desc'] = child.string.strip()
self.__advisories.append(advisory)
advisory = {}
in_advisory = 0
except:
pass
# Some sanity checks...
if len(self.__advisories) == 0:
raise FlyInMySoup, "No advisories in body > pre!n%s" % pre
if in_advisory:
raise FlyInMySoup, "Still looking for the last advisory"
# More protection for marc
if self.__lastMsgId and self.__advisories[0]['mid'] == str(self.__lastMsgId):
raise NotModified, "Not modified - last message id: %s"
% self.__lastMsgId
try:
nextpage = self.__hasAnotherPage(page)
except:
return
self.__fetchAdvisories(nextpage)
def __lazyFetch(self):
"""
Fetch advisories but only when needed
"""
if not self.__loaded:
self.__fetchAdvisories()
self.__loaded = 1
def __iter__(self):
self.__lazyFetch()
return self.__advisories.__iter__()
def __len__(self):
self.__lazyFetch()
return len(self.__advisories)
if __name__ == '__main__':
import getopt, sys, csv
from os import getcwd
from os.path import isdir, isfile, realpath, join
def usage():
"""
advisories.py [-p=proxy_url] [-f] [-d=target_dir] <application>
Pulls a list of security advisories for a given <application>
Puts a summary list in <application>.csv and raw text in
<application>_<msgid>.txt
options:
-d, --directory= (directory to write csv and raw msgs to)
-f, --fetchmsgs (fetch raw messages announcements as well)
-h, --help (display this message)
-p, --proxy=http://user:pass@proxy.isp.com:8080
"""
print usage.__doc__
def lastMsgId(csvfile):
"""
Pull out the last message id from the csvfile. Used to test for
changes if the advisories page
"""
if not isfile(csvfile): return None
try:
fh = open(csvfile, 'rb')
csvreader = csv.reader(fh, dialect='excel')
csvreader.next()
id = csvreader.next()[1]
fh.close()
return id
except:
return None
app = None
proxy = None
fetchMsgs = 0
dir = getcwd()
try:
opts, args = getopt.getopt(sys.argv[1:],
"fhp:d:", ["help", "fetchmsgs", "proxy=", "directory="])
for o, v in opts:
if o in ("-h", "--help"):
usage()
sys.exit(0)
if o in ("-f", "--fetchmsgs"):
fetchMsgs = 1
elif o in ("-p", "--proxy"):
proxy = v
elif o in ("-d", "--directory"):
if isdir(realpath(v)):
dir = realpath(v)
else:
raise "Invalid dir %s" % v
if len(args) == 1:
app = args[0]
else:
raise getopt.error("Supply an app name to fetch advisories for!")
except getopt.error, msg:
print msg
print "for help use --help"
sys.exit(2)
if proxy:
# Use the explicit proxy passed as a CLI option
proxy_support = urllib2.ProxyHandler({"http" : proxy})
else:
# Prevent urllib2 from attempting to auto detect a proxy
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
csvfile = join(dir,app+'.csv')
advs = Advisories(app, lastMsgId(csvfile))
if len(advs) > 0:
fh=open(csvfile, 'wb')
csvwriter=csv.writer(fh, dialect='excel')
csvwriter.writerow(('date','mid','desc'))
for a in advs:
csvwriter.writerow((a['date'], a['mid'], a['desc']))
if fetchMsgs:
mfh=open(join(dir, "%s_%s.txt" % (app, a['mid'])), 'wb')
mfh.write(fetchMessage(a['mid']))
mfh.close()
fh.close()
print "%s advisories found for %s" % (len(advs), app)
else:
print "No advisories found for %s" % app
Assuming you have a recent version of python and Beautiful Soup 3.x+ installed (download the tarball, extract somewhere and run $ setup.py install
to install into your Python library), you can run this script from the command line (it’s intended for cron) like;
$ advisories.py phpbb
… and it will create a file phpbb.csv
containing all advisories it found. There’s a few other features, like proxy support and the ability to download the raw advisories which you can read about by running $ advisories.py --help
. Make sure you read the warnings at the start of the script though!
So mission basically complete. The interesting part is figuring out where to put checks in the code. While Beautiful Soup allows you to read pretty much anything SGML-like, a change in the HTML tag structure of MARC would break this script (it’s not an official API after all), so hopefully it’s primed to raise exceptions in the right places should manual intervention be required.
Otherwise another project to investigate, if you’re getting into HTML mining, is webstemmer (Python again), which in some cases (e.g. a news site) may be smart enough to get you what you want with very little effort.
Harry Fuecks is the Engineering Project Lead at Tamedia and formerly the Head of Engineering at Squirro. He is a data-driven facilitator, leader, coach and specializes in line management, hiring software engineers, analytics, mobile, and marketing. Harry also enjoys writing and you can read his articles on SitePoint and Medium.