Snake Soup

Share this article

One of the (few) defining points of Web 2.0 is consuming remote data and services. Which is great if your service provider is Amazon, Yahoo or Google but not so great if it’s your regional elected representatives, who may only have just arrived at Web 1.0. Being able to mine such sites for data is becoming more and more a part of everyday web development.

Anyway, while pondering what forummatrix or wikimatrix is lacking, figured this was a good excuse to take BeautifulSoup for a spin; “a Python HTML/XML parser designed for quick turnaround projects like screen-scraping”, one of the better (if not the best, according to opinion) tools of this kind (note there’s also RubyfulSoup by the same author).

Beautiful Soup is capable of handling pretty much the worst HTML you can throw at it, and still give you a usable data structure. For example given some HTML like;

<i><b>Aargh!</i></b>

…and running through Beautiful Soup like;


from BeautifulSoup import BeautifulSoup
print BeautifulSoup('<i><b>Aargh!</i></b>').prettify()

…I get;

<i>
 <b>
  Aargh!
 </b>
</i>

…notice how it’s changed the order of the tags. This clean up allows me to access the inner text like;


from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<i><b>Aargh!</i></b>')
print soup.i.b.string

This isn’t intended as a full tutorial – the documentation is extensive and excellent. Another link you should be aware of though is urllib2 – The Missing Manual, which describes pythons urllib2 library (among other things, provides an HTTP client).

Anyway, the mission was to mine MARC for the secunia advisories mailing list, to speed evaluating security records.

MARC provides a search interface which displays results in pages of up to 30 at a time. Aside from the fact it’s all easily fetch able via HTTP GET requests, MARC doesn’t seem to undergo regular HTML changes (still looks the same as I remember and those <font/> tags are a give away), which hopefully means anything mining it’s HTML won’t be “broken” in the near future.

The result in advisories.py;


#!/usr/bin/python
"""
Pulls out secunia security advisories from
http://marc.theaimsgroup.com/?l=secunia-sec-adv
DO NOT overuse!

Make sure you read the following:
    http://marc.theaimsgroup.com/?q=about#Robots

Also be aware that secunia _may_ feel you may be making inappropriate
use of their advisories. For example they have strict rules regarding
content _on_ their site (http://secunia.com/terms_and_conditions/) but
this may not applying to the mailing list announcements


License on the script is GPL: http://www.gnu.org/copyleft/gpl.html
"""
import urllib2, re, time
from urllib import urlencode
from BeautifulSoup import BeautifulSoup

def fetchPage(application, page = 1):
    """
    Fetches a page of advisories, using the marc search interface
    """
    url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&%s' 
        % (urlencode({'s':application}), urlencode({'r':page}))
    return urllib2.urlopen(url)

def fetchMessage(mid):
    """
    Fetches a single advisory, given it's marc message id
    """
    url = 'http://marc.theaimsgroup.com/?l=secunia-sec-adv&%s&q=raw'
        % (urlencode({'m':mid}))
    return urllib2.urlopen(url).read()

class LastPage(Exception):
    """
    Used to flag that there are no pages of advisories to process
    """
    pass

class FlyInMySoup(Exception):
    """
    Used to indicate the HTML being passed varies wildly from what
    was expected.
    """
    pass

class NotModified(Exception):
    """
    Used to indicate there are no new advisories
    """
    pass

class Advisories:
    """
    Controls BeautifulSoup, pulling out relevant information from a page of advisories
    and 'crawling' for additional pages as needed
    """
    
    maxpages = 10 # If there are more than this num pages, give up
    requestdelay = 1 # Delay between successive requests - be kind to marc!
    
    __nohits = re.compile('^No hits found.*')
    __addate = re.compile('.*[0-9]+. ([0-9]{4}-[0-9]{2}-[0-9]{2}).*', re.DOTALL)
    __messageid = re.compile('.*m=([0-9]+).*')
    
    def __init__(self, application, lastMsgId = None):
        self.__application = application
        self.__lastMsgId = lastMsgId
        self.__advisories = []
        self.__pages = []
        self.__loaded = 0
    
    def __loadPage(self, page = 0):
        """
        Load a page and store it in mem as BeautifulSoup instance
        """
        self.__pages.append(BeautifulSoup(fetchPage(self.__application, page+1)))
        time.sleep(self.requestdelay)
    
    def __hasAdvisories(self, page = 0):
        """
        Test whether page has advisors. To be regarded as not having advisories,
        it must contain a font tag with the words "No hits found". Other input
        raises FlyInMySoup and will typically mean something is badly broken
        """
        font = self.__pages[page].body.find(name='font', size='+1')
        
        if not font:
            if self.__pages[page].body.pre is None:
                raise FlyInMySoup, "body > pre tag ? advisories?n%s"
                    % self.__pages[page].prettify
            return True
        
        if self.__nohits.match(font.string) == None:
            raise FlyInMySoup, "Nosir - dont like that font tag?n%s"
                % font.prettify
        
        return False
    
    def __hasAnotherPage(self, page = 0):
        """
        Hunts for a img src = 'images/arrright.gif' (Next) in
        the advisories page and if found returns a page number
        to make another request with. Other raises a LastPage
        exception
        """
        if page >= self.maxpages: raise LastPage;
        
        pre = self.__pages[page].body.pre
        imgs = pre.findAll(name='img', src='images/arrright.gif', limit=5)
        
        if len(imgs) > 0:
            return page + 1
        
        raise LastPage
    
    def __fetchAdvisories(self, page = 0):
        """
        Fetches a page of advisories, recursing if more pages of advisories
        were found
        """
        self.__loadPage(page)
        
        if self.__hasAdvisories(page):
            advisory = {}
            in_advisory = 0
            pre = self.__pages[page].body.pre
            for child in pre:
                if not in_advisory:
                    m = self.__addate.match(str(child))
                    if m is not None:
                        in_advisory = 1
                        advisory['date'] = m.group(1)
                else:
                    try:
                        advisory['mid'] = self.__messageid.match(child['href']).group(1)
                        advisory['desc'] = child.string.strip()
                        self.__advisories.append(advisory)
                        advisory = {}
                        in_advisory = 0
                    except:
                        pass
            
            # Some sanity checks...
            if len(self.__advisories) == 0:
                raise FlyInMySoup, "No advisories in body > pre!n%s" % pre
            
            if in_advisory:
                raise FlyInMySoup, "Still looking for the last advisory"
            
            # More protection for marc
            if self.__lastMsgId and self.__advisories[0]['mid'] == str(self.__lastMsgId):
                raise NotModified, "Not modified - last message id: %s"
                    % self.__lastMsgId
            
            try:
                nextpage = self.__hasAnotherPage(page)
            except:
                return
            self.__fetchAdvisories(nextpage)
    
    def __lazyFetch(self):
        """
        Fetch advisories but only when needed
        """
        if not self.__loaded:
            self.__fetchAdvisories()
            self.__loaded = 1
    
    def __iter__(self):
        self.__lazyFetch()
        return self.__advisories.__iter__()
    
    def __len__(self):
        self.__lazyFetch()
        return len(self.__advisories)
    
    

if __name__ == '__main__':
    import getopt, sys, csv
    from os import getcwd
    from os.path import isdir, isfile, realpath, join
    
    def usage():
        """
    advisories.py [-p=proxy_url] [-f] [-d=target_dir] <application>
        
        Pulls a list of security advisories for a given <application>
        
        Puts a summary list in <application>.csv and raw text in
        <application>_<msgid>.txt
        
        options:
            -d, --directory= (directory to write csv and raw msgs to)
            -f, --fetchmsgs (fetch raw messages announcements as well)
            -h, --help (display this message)
            -p, --proxy=http://user:pass@proxy.isp.com:8080
        """
        print usage.__doc__
    
    def lastMsgId(csvfile):
        """
        Pull out the last message id from the csvfile. Used to test for
        changes if the advisories page
        """
        if not isfile(csvfile): return None
        try:
            fh = open(csvfile, 'rb')
            csvreader = csv.reader(fh, dialect='excel')
            csvreader.next()
            id = csvreader.next()[1]
            fh.close()
            return id
        except:
            return None
    
    app = None
    proxy = None
    fetchMsgs = 0
    dir = getcwd()
    
    try:
        
        opts, args = getopt.getopt(sys.argv[1:], 
            "fhp:d:", ["help", "fetchmsgs", "proxy=", "directory="])
        for o, v in opts:
            if o in ("-h", "--help"):
                usage()
                sys.exit(0)
            if o in ("-f", "--fetchmsgs"):
                fetchMsgs = 1
            elif o in ("-p", "--proxy"):
                proxy = v
            elif o in ("-d", "--directory"):
                if isdir(realpath(v)):
                    dir = realpath(v)
                else:
                    raise "Invalid dir %s" % v
        
        if len(args) == 1:
            app = args[0]
        else:
            raise getopt.error("Supply an app name to fetch advisories for!")
        
    except getopt.error, msg:
        print msg
        print "for help use --help"
        sys.exit(2)
    
    if proxy:
        # Use the explicit proxy passed as a CLI option
        proxy_support = urllib2.ProxyHandler({"http" : proxy})
    else:
        # Prevent urllib2 from attempting to auto detect a proxy
        proxy_support = urllib2.ProxyHandler({})
    opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
    urllib2.install_opener(opener)
    
    csvfile = join(dir,app+'.csv')
    advs = Advisories(app, lastMsgId(csvfile))
    
    if len(advs) > 0:
        
        fh=open(csvfile, 'wb')
        csvwriter=csv.writer(fh, dialect='excel')
        csvwriter.writerow(('date','mid','desc'))
        
        for a in advs:
            csvwriter.writerow((a['date'], a['mid'], a['desc']))
            if fetchMsgs:
                mfh=open(join(dir, "%s_%s.txt" % (app, a['mid'])), 'wb')
                mfh.write(fetchMessage(a['mid']))
                mfh.close()
        
        fh.close()
        
        print "%s advisories found for %s" % (len(advs), app)
    
    else:
        print "No advisories found for %s" % app


Assuming you have a recent version of python and Beautiful Soup 3.x+ installed (download the tarball, extract somewhere and run $ setup.py install to install into your Python library), you can run this script from the command line (it’s intended for cron) like;

$ advisories.py phpbb

… and it will create a file phpbb.csv containing all advisories it found. There’s a few other features, like proxy support and the ability to download the raw advisories which you can read about by running $ advisories.py --help. Make sure you read the warnings at the start of the script though!

So mission basically complete. The interesting part is figuring out where to put checks in the code. While Beautiful Soup allows you to read pretty much anything SGML-like, a change in the HTML tag structure of MARC would break this script (it’s not an official API after all), so hopefully it’s primed to raise exceptions in the right places should manual intervention be required.

Otherwise another project to investigate, if you’re getting into HTML mining, is webstemmer (Python again), which in some cases (e.g. a news site) may be smart enough to get you what you want with very little effort.

Harry FuecksHarry Fuecks
View Author

Harry Fuecks is the Engineering Project Lead at Tamedia and formerly the Head of Engineering at Squirro. He is a data-driven facilitator, leader, coach and specializes in line management, hiring software engineers, analytics, mobile, and marketing. Harry also enjoys writing and you can read his articles on SitePoint and Medium.

Read Next
How to Deploy Apache Airflow on Vultr Using Anaconda
How to Deploy Apache Airflow on Vultr Using Anaconda
Vultr
Cloud Native: How Ampere Is Improving Nightly Arm64 Builds
Cloud Native: How Ampere Is Improving Nightly Arm64 Builds
Dave NearyAaron Williams
How to Create Content in WordPress with AI
How to Create Content in WordPress with AI
Çağdaş Dağ
A Beginner’s Guide to Setting Up a Project in Laravel
A Beginner’s Guide to Setting Up a Project in Laravel
Claudio Ribeiro
Enhancing DevSecOps Workflows with Generative AI: A Comprehensive Guide
Enhancing DevSecOps Workflows with Generative AI: A Comprehensive Guide
Gitlab
Creating Fluid Typography with the CSS clamp() Function
Creating Fluid Typography with the CSS clamp() Function
Daine Mawer
Comparing Full Stack and Headless CMS Platforms
Comparing Full Stack and Headless CMS Platforms
Vultr
7 Easy Ways to Make a Magento 2 Website Faster
7 Easy Ways to Make a Magento 2 Website Faster
Konstantin Gerasimov
Powerful React Form Builders to Consider in 2024
Powerful React Form Builders to Consider in 2024
Femi Akinyemi
Quick Tip: How to Animate Text Gradients and Patterns in CSS
Quick Tip: How to Animate Text Gradients and Patterns in CSS
Ralph Mason
Sending Email Using Node.js
Sending Email Using Node.js
Craig Buckler
Creating a Navbar in React
Creating a Navbar in React
Vidura Senevirathne
A Complete Guide to CSS Logical Properties, with Cheat Sheet
A Complete Guide to CSS Logical Properties, with Cheat Sheet
Ralph Mason
Using JSON Web Tokens with Node.js
Using JSON Web Tokens with Node.js
Lakindu Hewawasam
How to Build a Simple Web Server with Node.js
How to Build a Simple Web Server with Node.js
Chameera Dulanga
Building a Digital Fortress: How to Strengthen DNS Against DDoS Attacks?
Building a Digital Fortress: How to Strengthen DNS Against DDoS Attacks?
Beloslava Petrova
Crafting Interactive Scatter Plots with Plotly
Crafting Interactive Scatter Plots with Plotly
Binara Prabhanga
GenAI: How to Reduce Cost with Prompt Compression Techniques
GenAI: How to Reduce Cost with Prompt Compression Techniques
Suvoraj Biswas
How to Use jQuery’s ajax() Function for Asynchronous HTTP Requests
How to Use jQuery’s ajax() Function for Asynchronous HTTP Requests
Aurelio De RosaMaria Antonietta Perna
Quick Tip: How to Align Column Rows with CSS Subgrid
Quick Tip: How to Align Column Rows with CSS Subgrid
Ralph Mason
15 Top Web Design Tools & Resources To Try in 2024
15 Top Web Design Tools & Resources To Try in 2024
SitePoint Sponsors
7 Simple Rules for Better Data Visualization
7 Simple Rules for Better Data Visualization
Mariia Merkulova
Cloudways Autonomous: Fully-Managed Scalable WordPress Hosting
Cloudways Autonomous: Fully-Managed Scalable WordPress Hosting
SitePoint Team
Best Programming Language for AI
Best Programming Language for AI
Lucero del Alba
Quick Tip: How to Add Gradient Effects and Patterns to Text
Quick Tip: How to Add Gradient Effects and Patterns to Text
Ralph Mason
Logging Made Easy: A Beginner’s Guide to Winston in Node.js
Logging Made Easy: A Beginner’s Guide to Winston in Node.js
Vultr
How to Optimize Website Content for Featured Snippets
How to Optimize Website Content for Featured Snippets
Dipen Visavadiya
Psychology and UX: Decoding the Science Behind User Clicks
Psychology and UX: Decoding the Science Behind User Clicks
Tanya Kumari
Build a Full-stack App with Node.js and htmx
Build a Full-stack App with Node.js and htmx
James Hibbard
Digital Transformation with AI: The Benefits and Challenges
Digital Transformation with AI: The Benefits and Challenges
Priyanka Prajapat
Quick Tip: Creating a Date Picker in React
Quick Tip: Creating a Date Picker in React
Dianne Pena
How to Create Interactive Animations Using React Spring
How to Create Interactive Animations Using React Spring
Yemi Ojedapo
10 Reasons to Love Google Docs
10 Reasons to Love Google Docs
Joshua KrausZain Zaidi
How to Use Magento 2 for International Ecommerce Success
How to Use Magento 2 for International Ecommerce Success
Mitul Patel
5 Exciting New JavaScript Features in 2024
5 Exciting New JavaScript Features in 2024
Olivia GibsonDarren Jones
Tools and Strategies for Efficient Web Project Management
Tools and Strategies for Efficient Web Project Management
Juliet Ofoegbu
Choosing the Best WordPress CRM Plugin for Your Business
Choosing the Best WordPress CRM Plugin for Your Business
Neve Wilkinson
ChatGPT Plugins for Marketing Success
ChatGPT Plugins for Marketing Success
Neil Jordan
Managing Static Files in Django: A Comprehensive Guide
Managing Static Files in Django: A Comprehensive Guide
Kabaki Antony
The Ultimate Guide to Choosing the Best React Website Builder
The Ultimate Guide to Choosing the Best React Website Builder
Dianne Pena
Exploring the Creative Power of CSS Filters and Blending
Exploring the Creative Power of CSS Filters and Blending
Joan Ayebola
How to Use WebSockets in Node.js to Create Real-time Apps
How to Use WebSockets in Node.js to Create Real-time Apps
Craig Buckler
Best Node.js Framework Choices for Modern App Development
Best Node.js Framework Choices for Modern App Development
Dianne Pena
SaaS Boilerplates: What They Are, And 10 of the Best
SaaS Boilerplates: What They Are, And 10 of the Best
Zain Zaidi
Understanding Cookies and Sessions in React
Understanding Cookies and Sessions in React
Blessing Ene Anyebe
Enhanced Internationalization (i18n) in Next.js 14
Enhanced Internationalization (i18n) in Next.js 14
Emmanuel Onyeyaforo
Essential React Native Performance Tips and Tricks
Essential React Native Performance Tips and Tricks
Shaik Mukthahar
How to Use Server-sent Events in Node.js
How to Use Server-sent Events in Node.js
Craig Buckler
Five Simple Ways to Boost a WooCommerce Site’s Performance
Five Simple Ways to Boost a WooCommerce Site’s Performance
Palash Ghosh
Elevate Your Online Store with Top WooCommerce Plugins
Elevate Your Online Store with Top WooCommerce Plugins
Dianne Pena
Unleash Your Website’s Potential: Top 5 SEO Tools of 2024
Unleash Your Website’s Potential: Top 5 SEO Tools of 2024
Dianne Pena
How to Build a Chat Interface using Gradio & Vultr Cloud GPU
How to Build a Chat Interface using Gradio & Vultr Cloud GPU
Vultr
Enhance Your React Apps with ShadCn Utilities and Components
Enhance Your React Apps with ShadCn Utilities and Components
David Jaja
10 Best Create React App Alternatives for Different Use Cases
10 Best Create React App Alternatives for Different Use Cases
Zain Zaidi
Control Lazy Load, Infinite Scroll and Animations in React
Control Lazy Load, Infinite Scroll and Animations in React
Blessing Ene Anyebe
Building a Research Assistant Tool with AI and JavaScript
Building a Research Assistant Tool with AI and JavaScript
Mahmud Adeleye
Understanding React useEffect
Understanding React useEffect
Dianne Pena
Web Design Trends to Watch in 2024
Web Design Trends to Watch in 2024
Juliet Ofoegbu
Building a 3D Card Flip Animation with CSS Houdini
Building a 3D Card Flip Animation with CSS Houdini
Fred Zugs
How to Use ChatGPT in an Unavailable Country
How to Use ChatGPT in an Unavailable Country
Dianne Pena
An Introduction to Node.js Multithreading
An Introduction to Node.js Multithreading
Craig Buckler
How to Boost WordPress Security and Protect Your SEO Ranking
How to Boost WordPress Security and Protect Your SEO Ranking
Jaya Iyer
Understanding How ChatGPT Maintains Context
Understanding How ChatGPT Maintains Context
Dianne Pena
Building Interactive Data Visualizations with D3.js and React
Building Interactive Data Visualizations with D3.js and React
Oluwabusayo Jacobs
JavaScript vs Python: Which One Should You Learn First?
JavaScript vs Python: Which One Should You Learn First?
Olivia GibsonDarren Jones
13 Best Books, Courses and Communities for Learning React
13 Best Books, Courses and Communities for Learning React
Zain Zaidi
5 jQuery.each() Function Examples
5 jQuery.each() Function Examples
Florian RapplJames Hibbard
Implementing User Authentication in React Apps with Appwrite
Implementing User Authentication in React Apps with Appwrite
Yemi Ojedapo
AI-Powered Search Engine With Milvus Vector Database on Vultr
AI-Powered Search Engine With Milvus Vector Database on Vultr
Vultr
Understanding Signals in Django
Understanding Signals in Django
Kabaki Antony
Why React Icons May Be the Only Icon Library You Need
Why React Icons May Be the Only Icon Library You Need
Zain Zaidi
View Transitions in Astro
View Transitions in Astro
Tamas Piros
Getting Started with Content Collections in Astro
Getting Started with Content Collections in Astro
Tamas Piros
What Does the Java Virtual Machine Do All Day?
What Does the Java Virtual Machine Do All Day?
Peter Kessler
Become a Freelance Web Developer on Fiverr: Ultimate Guide
Become a Freelance Web Developer on Fiverr: Ultimate Guide
Mayank Singh
Layouts in Astro
Layouts in Astro
Tamas Piros
.NET 8: Blazor Render Modes Explained
.NET 8: Blazor Render Modes Explained
Peter De Tender
Mastering Node CSV
Mastering Node CSV
Dianne Pena
A Beginner’s Guide to SvelteKit
A Beginner’s Guide to SvelteKit
Erik KückelheimSimon Holthausen
Brighten Up Your Astro Site with KwesForms and Rive
Brighten Up Your Astro Site with KwesForms and Rive
Paul Scanlon
Which Programming Language Should I Learn First in 2024?
Which Programming Language Should I Learn First in 2024?
Joel Falconer
Managing PHP Versions with Laravel Herd
Managing PHP Versions with Laravel Herd
Dianne Pena
Accelerating the Cloud: The Final Steps
Accelerating the Cloud: The Final Steps
Dave Neary
An Alphebetized List of MIME Types
An Alphebetized List of MIME Types
Dianne Pena
The Best PHP Frameworks for 2024
The Best PHP Frameworks for 2024
Claudio Ribeiro
11 Best WordPress Themes for Developers & Designers in 2024
11 Best WordPress Themes for Developers & Designers in 2024
SitePoint Sponsors
Top 10 Best WordPress AI Plugins of 2024
Top 10 Best WordPress AI Plugins of 2024
Dianne Pena
20+ Tools for Node.js Development in 2024
20+ Tools for Node.js Development in 2024
Dianne Pena
The Best Figma Plugins to Enhance Your Design Workflow in 2024
The Best Figma Plugins to Enhance Your Design Workflow in 2024
Dianne Pena
Harnessing the Power of Zenserp for Advanced Search Engine Parsing
Harnessing the Power of Zenserp for Advanced Search Engine Parsing
Christopher Collins
Build Your Own AI Tools in Python Using the OpenAI API
Build Your Own AI Tools in Python Using the OpenAI API
Zain Zaidi
The Best React Chart Libraries for Data Visualization in 2024
The Best React Chart Libraries for Data Visualization in 2024
Dianne Pena
7 Free AI Logo Generators to Get Started
7 Free AI Logo Generators to Get Started
Zain Zaidi
Turn Your Vue App into an Offline-ready Progressive Web App
Turn Your Vue App into an Offline-ready Progressive Web App
Imran Alam
Clean Architecture: Theming with Tailwind and CSS Variables
Clean Architecture: Theming with Tailwind and CSS Variables
Emmanuel Onyeyaforo
How to Analyze Large Text Datasets with LangChain and Python
How to Analyze Large Text Datasets with LangChain and Python
Matt Nikonorov
6 Techniques for Conditional Rendering in React, with Examples
6 Techniques for Conditional Rendering in React, with Examples
Yemi Ojedapo
Introducing STRICH: Barcode Scanning for Web Apps
Introducing STRICH: Barcode Scanning for Web Apps
Alex Suzuki
Using Nodemon and Watch in Node.js for Live Restarts
Using Nodemon and Watch in Node.js for Live Restarts
Craig Buckler
Task Automation and Debugging with AI-Powered Tools
Task Automation and Debugging with AI-Powered Tools
Timi Omoyeni
Get the freshest news and resources for developers, designers and digital creators in your inbox each week