Simple Python Twitter Search API Crawler Class

I’ve been getting into Twitter (I’m @niallohiggins btw) a bit recently. One of the things I wanted to do was write a little program to periodically search for a specific tag and then process the results. The Twitter Search API is very easy to use, even if there are some annoying issues.

Here is a very simple class I wrote to issue searches and return the results. It also keeps track of the last high water mark (max_id) of the previous search, so you hopefully won’t get the same results twice – although you still want to code defensively for that in case there is a bug in Twitter. Feel free to use this code yourself. Note that you’ll have to implement your own ‘submit’ method.

import httlib
import json
import logging
import socket
import time
import urllib
 
SEARCH_HOST="search.twitter.com"
SEARCH_PATH="/search.json"
 
 
class TagCrawler(object):
    ''' Crawl twitter search API for matches to specified tag.  Use since_id to
    hopefully not submit the same message twice.  However, bug reports indicate
    since_id is not always reliable, and so we probably want to de-dup ourselves
    at some level '''
 
    def __init__(self, max_id, tag, interval):
        self.max_id = max_id
        self.tag = tag
        self.interval = interval
 
    def search(self):
        c = httplib.HTTPConnection(SEARCH_HOST)
        params = {'q' : self.tag}
        if self.max_id is not None:
            params['since_id'] = self.max_id
        path = "%s?%s" %(SEARCH_PATH, urllib.urlencode(params))
        try:
            c.request('GET', path)
            r = c.getresponse()
            data = r.read()
            c.close()
            try:
                result = json.loads(data)
            except ValueError:
                return None
            if 'results' not in result:
                return None
            self.max_id = result['max_id']
            return result['results']
        except (httplib.HTTPException, socket.error, socket.timeout), e:
            logging.error("search() error: %s" %(e))
            return None
 
    def loop(self):
        while True:
            logging.info("Starting search")
            data = self.search()
            if data:
                logging.info("%d new result(s)" %(len(data)))
                self.submit(data)
            else:
                logging.info("No new results")
            logging.info("Search complete sleeping for %d seconds"
                    %(self.interval))
            time.sleep(float(self.interval))
 
    def submit(self, data):
        pass

1 comment to Simple Python Twitter Search API Crawler Class

  • john conroy

    Thanks Niall — damn useful script ;)
    I need something similar and I’m glad I found some Python for a similar problem.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">