Python Web Scraping

I decided to do some web scraping in Python. The idea for this script is to aggregate all "HN: Who's Hiring" posts from the last few months to make them easily searchable. It's not super useful since the data only changes once per month, but I wanted to check out the PyQuery library, and the concepts can be extended for whatever scraping you need for your project. Let's get started.

This script is available on GitHub at https://github.com/smessimer/AHNWHScraper.

Clone the Git repo by doing:

git clone https://github.com/smessimer/AHNWHScraper.git

This will make a directory called AHNWHScraper with a file called scraper.py.

Scraper.py has a dependency on a web scraper library called PyQuery. PyQuery is nice because it lets you scrape web pages using JQuery-like selectors.

So let's install the PyQuery dependency. I like to use virtualenv to handle dependencies because you can create a virtual environment for every project you're working on, and you can install dependencies without affecting other projects. This script uses Python 3.x so you'll use the following command, in the directory you just cloned from GitHub:

virtualenv --python=python3 venv

Now that we have a new virtualenv, start the virtualenv and install PyQuery. We'll also need "requests" for getting web requests:

source venv/bin/activate
pip3 install PyQuery
pip3 install requests

Now, run the following:

python scraper.py

That will fetch the last six months worth of HN Who's Hiring posts and leave the output in results/results.html. Open this with your favorite web browser and search as needed for job posts.

As a side note: the output is ugly. It would be nice to extend this script to nicely format the output, and make things easily filterable and searchable. But this is a quick and dirty exercise, so I'll leave that for another time.

Let's look at the full code, then go through it to explain.

# scraper.py
import sys, os
sys.path.append(os.path.dirname(__file__))

import requests
from pyquery import PyQuery as pq
import re

def ScrapeJobs():
    
    startUrl = "https://www.google.com/search?site=&source=hp&q=site%3Anews.ycombinator.com+hacker+news+who+is+hiring"
    response = requests.get(startUrl)
    doc = pq(response.content)

    regex = ".*item\?id=(\d+)"

    results = [pq(result).html() for result in doc('cite')]

    storyNums = [re.findall(regex, result) for result in results]

    print(storyNums)
    links = []
    for storyNum in storyNums:
        if storyNum:
            links.append(storyNum[0])

    print(links)

    links = ["https://news.ycombinator.com/item?id={}".format(link) for link in links]
    
    print(links)

    jobs = []

    for link in links:
        response = requests.get(link)
        doc = pq(response.content)
        for job in doc('div.comment'):
            jobs.append(pq(job).html())

    print(jobs)
    
    # Okay, we got the goods.  Now output to file that we'll open up on the site.
    try:
        if not os.path.exists("results"):
            os.makedirs("results")
    except OSError:
        pass
    filename = "results/results.html"
    try:
        os.remove(filename)
    except OSError:
        pass
    f = open(filename, 'w')
    html = "<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css\">"
    html += "</head><body>"
    f.write(html)
    for j in jobs:
        f.write(j)
        f.write("<hr>")
    f.write("</body></html>")
    f.close()

    # Get the HN stylesheet
    cssUrl = "https://news.ycombinator.com/news.css"
    response = requests.get(cssUrl)
    filename = "results/news.css"
    try:
        os.remove(filename)
    except OSError:
        pass
    f = open(filename, 'w')
    f.write(response.text)
    f.close()


if __name__ == "__main__":
    ScrapeJobs()

Okay, let's go through the code.

import sys, os
sys.path.append(os.path.dirname(__file__))

import requests
from pyquery import PyQuery as pq
import re

First we append the current directory to our path. This will make it easier to make the results directory, in which we'll store the resulting web page containing all of the job listings.

Next, import requests, which we use to make HTTP requests to the needed web sites; PyQuery, which we use to select elements in the DOM; and re, which we use to handle REGEX and select the proper URLs from Google's search results.

startUrl = "https://www.google.com/search?site=&source=hp&q=site%3Anews.ycombinator.com+hacker+news+who+is+hiring"
response = requests.get(startUrl)
doc = pq(response.content)

startUrl is the URL that will send a search request to Google for "HN: Who's Hiring" posts. We then use this URL to make the request to Google, and convert it to something PyQuery can use and store the result in doc.

regex = ".*item\?id=(\d+)"

results = [pq(result).html() for result in doc('cite')]

storyNums = [re.findall(regex, result) for result in results]

print(storyNums)

The string defined as regex is used to match Google search results to a Hacker News story ID. The ID number will be matched by (\d+). We then get every element of type cite in the Google search results (this corresponds to the address of the search result links), and store the id number as elements in the list storyNums.

links = ["https://news.ycombinator.com/item?id={}".format(link) for link in links]

Now, we reassign links, using the HN story IDs, so that we have valid links to each of the last six "HN: Who's Hiring" posts.

jobs = []

for link in links:
    response = requests.get(link)
    doc = pq(response.content)
    for job in doc('div.comment'):
        jobs.append(pq(job).html())

Here is where we aggregate all the job listings across all of the "HN: Who's Hiring" posts. For each page found above (there is one for each of the last six months), get the post and add each comment from each post to jobs[].

Now we need to make a folder and file to put the results into. I'll put it in ./results/results.html.

try:
    if not os.path.exists("results"):
        os.makedirs("results")
    except OSError:
        pass
    filename = "results/results.html"
    try:
        os.remove(filename)
    except OSError:
        pass

All the above code is doing is checking to see if a ./results directory exists. If it doesn't, it creates it. Then it deletes any existing results.html file to make room for the new one.

Now we'll write the actual results to file.

try:
    with open(filename, 'w') as f:
        html = "<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"news.css\">"
        html += "</head><body>"
        f.write(html)
        for j in jobs:
            f.write(j)
            f.write("<hr>")
        f.write("</body></html>")
except IOError as e:
    print("I/O error({0}): {1}".format(e.errno, e.strerror)

First, we open the file with write 'w' permissions. Next, I add a basic <html> wrapper for the job listings. We need this because the job listings are just a bunch of <div>s. The <html> wrapper makes it a proper HTML document. We then write the first part of the HTML wrapper to file. Then we iterate through each job listing item in the jobs list and write it to file. Finally we close the <body> and <html> tags.

If you open results.html at this point it is really ugly, with no formatting. What we'll do next will make it only slightly less offensive to the eyes, but I'll call it good enough. Let's download the CSS from Hacker News and put it into the same folder.

# Get the HN stylesheet
cssUrl = "https://news.ycombinator.com/news.css"
response = requests.get(cssUrl)
filename = "results/news.css"
try:
    os.remove(filename)
except OSError:
    pass
   
try:
    with open(filename, 'w'):
        f.write(response.text)
except IOError as e:
    print("I/O error({0}): {1}".format(e.errno, e.strerror)

And then the last two lines just run ScrapeJobs() if this file is invoked using python. And, we're done!

Update #1: There really should be a try/except around the request operations.

Update #2: This grabs every comment in every Hacker News: Who's Hiring post. It would be nice to filter out only the actual job postings.

-Seth

Show Comments