Python Blog


JavaScript Web Scraping with PhantomJS and Selenium

In my previous post, I described how to create data structures for scraping stock market data from the web, using Barron's as an example. Since the data we wanted to scrape from Barron's website was fully rendered in HTML, we could simply use BeautifulSoup on its own to extract the required content. However, BeautifulSoup alone will not work if the content is embedded in JavaScript. For this we need to use PhantomJS and Selenium.

My primary motivation for this tutorial is that the documentation for Selenium and PhantomJS is quite convoluted, making a seemingly basic JavaScript scraping task seem complicated and non-intuitive. Once I was able to figure it out, however, I envisioned creating a sort of step-by-step "cheat sheet" that would enable others to pick this up relatively quickly, without having to bury themselves in a web of countless Google and Stack Overflow searches like I did. Before we begin, first install Selenium and PhantomJS from your terminal, using Brew:

$ pip install selenium
$ brew install phantomjs

Now that we have our required utilities set up, we need to inspect the HTML source code for the JavaScript-rendered site that we want to scrape. If you just try to view the source, you will only see the JavaScript sections added under the script tag. We need to inspect the source in order to see the JavaScript executed and loaded as HTML. Then we must navigate through the web of nested HTML tags in order to find the content we are looking for. The resulting HTML after JavaScript content is loaded is usually quite complex and intricate, much more so than a bare-bones HTML webpage (this makes intuitive sense, since JavaScript handles more complex tasks than those which can be done by HTML alone).

yahoo inspect screenshot 1

Running your cursor over different sections of the webpage will highlight the sections of code that correspond to it. Let's try to navigate our way to where the value for "Beta" is contained. Don't worry too much about the strange tag id's; we're only concerned with finding the specific tags that we need. For now, we'll focus on just grabbing this one piece of information - the stock's beta value - to demonstrate the power of Selenium and PhantomJS. Afterwards, you can extrapolate it to whatever content you want to scrape. In order to obtain the element that we want, the easiest way is to use the XPath for that element. Right-click the line of code corresponding to your desired element, select "Copy", and then select "Copy XPath". This will copy the XPath variable to the clipboard, which we can then use in our Python script. I would suggest pasting the XPath in a safe place, perhaps a simple text file, so that you can retrieve it later.

yahoo inspect screenshot 2

Now we can start writing our scraper. Unlike my last post, which contained more complex data structures, here we'll just define one StockQuote class to illustrate the power of Selenium and PhantomJS.

class StockQuote:
    def __init__(self, symbol, company):
        self.symbol = symbol
        self.company = company

    def scrape_yahoo(self, div=0):
        # Start with Yahoo's base URL format and use string replace for our current stock symbol
        url = 'http://finance.yahoo.com/quote/%s/key-statistics?p=%s/' % (self.symbol, self.symbol)

        # Instantiate the driver
        driver = webdriver.PhantomJS()
        driver.get(url)
        driver.implicitly_wait(20)

        # Use the XPath to get Beta value
        elem = driver.find_element_by_xpath("//*[@id='main-0-Quote-Proxy']/section/div[2]/section/div/section/div[2]/div[2]/div/div[1]/table/tbody/tr[1]/td[2]")

        # Store the value
        self.beta = elem.text

The StockQuote class is relatively simple, with the constructor setting the self.symbol and self.company variables to the values that are passed in. The interesting part of the code begins with the scrape_yahoo() function. First, we need to define our url string. The Yahoo URL that stores key statistics for a particular stock contains the stock symbol twice, which is how each URL differentiates itself from each other. Therefore, we just use string replace to set the unique url string for each stock.

Now for the tricky part. In order to utilize PhantomJS, we need to instantiate a driver by using the base call to webdriver.PhantomJS(). PhantomJS is essentially a headless WebKit for website testing. This means that it essentially uses a "virtual" browser rather than actually opening up the browser interface. This enables much faster unit tests for websites, since the overhead of browser startup and shutdown is avoided. We need to call the get() member function belonging to driver, passing our url variable in as the main argument. Its generally a good practice to add a specified wait time, especially if you are scraping many different URLs at once. Let's take a closer look at the next line of code.

# Use the XPath to get Beta value
elem = driver.find_element_by_xpath("//*[@id='main-0-Quote-Proxy']/section/div[2]/section/div/section/div[2]/div[2]/div/div[1]/table/tbody/tr[1]/td[2]")

We use the find_element_by_xpath() method to extract the element corresponding to a particular XPath. Take a second to notice the way the XPath string is structured - its essentially the HTML directory tree for the element we are scraping. This is why the last part of the XPath string is /table/tbody/tr[1]/td[2] which corresponds to the HTML source we inspected when looking at Yahoo's website. The stock's Beta value is stored in the second td element, which is why the XPath ends in td[2].

Once we instantiate elem to the element we are extracting, we can access its text member variable to extract the string that is stored under that tag. We can then simply store elem.text within the StockQuote member variable self.beta. Let's run a simple test case in our main method to see if the program correctly scrapes the Beta value for Apple stock (ticker symbol: AAPL).

def main():
    # Test out one quote to see if correct text is returned
    my_quote = StockQuote('AAPL', 'Apple')
    my_quote.scrape_yahoo()
    print my_quote.beta

if __name__ == '__main__':
    sys.exit(main())

We create a new StockQuote object and name it my_quote, and call the scrape_yahoo() method. We can run the program to see the output printed to the console:

yahoo scraper console screenshot

Sure enough, the beta value printed out matches correctly with the value 1.45 listed on the Yahoo page. Hopefully you can now realize the power of Python with Selenium and PhantomJS to scrape JavaScript-embedded content from the web, among its many other uses. You can just follow the same steps outlined in this post for each piece of data you need to scrape. If you wanted to scrape the beta value for many different stocks at once, for example, the corresponding XPath should be the same for each stock quote's Yahoo page. Thus, you could just put everything in a for loop and simply make sure to change the value of url every time. Be sure to check out the Github repository if you want to download the source code for yourself. Thanks for reading!