Web scraping dynamic content python selenium

Introduction

Over the past number of years front-end design methods and technologies for websites have developed greatly, and frameworks such as React, Angular, Vue, and more, have become extremely popular. These frameworks enable front-end website developers to work efficiently and offer many benefits in making websites, and the webpages they serve, much more usable and appealing for the website user. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. This contrasts with the more traditional method of server-based page generation, where the data and elements on a page are set once and require a full round-trip to the web server to get the next piece of data to serve to a user. When we scrape websites, the easiest to do are the more traditional, simple, server-based ones. These are the most predictable and consistent.

While Dynamic websites are of great benefit to the end user and the developer, they can be problematic when we want to scrape data from them. For example, consider that in a dynamic webpage: much of the functionality happens in response to user actions and the execution of JavaScript code in the context of the browser. Data that is automatically generated, or appears ‘on demand’, and is ‘automatically generated’ as a result of user interaction with the page can be difficult to replicate programmatically at a low level – a browser is a pretty sophisticated piece of software after all!

As a result of this level of dynamic interaction and interface automation, it is difficult to use a simple http agent to work with the dynamic nature of these websites and we need a different approach. The simplest solution to scraping data form dynamic websites is to use an automated web-browser, such as selenium, which is controlled by a programming language such as Python. In this guide, we will explore an example of how to set up and use Selenium with Python for scraping dynamic websites, and some of the use features available to us that are not easily achieved using more traditional scraping methods.

Requirements

For this guide, we are going to use the ‘Selenium’ library to both GET and PARSE the data.

    In general, once you have Python 3 installed correctly, you can download Selenium using the ‘PIP’ utility:

    You will also need to install a driver for the Selenium package, Chrome works well for this. Install it also using the chromedriver-install pip wrapper.

    1pip install chromedriver-install

    If Pip is not installed, you can download and install it here

    For simple web-scraping, an interactive editor like Microsoft Visual Code (free to use and download) is a great choice, and it works on Windows, Linux, and Mac.

    Getting Started Using Selenium

    After running the pip installs, we can start writing some code. One of the initial blocs of code checks to see if the Chromedriver is installed and, if not, downloads everything required. I like to specify the folder that chrome operates from so I pass the download and install folder as an argument for the install library.

    1import chromedriver_install as cdi
    2path = cdi.install(file_directory='c:\\data\\chromedriver\\', verbose=True, chmod=True, overwrite=False, version=None)
    3print('Installed chromedriver to path: %s' % path)

    python

    The main body of code is then called – this creates the Chromedriver instance, pointing the starting point to the folder I installed it to.

    1from selenium import webdriver
    2from selenium.webdriver.common.keys import Keys
    3
    4driver = webdriver.Chrome("c:\\data\\chromedriver\\chromedriver.exe")

    python

    Once this line executes, a version of Chrome will appear on the desktop – we can hide this, but for our initial test purposes its good to see what's happening. We direct the driver to open a webpage by calling the ‘get’ method, with a parameter of the page we want to visit.

    1driver.get("http://www.python.org")

    Web scraping dynamic content python selenium

    The power of Selenium is that it allows the chrome-driver to do the heavy lifting while it acts as a virtual user, interacting the webpage and sending your commands as required. To illustrate this, let's run a search on the Python website by adding some text to the search box. We first look for the element called ‘q’ – this is the “inputbox” used to send the search to the website. We clear it, then send in the keyboard string ‘pycon’

    1elem = driver.find_element_by_name("q")
    2elem.clear()
    3elem.send_keys("pycon")

    python

    Web scraping dynamic content python selenium

    We can then virtually hit ‘enter/return’ by sending ‘key strokes’ to the inputbox – the webpage submits, and the search results are shown to us.

    1elem.send_keys(Keys.RETURN)

    python

    Web scraping dynamic content python selenium

    We have seen how simple it is to get up and running with Selenium, next we will look at how to navigate around a webpage and indeed a full website using navigation commands. As humans, when we want to carry out a task on a webpage, we identify what we want to do visually, such as drag and drop, scroll, click a button, etc. We then move the mouse and click, or use the keyboard, accordingly. Things are not that simple (yet!) with Selenium, so we need to give it a bit of assistance. In order to navigate around a webpage, we need to tell Selenium what objects on the page to interact with. We do this by identifying page elements with XPaths and then calling functions appropriate to the task we wish to carry out.

    In the case of our first example, the search box, we did the following:

    • Tasked the driver to find a browser element named ‘q’.
    • Gave an instruction to send a series of characters to the element identified.
    • Gave an instruction to send key command for ‘RETURN’.

    This was the equivalent of us as humans, clicking into the search box, entering the search term, and hitting RETURN or ENTER on our keyboard.

    The pattern of navigation in Selenium therefore is:

    • Identify the element you wish to interact with.
    • Interact as required (set some text, extract a value, send a keystroke, etc.).

    Elements can be located using xPath ‘driver.find_element_by_xpath’, or more high level methods such as ‘find_element_by_id’.

    1<input type="text" name="searchbox" id="someUniqueId" />
    2
    3element = driver.find_element_by_id("someUniqueId")
    4element = driver.find_element_by_name("searchbox")
    5element = driver.find_element_by_xpath("//input[@id='someUniqueId']")

    python

    Sending interaction instructions, such as setting text, selecting a radio box, and hitting ‘RETURN’ (on the keyboard), can be achieved using the ‘sendkeys’ method:

    1element.send_keys("Set some text")

    python

    In addition to sending text, we can also send keystrokes, individually or combined, with the text.

    1element.send_keys(Keys.RETURN)
    2element.send_keys("Set text", Keys.ARROW_DOWN)

    python

    Working with Forms

    Working with forms in Selenium is straightforward and combines what we have learned with some additional functionality. Filling in a form on a webpage generally involves setting values of text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify and send data into a text field. Locating and selecting an option control requires us to:

    • Iterate through its options.
    • Set the option we want to choose a ‘selected’ value.

    In the following example we are searching a select control for the value ‘Ms’and, when we find it, we are clicking it to select it:

    1element = driver.find_element_by_xpath("//select[@name='Salutation']")
    2all_options = element.find_elements_by_tag_name("option")
    3for option in all_options:
    4    if option.get_attribute("value") == "Ms":
    5        option.click()
    

    python

    The final part of working with forms is knowing how to send the data in the form back to the server. This is achieved by either locating the submit button and sending a click event, or selecting any control within the form and calling ‘submit’ against that:

    1driver.find_element_by_id("SubmitButton").click()
    2
    3someElement = driver.find_element_by_name("searchbox")
    4someElement.submit()

    python

    Smile! … Taking a Screenshot

    One of the benefits of using Selenium is that you can take a screenshot of what the browser has rendered. This can be useful for debugging an issue and also for keeping a record of what the webpage looked like when it was scraped.

    Taking a screenshot could not be easier. We call the ‘save_screenshot’ method and pass in a location and filename to save the image.

    1driver.save_screenshot('WebsiteScreenShot.png')

    python

    Conclusion

    Web-scraping sites using Selenium can be a very useful tool in your bag of tricks, especially when faced with dynamic webpages. This guide has only scratched the surface – to learn more please visit the Selenium website .

    If you wish to learn more about web-scraping please consider the following courses Pluralsight has to offer:

    Can you scrape dynamic content from a website?

    There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.

    How do you scrape dynamic content?

    Selenuim: web scraping with a webdriver.
    define and setup Chrome path variable..
    define and setup Chrome webdriver path variable..
    define browser launch arguments (to use headless mode, proxy, etc.).
    instantiate a webdriver with defined above options..
    load a webpage via instantiated webdriver..

    Can Beautiful Soup scraping dynamic content?

    But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping.