Introduction to Selenium IDE Plugin with Simple Web Data Scraping Examples

The BIAMI Dev Selenium IDE plugin can be used to scrape data from websites. In this article, we will explore how to use the plugin to scrape three types of data: URLs (from buttons and links), text (any text within the HTML markup), and HTML of elements or the whole page.

  • Access web browser

    First, configure selenium_ide plugin with the webdriver you are using, the web browser's port and IP address. See Selenium IDE documentation for more details. You can set them as parameters to conveniently use throughout your process.

    Stage Business Task Name Technical Task Name Plugin
    Set up your webdriver param_set p=p_browser_webdriver
    Set up your browser IP param_set p=p_browser_ip
    Set up your browser port param_set p=p_browser_port
  • Open Page

    Navigate to the page where you will be collecting data. For this article, we will navigate to https://www.biami.dev/setup page.

    Stage Business Task Name Technical Task Name Plugin
    Open https://www.biami.dev/ page selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=open target=https://www.biami.dev/ value= capturedata= webdriver=|||p_browser_webdriver|||
    Click on "Learn" button selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=click target=xpath=//a[@title='Learn'] value= capturedata= webdriver=|||p_browser_webdriver|||
    Click on "Setup" button selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=click target=xpath=//a[@title='Setup'] value= capturedata= webdriver=|||p_browser_webdriver|||
  • Store text from HTML elements

    Let’s extract and save the text of the page’s main heading - “DIY Automation Setup”. We will use the “store text” command, and it will be saved to your “./temp/process” directory in 2_p_heading.txt file.

    Notice that the file will be saved with request ID prefix. This ID is randomly generated on each execution of your process. The current request ID is available as requestid parameter. We will use this parameter to assign text from the file to our “p_heading” parameter, to use the stored text later.

    Stage Business Task Name Technical Task Name Plugin
    Store the text of "DIY Automation Setup" heading on the page selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=store text target=xpath=//div[@class='wpb_wrapper']/h1 value=p_heading capturedata=value webdriver=|||p_browser_webdriver|||
    Assign exported text to a parameter to use it later param_set p=p_heading value=|||requestid|||_p_heading.txt
  • Store URL of elements like links and buttons

    You can extract and save attribute values of HTML elements using the “store attribute” command. We can use this command to save the “href” value of an element to extract the URL of the element.

    Let’s extract and save the URL of the “Download BIAMI Dev” link of the Setup page.

    Stage Business Task Name Technical Task Name Plugin
    Store the URL of "Download BIAMI Dev" link on the page selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=store attribute target=xpath=//a[contains(.,'Download BIAMI Dev')]@href value=p_url capturedata=value webdriver=|||p_browser_webdriver|||
    Assign exported URL to a parameter to use it later param_set p=p_url value=|||requestid|||_p_url.txt
  • Extracting HTML from the page

    The command “store innerhtml” allows you to extract the whole HTML of an element or the entire page. To save HTML, you have to specify the XPATH to that element in the “target” parameter - the plugin will extract HTML inside that element.

    You can also specify the HTML [body] to extract the entire page’s HTML.

    Stage Business Task Name Technical Task Name Plugin
    Store the whole article inner HTML to parse it later selenium_ide webbrowser=|||p_browser_ip|||:|||p_browser_port||| runcommand=store innerhtml target=xpath=//article[@role='main'] value=p_html capturedata=value webdriver=|||p_browser_webdriver|||
  • Conclusion

    Selenium IDE plugin provides quick and convenient way to extract data from web pages. You can use and manipulate this data throughout your process, and save it to the database.

    The working example of the process built in this article is available on GitHub.