Introduction to Selenium IDE Plugin with Simple Web Data Scraping Examples

The BIAMI Dev Selenium IDE plugin can be used to scrape data from websites. In this article, we will explore how to use the plugin to scrape three types of data: URLs (from buttons and links), text (any text within the HTML markup), and HTML of elements or the whole page.

Access web browser

First, configure selenium_ide plugin with the webdriver you are using, the web browser's port and IP address. See Selenium IDE documentation for more details. You can set them as parameters to conveniently use throughout your process.

Technical Task Name	Plugin
Set up your webdriver	param_set	p=p_browser_webdriver
Set up your browser IP	param_set	p=p_browser_ip
Set up your browser port	param_set	p=p_browser_port

Open Page

Navigate to the page where you will be collecting data. For this article, we will navigate to https://www.biami.dev/setup page.

Technical Task Name	Plugin
Open https://www.biami.dev/ page	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=open	target=https://www.biami.dev/	value=	capturedata=	webdriver=\|\|\|p_browser_webdriver\|\|\|
Click on "Learn" button	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=click	target=xpath=//a[@title='Learn']	value=	capturedata=	webdriver=\|\|\|p_browser_webdriver\|\|\|
Click on "Setup" button	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=click	target=xpath=//a[@title='Setup']	value=	capturedata=	webdriver=\|\|\|p_browser_webdriver\|\|\|

Store text from HTML elements

Let’s extract and save the text of the page’s main heading - “DIY Automation Setup”. We will use the “store text” command, and it will be saved to your “./temp/process” directory in 2_p_heading.txt file.

Notice that the file will be saved with request ID prefix. This ID is randomly generated on each execution of your process. The current request ID is available as requestid parameter. We will use this parameter to assign text from the file to our “p_heading” parameter, to use the stored text later.

Stage	Business Task Name	Technical Task Name	Plugin
		Store the text of "DIY Automation Setup" heading on the page	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=store text	target=xpath=//div[@class='wpb_wrapper']/h1	value=p_heading	capturedata=value	webdriver=\|\|\|p_browser_webdriver\|\|\|
		Assign exported text to a parameter to use it later	param_set	p=p_heading	value=\|\|\|requestid\|\|\|_p_heading.txt

Store URL of elements like links and buttons

You can extract and save attribute values of HTML elements using the “store attribute” command. We can use this command to save the “href” value of an element to extract the URL of the element.

Let’s extract and save the URL of the “Download BIAMI Dev” link of the Setup page.

Stage	Business Task Name	Technical Task Name	Plugin
		Store the URL of "Download BIAMI Dev" link on the page	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=store attribute	target=xpath=//a[contains(.,'Download BIAMI Dev')]@href	value=p_url	capturedata=value	webdriver=\|\|\|p_browser_webdriver\|\|\|
		Assign exported URL to a parameter to use it later	param_set	p=p_url	value=\|\|\|requestid\|\|\|_p_url.txt

Extracting HTML from the page

The command “store innerhtml” allows you to extract the whole HTML of an element or the entire page. To save HTML, you have to specify the XPATH to that element in the “target” parameter - the plugin will extract HTML inside that element.

You can also specify the HTML [body] to extract the entire page’s HTML.

Stage	Business Task Name	Technical Task Name	Plugin
		Store the whole article inner HTML to parse it later	selenium_ide	webbrowser=\|\|\|p_browser_ip\|\|\|:\|\|\|p_browser_port\|\|\|	runcommand=store innerhtml	target=xpath=//article[@role='main']	value=p_html	capturedata=value	webdriver=\|\|\|p_browser_webdriver\|\|\|

Conclusion

Selenium IDE plugin provides quick and convenient way to extract data from web pages. You can use and manipulate this data throughout your process, and save it to the database.

The working example of the process built in this article is available on GitHub.