In this article, we are going to see how to control the web browser with Python using selenium.Selenium is an open-source tool that automates web browsers. It provides a single interface that lets you write test scripts in programming languages like Ruby, Java, NodeJS, PHP, Perl, Python, and C#, etc.
To install this module, run these commands into your terminal:
pip install selenium
For automation please download the latest Google Chrome along with chromedriver from here.
Here we will automate the authorization at “//auth.geeksforgeeks.org” and extract the Name, Email, Institute name from the logged-in profile.
Initialization and Authorization
First, we need to initiate the web driver using selenium and send a get request to the url and Identify the HTML document and find the input tags and button tags that accept username/email, password, and sign-in button.
To send the user given email and password to the input tags respectively:
driver.find_element_by_name['user'].send_keys[email] driver.find_element_by_name['pass'].send_keys[password]
Identify the button tag and click on it using the CSS selector via selenium webdriver:
driver.find_element_by_css_selector[‘button.btn.btn-green.signin-button’].click[]
Scraping Data
Scraping Basic Information from GFG Profile
After clicking on Sign in, a new page should be loaded containing the Name, Institute Name, and Email id. Identify the tags containing the above data and select them.
container = driver.find_elements_by_css_selector[‘div.mdl-cell.mdl-cell–9-col.mdl-cell–12-col-phone.textBold’]
Get the text from each of these tags from the returned list of selected css selectors:
name = container[0].text try: institution = container[1].find_element_by_css_selector['a'].text except: institution = container[1].text email_id = container[2].text
Finally, print the output:
print[{"Name": name, "Institution": institution, "Email ID": email}]
Scraping Information from Practice tab
Click on the Practice tab and wait for few seconds to load the page.
driver.find_elements_by_css_selector['a.mdl-navigation__link'][1].click[]
Find the container containing all the information and select the grids using CSS selector from the container having information.
container = driver.find_element_by_css_selector[‘div.mdl-cell.mdl-cell–7-col.mdl-cell–12-col-phone.whiteBgColor.mdl-shadow–2dp.userMainDiv’]
grids = container.find_elements_by_css_selector[‘div.mdl-grid’]
Iterate each of the selected grids and extract the text from it and add it to a set/list for output.
res = set[] for grid in grids: res.add[grid.text.replace['\n',':']]
Below is the full implementation:
Python3
from
selenium
import
webdriver
import
time
if
__name__
=
=
'__main__'
:
email
=
''
password
=
'password'
options
=
webdriver.ChromeOptions[]
options.add_argument[
"--start-maximized"
]
options.add_argument[
'--log-level=3'
]
driver
=
webdriver.Chrome[executable_path
=
"C:/chromedriver/chromedriver.exe"
,
chrome_options
=
options]
driver.set_window_size[
1920
,
1080
]
time.sleep[
5
]
driver.find_element_by_name[
'user'
].send_keys[email]
driver.find_element_by_name[
'pass'
].send_keys[password]
driver.find_element_by_css_selector[
'button.btn.btn-green.signin-button'
].click[]
time.sleep[
5
]
container
=
driver.find_elements_by_css_selector[
'div.mdl-cell.mdl-cell--9-col.mdl-cell--12-col-phone.textBold'
]
name
=
container[
0
].text
try
:
institution
=
container[
1
].find_element_by_css_selector[
'a'
].text
except
:
institution
=
container[
1
].text
email_id
=
container[
2
].text
print
[
"Basic Info"
]
print
[{
"Name"
: name,
"Institution"
: institution,
"Email ID"
: email}]
driver.find_elements_by_css_selector[
'a.mdl-navigation__link'
][
1
].click[]
time.sleep[
5
]
container
=
driver.find_element_by_css_selector[
'div.mdl
-
cell.mdl
-
cell
-
-
7
-
col.mdl
-
cell
-
-
12
-
col
-
phone.\
whiteBgColor.mdl
-
shadow
-
-
2dp
.userMainDiv']
grids
=
container.find_elements_by_css_selector[
'div.mdl-grid'
]
res
=
set
[]
for
grid
in
grids:
res.add[grid.text.replace[
'\n'
,
':'
]]
print
[
"Practice Info"
]
print
[res]
driver.close[]
driver.quit[]
Output: