Scrape google search results python beautifulsoup

Question

Default Google search address start doesn't contain # symbol. Instead, it should have ? and /search pathname:

---> https://google.com/#q=
---> https://www.google.com/search?q=cake

Make sure you're passing user-agent into HTTP request headers because the default requests user-agent is python-requests and sites could identify that it's a bot and block the request thus you would receive a different HTML with some sort of an error that contains different elements/selectors which is the reason you were getting an empty result.

Check what's your user-agent, and a list of user-agents for mobile, tablets, etc.

Nội dung chính Show

How do you scrape Google search results in Python?
Can you scrape Google search results?
How do I get the results of a search in Python?
Is BeautifulSoup faster than Scrapy?

Pass user-agent in request headers:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('YOUR_URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, json, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'tesla',  # query 
  'gl': 'us',    # country to search from
  'hl': 'en',    # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

data = []

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

  # sometimes there's no description and we need to handle this exception
  try: 
    snippet = result.select_one('#rso .lyLwlc').text
  except: snippet = None

data.append({
   'title': title,
   'link': link,
   'snippet': snippet,
})

print(json.dumps(data, indent=2, ensure_ascii=False))

-------------
'''
[
  {
    "title": "Tesla: Electric Cars, Solar & Clean Energy",
    "link": "https://www.tesla.com/",
    "snippet": "Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ..."
  },
  {
    "title": "Tesla, Inc. - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Tesla,_Inc.",
    "snippet": "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ..."
  },
  {
    "title": "Nikola Tesla - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Nikola_Tesla",
    "snippet": "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the ..."
  }
]
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan just to test the API.

The difference in your case is that you don't have to figure out why the output is empty and what causes this to happen, bypass blocks from Google or other search engines, and then maintain the parser over time. Instead, you only need to grab the data from the structured JSON you want, fast.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "tesla",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:

  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

----------
'''
Title: Tesla: Electric Cars, Solar & Clean Energy
Summary: Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ...
Link: https://www.tesla.com/

Title: Tesla, Inc. - Wikipedia
Summary: Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ...
Link: https://en.wikipedia.org/wiki/Tesla,_Inc.
'''

Disclaimer, I work for SerpApi.

More and more frequently data science projects (and not only) require additional data that can be obtained via the means of web scraping. Google search is not an uncommon starting point.

In this guide we will walk through the script that obtains links from the google search results.

Let’s start with the imports, to obtain links from top-n pages of google search result, I am using selenium and BeautifulSoup.

from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

I am also using webdriver_manager package, which comes quite handy at times. Using this package there is no need to download a web driver to your local machine if you don’t have one, it also helps to avoid manual input of the custom path to a web driver. The package supports most of the browsers.

Next, we set up some preferences for the web browser. To avoid web browser popping up when you run your code, I use ‘headless’ argument. There are also a handful of other options that allow to customise the web browser to adapt to the task at hand.

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")

We can now start the ChromeDriver. First input argument requires a path to the driver, however by means of the webdriver_manager we can use installation instead.

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

Once the web driver is set up, we can move on to the main part of the code where we obtain web links for google search results.

# Query to obtain links
query = 'comprehensive guide to web scraping in python'
links = [] # Initiate empty list to capture final results# Specify number of pages on google search, each page contains 10 #links
n_pages = 20 
for page in range(1, n_pages):
    url = "http://www.google.com/search?q=" + query + "&start=" +      str((page - 1) * 10)
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    # soup = BeautifulSoup(r.text, 'html.parser')    search = soup.find_all('div', class_="yuRUbf")
    for h in search:
        links.append(h.a.get('href'))

The code requires two inputs, query of interest and the number of pages in google search to go through. Each page contains 10 search results.

Once parameters are in place we load the url using selenium webdriver, then using BeautifulSoup we parse website data using html.parser. Website data comes in html format, we can view the script behind the website by inspecting the web page.

Scrape google search results python beautifulsoup

We are interested in hyperlinks to the search results which are stored in

container. All elements are found using BeautifulSoup command .find_all() where we specify element and class as an inputs.

search = soup.find_all('div', class_="yuRUbf")

For every search result we obtained, we need to extract hyperlink which is stored as href attribute of element.

for h in search:
        links.append(h.a.get('href'))

We now have all the code blocks required to obtain the links to google search results.

Scrape google search results python beautifulsoup

How do you scrape Google search results in Python?

Can you scrape Google search results?

How do I get the results of a search in Python?

Is BeautifulSoup faster than Scrapy?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội