How Web Scraping Is Used To Extract Amazon Prime Data Using Selenium And Beautifulsoup?

by Webscreenscraping Web Data
Posted: Jun 03, 2022

Selenium is a great tool of web scraping, but has some flaws which is normal because it was designed primarily for testing online applications. However, BeautifulSoup was created particularly for web scraping and is also an excellent tool.

But even BeautifulSoup has its own flaws as when data to be scraped is behind the wall and it requires user authentication or some other actions from user.

This is where Selenium may be used to automate user interactions with the website, and Beautiful Soup will be used to scrape the data once we are in the wall.

When BeautifulSoup and Selenium are combined, you get a perfect web scraping tool. Selenium can also scrape data but BeautifulSoup is far better.

We will use BeautifulSoup and Selenium to scrape movie details from Amazon Prime Video in several categories, such as description, name, and ratings, and then filter the movies depending on the IMDB ratings.

Let’s discuss the process of scraping Amazon Prime data.

Firstly, import the necessary modules

from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup as soup from time import sleep from selenium.common.exceptions import NoSuchElementException import pandas as pd

Make three empty lists to keep track of the movie information.

movie_names = [] movie_descriptions = [] movie_ratings = []

Chrome Driver must be installed to work this program properly. Make sure you install the driver that relates to your browser version of chrome.

Now, define a function called open_site() that opens the sign-in page of Amazon Prime.

def open_site(): options = webdriver.ChromeOptions() options.add_argument("--disable-notifiactions") driver = webdriver.Chrome(executable_path='PATH/TO/YOUR/CHROME/DRIVER',options=options) driver.get(r'https://www.amazon.com/ap/signin?accountStatusPolicy=P1&clientContext=261-1149697-3210253&language=en_US&openid.assoc_handle=amzn_prime_video_desktop_us&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.primevideo.com%2Fauth%2Freturn%2Fref%3Dav_auth_ap%3F_encoding%3DUTF8%26location%3D%252Fref%253Ddv_auth_ret') sleep(5) driver.find_element_by_id('ap_email').send_keys('ENTER YOUR EMAIL ID') driver.find_element_by_id('ap_password').send_keys('ENTER YOUR PASSWORD',Keys.ENTER) sleep(2) search(driver)

Let's create a search() function that looks for the genre specified.

def search(driver): driver.find_element_by_id('pv-search-nav').send_keys('Comedy Movies',Keys.ENTER) last_height = driver.execute_script("return document.body.scrollHeight") while True: driver.execute_script("scrollTo(0, document.body.scrollHeight);") sleep(5) new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: Break last_height = new_height html = driver.page_source Soup = soup(html,'lxml') tiles = Soup.find_all('div',attrs={"class" : "av-hover-wrapper"}) for tile in tiles: movie_name = tile.find('h1',attrs={"class" : "_1l3nhs tst-hover-title"}) movie_description = tile.find('p',attrs={"class" : "_36qUej _1TesgD tst-hover-synopsis"}) movie_rating = tile.find('span',attrs={"class" : "dv-grid-beard-info"}) rating = (movie_rating.span.text) try: if float(rating[-3:])> 8.0 and float(rating[-3:])

About the Author

Sam Morris, Writing article and blogs realted to data analystics and data extraction process.

Rate this Article

Webscreenscraping Web Data

Member since: Jul 26, 2021
Published articles: 71