How to Extract Amazon Reviews with Python Scrapy?
We search many things online on the internet daily to purchase something, for comparing one product with another, to decide if one product is superior to other, etc. We straight away go to the reviews to see the stars or positive feedbacks it has received, right?
In this tutorial blog we will see how to extract Amazon reviews with Python Scrapy. We will save data in the excel spreadsheet or csv. These are the data-fields we will extract:
- Review’s Title
- Ratings
- Reviewer’s Name
- Review’s Description
- Review’s Content
- Helpful Counts
Then we will do some basic analysis with Pandas on dataset that we have extracted. Here, some data cleaning would be needed and in the end, we will provide price comparisons on an easy visual chart with Seaborn and Matplotlib.
Between these two platforms, we have found Shopee harder to extract data for some reasons: (1) it has frustrating popup boxes that appear while entering the pages; as well as (2) website-class elements are not well-defined (a few elements have different classes).
For the reason, we would start with extracting Lazada first. We will work with Shopee during Part 2!
Initially, we import the required packages:
- Web Scraping from selenium import webdriver from selenium.common.exceptions import * # Data manipulation import pandas as pd # Visualization import matplotlib.pyplot as plt import seaborn as sns
We choose Scrapy – a Python framework for larger-scale data scraping. Together with it, a few other packages would be needed to extract Amazon product reviews.
- Requests: For sending a URL request
- Pandas: For exporting csv
- Pymysql: For connecting mysql server as well as storing data there
- Math: For implementing mathematical operations
You can anytime install packages like given below with conda or pip.
pip install scrapyOR
conda intall -c conda-forge scrapyLet’s outline Start URL for Scraping Seller’s LinksLet’s see what this will like to extract reviews for a product. We have taken the URL: https://www.amazon.com/dp/B07N9255CG This will look like this:
When we go to its review section, this looks like an image given below. This might have different names given in the reviews.
However, if you carefully inspect these requests on the back whereas loading a page as well as play a bit with next as well as previous pages of the review, you could have noticed that there’s the post request loaded having content in a page?
Here, we have looked at the payload as well as headers needed for the successful response. In case, you are having properly inspected pages, you’ll identify the change between shifting a page as well as how that reflects on requests given for that.
source code: https://www.retailgators.com/how-to-extract-amazon-reviews-with-python-scrapy.php