- Views: 1
- Report Article
- Articles
- Business & Careers
- Business Ideas
How to Scrape E-commerce Sites Using Web Scraping to Compare Pricing Using Python — Part 1
Posted: Aug 05, 2021
We have been frequently said that between two big e-commerce platforms of Malaysia (Shopee and Lazada), one is normally cheaper as well as attracts good deal hunters whereas other usually deals with lesser price sensitive.
So, we have decided to discover ourselves… in the battle of these e-commerce platforms!
For that, we have written a Python script with Selenium as well as Chrome driver for automating the scraping procedure and create a dataset. Here, we would be extracting for these:
- Product’s Name
- Product’s Name
Then we will do some basic analysis with Pandas on dataset that we have extracted. Here, some data cleaning would be needed and in the end, we will provide price comparisons on an easy visual chart with Seaborn and Matplotlib.
Between these two platforms, we have found Shopee harder to extract data for some reasons: (1) it has frustrating popup boxes that appear while entering the pages; as well as (2) website-class elements are not well-defined (a few elements have different classes).
For the reason, we would start with extracting Lazada first. We will work with Shopee during Part 2!
Initially, we import the required packages:
Amazing! Although, we need to do some additional cleaning. You could have observed any difference in the datasets. Amongst the items, which is actually the twin pack that we would require to remove from the datasets.
Data cleaning is important for all sorts of data analysis as well as here we would remove entries, which we don’t require with the following code:
- This removes any entry with 'x2' in its title dfL = dfL[dfL[‘ItemName’].str.contains(‘x2’) == False]
For instance, if we would have searched ‘nescafe gold refill’ rather than ‘nescafe gold refill 170g’, then 117 items might have appeared rather than only 9 that we had scraped earlier. These extra items aren’t some refill packs that we were looking for however, rather capsule filtering cups instead.
Nevertheless, this won’t hurt for filtering your datasets again within the search terms:
dfL = dfL[dfL[‘ItemName’].str.contains(‘170g’) == True]In the final game, we would also make a column called ‘Platform’ as well as allocate ‘Lazada’ to all the entries here. It is completed so that we could later group different entries by these platforms (Shopee and Lazada) whenever we later organize the pricing comparison between two platforms.
dfL[‘Platform’] = ‘Lazada’Hurrah! Finally, our dataset is ready and clean!
Now, you need to visualize data with Seaborn and Matplotlib. We would be utilizing the box plot because it exclusively represents the following main statistical features (recognized as a five number summary) in this chart:
Highest Pricing
Lowest Pricing
Median Pricing
25th as well as 75th percentile pricing
- Plot the chart sns.set() _ = sns.boxplot(x=’Platform’, y=’Price’, data=dfL) _ = plt.title(‘Comparison of Nescafe Gold Refill 170g prices between e-commerce platforms in Malaysia’) _ = plt.ylabel(‘Price (RM)’) _ = plt.xlabel(‘E-commerce Platform’) # Show the plot plt.show()
We could see that item prices range among RM21–28, having the median pricing between RM27–28. Also, we can see that a box has shorter ‘whiskers’, specifying that the pricing is relatively constant without any important outliers. To know more about understanding box plots, just go through this great summary!
Looking to scrape price data from e-commerce websites? Contact Retailgators for eCommerce Data Scraping Services.
source code: https://www.retailgators.com/how-to-scrape-e-commerce-sites-using-web-scraping-to-compare-pricing-using-python.php
ECommerce Web Scraping Tools & Services | Retailgators https://www.retailgators.com/index.