How Web Scraping Is Used To Scrape Google Play Store Data?
Apps have increased the interaction with the world. Shopping, music, news, and dating are just a few of the things you may do on social media. If you can think of it, there's probably an app for it. Some apps are superior to others. You can learn what people like and dislike for an app by analyzing the language of user reviews. Sentiment Analysis and Topic Modeling are two domains of Natural Language Processing (NLP) that can aid with this, but not if you don't have any reviews to examine!
You need to scrape and store some reviews before we get ahead of ourselves. This blog will show you how to do just that with Python code and the google-play-scraper and PyMongo packages. You have several options for storing or saving your scraped reviews.
Real-Time APIs for crawling the Google Play Store is provided by google-play-scraper. It can be used to obtain:
App information includes the app's title and description, as well as the price, genre, and current version.
App evaluations
You can use the app function to retrieve app information, and the reviews or reviews_ all functions to get reviews. We will go through how to use the app briefly before concentrating on how to get the most out of reviews. While reviews all are convenient in some situations, we prefer working with reviews. Once we get there, we will explain why and how with plenty of code.
Initiating with Google-Play-ScraperStep 1: Obtain App IDs
For Scraping mobile app, you'll need one piece of information: the app's ID code. This can be discovered on the Google Play Store's URL for the app's page. The component you'll need comes just after "id=", as illustrated in the image below.
In other circumstances, the URL terminates with the app ID. In situations like these, you only need the section between "id=" and "&."
Your most recent work will be a collection of applications for mental health, mindfulness, and self-care. We will keep track of a lot of different information on a spreadsheet when exploring apps. This seemed like a reasonable place to keep each app's ID.
Step 2: Installing and Importing
Here, we will import and what is used earlier, including PyMongo. Also, initially you will need to install MongoDB. Guide for installing the community edition will be found here.
To be able to import each of the following, pip should be installed as needed:
import pandas as pd # for scraping app info and reviews from Google Play from google_play_scraper import app, Sort, reviews # for pretty printing data structures from pprint import pprint # for storing in MongoDB import pymongo from pymongo import MongoClient # for keeping track of timing import datetime as dt from tzlocal import get_localzone # for building in wait times import random import timeYou will also install Mongo, establish a new database for the project, and add new collections (essentially the MongoDB equivalent to the tables of relational databases). You will also have one collection for app information and another for app reviews.
- Set up Mongo client client = MongoClient(host='localhost', port=27017) ## Database for project app_proj_db = client['app_proj_db'] ## Set up new collection within project db for app info info_collection = app_proj_db['info_collection'] ## Set up new collection within project db for app reviews review_collection = app_proj_db['review_collection']
The Platform is ready for scraping now. What we need is list of app IDs. You can download a csv copy of the spreadsheet and read it using Pandas DataFrame.
- Read in file containing app names and IDs app_df = pd.read_csv('Data/app_ids.csv') app_df.head() ## Read in file containing app names and IDs app_df = pd.read_csv('Data/app_ids.csv') app_df.head()
- Get list of app names and app IDs app_names = list(app_df['app_name']) app_ids = list(app_df['android_appID'])
- Loop through app IDs to get app info app_info = [] for i in app_ids: info = app(i) del info['comments'] app_info.append(info) ## Pretty print the data for the first app pprint(app_info[0])
Let's use PyMongo's insert many methods to safely save the app details in our info collection. insert many expects a list of dictionaries, which we've just created.
- Insert app details into info_collection info_collection.insert_many(app_info)
- Query the collection and create DataFrame from the list of dicts info_df = pd.DataFrame(list(info_collection.find({}))) info_df.head() Scraping App Reviews
- If you truly want all the reviews, you can still acquire them.
- Instead of doing everything for a single app all at once, you may fragment the procedure for each app. This is beneficial since it provides you with options. You can do the following:
- Get updates on how many reviews you've scraped on a regular basis.
- Instead of waiting till the finish, save scraped data as you go.
Two variables are returned by the reviews function. We're looking for review data in the first variable. The second variable is an information token required for web scraping services are more than the count number of reviews.
rvws, token = reviews( 'co.thefabulous.app', # app's ID, found in app's url lang='en', # defaults to 'en' country='us', # defaults to 'us' sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT filter_score_with=5, # defaults to None (get all scores) count=100 # defaults to 100 #, continuation_token=token )The app ID is the first argument you'll need to offer to reviews. Sorting reviews is done in one of two ways: by most recent or by whatever Google Play believes is the most relevant. You can also filter reviews based on their score.
The count parameter's main purpose is to tell the function how many reviews it should retrieve before ending. The following is taken from the google-play-scraper documentation:
"An excessively high count can pose complications. Because Google Play supports a limit of 200 reviews per page, it is designed to paginate and recrawl by 200 until the number of results reaches count."
- As a side point, setting count to infinity is equivalent to setting reviews all to infinity, which seems excessive to me.
Count, in my opinion, is a better way to think about and use batch size. Simply set the number of reviews to 200, return the reviews along with your token, and utilize your token in the next iteration of the reviews function.
Review ScrapingLet us break the code
- Scrapes Google Play reviews by iterating through a list of app IDs.
- Stores the reviews in a MongoDB collection on a regular basis.
- Prints progress updates about the scraping operation.
Step 1: Setting up The Loop
You had previously saved our lists of app names and IDs. The app names list isn't necessarily required for scraping. There is a reason behind this. You will start out for loop to go through all of our apps in this block of code. Just double-check that your lists of names and IDs are same.
- Loop through apps to get reviews for app_name, app_id in zip(app_names, app_ids): # Get start time start = dt.datetime.now(tz=get_localzone()) fmt= "%m/%d/%y - %T %p" # Print starting output for app print('---'*20) print('---'*20) print(f'***** {app_name} started at {start.strftime(fmt)}') print() # Empty list for storing reviews app_reviews = [] # Number of reviews to scrape per batch count = 200 # To keep track of how many batches have been completed batch_num = 0
You will need to add 2 keys to every newly obtained review dictionaries. Because the data gathered for each review does not explicitly identify which app the review was for, attaching these identifiers is beneficial. A potential crisis has been averted!
- Retrieve reviews (and continuation_token) with reviews function rvws, token = reviews( app_id, # found in app's url lang='en', # defaults to 'en' country='us', # defaults to 'us' sort=Sort.NEWEST, # start with most recent count=count # batch size ) # For each review obtained for r in rvws: r['app_name'] = app_name # add key for app's name r['app_id'] = app_id # add key for app's id # Add the list of review dicts to overall list app_reviews.extend(rvws) # Increase batch count by one batch_num +=1 print(f'Batch {batch_num} completed.') # Wait 1 to 5 seconds to start next batch time.sleep(random.randint(1,5))
Each review has a distinct identification. We need to save these before gathering our next batch of reviews so that we may compare them later.
- Append review IDs to list prior to starting next batch pre_review_ids = [] for rvw in app_reviews: pre_review_ids.append(rvw['reviewId'])
Here we have received a token with the first batch of reviews hence we can loop through every batch of 200 reviews. We will set the maximum number of batches to 5,000 in the code below by using range(4999) (we already got our first batch). This implies we'll get the first million reviews, assuming there are any.
- Loop through at most max number of batches for batch in range(4999): rvws, token = reviews( # store continuation_token app_id, lang='en', country='us', sort=Sort.NEWEST, count=count, # using token obtained from previous batch continuation_token=token ) # Append unique review IDs from current batch to new list new_review_ids = [] for r in rvws: new_review_ids.append(r['reviewId']) # And add keys for name and id to ea review dict r['app_name'] = app_name # add key for app's name r['app_id'] = app_id # add key for app's id # Add the list of review dicts to main app_reviews list app_reviews.extend(rvws) # Increase batch count by one batch_num +=1
You will need to compare the collection of review IDs before scraping the current batch to the set of review IDs as you have now after absorbing the current batch. If the two sets are the same length, that signifies we've stopped adding new reviews to our database. As a result, you will interrupt the loop and go on to the next app.
- Break loop and stop scraping for current app if most recent batch # did not add any unique reviews all_review_ids = pre_review_ids + new_review_ids if len(set(pre_review_ids)) == len(set(all_review_ids)): print(f'No reviews left to scrape. Completed {batch_num} batches.\n') break # all_review_ids becomes pre_review_ids to check against # for next batch pre_review_ids = all_review_ids
Step 6: Save the Data and Print an Update After every ith Batch
It's wonderful to get an update on how things are doing when you're scraping tens of thousands or even millions of reviews. Perhaps more essential, it's comforting to know that your information is being securely saved as you travel. Every 100 batches, the following code accomplishes both.
- At every 100th batch if batch_num%100==0: # print update on number of batches print(f'Batch {batch_num} completed.') # insert reviews into collection review_collection.insert_many(app_reviews) # print update about num of reviews inserted store_time = dt.datetime.now(tz=get_localzone()) print(f""" Successfully inserted {len(app_reviews)} {app_name} reviews into collection at {store_time.strftime(fmt)}.\n """) # empty our list for next round of 100 batches app_reviews = [] # Wait 1 to 5 seconds to start next batch time.sleep(random.randint(1,5))