- Views: 1
- Report Article
- Articles
- Computers
- Information Technology
Web Scraper | Web Browser Data Crawler Using Automation.
Posted: Jan 19, 2024
Web Scraper - Web Browser Data Crawler Using AutomationRealdataAPI / facebook-reviews-data-scraper
Web Scraper is the primary tool of Real Data API for scraping and crawling various browsers. The Web Scraper crawls arbitrary websites on the Chrome browser and collects the desired data using a submitted Python code. This actor is compatible with both lists of URLs and recursive crawling to manage concurrency for optimizing performance. Crawl web browsers to scrape required web data in the USA, UK, UAE, Australia, France, Canada, Germany, Singapore, Spain, Mexico, and other countries.
Customize me! Report an issue Social Media- Readme
- API
- Input
- Related actors
Web Scraper Is An Easily Usable Data Scraping Tool That Crawls Arbitrary Web Pages And Collects Data In A Structured Format Using A Javascript Program. This Tool Uses The Chromium Browser To Crawl Web Pages And Access Dynamic Content. It Stores Collected Data In A Usable, Downloadable Dataset In Multiple Digestible Formats, Like CSV, JSON, Or XML. You Can Set Up A Web Data Extractor And Run It Manually Or Automatically Using Scraping API.
You Can Access Our Web Scraping Course From Our Documentation If You Need More Experience In Front-End Website Development Or Web Data Scraping. You Can Study Many Examples While Learning Through The Course In A Stepwise Manner. After That, You Can Continue Using The Web Scraping Tool.
What is the cost of using a Web Scraper?You Can Visit The Pricing Page To Learn About The Usage Cost Of A Web Scraper. These Cost Projections Depend On The Size Of Pages You Scrape From The Web. Different Types Of Web Scrapers Are Equivalent To Simple HTML And Total Web Pages.
UsageYou Need Two Things To Start Using This Web Scraping Tool. Firstly You Need To Inform The Scraper To Crawl The Required Pages With Links. And The Second Thing Is To Tell It How To Scrape Data From Those Pages.
The Tool Begins The Process By Crawling The Targeted Pages Using The Start URL From The Input Parameters. You Can Use A Link Selector To Make The Tool Follow Links, Pseudo Links, Or Global Patterns To Inform The Scraper To Add Specific Links In The Crawling Queue. You Can Efficiently Perform Recursive Crawling And Discover Each Product From The Online Store.
You Have To Provide A Page Function Based On Javascript Code To The Scraper To Load Targeted Web Pages And Tell It How To Collect Data From Those Pages. As The Tool Uses The Chromium Browser With Full Features, Writing Code For The Page Function Is Similar To Writing Code For Front-End Development, And You Can Use JQuery And Similar Libraries For The Same.
Shortly, The Web Crawler World Follows The Following Steps.
- It adds all starting links to the queue of crawling processes.
- The tool retrieves the first link from the crawling list and loads it in the Chromium web browser.
- It implements code from the Page function on the crawled page and stores the output.
- Alternatively, it finds all web page URLs with the help of a URL selector. If it can match any URL with the Pseudo URLs or Glob Patterns and can't visit it, it adds that link to the crawling queue.
- If the queue contains multiple items, it repeats the second step. Otherwise, it ends the process.
The Tool Has Many Settings To Configure For Performance Improvement And To Set Up Cookies To Log In To Various Websites, Etc. Check The Below Input Configurations For The List Of Settings.
LimitationsWe've Designed The Web Data Scraper For Easy And Generic Applications. It May Not Help You If You Need Flexibility And Advanced-Level Performance.
As Already Mentioned, The Scraper Uses A Chromium Browser With Full Features. It May Overkill Websites That Don't Render Content Using JavaScript Code Dynamically. You May Use Cheerio Data Scraper To Reach An Excellent Level Of Performance To Scrape These Websites. It Extracts The Data, Exports It, And Processes HTML Pages With Raw Data Without Using Browser Overhead.
Since You Execute The Page Function Of The Web Scraping Automation Tool In The Context Of A Targeted Web Page, The Function Supports Javascript Code With The Client Side. You Can Opt For A Puppeteer Scraper If You Want To Use Server-Based Libraries Or An Underlying Puppeteer Library To Control The Chromium Browser. Further, If You Use Playwright, You Can Explore Our Playwright Scraper. Besides These Scrapers, You Can Develop Your Customized Actor Using Crawler, Our SDK, And Node.Js For More Control And Flexibility.
Input Configuration of Web ScraperIn The Input Fields, The Web Scraping Tool Accepts Multiple Setting Configurations. You Can Enter Them Using A JSON Code Or Manually Using The Real Data API Actor Or Console. Please Check The Input Tab To Learn More About All Input Fields With Their Types.
Run ModeThe Run Mode Permits You To Switch Between Two Operating Modes Of The Scraper.
You Get Full Performance And Control Through The Production Mode Of The Scraper. Once You Change The Tool, You Must Switch The Scraper To Its Production Mode.
When You Start Developing Your Tool, You Should Be Able To Inspect Events In Your Browser To Debug The Code. You Can Inspect Those Events During The Development Mode Of The Scraper. It Permits You To Use Chrome DevTools And Control Your Browser Directly. It Will Also Prevent Timeouts And Restrict Concurrency To Improve Your Experience With DevTools. To Access DevTools, Visit The Live View Tab And Open It. You Can Explore The Advanced Configuration Section To Configure Other Debugging Options.
Start URLsThe Field With Star URLs Represents The List Of Initial Page Links The Tool Will Load. You Can Enter These Links One By One. If You Want To Enter Multiple Links Simultaneously, Compile Them In A Google Sheet Or CSV File And Upload Them. There Should Be Https:// Or Http:// Protocol As A Prefix For All URLs.
The Scrape Allows Adding New Links To Collect On The Fly, Using Glob Pattern, Link Selector, Or Pseudo URL Options. It Also Allows Using The Page Function Context.EnqueueRequest().
It Helps To Find Which URL The Scraper Loads Currently To Take Specific Actions. For Instance, You Can Take Multiple Actions On Product Detail Or Product Listings Pages While Scraping An Online Store. Alternatively, You Can Associate All URLs With Custom Data For The User With JSON Object That Refers JavaScript Program In The Page Function Context.Request.UserData. You Can Check The Tutorial For This Scraper In Our Documentation To Learn More.
Link SelectorThe Link Selector Field Uses CSS Selector To Find URLs To Other Pages, With
in the automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the Proxy page in Apify Console.Apify Proxy (selected groups)The scraper will load all web pages using Apify Proxy with specific groups of target proxy servers.Custom proxiesThe Scraper Will Use A Custom List Of Proxy Servers. The Proxies Must Be Specified In The Scheme://User:Password@Host:Port Format, Multiple Proxies Should Be Separated By A Space Or New Line. The URL Scheme Can Be Either Http Or Socks5. User And Password Might Be Omitted, But The Port Must Always Be Present.
Example:
Http://Bob:Password@Proxy1.Example.Com:8000 http://bob:password@proxy2.example.com:8000You Can Set Up The Proxy Configuration Programmatically While Calling The Scraper Using Our API After Setting The ProxyConfiguration Fields. You Can Feed The JSON Objects In The Following Manner:
{ // Indicates Whether To Use Apify Proxy Or Not. "UseApifyProxy": Boolean, // Array Of Apify Proxy Groups, Only Used If "UseApifyProxy" Is True. // If Missing Or Null, Apify Proxy Will Use The Automatic Mode. "ApifyProxyGroups": String[], // Array Of Custom Proxy URLs, In "Scheme://User:Password@Host:Port" Format. // If Missing Or Null, Custom Proxies Are Not Used. "ProxyUrls": String[], }Using Web Scraper to Logging Into WebsitesUsing The Initial Cookies Field, You Can Set Cookies That Your Scraper Will Use While Logging Into The Targeted Website. These Cookies Are Files With Small Texts That The Web Browser Will Store On Your Device. Several Websites Use Cookies And Save Information Related To Your Current Login Session. You Can Transfer The Login Information To The Scraper Input And Log In To The Required Website. Learn More About How Browser Automation Tools Help To Log Into Websites By Transferring Session Cookies From Our Dedicated Tutorial.
Note That The Lifetime Of Cookies Is Limited, And They Will Expire After A Specific Duration. You Must Update Your Cookies Frequently So Your Scraper Can Log In To The Website Regularly. The Optional Step Is To Use A Page Function To Actively Keep The Scraper Logged In To The Website. To Learn More, Check Out Our Guide On How To Log In To Websites Using Puppeteer.
The Web Scraping Tool Expects The Initial Cookies Field To Store Cookies In The JSON Array As A Separate Object. Learn More About It In The Following Example.
[ { "Name": " Ga", "Value": "GA1.1.689972112. 1627459041", "Domain": ".Apify.Com", "HostOnly": False, "Path": "/", "Secure": False, "HttpOnly": False, "SameSite": "No_restriction", "Session": False, "FirstPartyDomain": "", "ExpirationDate": 1695304183, "StorelId": "Firefox-Default", "Id": 1 } ]Advanced ConfigurationPre-Navigation Hooks
It Is A Function Array That Will Execute Before Running The Primary Page Function. All These Functions Pass A Similar Context Object As It Passed Into A Page Function. But The Second Object DirectNavigationOptions Also Passes.
You Can See The Existing Options Here:
PreNavigationHooks: [ Async ({ Id, Request, Session, ProxyInfo }, { Timeout, WaitUntil, Referer }) => {} ]Unlike Cheerio, Puppeteer, And Playwright Scrapers, We Don't Have Any Actor Object In The Hook Parameter In The Web Scraper Since We've Set The Algorithm To Execute The Hook Inside The Web Browser.
To Learn More, Check Out The Document For Puppeteer Hook And Pre-Navigation Type Of Hooks And See How Objects Pass In Functions.
Post-Navigation Hooks
It Is A Function Array That Will Execute Once The Primary Page Function Completes The Run. CrawlingContext Is The Only Available Object Parameter.
PostNavigationHooks: [ Async ({ Id, Request, Session, ProxyInfo, Response }) => {} ]Unlike With Cheerio, Puppeteer, And Playwright Scrapers, We Don't Have Any Actor Object In The Hook Parameter In The Web Scraper Since We've Set The Algorithm To Execute The Hook Inside The Web Browser.
To Learn More, Check Out The Document For Puppeteer Hook And Post-Navigation Type Of Hooks And See How Objects Pass In Functions.
Insert Breakpoint
If You Set The Run Mode To Production, This Property Has No Impact. When You Set It To Development, It Injects A Breakpoint At The Preferred Location In Each Page The Scraper Crawls. Implementation Of The Program Stops At The Breakpoint Until You Resume It Manually In The Window Of The DevTools That You Can Access Using The Container URL Or The Live View Tab. You Can Add Extra Breakpoints Using The Debugger And The Page Function Statement.
Debug Log
If You Set It To True, It Will Include Debugging Messages In The Log. You Can Log Into Your Debug Messaging Using Context.Log.Debug('Message') Option.
Web Browser Log
After Setting It To True, It Will Include Console Messages In The Actor Log. You May See Error Messages, Low-Value Messages, And Errors With High Concurrency.
Custom Data
Since The Fixed User Interface Input, It Doesn't Help To Add Other Required Fields For Particular Applications. If You Have Arbitrary Data And You Want To Pass It To The Scraper, You Can Use The Input Field With Custom Data Within The High-Level Configuration. You Will Also See Its Contents Under The CustomData With The Context Key As A Page Function Object.
Customized Namings
You Can Set Up Customized Names For The Below Fields With The Last Three Alternatives Available In The Advanced Configuration.
- Key-value store
- Dataset
- Request queue
The Scraper Will Retain The Named Storage. It Is Ok If You Don't Name The Storage If You Need The Data In It With Persistence To Our Platform For The Day Count According To Your Plan. Besides, If You Use Named Storages, It Will Permit You To Share It Across Many Executions (For Instance, Instead Of Using A Separate Dataset For Each Run, You Can Use A Single Dataset For All Runs). Check It Out Here.
Results
The Page Function Will Reflect The Web Scraping Output And Store It In The Default Dataset Related To The Execution Of The Actor. It Will Allow You To Export The Data To Multiple Usable Formats, Like XML, JSON, Excel, Or CSV. The Web Scraper Pushes The Single Record To The Dataset For All The Returning Objects From The Page Function And Expands It With The Metadata, Like The Web Page URL From Which You'll Get The Required Data.
Here Is An Example Of A Page Function That Returns The Sample Object:
{ Message: 'Hello World!'; }The Stored Object In The Usable Dataset Will Look Like The Below Format:
{ "Message": "Hello World!", "#Error": False, "#Debug": { "RequestId": "FvwscO2UJLdr10B", "Url": "Https://Www.Example.Com/", "LoadedUrl": "Https://Www.Example.Com/", "Method": "GET", "RetryCount": 0, "ErrorMessages": Null, "StatusCode": 200 } }You Can Call The API Endpoint Get Dataset Items To Download Outputs.
Https://Api.Apify.Com/V2/Datasets/[DATASET_ID]/Items?Format=JsonWhere [DATASET_ID] Is The ID Of The Dataset If The Scraper Runs. Once The Scraper Starts, You Can Observe The Returning Run Object. Optionally, You'll Discover Export UELs For The Outputs On Your Console Account.
Add The Query Parameter Clean=True To The API Link Or An Option To Clean Items, Skip The #Debug And #Error Metadata Fields From The Output, And Remove The Blank Results In Your Console Account.
Add The Query Parameter Clean=True To The API Link Or An Option To Clean Items, Skip The #Debug And #Error Metadata Fields From The Output, And Remove The Blank Results In Your Console Account.
Extra Resources
Congratulations! You Have Studied The Working Process Of The Web Scraper. Now, You Can Also Explore The Following Resources:
- Data Scraping using Web Scraper - A stepwise tutorial to learn how to use the web scraping tool with detailed descriptions and examples.
- Web Scraping Guide - An introduction to web data extraction with Real Data API.
- Actors Documentation - It is A guide to documentation for our cloud computing platform.
- Puppeteer Scraper - It is a web scraping tool similar to Web Scraper that can use server-side libraries and offers a lower control level of the Puppeteer library.
- Cheerio Scraper - It is another data scraping tool that downloads raw HTML data and processes it for better performance.
- Playwright Scraper - It is a similar web scraping tool that offers lower-level control of the playwright library and can access server-side programming libraries.
- Crawlee Documentation - you can learn how to develop a new data scraping task from zero with the help of the world-famous web scraping. And crawling library Node.js.
- Real Data API SDK Documentation - study more about the required tools to execute your actor on our platform.
Upgrading
You Can Learn More About Minor Breaking Updates In The V2 In The Migration Guide.
The V3 Added More Breaking Upgrades. You Can Explore Them In The V3 Migration Guide.
Breaking Changes Specific To The Scraper.
- It would help if you used proxy servers now.
#WebDataScraper,#WebDataExtraction,#ScrapingWebBrowserDataCrawler,#WebDataScraping,#ScrapingusingWebScraper,
Unlock the power of Amazon Asin data scraping with our comprehensive step-by-step guide. Discover how to leverage the Asin Data Scraper tool to extract valuable insights and gain a competitive edge.