Directory Image
This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Privacy Policy.

Web scraping for non programmers - what, why and how

Author: Abou Tarek
by Abou Tarek
Posted: Jul 07, 2018

Web scraping is a powerful technology that can accelerate your business growth. However, people without a tech background are struggling to understand what web scraping really is and how they benefit from it.

Lets understand the web

Web pages are built using text-based markup languages and contains a wealth of useful data. However, web pages are designed for human end-users to be accessed via a web browser not for the ease automated use. This human friendly design makes it difficult to access this data because it is unstructured

Markup languages : Markup languages are designed for the processing, definition and presentation of text. The language specifies protocol for formatting, both the layout and style, within a text file. The code used to specify the formatting are called tags. HTML is a an example of a widely known and used markup language.

Ok, cool, What is an API

Think of an API as the alternative user interface that software uses to interact with other software. An example would be zomato using google maps API to integrate location services within their app. Only a small number of websites have API’s because it is difficult to build and maintain an API due to the cost and efforts involved.

What is web scraping and how it is done?

Web scraping is an automated way of copying data from a website and turning it into a useful format that computers can work on. The typical process of web scraping is as follows

  1. Fetch page - If you are on amazon and you need to access the information on the books category, you ask the web scraper to request that page. The web scraper will go and fetch the page.
  2. Parse - If you just right click and then click view the page source option on your browser, you can see the markup language. Parsing is a way to extract the information we need that is locked within the markup language.
  3. Format - To make this into a useful format, a set of transformations need to be done, this is called formatting.
  4. Storing the data - The formatted data needs to be stored into a database for accessing it later.

What is a web scraper?

A web scraper or a crawler is a computer program or software that does web scraping.

How do I use the scraped data?

You can get the scraped data in computer friendly formats such as CSV or JSON. There are many self service tools like PowerBi, and Pentaho which you can use to analyse and transform the data without writing any code.

What are my options to get data from the web?

Code it yourself

If you have a capable technical team, you can build web scrapers using many programming languages and there are frameworks like the ones listed below.

  1. Scrapy

  2. Nokogiri

  3. Apache Nutch

Costs:

  1. You need to pay for Developers, Servers etc

  2. On average a developer need 10 hours to code a web scraper.

    it take 4-6 months to build a stable infrastructure to run these web scrapers

  3. You need to build systems for maintenance and Q&A

In layman terms, Code it yourself is building a car by buying its parts online. You build the car by hiring a team of mechanics and appoint a driver to run it.

Key Benefits:

  1. You have control over data extraction

  2. You have ownership and access to source code

Drawbacks:

  1. Very costly compared to DaaS and DIY tools

  2. Time to market is slow

  3. Lack of expertise can hurt

  4. Need a lot of human resources

Do it yourself tools

DIY tools make it possible non-tech guys to get data from websites.In theory a guy with basic computer skills should be able to configure DIY tools. In most cases you’ll end up hiring a developer to modify the data and write scripts to get the data the way you need it. Customizations and modifications will be necessary depends on what you do with the data.

Costs

  1. You need to pay a monthly /Yearly subscription to get a license.

  2. Customization of Data requires a developer it can take anywhere from a few hours to a few days to get it done properly.

  3. You need people and tools to do Q&A

  4. You need a full time tech guy to monitor the health of data extraction

  5. DIY tools won’t work well on websites with heavy Ajax or javascript ( These are advanced technologies to make websites interactive). In that cases you need to write custom scripts. For this you need a developer.

  6. You need custom programming to extract data from websites with anti- scraping technologies - This also requires a full time developer to function smoothly.

Key Benefits :

  1. You have control over data extraction process

  2. DIY tools reduces the technical barrier to extract data from websites

  3. Access to source code

Drawbacks:

  1. Steep learning curve

  2. It doesn't work with complex websites

  3. You need tech resources to manage and monitor data extraction

  4. Costly compared to DaaS

  5. No access to source code

Self service tools are like renting a car. You rent a car and drive yourself or appoint a driver to run it. You pay a monthly rent and pay the salary of the driver. If the car breaks down, you wait till the company fixes the problem.

Data as a Service

Data as a Service is the cousin of Software as a Service. DaaS enables people to get data in a ready to use format. This is the best option for those who want to focus on using the data rather than managing the data extraction. You can directly plug data streams into your analytics tools or your apps.

Cost :

You need to pay a monthly subscription for getting data.

Benefits :

  1. Most cost effective option.

  2. No resources required.

  3. Pay only for what you actually get.

Drawbacks

No or little control over data extraction process

In layman terms Data as a service is like uber, get a ride when you need it and you pay for the distance travelled. No need to take care of the maintenance or anything.

Conclusion

There are different ways to get web data for your business. It is upto you, what to choose. List your priorities and choose wisely. Want to book uber / rent a car or build a car.

I won’t buy a coffee shop just to drink tea everyday. Do you?

Having trouble understanding any of these? Visit This Website: http://datahut.co/

Rate this Article
Leave a Comment
Author Thumbnail
I Agree:
Comment 
Pictures
Author: Abou Tarek

Abou Tarek

Member since: Jul 07, 2018
Published articles: 1