Directory Image
This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Privacy Policy.

Data Science training in bangalore

Author: Saro Ji
by Saro Ji
Posted: Feb 19, 2019
data science

Integrated Development Environment:

You can run Python from a Graphical User Interface (GUI) environment as well, if you have a GUI application on your system that supports Python.

Techniques for scrubbing or cleaning Data in Data Science:

Data science is a field of Big Data geared toward providing meaningful information based on large amounts of complex data. Data science, or data-driven science, combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making.

Cleaning data:

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.


As we know the obtained data has inconsistencies, errors, weird characters, missing values or different problems. In this situation, you have to scrub, or clean the data before to use this data. So for scrubbing the data some techniques are used which are as follows:-

  • Filter lines
  • Extract certain columns or words
  • Replace values
  • Handle missing values
  • Convert data from one format to another

Filtering lines:

The first scrubbing operation is to filter lines. It means that from the input data every line will be calculated to determine whether it may be passed on as output.

  • Based on location

Based on their location is the simplest way to filter lines. It is useful when you want to inspect, say, the top 5 lines of a file, or when you want extract a particular row from the output of another command-line tool.

  • Based on pattern

If you want to extract or remove lines based on their contents then use grep which is canonical command-line tool for filtering lines. We can print every line that matches a certain pattern or regular expression.

  • Based on randomness

When you’re in the process of formulating your data pipeline and have a bulk of data, then debugging your pipeline can be cumbersome. In that case, sampling from the data might be useful. The core reason of the command-line tool sample is to get a subset of the data by outputting only a particular percentage of the input on a line-by-line basis.

Replacing and Deleting Values:

Command-line tool tr, which stands for translate that can be used to replace the individual characters.

Handle missing values:

Data mining methods vary in the way they treat missing values. Typically, they ignore the missing values, or exclude any records containing missing values, or replace missing values with the mean, or infer missing values from existing values.

About the Author

Seo analyst

Rate this Article
Leave a Comment
Author Thumbnail
Please or Join to add a comment.
Author: Saro Ji

Saro Ji

Flag of India

Member since: Feb 16, 2019
Total live articles: 1

Related Articles