Directory Image
This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Privacy Policy.

How to Predict Housing Prices with Linear Regression?

Author: Scraping Intelligence
by Scraping Intelligence
Posted: Feb 19, 2022

The final objective is to estimate the cost of a certain house in a Boston suburb. In 1970, the Boston Standard Metropolitan Statistical Area provided the information. To examine and modify the data, we will use several techniques such as data pre-processing and feature engineering. After that, we'll apply a statistical model like regression model to anticipate and monitor the real estate market.

Project Outline:

  • EDA
  • Feature Engineering
  • Pick and Train a Model
  • Interpret
  • Conclusion
EDA

Before using a statistical model, the EDA is a good step to go through in order to:

  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.
Importing the Libraries
  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.
    • Import the libraries #Dataframe/Numerical libraries import pandas as pd import numpy as np #Data visualization import plotly.express as px import matplotlib import matplotlib.pyplot as plt import seaborn as sns #Machine learning model from sklearn.linear_model import LinearRegression Reading the Dataset with Pandas#Reading the data path='./housing.csv' housing_df=pd.read_csv(path,header=None,delim_whitespace=True) CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV00.0063218.02.3100.5386.57565.24.09001296.015.3396.904.9824.010.027310.07.0700.4696.42178.94.96712242.017.8396.909.1421.620.027290.07.0700.4697.18561.14.96712242.017.8392.834.0334.730.032370.02.1800.4586.99845.86.06223222.018.7394.632.9433.440.069050.02.1800.4587.14754.26.06223222.018.7396.905.3336.2.............................................5010.062630.011.9300.5736.59369.12.47861273.021.0391.999.6722.45020.045270.011.9300.5736.12076.72.28751273.021.0396.909.0820.65030.060760.011.9300.5736.97691.02.16751273.021.0396.905.6423.95040.109590.011.9300.5736.79489.32.38891273.021.0393.456.4822.05050.047410.011.9300.5736.03080.82.50501273.021.0396.907.8811.9Have a Look at the Columns
    Crime: It refers to a town's per capita crime rate.

    ZN: It is the percentage of residential land allocated for 25,000 square feet.

    Indus: The amount of non-retail business lands per town is referred to as the indus.

    CHAS: CHAS denotes whether or not the land is surrounded by a river.

    NOX: The NOX stands for nitric oxide content (part per 10m)

    RM: The average number of rooms per home is referred to as RM.

    AGE: The percentage of owner-occupied housing built before 1940 is referred to as AGE.

    DIS: Weighted distance to five Boston employment centers are referred to as dis.

    RAD: Accessibility to radial highways index

    TAX: The TAX columns denote the rate of full-value property taxes per $10,000 dollars.

    B: B=1000(Bk — 0.63)2 is the outcome of the equation, where Bk is the proportion of blacks in each town.

    PTRATIO: It refers to the student-to-teacher ratio in each community.

    LSTAT: It refers to the population's lower socioeconomic status.

    MEDV: It refers to the 1000-dollar median value of owner-occupied residences.

    Data Preprocessing# Check if there is any missing values. housing_df.isna().sum() CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64

    No missing values are found

    We examine our data's mean, standard deviation, and percentiles.

    housing_df.describe()CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDVcount506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.00000025%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.02500050%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.20000075%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000

    The crime, area, sector, nitric oxides, 'B' appear to have multiple outliers at first look because the minimum and maximum values are so far apart. In the Age columns, the mean and the Q2(50 percentile) do not match.

    We might double-check it by examining the distribution of each column.

    Inferences
    1. The rate of crime is rather low. The majority of values are in the range of 0 to 25. With a huge value and a value of zero.
    2. The majority of residential land is zoned for less than 25,000 square feet. Land zones larger than 25,000 square feet represent a small portion of the dataset.
    3. The percentage of non-retial commercial acres is mostly split between two ranges: 0-13 and 13-23.
    4. The majority of the properties are bordered by the river, although a tiny portion of the data is not.
    5. The content of nitrite dioxide has been trending lower from.3 to.7, with a little bump towards.8. It is permissible to leave a value in the range of 0.1–1.
    6. The number of rooms tends to cluster around the average.
    7. With time, the proportion of owner-occupied units rises.
    8. As the number of weights grows, the weight distance between 5 employment centers reduces. It could indicate that individuals choose to live in new high-employment areas.
    9. People choose to live in places with limited access to roadways (0-10). We have a 30th percentile outlier.
    10. The majority of dwelling taxes are in the range of $200-450, with large outliers around $700,000.
    11. The percentage of people with lower status tends to cluster around the median. The majority of persons are of lower social standing.

    Because the model is overly generic, removing all outliers will underfit it. Keeping all outliers causes the model to overfit and become excessively accurate. The data's noise will be learned.

    The approach is to establish a happy medium that prevents the model from becoming overly precise. When faced with a new set of data, however, they generalise well.

    We'll keep numbers below 600 because there's a huge anomaly in the TAX column around 600.

    new_df=housing_df[housing_df['TAX']
    About the Author

    Scraping Intelligence- We Provide all type of Web Scraping Tools and Software, data extraction, Data Mining, Best Web Scraping Service Provider Usa to Scrape Data from Website.

    Rate this Article
Leave a Comment
Author Thumbnail
I Agree:
Comment 
Pictures
Author: Scraping Intelligence

Scraping Intelligence

Member since: Oct 28, 2020
Published articles: 59

Related Articles