Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Multiple Linear Regression Model with Normal Equation
Next article icon

Multiple Linear Regression With scikit-learn

Last Updated : 11 Jul, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, let's learn about multiple linear regression using scikit-learn in the Python programming language.

Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it's utilized as a method for predictive modeling, in which an algorithm is employed to forecast continuous outcomes. Multiple linear regression, often known as multiple regression, is a statistical method that predicts the result of a response variable by combining numerous explanatory variables. Multiple regression is a variant of linear regression (ordinary least squares)  in which just one explanatory variable is used.

Mathematical Imputation:

To improve prediction, more independent factors are combined. The following is the linear relationship between the dependent and independent variables:

 

here, y is the dependent variable.

  • x1, x2,x3,... are independent variables.
  • b0 =intercept of the line.
  • b1, b2, ... are coefficients.

for a simple linear regression line is of the form :

y = mx+c

for example if we take a simple example, :

feature 1: TV

feature 2: radio

feature 3:  Newspaper

output variable: sales

Independent variables are the features feature1 , feature 2 and feature 3. Dependent variable is sales. The equation for this problem will be:

y = b0+b1x1+b2x2+b3x3

x1, x2 and x3 are the feature variables. 

In this example, we use scikit-learn to perform linear regression. As we have multiple feature variables and a single outcome variable, it's a Multiple linear regression. Let's see how to do this step-wise.

Stepwise Implementation

Step 1: Import the necessary packages

The necessary packages such as pandas, NumPy, sklearn, etc... are imported.

Python3
# importing modules and packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn import preprocessing 

Step 2: Import the CSV file:

The CSV file is imported using pd.read_csv() method. To access the CSV file click here. The 'No ' column is dropped as an index is already present. df.head() method is used to retrieve the first five rows of the dataframe. df.columns attribute returns the name of the columns. The column names starting with 'X' are the independent features in our dataset. The column 'Y house price of unit area' is the dependent variable column. As the number of independent or exploratory variables is more than one, it is a Multilinear regression.

To view and download the CSV file click here.

Python3
# importing data df = pd.read_csv('Real estate.csv') df.drop('No', inplace = True,axis=1)  print(df.head()) print(df.columns) 

Output:

   X1 transaction date  X2 house age  ...  X6 longitude  Y house price of unit area

0             2012.917          32.0  ...     121.54024                        37.9

1             2012.917          19.5  ...     121.53951                        42.2

2             2013.583          13.3  ...     121.54391                        47.3

3             2013.500          13.3  ...     121.54391                        54.8

4             2012.833           5.0  ...     121.54245                        43.1

[5 rows x 7 columns]

Index(['X1 transaction date', 'X2 house age',

      'X3 distance to the nearest MRT station',

      'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',

      'Y house price of unit area'],

     dtype='object')

Step 3: Create a scatterplot to visualize the data:

A scatterplot is created to visualize the relation between the 'X4 number of convenience stores' independent variable and the 'Y house price of unit area' dependent feature.

Python3
# plotting a scatterplot sns.scatterplot(x='X4 number of convenience stores',                 y='Y house price of unit area', data=df) 

Output:

 

Step 4: Create feature variables: 

To model the data we need to create feature variables, X variable contains independent variables and y variable contains a dependent variable. X and Y feature variables are printed to see the data.

Python3
# creating feature variables X = df.drop('Y house price of unit area',axis= 1) y = df['Y house price of unit area'] print(X) print(y) 

Output:

    X1 transaction date  X2 house age  ...  X5 latitude  X6 longitude

0               2012.917          32.0  ...     24.98298     121.54024

1               2012.917          19.5  ...     24.98034     121.53951

2               2013.583          13.3  ...     24.98746     121.54391

3               2013.500          13.3  ...     24.98746     121.54391

4               2012.833           5.0  ...     24.97937     121.54245

..                   ...           ...  ...          ...           ...

409             2013.000          13.7  ...     24.94155     121.50381

410             2012.667           5.6  ...     24.97433     121.54310

411             2013.250          18.8  ...     24.97923     121.53986

412             2013.000           8.1  ...     24.96674     121.54067

413             2013.500           6.5  ...     24.97433     121.54310

[414 rows x 6 columns]

0      37.9

1      42.2

2      47.3

3      54.8

4      43.1

      ... 

409    15.4

410    50.0

411    40.6

412    52.5

413    63.9

Name: Y house price of unit area, Length: 414, dtype: float64

Step 5: Split data into train and test sets:

Here, train_test_split() method is used to create train and test sets, the feature variables are passed in the method. test size is given as 0.3, which means 30% of the data goes into test sets, and train set data contains 70% data. the random state is given for data reproducibility.

Python3
# creating train and test sets X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=0.3, random_state=101) 

Step 6: Create a linear regression model

A simple linear regression model is created. LinearRegression() class is used to create a simple regression model, the class is imported from sklearn.linear_model package.

Python3
# creating a regression model model = LinearRegression() 

Step 7: Fit the model with training data.

After creating the model, it fits with the training data. The model gains knowledge about the statistics of the training model. fit() method is used to fit the data.

Python3
# fitting the model model.fit(X_train,y_train) 

Step 8: Make predictions on the test data set.

In this model.predict() method is used to make predictions on the X_test data, as test data is unseen data and the model has no knowledge about the statistics of the test set. 

Python3
# making predictions predictions = model.predict(X_test) 

Step 9: Evaluate the model with metrics.

The multi-linear regression model is evaluated with mean_squared_error and mean_absolute_error metric. when compared with the mean of the target variable, we'll understand how well our model is predicting. mean_squared_error is the mean of the sum of residuals. mean_absolute_error is the mean of the absolute errors of the model. The less the error, the better the model performance is.

mean absolute error = it's the mean of the sum of the absolute values of residuals.

 

mean square error =  it's the mean of the sum of the squares of residuals.

 
  • y= actual value
  • y hat = predictions
Python3
# model evaluation print(   'mean_squared_error : ', mean_squared_error(y_test, predictions)) print(   'mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 

Output:

mean_squared_error :  46.21179783493418  mean_absolute_error :  5.392293684756571

For data collection, there should be a significant discrepancy between the numbers. If you want to ignore outliers in your data, MAE is a preferable alternative, but if you want to account for them in your loss function, MSE/RMSE is the way to go. MSE is always higher than MAE in most cases, MSE equals MAE only when the magnitudes of the errors are the same.

Code:

Here, is the full code together, combining the above steps.

Python3
# importing modules and packages import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn import preprocessing  # importing data df = pd.read_csv('Real estate.csv') df.drop('No', inplace=True, axis=1)  print(df.head())  print(df.columns)  # plotting a scatterplot sns.scatterplot(x='X4 number of convenience stores',                 y='Y house price of unit area', data=df)  # creating feature variables X = df.drop('Y house price of unit area', axis=1) y = df['Y house price of unit area']  print(X) print(y)  # creating train and test sets X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=0.3, random_state=101)  # creating a regression model model = LinearRegression()  # fitting the model model.fit(X_train, y_train)  # making predictions predictions = model.predict(X_test)  # model evaluation print('mean_squared_error : ', mean_squared_error(y_test, predictions)) print('mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 

Output:

   X1 transaction date  X2 house age  ...  X6 longitude  Y house price of unit area

0             2012.917          32.0  ...     121.54024                        37.9

1             2012.917          19.5  ...     121.53951                        42.2

2             2013.583          13.3  ...     121.54391                        47.3

3             2013.500          13.3  ...     121.54391                        54.8

4             2012.833           5.0  ...     121.54245                        43.1

[5 rows x 7 columns]

Index(['X1 transaction date', 'X2 house age',

      'X3 distance to the nearest MRT station',

      'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',

      'Y house price of unit area'],

     dtype='object')

    X1 transaction date  X2 house age  ...  X5 latitude  X6 longitude

0               2012.917          32.0  ...     24.98298     121.54024

1               2012.917          19.5  ...     24.98034     121.53951

2               2013.583          13.3  ...     24.98746     121.54391

3               2013.500          13.3  ...     24.98746     121.54391

4               2012.833           5.0  ...     24.97937     121.54245

..                   ...           ...  ...          ...           ...

409             2013.000          13.7  ...     24.94155     121.50381

410             2012.667           5.6  ...     24.97433     121.54310

411             2013.250          18.8  ...     24.97923     121.53986

412             2013.000           8.1  ...     24.96674     121.54067

413             2013.500           6.5  ...     24.97433     121.54310

[414 rows x 6 columns]

0      37.9

1      42.2

2      47.3

3      54.8

4      43.1

      ... 

409    15.4

410    50.0

411    40.6

412    52.5

413    63.9

Name: Y house price of unit area, Length: 414, dtype: float64

mean_squared_error :  46.21179783493418

mean_absolute_error :  5.392293684756571


Next Article
Multiple Linear Regression Model with Normal Equation
author
isitapol2002
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • Python scikit-module
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Multiple Linear Regression using R
    Prerequisite: Simple Linear-Regression using RLinear Regression: It is the basic and commonly used type for predictive analysis. It is a statistical approach for modeling the relationship between a dependent variable and a given set of independent variables.These are of two types:   Simple linear Re
    3 min read
  • K-Nearest Neighbors (KNN) Regression with Scikit-Learn
    K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. While it is commonly associated with classification tasks, KNN can also be used for regression. This article will delve into the fundamentals of KNN regression, how it works, and how to implement it usin
    7 min read
  • Multiple Linear Regression with Backward Elimination
    Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between a dependent variable and multiple independent variables. However, not all variables significantly contribute to the model. Backward Elimination technique helps in selecting only the most significant pr
    5 min read
  • Multiple Linear Regression Model with Normal Equation
    Prerequisite: NumPy Consider a data set, area (x1)rooms (x2)age (x3)price (y)2338656215274569244968972954756231768234253107485 let us consider, Here area, rooms, age are features / independent variables and price is the target / dependent variable. As we know the hypothesis for multiple linear regre
    3 min read
  • Multiple linear regression using ggplot2 in R
    A regression line is basically used in statistical models which help to estimate the relationship between a dependent variable and at least one independent variable. There are two types of regression lines : Single Regression Line.Multiple Regression Lines. In this article, we are going to discuss h
    3 min read
  • ML | Multiple Linear Regression using Python
    Linear regression is a fundamental statistical method widely used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression is an extension of this concept that allows us to
    4 min read
  • Linear Regression in Machine learning
    Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
    15+ min read
  • Multiple linear regression using R for the Real estate data set
    Multiple linear regression is widely used in machine learning and data science. In this article, We will discuss the Multiple linear regression by building a step-by-step project on a Real estate data set. Multiple linear regressionMultiple Linear Regression is a statistical method used to model the
    9 min read
  • Non-Linear Regressions with Caret Package in R
    Non-linear regression is used to fit relationships between variables that are beyond the capability of linear regression. It can fit intricate relationships like exponential, logarithmic and polynomial relationships. Caret, a package in R, offers a simple interface to develop and compare machine lea
    3 min read
  • Scatter Plot with Regression Line using Altair in Python
    Prerequisite: Altair In this article, we are going to discuss how to plot to scatter plots with a regression line using the Altair library. Scatter Plot and Regression Line The values of two different numeric variables is represented by dots or circle in Scatter Plot. Scatter Plot is also known as a
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences