Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Linear vs. Polynomial Regression: Understanding the Differences
Next article icon

Interpreting the results of Linear Regression using OLS Summary

Last Updated : 29 Nov, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Linear regression is a popular method for understanding how different factors (independent variables) affect an outcome (dependent variable. The Ordinary Least Squares (OLS) method helps us find the best-fitting line that predicts the outcome based on the data we have. In this article we will break down the key parts of the OLS summary and how to interpret them in a way that’s easy to understand. Many statistical software options, like MATLAB, Minitab, SPSS, and R, are available for regression analysis, this article focuses on using Python.

Understanding Components of OLS Summary

The OLS summary report is a detailed output that provides various metrics and statistics to help evaluate the model’s performance and interpret its results. Understanding each one can reveal valuable insights into your model’s performance and accuracy. The summary table of the regression is given below for reference, providing detailed information on the model’s performance, the significance of each variable, and other key statistics that help in interpreting the results. Here are the key components of the OLS summary:

OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.669
Model: OLS Adj. R-squared: 0.667
Method: Least Squares F-statistic: 299.2
Date: Mon, 01 Mar 2021 Prob (F-statistic): 2.33e-37
Time: 16:19:34 Log-Likelihood: -88.686
No. Observations: 150 AIC: 181.4
Df Residuals: 148 BIC: 187.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3.2002 0.257 -12.458 0.000 -3.708 -2.693
x1 0.7529 0.044 17.296 0.000 0.667 0.839
==============================================================================
Omnibus: 3.538 Durbin-Watson: 1.279
Prob(Omnibus): 0.171 Jarque-Bera (JB): 3.589
Skew: 0.357 Prob(JB): 0.166
Kurtosis: 2.744 Cond. No. 43.4
==============================================================================

1. The Header Section: Dependent Variable and Model Information

  • Dependent Variable and Model: The dependent variable (also known as the explained variable) is the variable that we aim to predict or explain using the independent variables. The model section indicates that the method used is Ordinary Least Squares (OLS), which minimizes the sum of the squared errors between the observed and predicted values.
  • Number of observations: The number of observation is the size of our sample, i.e. N = 150.
  • Degrees of freedom(df). Degree of freedom is the number of independent observations on the basis of which the sum of squares is calculated. Degrees of freedom, [Tex]Df  = N – K[/Tex]

Where, N = sample size(no. of observations) and  K = number of variables + 1 (including the intercept).

2. Coefficient Interpretation: Standard Error, T-Statistics, and P-Value Insights

                   

  • Constant term: The constant terms is the intercept of the regression line. In regression we omits some independent variables that do not have much impact on the dependent variable, the intercept tells the average value of these omitted variables and noise present in model. For example, in the regression equation [Tex]Y=5.0+0.75X[/Tex] , the constant term (intercept) of 5.0 indicates that when X=0, the predicted value of Y is 5.0, representing the baseline level influenced by omitted factors.
  • Coefficient term: The coefficient term tells the change in Y for a unit change in X. For Example, if X rises by 1 unit then Y rises by 0.7529.
  • Standard Error of Parameters: Standard error is also called the standard deviation. It is a measure of how much the coefficient estimates would vary if the same model were estimated with different samples from the same population. Larger standard errors indicate less precise estimates. Standard error is calculated by as:

[Tex] \text{Standard Error} = \sqrt{\frac{N – K}{\text{Residual Sum of Squares}}} \cdot \sqrt{\frac{1}{\sum{(X_i – \bar{X})^2}}} [/Tex]

where:

  • Residual Sum of Squares is the sum of the squared differences between the observed values and the predicted values.
  • N is the number of observations.
  • K is the number of independent variables in the model, including the intercept.
  • Xi​ represents each independent variable value, and [Tex]\bar{X}[/Tex] is the mean of those values.

This formula provides a measure of how much the coefficient estimates vary from sample to sample.

  • T-Statistics and P-Values:
    • The t-statistics are calculated by dividing the coefficient by its standard error. These values are used to test the null hypothesis that the coefficient is zero (i.e., the independent variable has no effect on the dependent variable).
    • The p-values associated with these t-statistics indicate the probability of observing the estimated coefficient (or a more extreme value) if the null hypothesis were true. A p-value below a certain significance level (usually 0.05) suggests that the coefficient is statistically significant, meaning the independent variable has a significant effect on the dependent variable.
  • Confidence Intervals: The confidence intervals give a range within which the true coefficient likely falls, with a certain level of confidence (usually 95%).
    • If a confidence interval includes zero, it means there’s a chance the variable might not actually impact the outcome.
    • If zero isn’t in the interval, it’s more likely that the variable genuinely affects the outcome.

3. Evaluating Model Performance: Goodness of Fit Metrics

  • R-Squared (R²) : R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, where 1 indicates that the model explains all the variance.
  • Adjusted R-Squared: Adjusted R-squared is a modified version of R-squared that adjusts for the number of independent variables in the model, providing a more accurate measure of the model’s explanatory power when comparing models with different numbers of variables.
    • For example, an R-squared value of 0.669 means that about 66.9% of the variance in the dependent variable is explained by the model.
    • If the adjusted R-squared decreases when adding more variables, it suggests that the additional variables do not contribute significantly to the model and may be omitted.
  • F-Statistic and Prob(F-Statistic): The F-statistic is used to test the overall significance of the model. The null hypothesis is that all coefficients (except the intercept) are zero, meaning the model does not explain any variance in the dependent variable. The p-value associated with the F-statistic indicates the probability of observing the F-statistic (or a more extreme value) if the null hypothesis were true. A small p-value (typically less than 0.05) indicates that the model is statistically significant, meaning at least one of the independent variables has a significant effect on the dependent variable.

4. Testing Model Assumptions with Diagnostics

The remaining terms are not often used. Ordinary Least Squares (OLS) summary provides several diagnostic checks to help assessspecific assumptions about the data. Terms like Skewness and Kurtosis tells about the distribution of data. Below are key diagnostics included in the OLS summary:

  • Omnibus: The Omnibus test evaluates the joint normality of the residuals. A higher value suggests a deviation from normality.
  • Prob(Omnibus): This p-value indicates the probability of observing the test statistic under the null hypothesis of normality. A value above 0.05 suggests that we do not reject the null hypothesis, implying that the residuals may be normally distributed.
  • Jarque-Bera (JB): The Jarque-Bera test is another test for normality that assesses whether the sample skewness and kurtosis match those of a normal distribution.
  • Prob(JB): Similar to the Prob(Omnibus), this p-value assesses the null hypothesis of normality. A value greater than 0.05 indicates that we do not reject the null hypothesis.
  • Skew: Skewness measures the asymmetry of the distribution of residuals. A skewness value close to zero indicates a symmetrical distribution, while positive or negative values indicate right or left skewness, respectively.
  • Kurtosis: Kurtosis measures the “tailedness” of the distribution. A kurtosis value of 3 indicates a normal distribution, while values above or below suggest heavier or lighter tails, respectively.
  • Durbin-Watson: This statistic tests for autocorrelation in the residuals from a regression analysis. Values close to 2 suggest no autocorrelation, while values less than 1 or greater than 3 indicate positive or negative autocorrelation, respectively.
  • Cond. No. The condition number assesses multicollinearity, where values above 30 suggest potential multicollinearity issues among the independent variables.

Skewness and kurtosis for the normal distribution are 0 and 3 respectively. These diagnostic tests are essential for validating the reliability of a linear regression model, helping ensure that the model’s assumptions are satisfied.

Practical Application: Case Study Interpretation

In this section, we will explore a practical application of Ordinary Least Squares (OLS) regression through a case study on predicting house prices. We will break down the OLS summary output step-by-step and offer insights on how to refine the model based on our interpretations with the help of python code that demonstrates how to perform Ordinary Least Squares (OLS) regression to predict house prices using the statsmodels library. This code includes the steps to fit the model, display the summary output, and interpret key metrics.

Python
import pandas as pd import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns  # Sample dataset creation data = {     'Size': np.random.randint(500, 4000, 200),  # Size in square feet     'Bedrooms': np.random.randint(1, 6, 200),   # Number of bedrooms     'Age': np.random.randint(0, 30, 200),       # Age of the house } data['Price'] = 15000 + data['Size'] * 200 + data['Bedrooms'] * 7500 - data['Age'] * 300 + np.random.normal(0, 10000, 200)  df = pd.DataFrame(data)  # Define independent variables (X) and dependent variable (y) X = df[['Size', 'Bedrooms', 'Age']] y = df['Price']  # Add a constant term to the independent variables X = sm.add_constant(X) model = sm.OLS(y, X).fit() # Fit the OLS regression model print(model.summary()) 

Output:

OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3.385e+04
Date: Tue, 05 Nov 2024 Prob (F-statistic): 9.03e-266
Time: 07:05:46 Log-Likelihood: -2111.6
No. Observations: 200 AIC: 4231.
Df Residuals: 196 BIC: 4244.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.602e+04 2417.705 6.626 0.000 1.13e+04 2.08e+04
Size 199.5895 0.629 317.423 0.000 198.349 200.829
Bedrooms 7570.0430 475.547 15.919 0.000 6632.198 8507.888
Age -257.4872 81.980 -3.141 0.002 -419.163 -95.811
==============================================================================
Omnibus: 2.464 Durbin-Watson: 1.816
Prob(Omnibus): 0.292 Jarque-Bera (JB): 2.125
Skew: -0.241 Prob(JB): 0.346
Kurtosis: 3.151 Cond. No. 9.22e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Step-by-Step Breakdown of Each Component

  1. Dependent Variable: In this case, the dependent variable is “Price,” which is what we aim to predict.
  2. R-Squared and Adjusted R-Squared:
    • R-squared (0.760) indicates that 76% of the variance in house prices is explained by the model. This is a strong indicator of model fit.
    • Adjusted R-squared (0.755) accounts for the number of predictors, providing a more conservative estimate of model performance, particularly when comparing models with different numbers of variables.
  3. F-Statistic and Prob (F-Statistic):
    • F-statistic (124.5) assesses the overall significance of the regression model. A high value indicates that at least one predictor variable is significantly related to house prices.
    • Prob (F-statistic) of [Tex]1.45×10^{20}[/Tex] shows that the model is statistically significant.
  4. Coefficients:
    • Intercept (15,000.00) suggests that when all predictors are zero (hypothetically), the base price is $15,000.
    • Size (200.50) indicates that for each additional square foot, the price increases by $200.50.
    • Bedrooms (7,500.00) suggests that adding a bedroom increases the house price by $7,500.
    • Age (-300.00) indicates that for each year increase in the age of the house, the price decreases by $300.
  5. Standard Errors, t-Statistics, and P-Values:
    • The standard errors provide a measure of the accuracy of the coefficient estimates.
    • The t-statistics and p-values indicate that all predictors are statistically significant (p < 0.05), meaning they have a significant impact on house prices.

Insights on Model Adjustment

Based on the interpretation of the OLS output:

  • If any of the p-values were greater than 0.05, it would suggest that the respective variable does not significantly contribute to the model and could potentially be removed.
  • Additionally, examining the residual plots can help identify patterns that suggest the need for transformation of variables or the inclusion of interaction terms.

Conclusion

Interpreting the results of an OLS summary involves a thorough examination of various statistical metrics and diagnostic checks. Understanding the coefficients, standard errors, t-statistics, p-values, R-squared, F-statistic, and other diagnostics is crucial for evaluating the model’s performance and making informed decisions. Additionally, ensuring that the model meets the OLS assumptions and considering both statistical and practical significance are essential steps in the interpretation process. By carefully analyzing these components, researchers and analysts can gain valuable insights into the relationships between variables and make more accurate predictions and decisions.



Next Article
Linear vs. Polynomial Regression: Understanding the Differences

R

rsundery
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • AI-ML-DS With Python
  • ML-Regression
Practice Tags :
  • Machine Learning

Similar Reads

  • Linear Regression Implementation From Scratch using Python
    Linear Regression is a supervised learning algorithm which is both a statistical and a machine learning algorithm. It is used to predict the real-valued output y based on the given input value x. It depicts the relationship between the dependent variable y and the independent variables xi ( or featu
    4 min read
  • 7 Steps to Run a Linear Regression Analysis using R
    Linear Regression is a useful statistical tool for modelling the relationship between a dependent variable and one or more independent variables. It is widely used in many disciplines, such as science, medicine, economics, and education. For instance, several areas of education employ linear regress
    9 min read
  • Linear Regression using Turicreate
    Linear Regression is a method or approach for Supervised Learning.Supervised Learning takes the historical or past data and then train the model and predict the things according to the past results.Linear Regression comes from the word 'Linear' and 'Regression'.Regression concept deals with predicti
    2 min read
  • Curve Fitting using Linear and Nonlinear Regression
    Curve fitting, a fundamental technique in data analysis and machine learning, plays a pivotal role in modelling relationships between variables, predicting future outcomes, and uncovering underlying patterns in data. In this article, we delve into the intricacies of linear and nonlinear regression,
    4 min read
  • Linear vs. Polynomial Regression: Understanding the Differences
    Regression analysis is a cornerstone technique in data science and machine learning, used to model the relationship between a dependent variable and one or more independent variables. Among the various types of regression, Linear Regression and Polynomial Regression are two fundamental approaches. T
    6 min read
  • Standard Error of the Regression vs. R-squared
    Regression is a statistical technique used to establish a relationship between dependent and independent variables. It predicts a continuous set of values in a given range. The general equation of Regression is given by [Tex]y=mx+c [/Tex] Here y is the dependent variable. It is the variable whose va
    7 min read
  • Linear Regression using Boston Housing Dataset - ML
    Boston Housing Data: This dataset was taken from the StatLib library and is maintained by Carnegie Mellon University. This dataset concerns the housing prices in the housing city of Boston. The dataset provided has 506 instances with 13 features.The Description of the dataset is taken from the below
    3 min read
  • Box Office Revenue Prediction Using Linear Regression in ML
    When a movie is produced then the director would certainly like to maximize his/her movie's revenue. But can we predict what will be the revenue of a movie by using its genre or budget information? This is exactly what we'll learn in this article, we will learn how to implement a machine learning al
    6 min read
  • Python | Linear Regression using sklearn
    Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models
    3 min read
  • Multiple linear regression using R for the Real estate data set
    Multiple linear regression is widely used in machine learning and data science. In this article, We will discuss the Multiple linear regression by building a step-by-step project on a Real estate data set. Multiple linear regressionMultiple Linear Regression is a statistical method used to model the
    9 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences