Interpreting the results of Linear Regression using OLS Summary

Last Updated : 29 Nov, 2024

Linear regression is a popular method for understanding how different factors (independent variables) affect an outcome (dependent variable. The Ordinary Least Squares (OLS) method helps us find the best-fitting line that predicts the outcome based on the data we have. In this article we will break down the key parts of the OLS summary and how to interpret them in a way that’s easy to understand. Many statistical software options, like MATLAB, Minitab, SPSS, and R, are available for regression analysis, this article focuses on using Python.

Understanding Components of OLS Summary

The OLS summary report is a detailed output that provides various metrics and statistics to help evaluate the model’s performance and interpret its results. Understanding each one can reveal valuable insights into your model’s performance and accuracy. The summary table of the regression is given below for reference, providing detailed information on the model’s performance, the significance of each variable, and other key statistics that help in interpreting the results. Here are the key components of the OLS summary:

OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.669
Model: OLS Adj. R-squared: 0.667
Method: Least Squares F-statistic: 299.2
Date: Mon, 01 Mar 2021 Prob (F-statistic): 2.33e-37
Time: 16:19:34 Log-Likelihood: -88.686
No. Observations: 150 AIC: 181.4
Df Residuals: 148 BIC: 187.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3.2002 0.257 -12.458 0.000 -3.708 -2.693
x1 0.7529 0.044 17.296 0.000 0.667 0.839
==============================================================================
Omnibus: 3.538 Durbin-Watson: 1.279
Prob(Omnibus): 0.171 Jarque-Bera (JB): 3.589
Skew: 0.357 Prob(JB): 0.166
Kurtosis: 2.744 Cond. No. 43.4
==============================================================================

1. The Header Section: Dependent Variable and Model Information

Dependent Variable and Model: The dependent variable (also known as the explained variable) is the variable that we aim to predict or explain using the independent variables. The model section indicates that the method used is Ordinary Least Squares (OLS), which minimizes the sum of the squared errors between the observed and predicted values.
Number of observations: The number of observation is the size of our sample, i.e. N = 150.
Degrees of freedom(df). Degree of freedom is the number of independent observations on the basis of which the sum of squares is calculated. Degrees of freedom, [Tex]Df = N – K[/Tex]

Where, N = sample size(no. of observations) and K = number of variables + 1 (including the intercept).

2. Coefficient Interpretation: Standard Error, T-Statistics, and P-Value Insights

Constant term: The constant terms is the intercept of the regression line. In regression we omits some independent variables that do not have much impact on the dependent variable, the intercept tells the average value of these omitted variables and noise present in model. For example, in the regression equation [Tex]Y=5.0+0.75X[/Tex] , the constant term (intercept) of 5.0 indicates that when X=0, the predicted value of Y is 5.0, representing the baseline level influenced by omitted factors.
Coefficient term: The coefficient term tells the change in Y for a unit change in X. For Example, if X rises by 1 unit then Y rises by 0.7529.
Standard Error of Parameters: Standard error is also called the standard deviation. It is a measure of how much the coefficient estimates would vary if the same model were estimated with different samples from the same population. Larger standard errors indicate less precise estimates. Standard error is calculated by as:

[Tex] \text{Standard Error} = \sqrt{\frac{N – K}{\text{Residual Sum of Squares}}} \cdot \sqrt{\frac{1}{\sum{(X_i – \bar{X})^2}}} [/Tex]

where:

Residual Sum of Squares is the sum of the squared differences between the observed values and the predicted values.
N is the number of observations.
K is the number of independent variables in the model, including the intercept.
Xi represents each independent variable value, and [Tex]\bar{X}[/Tex] is the mean of those values.

This formula provides a measure of how much the coefficient estimates vary from sample to sample.

T-Statistics and P-Values:
- The t-statistics are calculated by dividing the coefficient by its standard error. These values are used to test the null hypothesis that the coefficient is zero (i.e., the independent variable has no effect on the dependent variable).
- The p-values associated with these t-statistics indicate the probability of observing the estimated coefficient (or a more extreme value) if the null hypothesis were true. A p-value below a certain significance level (usually 0.05) suggests that the coefficient is statistically significant, meaning the independent variable has a significant effect on the dependent variable.
Confidence Intervals: The confidence intervals give a range within which the true coefficient likely falls, with a certain level of confidence (usually 95%).
- If a confidence interval includes zero, it means there’s a chance the variable might not actually impact the outcome.
- If zero isn’t in the interval, it’s more likely that the variable genuinely affects the outcome.

3. Evaluating Model Performance: Goodness of Fit Metrics

R-Squared (R²) : R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, where 1 indicates that the model explains all the variance.
Adjusted R-Squared: Adjusted R-squared is a modified version of R-squared that adjusts for the number of independent variables in the model, providing a more accurate measure of the model’s explanatory power when comparing models with different numbers of variables.
- For example, an R-squared value of 0.669 means that about 66.9% of the variance in the dependent variable is explained by the model.
- If the adjusted R-squared decreases when adding more variables, it suggests that the additional variables do not contribute significantly to the model and may be omitted.
F-Statistic and Prob(F-Statistic): The F-statistic is used to test the overall significance of the model. The null hypothesis is that all coefficients (except the intercept) are zero, meaning the model does not explain any variance in the dependent variable. The p-value associated with the F-statistic indicates the probability of observing the F-statistic (or a more extreme value) if the null hypothesis were true. A small p-value (typically less than 0.05) indicates that the model is statistically significant, meaning at least one of the independent variables has a significant effect on the dependent variable.

4. Testing Model Assumptions with Diagnostics

The remaining terms are not often used. Ordinary Least Squares (OLS) summary provides several diagnostic checks to help assessspecific assumptions about the data. Terms like Skewness and Kurtosis tells about the distribution of data. Below are key diagnostics included in the OLS summary:

Omnibus: The Omnibus test evaluates the joint normality of the residuals. A higher value suggests a deviation from normality.
Prob(Omnibus): This p-value indicates the probability of observing the test statistic under the null hypothesis of normality. A value above 0.05 suggests that we do not reject the null hypothesis, implying that the residuals may be normally distributed.
Jarque-Bera (JB): The Jarque-Bera test is another test for normality that assesses whether the sample skewness and kurtosis match those of a normal distribution.
Prob(JB): Similar to the Prob(Omnibus), this p-value assesses the null hypothesis of normality. A value greater than 0.05 indicates that we do not reject the null hypothesis.
Skew: Skewness measures the asymmetry of the distribution of residuals. A skewness value close to zero indicates a symmetrical distribution, while positive or negative values indicate right or left skewness, respectively.
Kurtosis: Kurtosis measures the “tailedness” of the distribution. A kurtosis value of 3 indicates a normal distribution, while values above or below suggest heavier or lighter tails, respectively.
Durbin-Watson: This statistic tests for autocorrelation in the residuals from a regression analysis. Values close to 2 suggest no autocorrelation, while values less than 1 or greater than 3 indicate positive or negative autocorrelation, respectively.
Cond. No. The condition number assesses multicollinearity, where values above 30 suggest potential multicollinearity issues among the independent variables.

Skewness and kurtosis for the normal distribution are 0 and 3 respectively. These diagnostic tests are essential for validating the reliability of a linear regression model, helping ensure that the model’s assumptions are satisfied.

Practical Application: Case Study Interpretation

In this section, we will explore a practical application of Ordinary Least Squares (OLS) regression through a case study on predicting house prices. We will break down the OLS summary output step-by-step and offer insights on how to refine the model based on our interpretations with the help of python code that demonstrates how to perform Ordinary Least Squares (OLS) regression to predict house prices using the statsmodels library. This code includes the steps to fit the model, display the summary output, and interpret key metrics.

Python

import pandas as pd import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt import seaborn as sns  # Sample dataset creation data = {     'Size': np.random.randint(500, 4000, 200),  # Size in square feet     'Bedrooms': np.random.randint(1, 6, 200),   # Number of bedrooms     'Age': np.random.randint(0, 30, 200),       # Age of the house } data['Price'] = 15000 + data['Size'] * 200 + data['Bedrooms'] * 7500 - data['Age'] * 300 + np.random.normal(0, 10000, 200)  df = pd.DataFrame(data)  # Define independent variables (X) and dependent variable (y) X = df[['Size', 'Bedrooms', 'Age']] y = df['Price']  # Add a constant term to the independent variables X = sm.add_constant(X) model = sm.OLS(y, X).fit() # Fit the OLS regression model print(model.summary())

Output:

OLS Regression Results
==============================================================================
Dep. Variable: Price R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3.385e+04
Date: Tue, 05 Nov 2024 Prob (F-statistic): 9.03e-266
Time: 07:05:46 Log-Likelihood: -2111.6
No. Observations: 200 AIC: 4231.
Df Residuals: 196 BIC: 4244.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.602e+04 2417.705 6.626 0.000 1.13e+04 2.08e+04
Size 199.5895 0.629 317.423 0.000 198.349 200.829
Bedrooms 7570.0430 475.547 15.919 0.000 6632.198 8507.888
Age -257.4872 81.980 -3.141 0.002 -419.163 -95.811
==============================================================================
Omnibus: 2.464 Durbin-Watson: 1.816
Prob(Omnibus): 0.292 Jarque-Bera (JB): 2.125
Skew: -0.241 Prob(JB): 0.346
Kurtosis: 3.151 Cond. No. 9.22e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Step-by-Step Breakdown of Each Component

Dependent Variable: In this case, the dependent variable is “Price,” which is what we aim to predict.
R-Squared and Adjusted R-Squared:
- R-squared (0.760) indicates that 76% of the variance in house prices is explained by the model. This is a strong indicator of model fit.
- Adjusted R-squared (0.755) accounts for the number of predictors, providing a more conservative estimate of model performance, particularly when comparing models with different numbers of variables.
F-Statistic and Prob (F-Statistic):
- F-statistic (124.5) assesses the overall significance of the regression model. A high value indicates that at least one predictor variable is significantly related to house prices.
- Prob (F-statistic) of [Tex]1.45×10^{20}[/Tex] shows that the model is statistically significant.
Coefficients:
- Intercept (15,000.00) suggests that when all predictors are zero (hypothetically), the base price is $15,000.
- Size (200.50) indicates that for each additional square foot, the price increases by $200.50.
- Bedrooms (7,500.00) suggests that adding a bedroom increases the house price by $7,500.
- Age (-300.00) indicates that for each year increase in the age of the house, the price decreases by $300.
Standard Errors, t-Statistics, and P-Values:
- The standard errors provide a measure of the accuracy of the coefficient estimates.
- The t-statistics and p-values indicate that all predictors are statistically significant (p < 0.05), meaning they have a significant impact on house prices.

Insights on Model Adjustment

Based on the interpretation of the OLS output:

If any of the p-values were greater than 0.05, it would suggest that the respective variable does not significantly contribute to the model and could potentially be removed.
Additionally, examining the residual plots can help identify patterns that suggest the need for transformation of variables or the inclusion of interaction terms.

Conclusion

Interpreting the results of an OLS summary involves a thorough examination of various statistical metrics and diagnostic checks. Understanding the coefficients, standard errors, t-statistics, p-values, R-squared, F-statistic, and other diagnostics is crucial for evaluating the model’s performance and making informed decisions. Additionally, ensuring that the model meets the OLS assumptions and considering both statistical and practical significance are essential steps in the interpretation process. By carefully analyzing these components, researchers and analysts can gain valuable insights into the relationships between variables and make more accurate predictions and decisions.

Linear vs. Polynomial Regression: Understanding the Differences

rsundery

Improve

Article Tags :

Practice Tags :

Machine Learning