Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Statistics with Python
  • Data Analysis Tutorial
  • Python – Data visualization tutorial
  • NumPy
  • Pandas
  • OpenCV
  • R
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
Open In App
Next Article:
Gaussian Distribution In Machine Learning
Next article icon

Gaussian Distribution In Machine Learning

Last Updated : 30 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

The Gaussian distribution, also known as the normal distribution, plays a fundamental role in machine learning. It is a key concept used to model the distribution of real-valued random variables and is essential for understanding various statistical methods and algorithms.

Table of Content

  • Gaussian Distribution
  • Gaussian Distribution Curve
  • Gaussian Distribution Table
  • Properties of Gaussian Distribution
  • Machine Learning Methods that uses Gaussian Distribution
  • Implementation of Gaussian Distribution in Machine Learning

Gaussian Distribution

In machine learning, the Gaussian distribution, is also known as the normal distribution. It is a continuous probability distribution function that is symmetrical at the mean, and the majority of data falls within one standard deviation of the mean. It is characterized by its bell-shaped curve.

Gaussian Distribution Formula

The PDF (probability density function) of the Gaussian distribution is given by the formula:

f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2} \right)

where:

  • x represents the Variable
  • μ represents the Mean
  • σ represents the Standard Deviation
  • e represents the base of the Natural Logarithm.

Gaussian Distribution Curve

The curve is symmetric and bell-shaped, and it mathematically represents the probability distribution of a continuous random variable. The Gaussian distribution is characterized by two parameters: the mean (μ) and the standard deviation (σ), which determine the location and the spread of the curve.

Probability-Distribution-Curve

  • The standard deviations are used to subdivide the area under the normal curve. Each subdivided section defines the percentage of data, which falls into the specific region of a graph.
  • Analysis : A smaller standard deviation results in a narrower and taller bell curve, indicating that data points are clustered closely around the mean. Conversely, a larger standard deviation leads to a wider and shorter bell curve, suggesting that data points are more spread out from the mean.
  • The Empirical Rule, also known as the 68-95-99.7 rule, quantifies the proportion of data falling within certain intervals around the mean in a normal distribution. It provides a quick way to estimate the spread of data without performing detailed calculations.
  • Within one standard deviation of the mean (Mean ± 1 SD), approximately 68% of the data is expected to fall.
  • Within two standard deviations of the mean (Mean ± 2 SD), approximately 95% of the data is expected to fall.
  • Within three standard deviations of the mean (Mean ± 3 SD), approximately 99.7% of the data is expected to fall.

Gaussian Distribution Table

  • A Gaussian distribution table, also known as a standard normal distribution table or z-table, is a tabulated form that provides values of the cumulative distribution function (CDF) for the standard normal distribution.
  • The standard normal distribution has a mean(central value) of 0 and a standard deviation of 1.
  • Normally , the table consists of two columns namely Z-value and their Cumulative probability . Z-value is the number of standard deviations away from the mean. It ranges from negative infinity to positive infinity.
  • Cumulative probability represents the probability that a standard normal random variable is less than or equal to the corresponding z-value.

Note:

  • Columns = value of z ranging from -3.4 to 3.4, with increments of 0.1.
  • Rows = percentile value ranging from 0.00 to 0.09, with increments of 0.01.


Z-Value00.010.020.030.040.050.060.070.080.09
000.0040.0080.0120.0160.01990.02390.02790.03190.0359
0.10.03980.04380.04780.05170.05570.05960.06360.06750.07140.0753
0.20.07930.08320.08710.0910.09480.09870.10260.10640.11030.1141
0.30.11790.12170.12550.12930.13310.13680.14060.14430.1480.1517
0.40.15540.15910.16280.16640.170.17360.17720.18080.18440.1879
0.50.19150.1950.19850.20190.20540.20880.21230.21570.2190.2224
0.60.22570.22910.23240.23570.23890.24220.24540.24860.25170.2549
0.70.2580.26110.26420.26730.27040.27340.27640.27940.28230.2852
0.80.28810.2910.29390.29670.29950.30230.30510.30780.31060.3133
0.90.31590.31860.32120.32380.32640.32890.33150.3340.33650.3389
10.34130.34380.34610.34850.35080.35310.35540.35770.35990.3621
1.10.36430.36650.36860.37080.37290.37490.3770.3790.3810.383
1.20.38490.38690.38880.39070.39250.39440.39620.3980.39970.4015
1.30.40320.40490.40660.40820.40990.41150.41310.41470.41620.4177
1.40.41920.42070.42220.42360.42510.42650.42790.42920.43060.4319
1.50.43320.43450.43570.4370.43820.43940.44060.44180.44290.4441
1.60.44520.44630.44740.44840.44950.45050.45150.45250.45350.4545
1.70.45540.45640.45730.45820.45910.45990.46080.46160.46250.4633
1.80.46410.46490.46560.46640.46710.46780.46860.46930.46990.4706
1.90.47130.47190.47260.47320.47380.47440.4750.47560.47610.4767
20.47720.47780.47830.47880.47930.47980.48030.48080.48120.4817

The Z score table is often used in statistical calculations and hypothesis testing to determine probabilities associated with specific z-values.

For example , z-value of 1.96 in the table then the cumulative probability to be approximately 0.975 , we can infer that approximately 97.5% of the area under the standard normal curve lies to the left of z = 1.96.

Properties of Gaussian Distribution

Some of the important properties are

  • The Gaussian distribution must be symmetric around its mean with same probability density on both sides of mean.
  • The sum of many independent, identically distributed random variables converges to a Gaussian distribution.
  • When you estimate the mean and variance of a Gaussian distribution from a set of data, the maximum likelihood estimators provide the most accurate estimates compared to other distributions.
  • In linear transformations, if X follows a Gaussian distribution, then aX+b also follows a Gaussian distribution for constants a and b. This property makes the Gaussian distribution robust and convenient for modeling various real-world phenomena that involve linear transformations.
  • In multiple dimensions, the Gaussian distribution extends naturally. It describes how multiple variables can be jointly Gaussian, meaning that any linear combination of these variables also follows a Gaussian distribution. This property is valuable for modeling complex systems with multiple interacting variables.

Machine Learning Methods that uses Gaussian Distribution

  • Likelihood Modeling: In algorithms, such as linear regression, logistic regression, and Gaussian mixture models, it is often assumed that the observed data is generated from a Gaussian distribution. It simplifies the model and allows for efficient parameter estimation.
  • Bayesian Inference: In Bayesian machine learning, the Gaussian distribution is commonly used as a prior distribution over model parameters. This prior distribution reflects about the parameters before observing any data and is updated to a posterior distribution using Bayes' theorem.
  • Clustering: Gaussian mixture models (GMMs) can model complex data distributions and are often used in image segmentation and data compression.
  • Anomaly Detection: Gaussian distribution is often used in anomaly detection algorithms, where the goal is to identify rare events or outliers in the data. Anomalies are detected based on the likelihood of the data under the Gaussian distribution.
  • Dimensionality Reduction: Principal Component Analysis (PCA), it finds the directions of maximum variance in the data, which correspond to the principal components.
  • Kernel Methods: Gaussian kernel is commonly used in kernelized machine learning algorithms, such as Support Vector Machines (SVMs) and Gaussian Processes (GPs), to define the similarity between data points.

Implementation of Gaussian Distribution in Machine Learning

Consider the famous Iris dataset consists of 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. We can examine the distribution of one of these features, such as sepal length, using a histogram to see if it approximately follows a Gaussian distribution.

  • x = np.linspace(np.min(sepal_length), np.max(sepal_length), 100) : the np.linspace function is used to create an array of 100 evenly spaced numbers between the minimum and maximum values of the sepal length feature (sepal_length). This array is used to plot the Gaussian distribution curve.
Python
from sklearn.datasets import load_iris import matplotlib.pyplot as plt import numpy as np  # Load the Iris dataset iris = load_iris() sepal_length = iris.data[:, 0]  # Extract sepal length (feature at index 0)  mu, std = np.mean(sepal_length), np.std(sepal_length) x = np.linspace(np.min(sepal_length), np.max(sepal_length), 100) y = (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / std)**2)  plt.figure(figsize=(8, 6)) plt.hist(sepal_length, bins=20, color='skyblue', edgecolor='black', alpha=0.7, density=True) plt.plot(x, y, color='red', label='Gaussian Fit') plt.xlabel('Sepal Length (cm)') plt.ylabel('Density') plt.title('Distribution of Sepal Length in Iris Dataset with Gaussian Fit') plt.legend() plt.show() 


Output:

Screenshot-2024-03-13-173238
FIGURE 1


  • Central Tendency: The peak of the distribution (mean) suggests that the most common sepal length among the iris flowers in the dataset is around 5.8 centimeters.
  • Variability: The spread of the distribution (standard deviation) indicates how much the sepal lengths vary from the mean. A larger standard deviation would imply more variability in sepal lengths among the iris flowers.
  • Normality: The distribution roughly follows a bell-shaped curve, which is characteristic of a normal (Gaussian) distribution. This suggests that sepal lengths in the Iris dataset may be normally distributed.
  • Outliers: The presence of outliers, particularly on the right tail of the distribution, indicates that there are some iris flowers with unusually long sepal lengths compared to the rest of the dataset. These outliers could be due to measurement errors or represent a distinct subgroup of iris flowers.

The stability of Gaussian distributions under linear combinations facilitates analytical solutions for understanding the behavior of random variables and making predictions based on data making it a cornerstone in statistical modeling and analysis.


Next Article
Gaussian Distribution In Machine Learning

S

selvi1977rctbc
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • ML-Statistics
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

    Marginal Gaussian Distributions
    Gaussian distributions are central to probability and statistics because they are simple and highly applicable. In the case of multivariate Gaussian distributions, a key idea is the marginal distribution, which gives the distribution of a subset of variables while the rest are ignored. Marginal Gaus
    4 min read
    Gaussian Processes in Machine Learning
    In the world of machine learning, Gaussian Processes (GPs) is a powerful, flexible approach to modeling and predicting complex datasets. GPs belong to a class of probabilistic models that are particularly effective in scenarios where the prediction not only involves the most likely outcome but also
    9 min read
    Diffusion Models in Machine Learning
    A diffusion model in machine learning is a probabilistic framework that models the spread and transformation of data over time to capture complex patterns and dependencies.In this article, we are going to explore the fundamentals of diffusion models and implement diffusion models to generate images.
    9 min read
    Discrete Probability Distributions for Machine Learning
    Discrete probability distributions are used as fundamental tools in machine learning, particularly when dealing with data that can only take a finite number of distinct values. These distributions describe the likelihood of each possible outcome for a discrete random variable. Understanding these di
    6 min read
    Bias and Variance in Machine Learning
    There are various ways to evaluate a machine-learning model. We can use MSE (Mean Squared Error) for Regression; Precision, Recall, and ROC (Receiver operating characteristics) for a Classification Problem along with Absolute Error. In a similar way, Bias and Variance help us in parameter tuning and
    10 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences