Expectation-Maximization Algorithm - ML
Last Updated : 16 May, 2025
Expectation-Maximization (EM) algorithm is a iterative method used in unsupervised machine learning to find unknown values in statistical models. It helps to find the best values for unknown parameters especially when some data is missing or hidden. It works in two steps:
- E-step (Expectation Step): Estimates missing or hidden values using current parameter estimates.
- M-step (Maximization Step): Updates model parameters to maximize the likelihood based on the estimated values from the E-step.
This process repeats until the model reaches a stable solution as it improve accuracy with each iteration. It is widely used in clustering like Gaussian Mixture Models and handling missing data.
Expectation-Maximization in EM AlgorithmBy iteratively repeating these steps the EM algorithm seeks to maximize the likelihood of the observed data.
Key Terms in Expectation-Maximization (EM) Algorithm
Lets understand about some of the most commonly used key terms in the Expectation-Maximization (EM) Algorithm:
- Latent Variables: These are hidden parts of the data that we can’t see directly but they still affect what we do see. We try to guess their values using the visible data.
- Likelihood: This refers to the probability of seeing the data we have based on certain assumptions or parameters. The EM algorithm tries to find the best parameters that make the data most likely.
- Log-Likelihood: This is just the natural log of the likelihood function. It's used to make calculations easier and measure how well the model fits the data. The EM algorithm tries to maximize the log-likelihood to improve the model fit.
- Maximum Likelihood Estimation (MLE): This is a method to find the best values for a model’s settings called parameters. It looks for the values that make the data we observed most likely to happen.
- Posterior Probability: In Bayesian methods this is the probability of the parameters given both prior knowledge and the observed data. In EM it helps estimate the "best" parameters when there's uncertainty about the data.
- Expectation (E) Step: In this step the algorithm estimates the missing or hidden information (latent variables) based on the observed data and current parameters. It calculates probabilities for the hidden values given what we can see.
- Maximization (M) Step: This step update the parameters by finding the values that maximize the likelihood based on the estimates from the E-step.
- Convergence: Convergence happens when the algorithm has reached a stable point. This is checked by seeing if the changes in the model's parameters or the log-likelihood are small enough to stop the process.
Working of Expectation-Maximization (EM) Algorithm
So far, we've discussed the key terms in the EM algorithm. Now, let's dive into how the EM algorithm works. Here's a step-by-step breakdown of the process:
EM Algorithm Flowchart1. Initialization: The algorithm starts with initial parameter values and assumes the observed data comes from a specific model.
2. E-Step (Expectation Step):
- Find the missing or hidden data based on the current parameters.
- Calculate the posterior probability of each latent variable based on the observed data.
- Compute the log-likelihood of the observed data using the current parameter estimates.
3. M-Step (Maximization Step):
- Update the model parameters by maximize the log-likelihood.
- The better the model the higher this value.
4. Convergence:
- Check if the model parameters are stable and converging.
- If the changes in log-likelihood or parameters are below a set threshold, stop. If not repeat the E-step and M-step until convergence is reached
Implementation of Expectation-Maximization Algorithm
Step 1 : Import the necessary libraries
First we will import the necessary Python libraries like NumPy, Seaborn, Matplotlib and SciPy.
Python import numpy as np import seaborn as sns from scipy.stats import norm from scipy.stats import gaussian_kde import matplotlib.pyplot as plt
Step 2 : Generate a dataset with two Gaussian components
We generate two sets of data values from two different normal distributions:
- One centered around 2 (with more spread).
- Another around -1 (with less spread).
These two sets are then combined to form a single dataset. We plot this dataset to visualize how the values are distributed.
Python mu1, sigma1 = 2, 1 mu2, sigma2 = -1, 0.8 X1 = np.random.normal(mu1, sigma1, size=200) X2 = np.random.normal(mu2, sigma2, size=600) X = np.concatenate([X1, X2]) sns.kdeplot(X) plt.xlabel('X') plt.ylabel('Density') plt.title('Density Estimation of X') plt.show()
Output:
Density PlotStep 3: Initialize parameters
We make initial guesses for each group’s:
- Mean (average),
- Standard deviation (spread),
- Proportion (how much each group contributes to the total data).
Python mu1_hat, sigma1_hat = np.mean(X1), np.std(X1) mu2_hat, sigma2_hat = np.mean(X2), np.std(X2) pi1_hat, pi2_hat = len(X1) / len(X), len(X2) / len(X)
We run a loop for 20 rounds called epochs. In each round:
- The E-step calculates the responsibilities (gamma values) by evaluating the Gaussian probability densities for each component and weighting them by the corresponding proportions.
- The M-step updates the parameters by computing the weighted mean and standard deviation for each component
We also calculate the log-likelihood in each round to check if the model is getting better. This is a measure of how well the model explains the data.
Python num_epochs = 20 log_likelihoods = [] for epoch in range(num_epochs): # E-step: Compute responsibilities gamma1 = pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat) gamma2 = pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat) total = gamma1 + gamma2 gamma1 /= total gamma2 /= total # M-step: Update parameters mu1_hat = np.sum(gamma1 * X) / np.sum(gamma1) mu2_hat = np.sum(gamma2 * X) / np.sum(gamma2) sigma1_hat = np.sqrt(np.sum(gamma1 * (X - mu1_hat)**2) / np.sum(gamma1)) sigma2_hat = np.sqrt(np.sum(gamma2 * (X - mu2_hat)**2) / np.sum(gamma2)) pi1_hat = np.mean(gamma1) pi2_hat = np.mean(gamma2) # Compute log-likelihood log_likelihood = np.sum(np.log(pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat) + pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat))) log_likelihoods.append(log_likelihood) plt.plot(range(1, num_epochs+1), log_likelihoods) plt.xlabel('Epoch') plt.ylabel('Log-Likelihood') plt.title('Log-Likelihood vs. Epoch') plt.show()
Output:
Epoch vs Log-likelihoodStep 5: Visualize the Final Result
Now we will finally visualize the curve which compare the final estimated curve (in red) with the original data’s smooth curve (in green).
Python X_sorted = np.sort(X) density_estimation = pi1_hat*norm.pdf(X_sorted, mu1_hat, sigma1_hat) + pi2_hat * norm.pdf(X_sorted, mu2_hat, sigma2_hat) plt.plot(X_sorted, gaussian_kde(X_sorted)(X_sorted), color='green', linewidth=2) plt.plot(X_sorted, density_estimation, color='red', linewidth=2) plt.xlabel('X') plt.ylabel('Density') plt.title('Density Estimation of X') plt.legend(['Kernel Density Estimation','Mixture Density']) plt.show()
Output:
Estimated densityThe above image compares Kernel Density Estimation (green) and Mixture Density (red) for variable X. Both show similar patterns with a main peak near -1.5 and a smaller bump around 2 indicate two data clusters. The red curve is slightly smoother and sharper than the green one.
Advantages of EM algorithm
- Always improves results – With each step, the algorithm improves the likelihood (chances) of finding a good solution.
- Simple to implement – The two steps (E-step and M-step) are often easy to code for many problems.
- Quick math solutions – In many cases, the M-step has a direct mathematical solution (closed-form), making it efficient
Disadvantages of EM algorithm
- Takes time to finish: It converges slowly meaning it may take many iterations to reach the best solution.
- Gets stuck in local best: Instead of finding the absolute best solution, it might settle for a "good enough" one.
- Needs extra probabilities: Unlike some optimization methods that only need forward probability, EM requires both forward and backward probabilities making it slightly more complex.
The EM algorithm iteratively estimates missing data and updates model parameters to improve accuracy. By alternating between the E-step and M-step it refines the model until it converges making it widely used tool for handling hidden or incomplete data in machine learning.
Similar Reads
Machine Learning Algorithms Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith
8 min read
Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025 Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi
14 min read
Linear Model Regression
Ordinary Least Squares (OLS) using statsmodelsOrdinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P
3 min read
Linear Regression (Python Implementation)Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr
14 min read
Multiple Linear Regression using Python - MLLinear regression is a statistical method used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression extends this concept by modelling the relationship between a dependen
4 min read
Polynomial Regression ( From Scratch using Python )Prerequisites Linear RegressionGradient DescentIntroductionLinear Regression finds the correlation between the dependent variable ( or target variable ) and independent variables ( or features ). In short, it is a linear model to fit the data linearly. But it fails to fit and catch the pattern in no
5 min read
Bayesian Linear RegressionLinear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the
10 min read
How to Perform Quantile Regression in PythonIn this article, we are going to see how to perform quantile regression in Python. Linear regression is defined as the statistical method that constructs a relationship between a dependent variable and an independent variable as per the given set of variables. While performing linear regression we a
4 min read
Isotonic Regression in Scikit LearnIsotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mat
6 min read
Stepwise Regression in PythonStepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F
6 min read
Least Angle Regression (LARS)Regression is a supervised machine learning task that can predict continuous values (real numbers), as compared to classification, that can predict categorical or discrete values. Before we begin, if you are a beginner, I highly recommend this article. Least Angle Regression (LARS) is an algorithm u
3 min read
Linear Model Classification
Regularization
K-Nearest Neighbors (KNN)
Support Vector Machines
ML - Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read
Decision Tree
Ensemble Learning