Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Process Capability Cp Formula
Next article icon

Dirichlet Process Mixture Models (DPMMs)

Last Updated : 11 Mar, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Clustering is the process of grouping similar data points together. The goal is to identify natural patterns within the dataset, where points in the same cluster are more similar to each other than those in different clusters. It is an unsupervised learning technique meaning it doesn't require predefined labels or targets.

Key Features of Clustering

  • Similarity Measure: Clustering relies on measuring how similar or dissimilar data points are, using methods like Euclidean distance or cosine similarity.
  • Grouping Criteria: Clusters are formed based on specific rules defined by the chosen algorithm.
  • Unsupervised Nature: The algorithm explores data patterns without prior knowledge of labels.

Flexible Clustering

Traditional clustering methods like K-means or Gaussian Mixture Models (GMMs) require the number of clusters to be specified beforehand. However in real-world scenarios, the number of clusters is often unknown. Flexible clustering methods automatically determine the number of clusters based on the data itself.

Dirichlet Process Mixture Models (DPMMs) offer a probabilistic and nonparametric approach to clustering. They dynamically adjust the number of clusters based on the complexity of the data.

Understanding DPMMs

To grasp DPMMs it's important to understand two key concepts:

1. Beta Distribution

The Beta distribution models probabilities for two possible outcomes (success or failure). It is defined by two parameters α and β that shape the distribution. The probability density function (PDF) is given by:

f(x,\alpha,\beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\Beta(\alpha\beta)}

Where B(α, β) is the beta function.

2. Dirichlet Distribution

The Dirichlet distribution is a generalization of the Beta distribution for multiple outcomes. It represents the probabilities of different categories, like rolling a dice with unknown probabilities for each side.

The PDF of the Dirichlet distribution is:

f(p,\alpha) = \frac{1}{B(\alpha)}\Pi_{i=1}^{k}p_i^{\alpha_i -1}

  • p=(p1​,p2​,…, pK​) is a vector representing a probability distribution over K categories. Each pi​ is a probability, and ∑K​pi​=1.
  • α=(α1​,α2​,…,αK​) is a vector of positive shape parameters. This determines the shape of the distribution
  • B(α) is a beta function.

How α Affects the Distribution

  • Higher α values result in probabilities concentrated around the mean.
  • Equal α values produce symmetric distributions.
  • Different α values create skewed distributions.

Dirichlet Process (DP)

A Dirichlet Process is a stochastic process that generates probability distributions over infinite categories. It enables clustering without specifying the number of clusters in advance.

The Dirichlet Process is defined as:

DP(α,G0 )

Where:

  • α: Concentration parameter controlling cluster diversity.
  • G₀: Base distribution representing the prior belief about cluster parameters.

Stick-Breaking Process

The stick-breaking process is a method to generate probabilities from a Dirichlet Process. The concept is shown in the image below:

file-Geeksforgeeks
  • We take a stick of length unit 1 representing our base probability distribution
  • Using marginal distribution property we break it into two. We use beta distribution. Suppose the length obtained is p1
  • The conditional probability of the remaining categories is a Dirichlet distribution
  • The length of the stick that remains is 1-p1, and using the marginal property again
  • Repeat the above steps to obtain enough pi such that the sum is close to 1
  • Mathematically this can be expressed as
    • For k=1,p1=β(1,α)
    • For k=2,p2=β(1,α)∗(1−p1)
    • For k=3,p3=β(1,α)∗(1−p1−p2)

For each of the categories sample we also sample μ from our base distribution. This becomes our cluster parameters.

Dirichlet Process Mixture Models (DPMMs)

A DPMM is an extension of Gaussian Mixture Models where the number of clusters is not fixed. It uses the Dirichlet Process as a prior for the mixture components.

How DPMMs Work

  1. Initialize: Assign random clusters to data points.
  2. Iteration:
    • Pick one data point.
    • Fix other cluster assignments.
    • Assign the point to an existing cluster or a new cluster based on probabilities.
  3. Repeat: Continue until cluster assignments no longer change.

The probability of assigning a point to an existing cluster is: \frac{n_k}{n-1+\alpha} \Nu (\mu,1)

The probability of assigning a point to a new cluster is: \frac{\alpha}{n-1+\alpha}\Nu(0,1)

Where:

  • nₖ: Number of points in cluster k.
  • α: Concentration parameter.
  • N(μ, σ): Gaussian distribution.

DPMM is an extension of Gaussian Mixture Models where the number of clusters is not fixed. It uses the Dirichlet Process as a prior for the mixture components.

Implementing Dirichlet Process Mixture Models using Sklearn

Now let us implement DPMM process in scikit learn and we'll use the Mall Customers Segmentation Data. Let's understand this step-by-step:

Step 1: Import Libraries and Load Dataset

In this step we will import all the necessary libraries. This dataset contains customer information, including age, income and spending score.

Python
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.mixture import BayesianGaussianMixture from sklearn.decomposition import PCA  url = 'https://raw.githubusercontent.com/vihar/Customer-Segmentation-Tutorial-in-Python/master/Mall_Customers.csv' data = pd.read_csv(url) print(data.head()) 

Step 2: Feature Selection

In this step we select features that are likely to influence customer clusters.

Python
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].values 


Step 3: Dimensionality Reduction

we will use PCA algorithm to reduces the data's dimensions to 2 for easy visualization.

Python
pca = PCA(n_components=2) X_pca = pca.fit_transform(X) 

Step 4: Fit Bayesian Gaussian Mixture Model

The model automatically determines the optimal number of clusters based on the data.

Python
bgm = BayesianGaussianMixture(n_components=10, covariance_type='full', random_state=42) bgm.fit(X) labels = bgm.predict(X) 

Step 5: Visualization

Clusters are visualized with different colors making patterns easier to interpret.

Python
# Perform dimensionality reduction for visualization pca = PCA(n_components=2) X_pca = pca.fit_transform(X)  # Visualize the results plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_pred, edgecolors='k', cmap=plt.cm.Paired, marker='o', s=100, linewidth=2) plt.title('Bayesian Gaussian Mixture Model Clustering') plt.show() 

Output:

Advantages over traditional methods

  • One of the primary advantages of DPMMs is their ability to automatically determine the number of clusters in the data. Traditional methods often require the pre-specification of the number of clusters (e.g., in k-means) which can be challenging in real-world applications.
  • DPMMs operate within a probabilistic framework allowing for the quantification of uncertainty. Traditional methods often provide "hard" assignments of data points to clusters while DPMMs give probabilistic cluster assignments capturing the uncertainty inherent in the data
  • DPMMs find applications in a wide range of fields including natural language processing, computer vision, bioinformatics, and finance. Their flexibility makes them applicable to diverse datasets and problem domains.

Next Article
Process Capability Cp Formula

R

rahulsm27
Improve
Article Tags :
  • Machine Learning
  • Geeks Premier League
  • AI-ML-DS
  • Geeks Premier League 2023
Practice Tags :
  • Machine Learning

Similar Reads

  • How do diffusion models use iterative processes to generate images?
    In recent years, diffusion models have emerged as a powerful class of generative models, particularly for tasks such as image generation. These models rely on iterative processes to transform noise into coherent images, leveraging principles from probability theory and stochastic processes. This art
    5 min read
  • Gaussian Mixture Models (GMM) in Scikit Learn
    The Gaussian Mixture Model (GMM) is a flexible clustering technique that models data as a mixture of multiple Gaussian distributions. Unlike k-means which assumes spherical clusters GMM allows clusters to take various shapes making it more effective for complex datasets. If you're new to GMM, you ca
    7 min read
  • Dirichlet Distribution in R
    The Dirichlet distribution is a multivariate generalization of the Beta distribution and is commonly used in Bayesian statistics, machine learning, and other fields where probability distributions over multiple categories are required. This article provides an in-depth understanding of the Dirichlet
    4 min read
  • What is Gaussian mixture model clustering using R
    Gaussian mixture model (GMM) clustering is a used technique in unsupervised machine learning that groups data points based on their probability distributions. In R Programming Language versatility lies in its ability to model clusters of shapes and sizes making it applicable to scenarios. The approa
    8 min read
  • Process Capability Cp Formula
    Process Capability is a statistical measurement that assesses a process's ability to produce output within desired specifications consistently. It involves analyzing the process's performance based on historical data to answer whether the process can meet the desired specifications reliably. With th
    8 min read
  • How Linear Mixed Model Works in R
    Linear mixed models (LMMs) are statistical models that are used to analyze data with both fixed and random effects. They are particularly useful when analyzing data with hierarchical or nested structures, such as longitudinal or clustered data. In R Programming Language, the lme4 package provides a
    4 min read
  • Dirichlet Distribution
    The Dirichlet distribution is a multivariate extension of the Beta distribution and is extensively applied in Bayesian statistics and machine learning. It is used to model categorical data, proportions, and probabilities and acts as a conjugate prior for multinomial distributions. Some of its import
    6 min read
  • Gaussian Mixture Model
    Clustering is a key technique in unsupervised learning, used to group similar data points together. While traditional methods like K-Means and Hierarchical Clustering are widely used, they assume that clusters are well-separated and have rigid shapes. This can be limiting in real-world scenarios whe
    7 min read
  • Probabilistic Models in Machine Learning
    Machine learning algorithms today rely heavily on probabilistic models, which take into consideration the uncertainty inherent in real-world data. These models make predictions based on probability distributions, rather than absolute values, allowing for a more nuanced and accurate understanding of
    6 min read
  • Markov Decision Process
    Reinforcement Learning:Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior; t
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences