Determining the Number of Clusters in Data Mining

Last Updated : 13 Feb, 2022

In Clustering algorithms like K-Means clustering, we have to determine the right number of clusters for our dataset. This ensures that the data is properly and efficiently divided. An appropriate value of 'k' i.e. the number of clusters helps in ensuring proper granularity of clusters and helps in maintaining a good balance between compressibility and accuracy of clusters.

Let us consider two cases:

Case 1: Treat the entire dataset as one cluster Case 2: Treat each data point as a cluster

This will give the most accurate clustering because of the zero distance between the data point and its corresponding cluster center. But, this will not help in predicting new inputs. It will not enable any kind of data summarization.

So, we can conclude that it is very important to determine the 'right' number of clusters for any dataset. This is a challenging task but very approachable if we depend on the shape and scaling of the data distribution. A simple method to calculate the number of clusters is to set the value to about √(n/2) for a dataset of 'n' points. In the rest of the article, two methods have been described and implemented in Python for determining the number of clusters in data mining.

1. Elbow Method:

This method is based on the observation that increasing the number of clusters can help in reducing the sum of the within-cluster variance of each cluster. Having more clusters allows one to extract finer groups of data objects that are more similar to each other. For choosing the 'right' number of clusters, the turning point of the curve of the sum of within-cluster variances with respect to the number of clusters is used. The first turning point of the curve suggests the right value of 'k' for any k > 0. Let us implement the elbow method in Python.

Step 1: Importing the libraries

Python3

# importing the libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans

Step 2: Loading the dataset

We have used the Mall Customer dataset which can be found on this link.

Python3

# loading the dataset dataset = pd.read_csv('Mall_Customers.csv')  # printing first five rows of the dataset print(dataset.head(5))

Output:

Step 3: Checking for any null values

The dataset has 200 rows and 5 columns. It has no null values.

Python3

# printing the shape of dataset print(dataset.shape)  # checking for any # null values present print(dataset.isnull().sum())

Output:

Shape of the dataset along with count of null values

Step 4: Extracting 2 columns from the dataset for clustering

Let us extract two columns namely 'Annual Income (k$)' and 'Spending Score (1-100)' for further process.

Python3

# extracting values from two  # columns for clustering dataset_new = dataset[['Annual Income (k$)',                         'Spending Score (1-100)']].values

Step 5: Determining the number of clusters using the elbow method and plotting the graph

Python3

# determining the maximum number of clusters  # using the simple method limit = int((dataset_new.shape[0]//2)**0.5)  # selecting optimal value of 'k' # using elbow method  # wcss - within cluster sum of # squared distances wcss = {}  for k in range(2,limit+1):     model = KMeans(n_clusters=k)     model.fit(dataset_new)     wcss[k] = model.inertia_      # plotting the wcss values # to find out the elbow value plt.plot(wcss.keys(), wcss.values(), 'gs-') plt.xlabel('Values of "k"') plt.ylabel('WCSS') plt.show()

Output:

Through the above plot, we can observe that the turning point of this curve is at the value of k = 5. Therefore, we can say that the 'right' number of clusters for this data is 5.

2. Silhouette Score:

Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well data points are clustered with other data points that are similar to each other. This method can be used to find the optimal value of 'k'. This score is within the range of [-1,1]. The value of 'k' having the silhouette score nearer to 1 can be considered as the 'right' number of clusters. sklearn.metrics.silhouette _score() is used to find the score in Python. Let us implement this for the same dataset used in elbow method.

Step 1: Importing libraries

Python3

# importing the libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

Step 2: Loading the dataset

We have used the Mall Customer dataset.

Python3

# loading the dataset dataset = pd.read_csv('Mall_Customers.csv')  # printing first five rows of the dataset print(dataset.head(5))

Output:

Step 3: Checking for any null values

The dataset has 200 rows and 5 columns. It has no null values.

Python3

# printing the shape of dataset print(dataset.shape)  # checking for any # null values present print(dataset.isnull().sum())

Output:

Step 4: Extracting 2 columns from the dataset for clustering

Let us extract two columns namely 'Annual Income (k$)' and 'Spending Score (1-100)' for further process.

Python3

# extracting values from two  # columns for clustering dataset_new = dataset[['Annual Income (k$)',                         'Spending Score (1-100)']].values

Step 5: Determining the number of clusters using silhouette score

The minimum number of clusters required for calculating silhouette score is 2. So the loop starts from 2.

Python3

# determining the maximum number of clusters  # using the simple method limit = int((dataset_new.shape[0]//2)**0.5)  # determining number of clusters # using silhouette score method for k in range(2, limit+1):     model = KMeans(n_clusters=k)     model.fit(dataset_new)     pred = model.predict(dataset_new)     score = silhouette_score(dataset_new, pred)     print('Silhouette Score for k = {}: {:<.3f}'.format(k, score))

As we can observe, the value of k = 5 has the highest value i.e. nearest to +1. So, we can say that the optimal value of 'k' is 5.

Now, we have rightly determined and validated the number of clusters for the Mall Customer Dataset using two methods - elbow method and silhouette score. In both the cases, k = 5. Let us now perform KMeans clustering on the dataset and plot the clusters.

Python3

# clustering the data using Kmeans # using k = 5 model = KMeans(n_clusters=5)  # predicting the clusters pred = model.fit_predict(dataset_new)  # plotting all the clusters colours = ['red', 'blue', 'green', 'yellow', 'orange']  for i in np.unique(model.labels_):     plt.scatter(dataset_new[pred==i, 0],                 dataset_new[pred==i, 1],                 c = colours[i])      # plotting the cluster centroids plt.scatter(model.cluster_centers_[:, 0],              model.cluster_centers_[:, 1],              s = 200,  # marker size             c = 'black')  plt.title('K Means clustering') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.show()

From the above plot, we can see that five efficient clusters have been formed which are clearly separable from each other. The cluster centroids are also visible in black color.

Determining the Number of Clusters in Data Mining

riyaaggarwal

Improve

Article Tags :

Determining the Number of Clusters in Data Mining

1. Elbow Method:

2. Silhouette Score:

Similar Reads