Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Clustering Strings in R
Next article icon

Project | Scikit-learn – Whisky Clustering

Last Updated : 10 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Introduction | Scikit-learn

Scikit-learn is a machine learning library for Python.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.Learn more on Scikit-learn from here.
 

Case Study | Clustering Whiskey


Aim & Description: Scotch whisky is prized for its complexity and variety of flavors.And the regions of Scotland where it is produced are believed to have distinct flavor profiles.In this case study, we will classify scotch whiskies based on their flavor characteristics.The dataset we’ll be using contains a selection of scotch whiskies from several distilleries, and we’ll attempt to cluster whiskies into groups that are similar in flavor.This case study will deepen your understanding of Pandas, NumPy, and scikit-learn, and perhaps of scotch whisky.
Source: Download the whiskey regions dataset and whiskey varieties dataset.We will put these datasets in the working path directory.The dataset we’ll be using consists of tasting ratings of one readily available single malt scotch whisky from almost every active whisky distillery in Scotland.The resulting dataset has 86 malt whiskies that are scored between 0 and 4 in 12 different taste categories.The scores have been aggregated from 10 different tasters.The taste categories describe whether the whiskies are sweet, smoky, medicinal, spicy, and so on. 
 

? Pairwise Correlation ?


The whisky variety dataset contains 86 rows of malt whisky test scores and 17 columns of taste categories.We add another column to the dataset using the code whisky[“Region”] = pd.read_csv(“regions.txt”), now its 86 rows and 18 columns (The new column is Region information).All 18 column names can be found with the help of the command >>>whisky.columns.We narrow our scope to 18 rows & 12 columns using the whisky.iloc[:, 2:14] command and store the results in a variable called flavors.Using corr() method we compute pairwise correlation of columns of the flavor variable.We code,
 

Python
import numpy as np import pandas as pd  whisky = pd.read_csv("whiskies.txt") whisky["Region"] = pd.read_csv("regions.txt")      # >>>whisky.head(), iloc method to index a data frame by location.      # >>>whisky.iloc[0:10], we specified the rows from 0 - 9      # >>>whisky.iloc[0:10, 0:5], we specified the rows from 0 - 9 & columns from 0-5      # >>>whisky.columns flavors = whisky.iloc[:, 2:14]  corr_flavors = pd.DataFrame.corr(flavors) print(corr_flavors) 

Output: The correlation DataFrame is: 
 


 

? Plotting Pairwise Correlation ?


We are going to plot the correlation DataFrame using matplotlib plot.For convenience we will display a colorbar along with the plot.We code, 
 

Python
import matplotlib.pyplot as plt  plt.figure(figsize=(10, 10)) plt.pcolor(corr_flavors) plt.colorbar()      #>>>plt.savefig("corlate-whisky1.pdf")  corr_whisky = pd.DataFrame.corr(flavors.transpose()) plt.figure(figsize=(10, 10)) plt.pcolor(corr_whisky) plt.axis("tight") plt.colorbar()      #>>>plt.savefig("corlate-whisky2.pdf")  plt.show() 

Output: In the plot 1 & 2 the blue color represents the minimum correlation and red colour shows maximum correlation.The first plot is a normal correlation of all the taste categories and the second plot is the correlation between the malt whisky test scores.The second plot looks more complex relative to the first one, due to higher number of columns (86). 
 


 


 

? Spectral-Coclustering ?


The goal of co-clustering is to simultaneously cluster the rows and columns of an input data matrix.The matrix is passed to the Spectral Co-Clustering algorithm. 
 

Python
from sklearn.cluster.bicluster import SpectralCoclustering import numpy as np import pandas as pd import matplotlib.pyplot as plt  model = SpectralCoclustering(n_clusters=6, random_state=0) model.fit(corr_whisky) model.rows_      #>>>np.sum(model.rows_, axis=1)      #>>>np.sum(model.rows_, axis=0) model.row_labels_ 

Output: We use SpectralCoclustering() to Clusters rows and columns of the array.The output of the above code is: 
 


 

? Comparing Correlated Data ?


We will import the necessary modules and sort the data by group.We try to compare the plot between the rearranged correlations vs the original one side by side.We code, 
 

Python
from sklearn.cluster.bicluster import SpectralCoclustering import numpy as np import pandas as pd import matplotlib.pyplot as plt  whisky['Group'] = pd.Series(model.row_labels_, index = whisky.index) whisky = whisky.ix[np.argsort(model.row_labels_)] whisky = whisky.reset_index(drop=True)  correlations = pd.DataFrame.corr(whisky.iloc[:, 2:14].transpose()) correlations = np.array(correlations)  plt.figure(figsize = (14, 7)) plt.subplot(121) plt.pcolor(corr_whisky) plt.title("Original") plt.axis("tight") plt.subplot(122) plt.pcolor(correlations) plt.title("Rearranged") plt.axis("tight") plt.show() plt.savefig("correlations.pdf") 

Output: In the output plot, the first plot is the original correlation plot and the second one is for the sorted and rearranged one.The stark red diagonal in both the figures represent a correlation ratio of 1.
 


Reference : 

  • Scikit-learn clustering


 



Next Article
Clustering Strings in R

A

Amartya Ranjan Saikia
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • AI-ML-DS With Python
  • Machine Learning Projects
  • Python scikit-module
Practice Tags :
  • Machine Learning

Similar Reads

  • Hierarchical Clustering with Scikit-Learn
    Hierarchical clustering is a popular method in data science for grouping similar data points into clusters. Unlike other clustering techniques like K-means, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters that can
    4 min read
  • Clustering Performance Evaluation in Scikit Learn
    In this article, we shall look at different approaches to evaluate Clustering Algorithms using Scikit Learn Python Machine Learning Library. Clustering is an Unsupervised Machine Learning algorithm that deals with grouping the dataset to its similar kind data point. Clustering is widely used for Seg
    3 min read
  • Spectral Clustering using R
    Spectral clustering is a technique used in machine learning and data analysis for grouping data points based on their similarity. The method involves transforming the data into a representation where the clusters become apparent and then using a clustering algorithm on this transformed data. In R Pr
    9 min read
  • Clustering Strings in R
    Clustering is a fundamental unsupervised learning technique used to group similar data points together based on their features. While clustering is commonly applied to numerical data, it can also be used to cluster strings or text data. In this article, we'll explore the theory behind clustering str
    4 min read
  • Implementing PCA in Python with scikit-learn
    In this article, we will learn about PCA (Principal Component Analysis) in Python with scikit-learn. Let's start our learning step by step. WHY PCA? When there are many input attributes, it is difficult to visualize the data. There is a very famous term ‘Curse of dimensionality in the machine learni
    5 min read
  • K means clustering using Weka
    In this article, we are going to see how to use Weka explorer to do simple k-mean clustering. Here we will use sample data set which is based on iris data that is available in ARFF format. There are 150 iris instances in this dataset. Before starting let's have a small intro about clustering and sim
    3 min read
  • Plotting Boundaries of Cluster Zone with Scikit-Learn
    Clustering is a popular technique in machine learning for identifying groups within a dataset based on similarity. In Python, the scikit-learn package provides a range of clustering algorithms like KMeans, DBSCAN, and Agglomerative Clustering. A critical aspect of cluster analysis is visualizing the
    6 min read
  • Clustering in Machine Learning
    In real world, not every data we work upon has a target variable. Have you ever wondered how Netflix groups similar movies together or how Amazon organizes its vast product catalog? These are real-world applications of clustering. This kind of data cannot be analyzed using supervised learning algori
    9 min read
  • Spectral Co-Clustering Algorithm in Scikit Learn
    Spectral co-clustering is a type of clustering algorithm that is used to find clusters in both rows and columns of a data matrix simultaneously. This is different from traditional clustering algorithms, which only cluster the rows or columns of a data matrix. Spectral co-clustering is a powerful too
    4 min read
  • Hierarchical clustering using Weka
    In this article, we will see how to utilize the Weka Explorer to perform hierarchical analysis. The sample data set for this example is based on iris data in ARFF format. The data has been appropriately preprocessed, as this article expects. This dataset has 150 iris occurrences. Clustering: Cluster
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences