Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
What is Standardization in Machine Learning
Next article icon

What is Standardization in Machine Learning

Last Updated : 11 Jan, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

In Machine Learning we train our data to predict or classify things in such a manner that isn't hardcoded in the machine. So for the first, we have the Dataset or the input data to be pre-processed and manipulated for our desired outcomes. Any ML Model to be built follows the following procedure:

  • Collect Data
  • Perform Data Munging/Cleaning (Feature Scaling)
  • Pre-Process Data
  • Apply Visualizations

Feature Scaling is a method to standardize the features present in the data in a fixed range. It has to perform during the data pre-processing. It has two main ways: Standardization and Normalization.

Standardization

The steps to be followed are : 

Data collection

Our data can be in various formats i.e., numbers (integers) & words (strings), for now, we'll consider only the numbers in our Dataset.

Assume our dataset has random numeric values in the range of 1 to 95,000 (in random order). Just for our understanding consider a small Dataset of barely 10 values with numbers in the given range and randomized order.

1) 99 2) 789 3) 1 4) 541 5) 5 6) 6589 7) 94142 8) 7 9) 50826 10) 35464

If we just look at these values, their range is so high, that while training the model with 10,000 such values will take lot of time. That's where the problem arises.

Understanding standardization

We have a solution to solve the problem arisen i.e. Standardization. It helps us solve this by : 

  • Down Scaling the Values to a scale common to all, usually in the range -1 to +1.
  • And keeping the Range between the values intact.

So, how do we do that? we'll there's a mathematical formula for the same i.e., Z-Score = (Current_value - Mean) / Standard Deviation.

Standardization Formula

Using this formula we are replacing all the input values by the Z-Score for each and every value. Hence we get values ranging from -1 to +1, keeping the range intact.

Standardization performs the following:

  • Converts the Mean (μ) to 0
  • Converts to S.D. (σ) to 1

It's pretty obvious for Mean = 0 and S.D = 1 as all the values will have such less difference and each value will nearly be equal 0, hence Mean = 0 and S.D. = 1.

NOTE : (Just for Better Understanding)

For Mean

When we Subtract a value Smaller than the Mean we get (-ve) Output
When we Subtract a value Larger than the Mean we get (+ve) Output

Hence, when we get (-ve) & (+ve) Values for Subtraction of Value with Mean, while Summation of all these values,
We get the Final Mean as 0.

And when we get the Mean as 0, it means that most or nearly all values are equal to highly close to 0 and have very low variance among them.

Therefore, the S.D also becomes 1 (as good as no difference).

Implementation of Code

Here we are doing the Following:

  • Calculating the Z-Score
  • Comparing the Original Values and Standardized Values
  • Comparing the Range of both using Scatter Plots
Python3
# Importing Libraries import matplotlib import matplotlib.pyplot as plt  # We are just using 10 values # for our Dataset  # Here, dataset_0 will be constant # as it's range is just 1 - 10 # But, dataset_1 will be scaled down # as it's range is 1 - 95,000  global dataset_0, dataset_1 dataset_0 = [10, 5, 6, 1, 3, 7, 9, 4, 8, 2] dataset_1 = [1, 99, 789, 5, 6859, 541,              94142, 7, 50826, 35464]  n = len(dataset_1) mean_ans = 0 ans = 0 j = 0 

Now to calculate the summation, run below code : 

Python3
# Calculating summation for i in dataset_1:     j = j + i     k = i*i     ans = ans + k  print('n : ', n) print("Summation (X)   : ", j) print("Summation (X^2) : ", ans) 

Output : 

n :  10 Summation (X)   :  188733 Summation (X^2) :  12751664695

Then Calculate the Standard Deviation

Python3
# Calculating Standard Deviation part_1 = ans/n part_2 = mean_ans*mean_ans standard_deviation = part_1 - part_2 print("Standard Deviation : ", standard_deviation) 

Output : 

1275166469.5

Then calculate the Mean by using this code 

Python3
# Calculating Mean mean = j/n mean 

Output : 

18873.3

To Calculate the Z-Score for each Value of dataset_1

Python3
# Calculating the Z-Score for each  # Value of dataset_1 final_z_score = [] print("Calculating Z-Score of Each Value in dataset_1")  for i in dataset_1:     z_score = (i-mean)/standard_deviation     final_z_score.append("{:.20f}".format(z_score)) 

Comparison between the Values of Original Dataset and Scaled Down Dataset

Python3
# Comparing the Values of Original Dataset and Saled Down Dataset print("\nOriginal DataSet   |               Z-Score ") print() for i in range(len(dataset_1)):     print("    ", dataset_1[i], "          |     ", final_z_score[i]) 

Output :

Original DataSet   |               Z-Score        1           |      -0.00001479987158649171      99          |      -0.00001472301887561513      789         |      -0.00001418191305413712      5           |      -0.00001479673474114981      6859        |      -0.00000942175024780167      541         |      -0.00001437639746533501      94142       |      0.00005902656774649453      7           |      -0.00001479516631847886      50826       |      0.00002505766953904366      35464       |      0.00001301061500347112 

Now we will compare and see the graph of the Original Values and the Standardized Values. 

Python3
# Here We are checking the Graph # of the Original Values plt.scatter(dataset_0, dataset_1, label="stars",             color="blue", marker="*", s=40) plt.xlabel('x - axis') plt.ylabel('y - axis') plt.legend() plt.show()    # Here we are checking the Graph of # the Standardized Values plt.scatter(dataset_0, final_z_score, label="stars",             color="blue", marker="*", s=30) plt.xlabel('x - axis') plt.ylabel('y - axis') plt.legend() plt.show() 

Output : 

Graph of the Original Values
Graph of the Original Values
Graph of the Standardized Values
Graph of the Standardized Values
Python3
final_z_score = []  for i in dataset_1:     z_score = (i-mean)/standard_deviation     final_z_score.append("{:.20f}".format(z_score)) 

Comparison between the Values of Original Dataset and Scaled Down Dataset.

Python3
print("  Original DataSet   |               Z-Score ") print() for i in range(len(dataset_1)):     print("         ", dataset_1[i], "         |     ", final_z_score[i]) 

Output :

  Original DataSet   |               Z-Score             1          |      -0.00001479987158649171           99         |      -0.00001472301887561513           789        |      -0.00001418191305413712           5          |      -0.00001479673474114981           6859       |      -0.00000942175024780167           541        |      -0.00001437639746533501           94142      |      0.00005902656774649453           7          |      -0.00001479516631847886           50826      |      0.00002505766953904366           35464      |      0.00001301061500347112

Comparing the Range of Values using Graphs

1) Graph of the Original Values

Python3
# Now we will compare and see the graph  # of the Original Values and the  # Standardized Values import matplotlib import matplotlib.pyplot as plt  # Here We are checking the Graph of the  # Original Values plt.scatter(dataset_0, dataset_1, label="stars",             color="blue", marker="*", s=40)  plt.xlabel('x - axis') plt.ylabel('y - axis')  plt.title('Original Values') plt.legend()  plt.show() 

Output :

 

Graph of the Original Values
Graph of the Original Values

2) Graph of the Standardized Values

Python3
import matplotlib  import matplotlib.pyplot as plt  plt.scatter(dataset_0, final_z_score, label= "stars",              color= "blue", marker= "*", s=30)    plt.xlabel('x - axis') plt.ylabel('y - axis')  plt.title('Original Values') plt.legend()    plt.show() 

Output :

 Graph of the Standardized Values
 Graph of the Standardized Values

Hence we have Reviewed, Understood the Concept, and Implemented as well the Concept of Standardization in Machine Learning


Next Article
What is Standardization in Machine Learning

B

brian_thoms
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

    What is Data Segmentation in Machine Learning?
    In machine learning, the effective utilization of data is paramount. Data segmentation stands as a crucial process in this landscape, facilitating the organization and analysis of datasets to derive meaningful insights. From enhancing model accuracy to optimizing decision-making processes, data segm
    15 min read
    Regularization in Machine Learning
    Regularization is an important technique in machine learning that helps to improve model accuracy by preventing overfitting which happens when a model learns the training data too well including noise and outliers and perform poor on new data. By adding a penalty for complexity it helps simpler mode
    7 min read
    Data Transformation in Machine Learning
    Often the data received in a machine learning project is messy and missing a bunch of values, creating a problem while we try to train our model on the data without altering it. In building a machine learning project that could predict the outcome of data well, the model requires data to be presente
    15+ min read
    What is AutoML in Machine Learning?
    Automated Machine Learning (automl) addresses the challenge of democratizing machine learning by automating the complex model development process. With applications in various sectors, AutoML aims to make machine learning accessible to those lacking expertise. The article highlights the growing sign
    13 min read
    Regularization Techniques in Machine Learning
    Overfitting is a major concern in the field of machine learning, as models aim to extract complex patterns from data. When a model learns to commit the training data to memory instead of making good generalizations to new data, this is known as overfitting. The model may perform poorly as a result w
    10 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences