Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Cross-validation on Digits Dataset in Scikit-learn
Next article icon

Validation Curve using Scikit-learn

Last Updated : 24 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Validation curves are essential tools in machine learning for diagnosing model performance and understanding the impact of hyperparameters on model accuracy. This article will delve into the concept of validation curves, their importance, and how to implement them using Scikit-learn in Python.

Table of Content

  • What is a Validation Curve?
  • Understanding Bias and Variance
  • Implementing Validation Curves with Scikit-learn
  • Interpreting the Validation Curve
  • Validation Curves with Machine Learning Models
    • Example 1: Validation Curve with Random Forest
    • Example 2: Validation Curve with Ridge Regression

What is a Validation Curve?

A validation curve is a graphical representation that shows the relationship between a model's performance and a specific hyperparameter. It helps in understanding how changes in hyperparameters affect the training and validation scores of a model. The curve typically plots the model performance metric (such as accuracy, F1-score, or mean squared error) on the y-axis and a range of hyperparameter values on the x-axis.

We have a table which describes various scenarios of the two scores of validation and training.

Training Score

Validation Score

Estimator is:

Low

Low

Underfitting

High

Low

Overfitting

Low

High

(Not Possible)

Importance of Validation Curves

Validation curves are crucial for several reasons:

  1. Hyperparameter Tuning: They help in selecting the optimal hyperparameter values that balance bias and variance.
  2. Diagnosing Overfitting and Underfitting: By analyzing the training and validation scores, one can identify whether the model is overfitting or underfitting.
  3. Model Improvement: They provide insights into how to improve the model by adjusting hyperparameters.

Understanding Bias and Variance

Before diving into the implementation, it's essential to understand the concepts of bias and variance:

  • Bias: Error due to overly simplistic models that do not capture the underlying patterns in the data (underfitting).
  • Variance: Error due to overly complex models that capture noise in the training data (overfitting).

Implementing Validation Curves with Scikit-learn

Step 1: Import Required Libraries

First, we need to import the necessary libraries. We'll use Scikit-learn for model building and Matplotlib for plotting.

Python
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import validation_curve 

Step 2: Load the Dataset

For this example, we'll use the digits dataset from Scikit-learn.

Python
# Load the digits dataset dataset = load_digits() X, y = dataset.data, dataset.target 

Step 3: Define the Hyperparameter Range

We'll define the range of hyperparameter values we want to evaluate. In this case, we'll vary the number of neighbors (n_neighbors) for the K-Nearest Neighbors (KNN) classifier.

Python
# Define the range for the hyperparameter param_range = np.arange(1, 10, 1) 

Step 4: Calculate Training and Validation Scores

We'll use the validation_curve function from Scikit-learn to calculate the training and validation scores for each value of the hyperparameter.

Python
# Calculate accuracy on training and test set using the validation curve train_scores, test_scores = validation_curve(     KNeighborsClassifier(),     X, y,     param_name="n_neighbors",     param_range=param_range,     cv=5,     scoring="accuracy" ) 

Step 5: Calculate Mean and Standard Deviation

Next, we'll calculate the mean and standard deviation of the training and validation scores.

Python
# Calculate mean and standard deviation of training scores mean_train_score = np.mean(train_scores, axis=1) std_train_score = np.std(train_scores, axis=1)  # Calculate mean and standard deviation of validation scores mean_test_score = np.mean(test_scores, axis=1) std_test_score = np.std(test_scores, axis=1) 

Step 6: Plot the Validation Curve

Finally, we'll plot the validation curve using Matplotlib.

Python
# Plot mean accuracy scores for training and testing scores plt.plot(param_range, mean_train_score, label="Training Score", color='b') plt.plot(param_range, mean_test_score, label="Cross Validation Score", color='g')  # Plot the accuracy bands plt.fill_between(param_range, mean_train_score - std_train_score, mean_train_score + std_train_score, alpha=0.2, color='blue') plt.fill_between(param_range, mean_test_score - std_test_score, mean_test_score + std_test_score, alpha=0.2, color='green')  # Create the plot plt.title("Validation Curve with KNN") plt.xlabel("Number of Neighbors") plt.ylabel("Accuracy") plt.legend(loc="best") plt.show() 

Output:

download---2024-06-24T153236115
Validation Curve

Interpreting the Validation Curve

Interpreting the results of a validation curve can sometimes be tricky. Here are some key points to keep in mind:

  1. Underfitting: If both the training and validation scores are low, the model is likely underfitting. This means the model is too simple or is informed by too few features.
  2. Overfitting: If the training score is high and the validation score is low, the model is overfitting. This means the model is too complex and is capturing noise in the training data.
  3. Optimal Hyperparameter: The optimal value of the hyperparameter is where the training and validation scores are closest to each other and both are relatively high.

Validation Curves with Machine Learning Models

Example 1: Validation Curve with Random Forest

Let's consider another example using the RandomForestClassifier and varying the number of trees (n_estimators).

Python
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.model_selection import validation_curve from sklearn.ensemble import RandomForestClassifier  X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_classes=2, random_state=42)  param_range = np.arange(1, 250, 2)  # Calculate accuracy on training and test set using the validation curve train_scores, test_scores = validation_curve(     RandomForestClassifier(),     X, y,     param_name="n_estimators",     param_range=param_range,     cv=4,     scoring="accuracy",     n_jobs=-1 )  # Calculate mean and standard deviation of training scores across folds train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1)  # Calculate mean and standard deviation of test scores across folds test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1)  # Plot mean accuracy scores for training and test scores plt.figure(figsize=(10, 6)) plt.plot(param_range, train_mean, label="Training score", color="black") plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")  # Plot the accuracy bands for training and test scores plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.2, color="gray") plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.2, color="gainsboro")  plt.title("Validation Curve with Random Forest") plt.xlabel("Number of Trees") plt.ylabel("Accuracy Score") plt.legend(loc="best") plt.tight_layout() plt.show() 

Output:

download---2024-06-24T154122586
Validation Curve with Random Forest

Example 2: Validation Curve with Ridge Regression

Python
from sklearn.model_selection import validation_curve from sklearn.linear_model import Ridge from sklearn.datasets import make_regression import numpy as np import matplotlib.pyplot as plt  X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=42)  # Define the parameter range for alpha param_range = np.logspace(-7, 3, 3)  # Compute the validation curve train_scores, valid_scores = validation_curve(Ridge(), X, y, param_name="alpha",                                               param_range=param_range, cv=5)  # Calculate mean and standard deviation of training and validation scores train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) valid_scores_mean = np.mean(valid_scores, axis=1) valid_scores_std = np.std(valid_scores, axis=1)  # Plotting the validation curve plt.figure() plt.title("Validation Curve with Ridge Regression") plt.xlabel("Alpha") plt.ylabel("Score") plt.ylim(0.0, 1.1) plt.semilogx(param_range, train_scores_mean, label="Training score", color="darkorange", lw=2) plt.fill_between(param_range, train_scores_mean - train_scores_std,                  train_scores_mean + train_scores_std, alpha=0.2,                  color="darkorange", lw=2) plt.semilogx(param_range, valid_scores_mean, label="Cross-validation score",              color="navy", lw=2) plt.fill_between(param_range, valid_scores_mean - valid_scores_std,                  valid_scores_mean + valid_scores_std, alpha=0.2,                  color="navy", lw=2) plt.legend(loc="best") plt.show() 

Output:

cross_valid
Validation Curve with Ridge Regression

Conclusion

Validation curves are powerful tools for diagnosing model performance and understanding the impact of hyperparameters. By using Scikit-learn and visualization libraries like Matplotlib and Yellowbrick, you can effectively create and interpret validation curves to improve your machine learning models. Understanding and utilizing validation curves will help you build models that generalize well to unseen data, ultimately leading to more robust and accurate predictions.


Next Article
Cross-validation on Digits Dataset in Scikit-learn

P

pritamauddy25
Improve
Article Tags :
  • Machine Learning
  • Blogathon
  • AI-ML-DS
  • AI-ML-DS With Python
  • Data Science Blogathon 2024
  • Sklearn
Practice Tags :
  • Machine Learning

Similar Reads

  • Cross-Validation Using K-Fold With Scikit-Learn
    Cross-validation involves repeatedly splitting data into training and testing sets to evaluate the performance of a machine-learning model. One of the most commonly used cross-validation techniques is K-Fold Cross-Validation. In this article, we will explore the implementation of K-Fold Cross-Valida
    12 min read
  • Cross Validation in Machine Learning
    Cross-validation is a technique used to check how well a machine learning model performs on unseen data. It splits the data into several parts, trains the model on some parts and tests it on the remaining part repeating this process multiple times. Finally the results from each validation step are a
    7 min read
  • Cross-validation on Digits Dataset in Scikit-learn
    In this article, we will discuss cross-validation and its use on digit datasets. Further, we will see the code implementation using a digits dataset. What is Cross-Validation?Cross Validation on the Digits Dataset will allow us to choose the best parameters avoiding overfitting over the training dat
    5 min read
  • Validation Curve
    Model validation is an important part of the data science project since want to select a model which not only performs well on our training dataset but also has good accuracy on the testing dataset. Model validation helps us in finding a model which has low variance. What is Validation Curve   A Val
    4 min read
  • Using Learning Curves - ML
    A learning model of a Machine Learning model shows how the error in the prediction of a Machine Learning model changes as the size of the training set increases or decreases. Before we continue, we must first understand what variance and bias mean in the Machine Learning model. Bias: It is basically
    4 min read
  • SVM with Univariate Feature Selection in Scikit Learn
    Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amou
    10 min read
  • Time Series Cross-Validation
    In this article, we delve into the concept of Time Series Cross-Validation (TSCV), a powerful technique for robust model evaluation in time series analysis. We'll explore its significance, implementation, and best practices, along with providing insightful code examples for clarity. What is Cross Va
    4 min read
  • Function Argument Validation
    MATLAB is a programming language that is used for solving math problems so it is also a concept of MATLAB programming language. It is basically defined as a process to declare specific restrictions on the function arguments. By using argument validation we can constrain the class, size, and other th
    3 min read
  • Y Scrambling for Model Validation
    Y Scrambling is a method that one can use in order to test whether the predictions made by the model aren't made just by chance. It is used in the validation of multi linear regression QSPR models. It has many names Y-Scrambling, Y-Randomization, Y-Permutation, etc. This process is amazingly simple
    3 min read
  • K- Fold Cross Validation in Machine Learning
    K-Fold Cross Validation is a statistical technique to measure the performance of a machine learning model by dividing the dataset into K subsets of equal size (folds). The model is trained on K − 1 folds and tested on the last fold. This process is repeated K times, with each fold being used as the
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences