Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Cross Validation in Machine Learning
Next article icon

Cross-Validation Using K-Fold With Scikit-Learn

Last Updated : 27 May, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Cross-validation involves repeatedly splitting data into training and testing sets to evaluate the performance of a machine-learning model. One of the most commonly used cross-validation techniques is K-Fold Cross-Validation. In this article, we will explore the implementation of K-Fold Cross-Validation using Scikit-Learn, a popular Python machine-learning library.

Table of Content

  • What is K-Fold Cross Validation?
  • K-Fold With Scikit-Learn
  • Visualizing K-Fold Cross-Validation Behavior
  • Logistic Regression Model & K-Fold Cross Validating
  • Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)
  • Advantages & Disadvantages of K-Fold Cross Validation
  • Additional Information
  • Conclusions
  • Frequently Asked Questions (FAQs)

What is K-Fold Cross Validation?

In K-Fold cross-validation, the input data is divided into 'K' number of folds, hence the name K Fold. The model undergoes training with K-1 folds and is evaluated on the remaining fold. This procedure is performed K times, where each fold is utilized as the testing set one time. The performance metrics are averaged across K iterations to offer a more reliable evaluation of the model's performance.

Example: Suppose we specified the fold as 10 (k = 10), then the K-Fold cross-validation splits the input data into 10 folds, which means we have 10 sets of data to train and test our model. So for every iteration, the model uses one fold as test data and the remaining as training data (9 folds). Every time, it picks a different fold for evaluation, and the result is an array of evaluation scores for each fold.

K-Fold With Scikit-Learn

Let's look at how to implement K-Fold cross-validation using Scikit-Learn. To achieve this, we need to import the KFold class from sklearn.model_selection. Let's look at the KFold class from Scikit-Learn, its parameters, and its methods.

sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=None)

PARAMETERS:

  • n_splits (int, default=5): Number of folds. Must be at least 2.
  • shuffle (bool, default=False): Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled
  • random_state (int, default=None): When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect.

METHODS:

  • get_metadata_routing(): Get metadata routing of this object.
  • get_n_splits(X=None, y=None, groups=None): Returns the number of splitting iterations in the cross-validator. Here X,y and groups are objects.
  • split(X, y=None, groups=None): Generate indices to split data into training and test set. Here X is an array which holds number of samples and number of features, y is the target variable for supervised learning problems, groups is the samples used while splitting the data into training / test set.

Let's create a synthetic regression dataset to analyse how the K-Fold split works. The code is as follows:

Python
import numpy as np from sklearn import datasets from sklearn.model_selection import KFold  # synthetic regression dataset X, y = datasets.make_regression(   n_samples=10, n_features=1, n_informative=1,   noise=0, random_state=0)  # KFold split kf = KFold(n_splits=4) for i, (train_index, test_index) in enumerate(kf.split(X)):     print(f"Fold {i}:")     print(f"  Training dataset index: {train_index}")     print(f"  Test dataset index: {test_index}") 

Output:

Fold 0:
Training dataset index: [3 4 5 6 7 8 9]
Test dataset index: [0 1 2]
Fold 1:
Training dataset index: [0 1 2 6 7 8 9]
Test dataset index: [3 4 5]
Fold 2:
Training dataset index: [0 1 2 3 4 5 8 9]
Test dataset index: [6 7]
Fold 3:
Training dataset index: [0 1 2 3 4 5 6 7]
Test dataset index: [8 9]

In the above code we created a synthetic regression dataset by using make_regression() method from sklearn. Here X is the input set and y is the target data (label). The KFold class divides the input data into four folds using the split() method. Hence, it has a total of four iterations (4 folds). Hope you noticed that for the entire iterations, the train index and test index are different, and it also considered the entire data for training. Let's check the number of splits using the get_n_splits() method. 

Python
kf.get_n_splits(X) 

Output:

4

Visualizing K-Fold Cross-Validation Behavior

We can create a classification dataset and visualize the behaviour of K-Fold cross-validation. The code is as follows:

Python
from sklearn.datasets import make_classification # define dataset X, y = make_classification(   n_samples=100, n_features=20, n_informative=15, n_redundant=5)  # prepare the K-Fold cross-validation procedure n_splits = 10 cv = KFold(n_splits=n_splits) 


Using the make_classification() method, we created a synthetic binary classification dataset of 100 samples with 20 features and prepared a K-Fold cross-validation procedure for the dataset with 10 folds. Then we displayed the training and test data for each fold. You can notice how the data is divided among the training and test sets for each fold.

Let's visualise K-Fold cross validation behavior in Sklearn. The code is as follows:

Python
import matplotlib.pyplot as plt from matplotlib.patches import Patch import numpy as np  def plot_kfold(cv, X, y, ax, n_splits, xlim_max=100):     """     Plots the indices for a cross-validation object.      Parameters:     cv: Cross-validation object     X: Feature set     y: Target variable     ax: Matplotlib axis object     n_splits: Number of folds in the cross-validation     xlim_max: Maximum limit for the x-axis     """          # Set color map for the plot     cmap_cv = plt.cm.coolwarm     cv_split = cv.split(X=X, y=y)          for i_split, (train_idx, test_idx) in enumerate(cv_split):         # Create an array of NaNs and fill in training/testing indices         indices = np.full(len(X), np.nan)         indices[test_idx], indices[train_idx] = 1, 0                  # Plot the training and testing indices         ax_x = range(len(indices))         ax_y = [i_split + 0.5] * len(indices)         ax.scatter(ax_x, ax_y, c=indices, marker="_",                     lw=10, cmap=cmap_cv, vmin=-0.2, vmax=1.2)      # Set y-ticks and labels     y_ticks = np.arange(n_splits) + 0.5     ax.set(yticks=y_ticks, yticklabels=range(n_splits),            xlabel="X index", ylabel="Fold",            ylim=[n_splits, -0.2], xlim=[0, xlim_max])      # Set plot title and create legend     ax.set_title("KFold", fontsize=14)     legend_patches = [Patch(color=cmap_cv(0.8), label="Testing set"),                       Patch(color=cmap_cv(0.02), label="Training set")]     ax.legend(handles=legend_patches, loc=(1.03, 0.8))  # Create figure and axis fig, ax = plt.subplots(figsize=(6, 3)) plot_kfold(cv, X, y, ax, n_splits) plt.tight_layout() fig.subplots_adjust(right=0.6) 

Output

K-Fold
K-Fold

in the above code, we used matplotlib to visualize the sample plot for indices of a k-fold cross-validation object. We generated training or test visualizations for each CV split. Here, we filled the indices with training or test groups using Numpy and plotted the indices using the scatter() method. The cmap parameter specifies the color of the training and test sets, and the lw parameter sets the width of each fold. Finally, by using the set() method, we formatted the X and Y axes.

Logistic Regression Model & K-Fold Cross Validating

Now let's create a logistic regression model and cross-validate it using K-Fold. The code is as follows:

Python
from numpy import mean from numpy import std  from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score  # create model log_reg = LogisticRegression() # evaluate model scores = cross_val_score(     log_reg, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # accuracy print('Accuracy: %.3f ,\nStandard Deviations :%.3f' %       (mean(scores), std(scores))) 

Output:

Accuracy: 0.810 ,
Standard Deviations :0.114

In the above code, we make use of the cross_val_score() method to evaluate a score by k-fold cross-validation. Here, we passed the logistic regression model and evaluation procedure (K-Fold) as parameters. The accuracy is the evaluation metric (scoring parameter) that we used to score the dataset.

Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)

Now it's time to cross-validate different regression models using K-Fold, and we can analyze the performance of each model. Let's make use of the California Housing dataset from Sklearn. The code is as follows:

Python
from sklearn.datasets import fetch_california_housing  # fetch california housing data housing = fetch_california_housing() print("Dataset Shape:", housing.data.shape, housing.target.shape) print("Dataset Features:", housing.feature_names) 

Output

Dataset Shape: (20640, 8) (20640,)
Dataset Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude']

Here we make use of the fetch_california_housing() method from the sklearn dataset. The dataset consists of 20,640 samples and 9 features (including the label).

Here, the dataset contains only numerical features, and there are no missing values. So we don't need to deal with text attributes or missing values; all we need to do is scale the features.

Let's scale the features and apply K-Fold to the dataset.

Python
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import KFold import numpy as np   X_housing = housing.data y_housing = housing.target  # Scaling the data scaler = StandardScaler() X_scaler = scaler.fit_transform(X_housing)  # K-Fold split cnt = 0 n_splits = 10 kf = KFold(n_splits=n_splits, shuffle=True, random_state=42) for train_index, test_index in kf.split(X_scaler, y_housing):     print(f'Fold:{cnt}, Train set: {len(train_index)}, \     Test set:{len(test_index)}')     cnt += 1 

Output

Fold:0, Train set: 18576,     Test set:2064
Fold:1, Train set: 18576, Test set:2064
Fold:2, Train set: 18576, Test set:2064
Fold:3, Train set: 18576, Test set:2064
Fold:4, Train set: 18576, Test set:2064
Fold:5, Train set: 18576, Test set:2064
Fold:6, Train set: 18576, Test set:2064
Fold:7, Train set: 18576, Test set:2064
Fold:8, Train set: 18576, Test set:2064
Fold:9, Train set: 18576, Test set:2064

Here, we scaled the features using the StandardScaler() method from Sklearn and passed the scaled features to the fit_transform() method. Then we prepared the K-Fold validation procedure, where we set the folds as 10 and mixed the dataset by setting the shuffle parameter as true. 

Let's visualise the split using matplotlib.

Python
fig, ax = plt.subplots(figsize=(6, 3)) plot_kfold(kf, X_scaler, y_housing, ax, n_splits, xlim_max=2000) # Make the legend fit plt.tight_layout() fig.subplots_adjust(right=0.7) 

Output

k-fold-calif-visualise
K-Fold with Shuffle

We make use of the same plot_cv_indices() method (explained above) to visualize the data split. Hope you noticed that in the above plot diagram, the training and test sets got shuffled up. This is because we set the shuffle parameter in K-Fold as true. This helps in considering data from different section.

Now let's create different regression models and apply K-fold cross validation. The code is as follows:

Python
from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor  from sklearn.model_selection import cross_val_score  def cross_validation(reg_model, housing_prepared, housing_labels, cv):     scores = cross_val_score(       reg_model, housing_prepared,       housing_labels,       scoring="neg_mean_squared_error", cv=cv)     rmse_scores = np.sqrt(-scores)     print("Scores:", rmse_scores)     print("Mean:", rmse_scores.mean())     print("StandardDeviation:", rmse_scores.std())  print("----- Linear Regression Model Cross Validation ------") lin_reg = LinearRegression() cross_validation(lin_reg, X_scaler, y_housing, kf) print("") print("----- Decision Tree Regression Model Cross Validation ------") tree_reg = DecisionTreeRegressor() cross_validation(tree_reg, X_scaler, y_housing, kf) print("") print("----- Random Forest Regression Model Cross Validation ------") forest_reg = RandomForestRegressor() cross_validation(forest_reg, X_scaler, y_housing, kf) 

Output

----- Linear Regression Model Cross Validation ------
Scores: [0.74766431 0.74372259 0.6936579 0.75776228 0.69926807 0.72690314
0.74241379 0.68908607 0.75124511 0.74163695]
Mean: 0.7293360220706322
StandardDeviation: 0.02440550831772841

----- Decision Tree Regression Model Cross Validation ------
Scores: [0.69024329 0.71299152 0.72902583 0.74687543 0.73311366 0.70912615
0.71031728 0.70438177 0.71907938 0.74508813]
Mean: 0.7200242426779767
StandardDeviation: 0.01731035436143824

----- Random Forest Regression Model Cross Validation ------
Scores: [0.50050277 0.49624521 0.47534694 0.522097 0.48679587 0.51611116
0.48861124 0.46187822 0.50740703 0.50927282]
Mean: 0.4964268280240172
StandardDeviation: 0.017721367101897926

In the above code, we created three different regression models (Linear, Decision Tree and Random Forest regression) and identified the prediction error using cross-validation for each model. The cross_val_score() method makes use of neg_mean_squared_error as an evaluation metric (scoring parameter) and K-fold as the cross-validation procedure. Here, we randomly split the training set into 10 distinct subsets called folds. So the K-Fold cross-validation feature can train and evaluate the model 10 times by picking a different fold each time and training on the other 9 folds.

You can notice that the decision tree has a mean prediction error of $72002, whereas the linear regression score is $72933. The Random Forest Regressor seems to be a promising model with a prediction error of $49642.

Once you have identified a promising model, you can fine tune the particular model and increase the model performance .

Advantages & Disadvantages of K-Fold Cross Validation

Advantages of K-Fold Cross Validation

  • It has a great positive impact on reducing underfitting and overfitting. It considers most of the data for training and validation.
  • Model performance analysis on each fold helps to understand the variation of input data and also provides more insights to fine-tune the model.
  • It can efficiently handle unbalanced data and be used for hyperparameter tuning.

Disadvantage of K-Fold Cross Validation

  • The approach can be computationally expensive.

Additional Information

Apart from K-Fold cross-validation, there are a few other variations of K-Fold techniques. A few of them are:

  • Repeated K-Fold: It can be used when one requires to run KFold n times, producing different splits in each repetition.
  • Stratified K-Fold: It is variation of K-Fold which returns startified sample
  • Group K-Fold:It is a variation of k-fold which ensures that the same group is not represented in both testing and training sets.
  • StratifiedGroupKFold: It is a cross-validation scheme that combines both StratifiedKFold and GroupKFold.

Conclusions

We have discussed the importance of K-Fold cross-validation technique in machine learning and gone through how it can be implemented using Sklearn. Hope you understood how K-fold methodology can increase model performance by avoiding overfitting and underfitting. We also analyzed the performance of different regression models, which helped us choose the most promising model for prediction.


Next Article
Cross Validation in Machine Learning
author
sonugeorge
Improve
Article Tags :
  • Machine Learning
  • Blogathon
  • AI-ML-DS
  • AI-ML-DS With Python
  • Data Science Blogathon 2024
  • Sklearn
Practice Tags :
  • Machine Learning

Similar Reads

  • Validation Curve using Scikit-learn
    Validation curves are essential tools in machine learning for diagnosing model performance and understanding the impact of hyperparameters on model accuracy. This article will delve into the concept of validation curves, their importance, and how to implement them using Scikit-learn in Python. Table
    7 min read
  • K- Fold Cross Validation in Machine Learning
    K-Fold Cross Validation is a statistical technique to measure the performance of a machine learning model by dividing the dataset into K subsets of equal size (folds). The model is trained on K − 1 folds and tested on the last fold. This process is repeated K times, with each fold being used as the
    4 min read
  • Stratified K Fold Cross Validation
    Stratified K-Fold Cross Validation is a technique used for evaluating a model. It is particularly useful for classification problems in which the class labels are not evenly distributed i.e data is imbalanced. It is a enhanced version of K-Fold Cross Validation. Key difference is that it uses strati
    4 min read
  • Cross-validation on Digits Dataset in Scikit-learn
    In this article, we will discuss cross-validation and its use on digit datasets. Further, we will see the code implementation using a digits dataset. What is Cross-Validation?Cross Validation on the Digits Dataset will allow us to choose the best parameters avoiding overfitting over the training dat
    5 min read
  • Cross Validation in Machine Learning
    In machine learning, simply fitting a model on training data doesn't guarantee its accuracy on real-world data. To ensure that your machine learning model generalizes well and isn't overfitting, it's crucial to use effective evaluation techniques. One such technique is cross-validation, which helps
    8 min read
  • Recursive Feature Elimination with Cross-Validation in Scikit Learn
    In this article, we will earn how to implement recursive feature elimination with cross-validation using scikit learn package in Python. What is Recursive Feature Elimination (RFE)? Recursive Feature Elimination (RFE) is a feature selection algorithm that is used to select a subset of the most relev
    5 min read
  • How to perform 10 fold cross validation with LibSVM in R?
    Support Vector Machines (SVM) are a powerful tool for classification and regression tasks. LibSVM is a widely used library that implements SVM, and it can be accessed in R with the e1071 package. Cross-validation, particularly 10-fold cross-validation, is an essential technique for assessing the per
    5 min read
  • Cross Validation on a Dataset with Factors in R
    Cross-validation is a widely used technique in machine learning and statistical modeling to assess how well a model generalizes to new data. When working with datasets containing factors (categorical variables), it's essential to handle them appropriately during cross-validation to ensure unbiased p
    4 min read
  • How to Use K-Fold Cross-Validation in a Neural Network
    To use K-Fold Cross-Validation in a neural network, you need to perform K-Fold Cross-Validation splits the dataset into K subsets or "folds," where each fold is used as a validation set while the remaining folds are used as training sets. This helps in understanding how the model performs across dif
    3 min read
  • Generalisation Performance from NNET in R using k-fold cross-validation
    Neural networks are a powerful tool for solving complex machine-learning tasks. However, assessing their performance on new, unseen data is crucial to ensure their reliability. In this tutorial, we'll explore how to evaluate the generalization performance of a neural network implemented using the `n
    15+ min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences