How to interpret cross validation output from cv.kknn (kknn package)

Last Updated : 19 Jul, 2024

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The most common type is k-fold cross-validation, where the dataset is divided into k subsets (folds). The model is trained on k-1 of these folds and validated on the remaining one. This process is repeated k times, with each fold being used exactly once as the validation data. The results are then averaged to produce a single performance metric.

What is cv.kknn?

The cv.kknn function of the kknn the package performs k-fold cross-validation on a k-NN classifier. This means it divides the dataset into k equal parts (folds), trains the model on k-1 folds, and tests it on the remaining fold. This process is repeated k times, each time with a different fold as the test set. The function provides a detailed output that includes confusion matrices for each fold, overall accuracy for each fold, and other performance metrics.

The cv.kknn function in the kknn package is used for k-fold cross-validation of k-nearest neighbor models. The function syntax is:

cv.kknn(formula, data, k = 10, distance = 2, kernel = "rectangular", ykernel = NULL, scale = TRUE)
Where:
formula: A symbolic description of the model to be fit.
data: The dataset to be used.
k: The number of folds for cross-validation (default is 10).
distance: The Minkowski distance metric (default is 2, which is Euclidean distance).
kernel: The kernel function for weighting neighbors (default is "rectangular").
ykernel: Kernel for regression.
scale: Logical, whether to scale the data.

Let's go through a practical example using the iris dataset to illustrate how to interpret the cross-validation output from cv.kknn in R Programming Language.

Step 1: Load Necessary Libraries and Data

First we will install and load the Libraries and Data.

library(kknn) data(iris)

Step 2: Perform Cross-Validation

We will perform 5-fold cross-validation on the iris dataset using cv.kknn.

set.seed(123) # For reproducibility cv_results <- cv.kknn(Species ~ ., data = iris, k = 5)

Step 3: Understanding the Output

The output of cv.kknn is a list containing various elements. Let's inspect the key components:

print(cv_results)

Output:

[[1]]
    y yhat
1   1    1
2   1    1
3   1    1
4   1    1
5   1    1
6   1    1.................

[[2]]
[1] 1

The table displays the actual and predicted class labels for each observation in the iris dataset. Here’s how to interpret it:

Column y: The true class labels for the observations.
Column yhat: The predicted class labels by the k-NN model.
- True labels (y) and predicted labels (yhat) are all 1, indicating that all setosa samples were correctly predicted.
- True labels are all 2 (versicolor), but there are a few misclassifications where yhat is 3 (virginica).
- True labels are all 3 (virginica), but there are a few misclassifications where yhat is 2 (versicolor).

In 2nd case, it seems to be 1, suggesting a perfect accuracy of 100%. However, this might be a simplified output, and the actual accuracy should consider misclassifications observed in the prediction results table.

Conclusion

Cross-validation is an essential step in the model evaluation process, providing insights into how well your model generalizes to unseen data. The cv.kknn function in the kknn package offers a convenient way to perform k-fold cross-validation for k-nearest neighbors models in R. By understanding and interpreting the output of cv.kknn, you can make informed decisions about your model's performance and potential improvements. This ensures a robust and reliable machine learning workflow.

LOOCV (Leave One Out Cross-Validation) in R Programming

nyadavxenc

Improve

Article Tags :

Practice Tags :

Machine Learning