K-Fold Cross Validation Technique and its Essentials (2024)

Introduction

Welcome to this comprehensive guide on model evaluation and selection techniques in machine learning, particularly focusing on K-fold cross-validation and its application in time series analysis. Before delving into the specifics, let’s consider the importance of these techniques in monitoring model performance before deployment. Understanding the performance metrics such as mean squared error, which evaluates the deviation between predicted and observed values, is crucial in ensuring model accuracy. We will explore how K-fold cross-validation, especially in the context of time series data, helps in training and validating models using multiple train-test splits.

K-Fold Cross Validation Technique and its Essentials (1)

By employing K-fold cross-validation, with features like test_index and train_index, we can mitigate overfitting and understand how the model generalizes to unseen data. Furthermore, we will examine the role of neural networks in classification tasks, highlighting their application in subsamples and their ability to learn complex patterns. Join us on this journey to optimize your machine learning models and enhance their performance.

Learning Outcomes

  • Understand the concept of n_splits in 5 fold cross validation in machine learning cross-validation and implement K-fold cross-validation with different values of n_splits.
  • Discuss how the choice of n_splits affects the model evaluation.
  • Explain the significance of random_state in machine learning models.
  • Discuss how setting random_state ensures reproducibility of results.
  • Implement random_state in scikit-learn for various classifiers and regression models.
  • Implement various machine learning algorithms using scikit-learn.
  • Understand the importance of stratified k-fold cross-validation in classification problems.
  • Discuss the advantages and limitations of train-test split compared to other validation techniques.
  • Implement various classifiers (e.g., SVM, Random Forest, Logistic Regression) using scikit-learn.
  • Discuss strategies for handling new data in machine learning models.

This article was published as a part of theData Science Blogathon.

Table of contents

  • What is Accuracy of the Model and Performance?
  • ML Engineers and Business Team Agreement
  • What is K-Fold Cross Validation?
  • Life Cycle of K-Fold Cross-Validation
  • Thumb Rules Associated with K Fold
  • Basic Example
  • Model Selection usi ng K-Fold
  • Parameter Tuning Using K-Fold
  • K-Fold in Visual form
  • Frequently Asked Questions

What are Model Performance and its necessity?

Machine learning model performance assessment is just like assessing the scores, how we used to evaluate our sores in high schools and colleges for the meeting the eligibility criteria for getting the best courses or getting selected in the campus interviews for companies for the job and clearing cut-off scores for many more competition exams for getting selected. So apparently, the GOOD score recognizes the fact that the candidate is always good. The same is been expected in the machine learning model, and that should achieve the expected results in predictions/forecasting/calcification problem statements. Even in the ML world, the model has been trained in the context of data, model, and code.

What is Accuracy of the Model and Performance?

The accuracy of a model in data science is computed as a metric to understand how well it performs in predicting outcomes. It measures the proportion of true predictions made by different models built using available data records. To achieve robust performance, these models are trained across various combinations of data, ensuring they can generalize well to new data and improve accuracy.

ML Engineers and Business Team Agreement

As we know, there are various methods to evaluate model performance. It is our team’s responsibility to construct a robust and generalized model that meets production expectations. Additionally, we need to effectively communicate its performance and the business benefits to stakeholders and customers, guided by SMEs, to achieve our goals.

As we are an ML engineer team, we must provide the performance of the model in the numeric range. Let’s say the performance of the model would be 85-90%. Sometimes the performance of the model in training and testing will not behave the same in production, in many cases, Overfitting or Underfitting will be experienced during the production environment.

Yes! Of course, this is really threatening to junior Data scientists and ML Engineers, but the challenge is one requires to improvise your technical capabilities, right? , So after many iterations and CI/CD involvement (MLOps), only the model will achieve the accuracy as expected and in a generalised mode. One step further, always we have to monitor the performance and apply the necessary changes to the model algorithm and code.

Will see how we could overcome this in the real-time, scenario.

As I mentioned earlier the RANGE-Factor, we have different techniques to evaluate, in which Cross-Validation or 5 fold cross validation is best and easy to understand. This is simple in nature and involves a typical resampling technique, without any replacement in the data. And easily we could understand and visualise while implementing.

K-Fold Cross Validation Technique and its Essentials (2)

Image designed by the author

What is K-Fold Cross Validation?

K-fold cross validation in machine learning cross-validation is a powerful technique for evaluating predictive models in data science. It involves splitting the dataset into k subsets or folds, where each fold is used as the validation set in turn while the remaining k-1 folds are used for training. This process is repeated k times, and performance metrics such as accuracy, precision, and recall are computed for each fold. By averaging these metrics, we obtain an estimate of the model’s generalization performance. This method is essential for model assessment, selection, and hyperparameter tuning, offering a reliable measure of a model’s effectiveness. Compared to leave-one-out cross-validation, which uses k equal to the number of samples, K-fold cross-validation is computationally efficient and widely used in practice.

In each set (fold) training and the test would be performed precisely once during this entire process.It helps us to avoid overfitting. As we know when a model is trained using all of the data in a single short and give the best performance accuracy. To resist this k fold cross validation in machine learning cross-validation helps us to build the model is a generalized one.

To achieve this K-Fold Cross Validation, we have to split the data set into three sets, Training, Testing, and Validation, with the challenge of the volume of the data.

Here Test and Train data set will support building model and hyperparameter assessments.

In which the model has been validated multiple times based on the value assigned as a parameter and which is called K and it should be an INTEGER.

Make it simple, based on the K value, the data set would be divided, and train/testing will be conducted in a sequence way equal to K time.

Life Cycle of K-Fold Cross-Validation

K-Fold Cross Validation Technique and its Essentials (3)

Image designed by the author

Let’s have a generalised K value. If K=5, it means, in the given dataset and we are splitting into 5 folds and running the Train and Test. During each run, one fold is considered for testing and the rest will be for training and moving on with iterations, the below pictorial representation would give you an idea of the flow of the fold-defined size.

Image designed by the author

In which each data point is used, once in the hold-out set and K-1 in Training. So, during the full iteration at least once, one fold will be used for testing and the rest for training.

In the above set, 5- Testing 20 Training. In each iteration, we will get an accuracy score and have to sum them and find the mean. Here we can understand how the data is spread in a way of consistency and will make a conclusion whether to for the production with this model (or) NOT.

K-Fold Cross Validation Technique and its Essentials (5)

Thumb Rules Associated with K Fold

Now, we will discuss a few thumb rules while playing with K – fold

  • K should be always >= 2 and = to number of records, (LOOCV)
    • If 2 then just 2 iterations
    • If K=No of records in the dataset, then 1 for testing and n- for training
  • The optimized value for the K is 10 and used with the data of good size. (Commonly used)
  • If the K value is too large, then this will lead to less variance across the training set and limit the model currency difference across the iterations.
  • The number of folds is indirectly proportional to the size of the data set, which means, if the dataset size is too small, the number of folds can increase.
  • Larger values of K eventually increase the running time of the cross-validation process.
K-Fold Cross Validation Technique and its Essentials (6)

Please remember K-Fold Cross Validation for the below purpose in the ML stream.

  • Model selection
  • Parameter tuning
  • Feature selection

So far, we have discussed the K Fold and its way of implementation, let’s do some hands-on now.

Basic Example

I am creating a simple array, defining the K size as 5 and splitting my array. Using the simple loop and printing the Train and Test portions. Here we could see clearly that the data points in TT buckets and Test data are unique in each cycle.

Python Code:

You can see the Train and Test array and how the array got split in every iteration.

Let’s do this with the dataset.

Model Selection using K-Fold

from sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierimport numpy as npfrom sklearn.datasets import load_digitsimport matplotlib.pyplot as pltdigits = load_digits()from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

imported required libraries and loaded digits (hand-written digits – open source), let’s apply a different algorithm.

Logistic Regression

I am usingliblinear. This is the “Large Linear Classification” category. It uses a Coordinate-Descent Algorithm. This would minimize a multivariate function by resolving the univariate and its optimization problems during the loop.

lr = LogisticRegression(solver='liblinear',multi_class='ovr')lr.fit(X_train, y_train)lr.score(X_test, y_test)

Output

Score :0.972222

SVC

Just usinggamma is a parameter for non-linear perspective for hyperplanes. The value of the gamma tries to fit the training data set anduses 1/n_features.

svm = SVC(gamma='auto')svm.fit(X_train, y_train)svm.score(X_test, y_test)

Output

Score :0.62037

Random Forest

For RFC, I am assigning estimators as 40.

rf = RandomForestClassifier(n_estimators=40)rf.fit(X_train, y_train)rf.score(X_test, y_test)

Output

Score:0.96666

Scores from the above list of algorithmsLogistic Regression and Random Forest are doing comparatively better than SVM.

Now will use cross_val_score function and get the scores, passing different algorithms with dataset and cv.

from sklearn.model_selection import cross_val_score

Set LogisticRegression, CV =3

score_lr=cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)print(score_lr)print("Avg :",np.average(score_lr))

Output: for 3 fold we have 3 scores

[0.89482471 0.95325543 0.90984975]Avg : 0.9193099610461881

Set SVM and CV=3

score_svm =cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)print(score_svm)print("Avg :",np.average(score_svm))

Output: Scores

[0.38063439 0.41068447 0.51252087]Avg : 0.4346132442960489

Set Random Forest and CV=3

score_rf=cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)print(score_rf)print("Avg :",np.average(score_rf))

Output:Scores

[0.92821369 0.95325543 0.92320534]Avg : 0.9348914858096827
Before K Fold applyAfter K Fold applied (Avg)
Logistic Regression97%91%
SVM62%43%
Random Forest96%93%

Based on the above table, we will go with Random Forest for this dataset for production. But we have to monitor the model performance based on the data drift and as the business case changes, we have to revisit the model and redeploy.

Parameter Tuning Using K-Fold

Let us consider theRandomForestClassifier for this analysis, andn_estimators is our parameter for this case and CV as 10 (commonly used)

scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)print("Avg Score for Estimators=5 and CV=10 :",np.average(scores1))

Output

Avg Score for Estimators=5 and CV=10 : 0.87369
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)print("Avg Score for Estimators=20 and CV=10 :",np.average(scores2))

Output

Avg Score for Estimators=20 and CV=10 : 0.93377
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)print("Avg Score for Estimators=30 and CV=10 :",np.average(scores3))

Output

Avg Score for Estimators=30 and CV=10 : 0.94879
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)print("Avg Score for Estimators=40 and CV=10 :",np.average(scores4))

Output

Avg Score for Estimators=40 and CV=10 : 0.94824
scores187.36%
scores293.33%
scores394.87%
scores494.82%

Based on the above observation, we will go with Estimators=30.

K-Fold in Visual form

Visual representation is always the best evidence for any data which is located across the axes.

from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=5)scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')print(scores.mean())

Output

0.9666666666666668
k_range = list(range(1, 25))k_scores = []for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy') k_scores.append(scores.mean())print(k_scores)

Output

[0.96, 0.95333, 0.96666, 0.96666, 0.966668, 0.96666, 0.966666, 0.966666, 0.97333, 0.96666, 0.96666, 0.97333, 0.9800, 0.97333, 0.97333, 0.97333, 0.97333, 0.98000, 0.9733333, 0.980000, 0.966666, 0.96666, 0.973333, 0.96, 0.96666, 0.96, 0.96666, 0.953333, 0.95333, 0.95333]
import matplotlib.pyplot as plt%matplotlib inlineplt.plot(k_range, k_scores)plt.xlabel('Value of K for KNN')plt.ylabel('Cross-Validated-Accuracy')

Output: With a simple plot, X=> value of K and Y=> Accuracy for respective CV

K-Fold Cross Validation Technique and its Essentials (7)

The above visual representation helps us to understand the accuracy is ~98%for K=12,18 and 19 for KNN.

Conclusion

Employing K-fold cross-validation enables a comprehensive evaluation of model performance by partitioning the entire dataset into K equal-sized subsets. This method allows us to mitigate the impact of imbalanced data and provides reliable cross-validation results for deep learning models. By selecting the appropriate hyperparameters based on these results, we can optimize model performance and enhance its generalization ability across the entire dataset.

Key Takeaways

  • The test dataset is crucial for evaluating the performance of a trained model on unseen data, ensuring it generalizes well beyond the training set.
  • After training a model on the training data, it’s essential to evaluate its performance on both the validation and test datasets to ensure it meets performance expectations.
  • Validation data helps in tuning model hyperparameters and assessing the model’s performance before finalizing it for deployment.
  • The KFold class from the sklearn.model_selection module is instrumental in splitting the data into K folds for cross-validation, ensuring robust model evaluation and preventing overfitting.

Frequently Asked Questions

Q1. What is the k-fold cross-validation method?

A. K-fold cross-validation splits data into k equal parts; each part serves as a test set while the others form the training set, rotating until every part has been tested.

Q2. Why is k-fold cross-validation useful?

A. It is useful because it maximizes the use of limited data, reduces variance in performance estimates, and provides a more reliable model evaluation.

Q3. What is K means in k-fold cross-validation?

A. K represents the number of splits or folds into which the data is divided, determining how many times the model is trained and tested.

Q4. What is the difference between K-fold and V-fold cross-validation?

A. K-fold and V-fold cross-validation are essentially the same; both involve dividing the data into k or v folds. The terms are often used interchangeably.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

blogathonk fold validationmachine learningML Engineers

Shanthababu Pandian13 Jun, 2024

IntermediateMachine LearningPythonTechnique

K-Fold Cross Validation Technique and its Essentials (2024)

FAQs

What considerations must be made when applying k-fold cross-validation? ›

To achieve this K-Fold Cross Validation, we have to split the data set into three sets, Training, Testing, and Validation, with the challenge of the volume of the data. Here Test and Train data set will support building model and hyperparameter assessments.

What is the k-fold cross-validation method? ›

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. If you have a machine learning model and some data, you want to tell if your model can fit. You can split your data into training and test set.

What is the goal of k-fold cross-validation? ›

Given the training data set, the k-fold cross validation is done with the purpose of estimating beforehand how well the model would perform. Given the randomization, it is unlikely that there will be a dramatic change from one run into the next one in the loop of the cross-validation.

What is the 5 fold cross-validation? ›

... the five-fold cross-validation (CV) is a process when all data is randomly split into k folds, in our case k = 5, and then the model is trained on the k − 1 folds, while one fold is left to test a model (an example is illustrated on Fig. 9). This procedure is repeated k times.

What is the downside of k-fold cross-validation? ›

K-fold cross-validation is valuable for assessing model performance, but it has limitations. The main drawbacks include increased computational cost and time due to multiple model trainings.

When should you use k-fold cross-validation? ›

Cross-validation is usually used in machine learning for improving model prediction when we don't have enough data to apply other more efficient methods like the 3-way split (train, validation and test) or using a holdout dataset. This is the reason why our dataset has only 100 data points.

How do I choose the best k-fold cross-validation? ›

k-Fold cross-validation

Usually, k is 5 or 10 but you can choose any number which is less than the dataset's length. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set. In the end, you should have validated the model on every fold that you have.

How does k-fold cross-validation prevent overfitting? ›

With k-fold cross-validation, we evaluate the model numerous times on distinct subsets of the data, resulting in a more trustworthy estimate of performance and aiding in the detection of overfitting or model instability. We only assess the model's performance on one split of the data without cross-validation.

Does k-fold cross-validation increase accuracy? ›

To do that, you need to evaluate its performance using a reliable method that avoids overfitting or underfitting. One such method is k-fold cross-validation, which can help you improve your model accuracy by reducing the variance of your estimates.

What is the result of k-fold cross-validation? ›

In K-fold cross-validation, the data set is divided into a number of K-folds and used to assess the model's ability as new data become available. K represents the number of groups into which the data sample is divided. For example, if you find the k value to be 5, you can call it 5-fold cross-validation.

Why use k-fold cross-validation instead of leave one out? ›

K-fold cross-validation strikes a balance between bias and variance by partitioning data into k subsets, whereas leave-one-out cross-validation provides low bias but can be computationally expensive for large datasets.

What is true for k-fold cross-validation? ›

We can see that K-Fold Cross-Validation provides a more robust and reliable performance estimate because it reduces the impact of data variability. By using multiple training and testing cycles, it minimizes the risk of overfitting to a particular data split.

What are the values for k-fold cross-validation? ›

Sensitivity Analysis for k. The key configuration parameter for k-fold cross-validation is k that defines the number folds in which to split a given dataset. Common values are k=3, k=5, and k=10, and by far the most popular value used in applied machine learning to evaluate models is k=10.

How many folds for k-fold cross-validation? ›

When performing cross-validation, we tend to go with the common 10 folds ( k=10 ). In this vignette, we try different number of folds settings and assess the differences in performance. To make our results robust to this choice, we average the results of different settings.

What is the difference between K-fold and V fold cross-validation? ›

V-fold cross-validation (also known as k-fold cross-validation) randomly splits the data into V groups of roughly equal size (called "folds").

Which of the following statements is correct with respect to k-fold cross-validation? ›

Answer & Explanation

The statement that higher values of K will result in higher confidence in the cross-validation result as compared to a lower value of K is NOT TRUE. When performing K-fold pass, the validation set is first divided into K subsets, and then the validation process is carried out K times.

Does k-fold cross-validation cause overfitting? ›

K-fold cross validation can help avoid overfitting or underfitting by providing a more reliable estimate of the model's performance on unseen data.

What is the choice of K in k-fold cross-validation based on? ›

Here's how to set the value of K In K-fold cross-validation…

In most cases, the choice of k is usually 5 or 10, but there is no formal rule. However, the value of k relies upon the size of the dataset. The runtime of the cross-validation algorithm and the computational cost with large values of k.

Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 5595

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.