Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Feature Agglomeration vs Univariate Selection in Scikit Learn
Next article icon

SVM with Univariate Feature Selection in Scikit Learn

Last Updated : 24 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amount of computational power and is sensitive to the choice of features. This can make the model more complex and harder to interpret.

Univariate feature selection is a method used to select the most important features in a dataset. The idea behind this method is to evaluate each individual feature's relationship with the target variable and select the ones that have the strongest correlation. This process is repeated for each feature and the best ones are selected based on defined criteria, such as the highest correlation or statistical significance.

In univariate feature selection, the focus is on individual features and their contribution to the target variable, rather than considering the relationships between features. This method is simple and straightforward, but it does not take into account any interactions or dependencies between features.

Univariate feature selection is useful when working with a large number of features and the goal is to reduce the dimensionality of the data and simplify the modeling process. It is also useful for feature selection in cases where the relationship between the target variable and individual features is not complex and can be understood through a simple statistical analysis.

Syntax of  SelectKBest():

Select features according to the k highest scores.

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)

score_fun : In score_fun we can use f_classif, f_regression, chi2, mutual_info_classif,  GenericUnivariateSelect etc, The default is f_classif
used for classification data it takes two arrays X and y, and return a pair of arrays (scores, pvalues) or a single array with scores. 

k : We can assign the integer value denotes the number of features we want or "all", The default value is 10

ANOVA stands for Analysis of Variance and is a statistical technique used to determine the relationship between a dependent variable (label) and one or more independent variables (features). It measures the variability between different groups of data and helps to identify which independent variable has a significant impact on the dependent variable.

In machine learning, ANOVA is used as a univariate feature selection method between the feature and the label. This means it helps to identify the most important features in a dataset that have the greatest impact on the target variable.

Univariate statistical tests are a class of statistical tests that are used to analyze the distribution of a single variable. The goal of these tests is to determine whether there is significant variation in the variable and to identify any patterns or relationships in the data. Some common univariate statistical tests include:

The F-score, also known as the F-statistic, is a ratio of two variances used in ANOVA. It is calculated as the ratio of the variance between the groups to the variance within the groups. The F-score is used to test the hypothesis that the means of the groups are equal.

Formula:  The F-score can be calculated as follows:  F = (MSB / MSW)  where:  MSB = Mean Square Between (variance between groups)  MSW = Mean Square Within (variance within groups)

The F-score is used to test the null hypothesis, which states that the means of the groups are equal. If the calculated F-score is larger than the critical value from the F-distribution, the null hypothesis is rejected, and it is concluded that there is a significant difference between the means of the groups.

Here's an example of how ANOVA works in Scikit Learn, which we will use as the score_fun:

f_classif :  In the first example, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions. 

f_regression : In the second example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.

chi2: This test is used to determine whether there is a significant association between two categorical variables. The test calculates the difference between the expected frequency of occurrences and the observed frequency of occurrences.

EXAMPLE 1 : 

In this article, we will use the iris dataset from the sci-kit-learn library and apply univariate feature selection to the data before training an SVM. The iris dataset contains 150 samples of iris flowers, with four features: sepal length, sepal width, petal length, and petal width. The goal is to use SVM to classify the iris flowers into three different species based on their features.

Step 1: Load the iris dataset and split the data into training and test sets:

Python3
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split  iris = load_iris(as_frame=True) df = iris.frame X = df.drop(['target'], axis = 1) y = df['target']  X_train, X_test, y_train, y_test = train_test_split(X,                                                     y,                                                     test_size=0.2,                                                     random_state=42) 

Step 2: Univariate Feature Selection

we will use the SelectKBest class from sklearn.feature_selection module to perform univariate feature selection. 

In this case, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions. 

We will set the k parameter to 2, which means that we will keep the two best features from the dataset.

Python3
from sklearn.feature_selection import SelectKBest, f_classif  selector = SelectKBest(f_classif, k=2) selector.fit(X_train, y_train)  print('Number of input features:', selector.n_features_in_) print('Input features Names  :', selector.feature_names_in_) print('Input features scores :', selector.scores_) print('Input features pvalues:', selector.pvalues_) print('Output features Names :', selector.get_feature_names_out()) 

Output:

Number of input features: 4  Input features Names  : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'   'petal width (cm)']  Input features scores : [ 84.80836804  41.29284269 925.55642345 680.77560309]  Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]  Output features Names : ['petal length (cm)' 'petal width (cm)']

Now we will select both petal length and petal width by using selector.transform to train and test features.

Python3
X_train_selected = selector.transform(X_train) X_test_selected = selector.transform(X_test) 

Step 3: Apply the Support Vector Machine Classifier to train the model.

Now that we have selected the best two features, we will train an SVM classifier using these features:

Python3
from sklearn.svm import SVC  clf = SVC(kernel='linear', C=1, random_state=42) clf.fit(X_train_selected, y_train) 

Step 4: Evaluate the performance of the SVM classifier

Finally, we will evaluate the performance of the SVM classifier by calculating its accuracy on the test set:

Python3
from sklearn.metrics import accuracy_score  y_pred = clf.predict(X_test_selected) accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) 

Output:

Accuracy: 1.0

This means that the SVM classifier was able to classify 100% of the test samples correctly, using only two features. By reducing the number of features in the model, we have made it simpler and more interpretable, while still achieving good performance.

Full code:

Python3
# Import the necesssary libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.feature_selection import SelectKBest, f_classif from sklearn.svm import SVC from sklearn.metrics import accuracy_score  # Load the datasets iris = load_iris(as_frame=True) df = iris.frame X = df.drop(['target'], axis = 1) y = df['target']  # Split train and test datasets X_train, X_test, y_train, y_test = train_test_split(X,                                                     y,                                                     test_size=0.2,                                                     random_state=42)  # Select the best features selector = SelectKBest(f_classif, k=2) selector.fit(X_train, y_train)  print('Number of input features:', selector.n_features_in_) print('Input features Names  :', selector.feature_names_in_) print('Input features scores :', selector.scores_) print('Input features pvalues:', selector.pvalues_) print('Output features Names :', selector.get_feature_names_out())  X_train_selected = selector.transform(X_train) X_test_selected = selector.transform(X_test)  # Train the classifier clf = SVC(kernel='linear', C=1, random_state=42) clf.fit(X_train_selected, y_train) # Prediction y_pred = clf.predict(X_test_selected) # Evaluation accuracy = accuracy_score(y_test, y_pred) print("\n Accuracy:", accuracy) 

Output:

Number of input features: 4  Input features Names  : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'   'petal width (cm)']  Input features scores : [ 84.80836804  41.29284269 925.55642345 680.77560309]  Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]  Output features Names : ['petal length (cm)' 'petal width (cm)']     Accuracy: 1.0

Example 2: 

In this example, we are using the SelectKBest class from sklearn.feature_selection module. The f_regression function is used as the scoring function, which is the ANOVA F-value between the feature and the target. The fit method is used to fit the selector to the data, and the scores_ attribute is used to get the scores for each feature. Finally, we sort the scores and get the names of the top 5 features with the greatest impact on the target variable.

In the first example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.

The value of k determines the number of features that will be selected. In the first example, k=5, so the top 5 features will be selected based on their F-values. In the second example, k=2, so the top 2 features will be selected based on their F-values.

Python3
from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.svm import SVR from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression from sklearn.metrics import mean_squared_error  # Load the diabetes dataset data = load_diabetes(as_frame=True) df = data.frame  X = df.drop(['target'], axis = 1) y = df['target']  # Split train and test datasets X_train, X_test, y_train, y_test = train_test_split(X,                                                     y,                                                     test_size=0.2,                                                     random_state=42)  # Create the feature selector selector = SelectKBest(f_regression, k=3)  # Fit the selector to the data selector.fit(X_train, y_train)  print('Number of input features:', selector.n_features_in_) print('Input features Names  :', selector.feature_names_in_) # Get the scores for each feature print('Input features scores :', selector.scores_) # Get the pvalues for each feature print('Input features pvalues:', selector.pvalues_) # Print the names of the best features print('Output features Names :', selector.get_feature_names_out())  X_train_selected = selector.transform(X_train) X_test_selected = selector.transform(X_test)  # Train the classifier reg = SVR(kernel='rbf') reg.fit(X_train_selected, y_train) # Prediction y_pred = reg.predict(X_test_selected) mse = mean_squared_error(y_test, y_pred) print("\Mean Squared Error :", mse) 

Output:

Number of input features: 10  Input features Names  : ['age' 'sex' 'bmi' 'bp' 's1' 's2' 's3' 's4' 's5' 's6']  Input features scores : [1.40986700e+01 1.77755064e-02 2.02386965e+02 8.65580384e+01   1.45561098e+01 8.63143031e+00 6.07087750e+01 7.74171182e+01 1.53967806e+02 6.31023038e+01]  Input features pvalues: [2.02982942e-04 8.94012908e-01 1.39673719e-36 1.49839640e-18   1.60730187e-04 3.52250747e-03 7.56195523e-14 6.36582277e-17 1.45463546e-29 2.69104622e-14]  Output features Names : ['bmi' 'bp' 's5']  \Mean Squared Error : 3668.63356096246

In conclusion, univariate feature selection is a useful technique for reducing the complexity of SVM models.


Next Article
Feature Agglomeration vs Univariate Selection in Scikit Learn
author
isitapol2002
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Feature Agglomeration vs Univariate Selection in Scikit Learn
    Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, inc
    4 min read
  • Feature Selection in Python with Scikit-Learn
    Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
    4 min read
  • Joint Feature Selection with multi-task Lasso in Scikit Learn
    This article is likely to introduces the concepts of Lasso and multitask Lasso regression and demonstrates how to implement these methods in Python using the scikit-learn library. The article is going to cover the differences between Lasso and multitask Lasso and provide guidance on which method to
    7 min read
  • SVM Feature Selection in R with Example
    In machine learning, SVM is often praised for its robustness and accuracy, particularly in binary classification problems. However, like any model, its performance can be heavily dependent on the input features. Effective feature selection not only simplifies the model by reducing the number of vari
    4 min read
  • Feature Transformations with Ensembles of Trees in Scikit Learn
    An ensemble of trees is an efficient technique that can be used to combine multiple weak learners into a strong learner. The main idea of the ensemble for trees is that we take aggregate of the results from multiple trees which may not have been able to perform well. This aggregate mitigates the wea
    10 min read
  • Why is scikit-learn SVM.SVC() extremely slow?
    In the world of machine learning, the Support Vector Machine (SVM) is a powerful algorithm for classification and regression tasks. In scikit-learn, a popular Python library for machine learning, the SVC (Support Vector Classification) class from the svm module is commonly used to implement SVM. How
    4 min read
  • Feature Selection Techniques in Machine Learning
    In data science many times we encounter vast of features present in a dataset. But it is not necessary all features contribute equally in prediction that where feature engineering comes. It helps in choosing important features while discarding rest. In this article we will learn more about it and it
    6 min read
  • Performing Feature Selection with gridsearchcv in Sklearn
    Feature selection is a crucial step in machine learning, as it helps to identify the most relevant features in a dataset that contribute to the model's performance. One effective way to perform feature selection is by combining it with hyperparameter tuning using GridSearchCV from scikit-learn. In t
    4 min read
  • What is Scikit-learn Random State in Splitting Dataset?
    One of the key aspects for developing reliable models is the concept of the random_state parameter in Scikit-learn, particularly when splitting datasets. This article delves into the significance of random_state, its usage, and its impact on model performance and evaluation. Table of Content Underst
    5 min read
  • Feature Selection with the Caret R Package
    The Caret (Classification And REgression Training) is an R package that provides a unified interface for performing machine learning tasks, such as data preprocessing, model training and performance evaluation. One of the tasks that Caret can help with is feature selection, which involves selecting
    6 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences