Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
How to Install Scikit-Learn on Linux?
Next article icon

Pipelines - Python and scikit-learn

Last Updated : 13 Jul, 2021
Comments
Improve
Suggest changes
Like Article
Like
Report

The workflow of any machine learning project includes all the steps required to build it. A proper ML project consists of basically four main parts are given as follows: 
 

  • Gathering data: 
    The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other sources.
  • Data pre-processing: 
    Usually, within the collected data, there is a lot of missing data, extremely large values, unorganized text data or noisy data and thus cannot be used directly within the model, therefore, the data require some pre-processing before entering the model.
  • Training and testing the model: Once the data is ready for algorithm application, It is then ready to put into the machine learning model. Before that, it is important to have an idea of what model is to be used which may give a nice performance output. The data set is divided into 3 basic sections i.e. The training set, validation set and test set. The main aim is to train data in the train set, to tune the parameters using ‘validation set’ and then test the performance test set.
  • Evaluation: 
    Evaluation is a part of the model development process. It helps to find the best model that represents the data and how well the chosen model works in the future. This is done after training of model in different algorithms is done. The main motto is to conclude the evaluation and choose model accordingly again.


ML Workflow in python 
The execution of the workflow is in a pipe-like manner, i.e. the output of the first steps becomes the input of the second step. Scikit-learn is a powerful tool for machine learning, provides a feature for handling such pipes under the sklearn.pipeline module called Pipeline.  
It takes 2 important parameters, stated as follows: 
 

  • The Stepslist: 
    List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
  • verbose:


Code: 
 

python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier # import some data within sklearn for iris classification  iris = datasets.load_iris() X = iris.data  y = iris.target  # Splitting data into train and testing part # The 25 % of data is test size of the data  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # importing pipes for making the Pipe flow from sklearn.pipeline import Pipeline # pipe flow is : # PCA(Dimension reduction to two) -> Scaling the data -> DecisionTreeClassification  pipe = Pipeline([('pca', PCA(n_components = 2)), ('std', StandardScaler()), ('decision_tree', DecisionTreeClassifier())], verbose = True)  # fitting the data in the pipe pipe.fit(X_train, y_train)  # scoring data  from sklearn.metrics import accuracy_score print(accuracy_score(y_test, pipe.predict(X_test))) 

Output: 
 

[Pipeline] ............... (step 1 of 3) Processing pca, total=   0.0s [Pipeline] ............... (step 2 of 3) Processing std, total=   0.0s [Pipeline] ..... (step 3 of 3) Processing Decision_tree, total=   0.0s 0.9736842105263158


Important property: 
 

  • pipe.named_steps: pipe.named_steps is a dictionary storing the name key linked to the individual objects in the pipe. For example:
pipe.named_steps['decision_tree'] # returns a decision tree classifier object  


Hyper parameters: 
There are different set of hyper parameters set within the classes passed in as a pipeline. To view them, pipe.get_params() method is used. This method returns a dictionary of the parameters and descriptions of each classes in the pipeline. 
Example: 
 

python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier # import some data within sklearn for iris classification  iris = datasets.load_iris() X = iris.data  y = iris.target  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)  from sklearn.pipeline import Pipeline pipe = Pipeline([('pca', PCA(n_components = 2)), ('std', StandardScaler()), ('Decision_tree', DecisionTreeClassifier())], verbose = True)  pipe.fit(X_train, y_train)  # to see all the hyper parameters pipe.get_params() 

Output: 
 

{'memory': None,  'steps': [('pca',    PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,        svd_solver='auto', tol=0.0, whiten=False)),   ('std', StandardScaler(copy=True, with_mean=True, with_std=True)),   ('Decision_tree',    DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',                           max_depth=None, max_features=None, max_leaf_nodes=None,                           min_impurity_decrease=0.0, min_impurity_split=None,                           min_samples_leaf=1, min_samples_split=2,                           min_weight_fraction_leaf=0.0, presort='deprecated',                           random_state=None, splitter='best'))],  'verbose': True,  'pca': PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,      svd_solver='auto', tol=0.0, whiten=False),  'std': StandardScaler(copy=True, with_mean=True, with_std=True),  'Decision_tree': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',                         max_depth=None, max_features=None, max_leaf_nodes=None,                         min_impurity_decrease=0.0, min_impurity_split=None,                         min_samples_leaf=1, min_samples_split=2,                         min_weight_fraction_leaf=0.0, presort='deprecated',                         random_state=None, splitter='best'),  'pca__copy': True,  'pca__iterated_power': 'auto',  'pca__n_components': 2,  'pca__random_state': None,  'pca__svd_solver': 'auto',  'pca__tol': 0.0,  'pca__whiten': False,  'std__copy': True,  'std__with_mean': True,  'std__with_std': True,  'Decision_tree__ccp_alpha': 0.0,  'Decision_tree__class_weight': None,  'Decision_tree__criterion': 'gini',  'Decision_tree__max_depth': None,  'Decision_tree__max_features': None,  'Decision_tree__max_leaf_nodes': None,  'Decision_tree__min_impurity_decrease': 0.0,  'Decision_tree__min_impurity_split': None,  'Decision_tree__min_samples_leaf': 1,  'Decision_tree__min_samples_split': 2,  'Decision_tree__min_weight_fraction_leaf': 0.0,  'Decision_tree__presort': 'deprecated',  'Decision_tree__random_state': None,  'Decision_tree__splitter': 'best'}


 


Next Article
How to Install Scikit-Learn on Linux?
author
piyush25pv
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • Python scikit-module
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

  • PCA and SVM Pipeline in Python
    Principal Component Analysis (PCA) and Support Vector Machines (SVM) are powerful techniques used in machine learning for dimensionality reduction and classification, respectively. Combining them into a pipeline can enhance the performance of the overall system, especially when dealing with high-dim
    5 min read
  • How to Normalize Data Using scikit-learn in Python
    Data normalization is a crucial preprocessing step in machine learning. It ensures that features contribute equally to the model by scaling them to a common range. This process helps in improving the convergence of gradient-based optimization algorithms and makes the model training process more effi
    4 min read
  • How to Install Scikit-Learn on Linux?
    In this article, we are going to see how to install Scikit-Learn on Linux. Scikit-Learn is a python open source library for predictive data analysis. It is built on NumPy, SciPy, and matplotlib. It is written in Python, Cython, C, and C++ language. It is available for Linux, Unix, Windows, and Mac.
    2 min read
  • Save and Load Machine Learning Models in Python with scikit-learn
    In this article, let's learn how to save and load your machine learning model in Python with scikit-learn in this tutorial. Once we create a machine learning model, our job doesn't end there. We can save the model to use in the future. We can either use the pickle or the joblib library for this purp
    4 min read
  • What is fit() method in Python's Scikit-Learn?
    Scikit-Learn, a powerful and versatile Python library, is extensively used for machine learning tasks. It provides simple and efficient tools for data mining and data analysis. Among its many features, the fit() method stands out as a fundamental component for training machine learning models. This
    4 min read
  • Implementing PCA in Python with scikit-learn
    In this article, we will learn about PCA (Principal Component Analysis) in Python with scikit-learn. Let's start our learning step by step. WHY PCA? When there are many input attributes, it is difficult to visualize the data. There is a very famous term ‘Curse of dimensionality in the machine learni
    5 min read
  • What is python scikit library?
    Python is known for its versatility across various domains, from web development to data science and machine learning. In machine learning, one of the go-to libraries for Python enthusiasts is Scikit-learn, often referred to as "sklearn." It's a powerhouse for creating robust machine learning models
    7 min read
  • Differences Between Scikit Learn, Keras, and PyTorch
    In the ever-evolving landscape of machine learning and deep learning, selecting the right library for your project is crucial. SciKit Learn, Keras, and PyTorch are three popular libraries that cater to different needs. Understanding their differences can help you choose the most appropriate tool for
    3 min read
  • Python for Machine Learning
    Welcome to "Python for Machine Learning," a comprehensive guide to mastering one of the most powerful tools in the data science toolkit. Python is widely recognized for its simplicity, versatility, and extensive ecosystem of libraries, making it the go-to programming language for machine learning. I
    6 min read
  • Multiple Linear Regression With scikit-learn
    In this article, let's learn about multiple linear regression using scikit-learn in the Python programming language. Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it's utilized as a method for predictive mode
    11 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences