Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Python | Create Test DataSets using Sklearn
Next article icon

Python | Create Test DataSets using Sklearn

Last Updated : 21 Apr, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

Python's Sklearn library provides a great sample dataset generator which will help you to create your own custom dataset. It's fast and very easy to use. Following are the types of samples it provides.
For all the above methods you need to import sklearn.datasets.samples_generator. 
 

Python3
# importing libraries from sklearn.datasets import make_blobs  # matplotlib for plotting from matplotlib import pyplot as plt  from matplotlib import style 

sklearn.datasets.make_blobs

Python3
# Creating Test DataSets using sklearn.datasets.make_blobs from sklearn.datasets import make_blobs from matplotlib import pyplot as plt  from matplotlib import style  style.use("fivethirtyeight")  X, y = make_blobs(n_samples = 100, centers = 3,                 cluster_std = 1, n_features = 2)  plt.scatter(X[:, 0], X[:, 1], s = 40, color = 'g') plt.xlabel("X") plt.ylabel("Y")  plt.show() plt.clf() 

Output: 
 

make_blobs with 3 centers


 

sklearn.datasets.make_moon


 

Python3
# Creating Test DataSets using sklearn.datasets.make_moon from sklearn.datasets import make_moons from matplotlib import pyplot as plt  from matplotlib import style  X, y = make_moons(n_samples = 1000, noise = 0.1) plt.scatter(X[:, 0], X[:, 1], s = 40, color ='g') plt.xlabel("X") plt.ylabel("Y")  plt.show() plt.clf() 

Output: 
 

make_moons with 1000 data points


 

sklearn.datasets.make_circle


 

Python3
# Creating Test DataSets using sklearn.datasets.make_circles from sklearn.datasets import make_circles from matplotlib import pyplot as plt  from matplotlib import style  style.use("fivethirtyeight")  X, y = make_circles(n_samples = 100, noise = 0.02) plt.scatter(X[:, 0], X[:, 1], s = 40, color ='g') plt.xlabel("X") plt.ylabel("Y")  plt.show() plt.clf() 

Output: 
 

make _circle with 100 data points

Scikit-learn (sklearn) is a popular machine learning library for Python that provides a wide range of functionalities, including data generation. In order to create test datasets using Sklearn, you can use the following code:

Advantages of creating test datasets using Sklearn:

  1. Time-saving: Sklearn provides a quick and easy way to generate test datasets for machine learning tasks, which saves time compared to manually creating datasets.
  2. Consistency: The datasets generated by Sklearn are consistent and reproducible, which helps ensure consistency in your experiments and results.
  3. Flexibility: Sklearn provides a wide range of functions for generating datasets, including functions for classification, regression, clustering, and more, which makes it a flexible tool for generating test datasets for different types of machine learning tasks.
  4. Control over dataset parameters: Sklearn allows you to customize the generation of datasets by specifying parameters such as the number of samples, the number of features, and the level of noise, which gives you greater control over the test datasets you create.

Disadvantages of creating test datasets using Sklearn:

  1. Limited dataset complexity: The datasets generated by Sklearn are typically simple and may not reflect the complexity of real-world datasets. Therefore, it may not be suitable for testing the performance of machine learning algorithms on complex datasets.
  2. Lack of diversity: Sklearn datasets may not reflect the diversity of real-world datasets, which may limit the generalizability of your machine learning models.
  3. Overfitting risk: If you generate test datasets that are too similar to your training datasets, there is a risk of overfitting your machine learning models, which can result in poor performance on new and unseen data.
  4. Overall, Sklearn provides a useful tool for generating test datasets quickly and efficiently, but it's important to keep in mind the limitations and potential drawbacks of using synthetic datasets for machine learning testing. It's recommended to use real-world datasets whenever possible to ensure the most accurate representation of the problem you are trying to solve.
     

Next Article
Python | Create Test DataSets using Sklearn

P

Praveen Sinha
Improve
Article Tags :
  • Machine Learning
Practice Tags :
  • Machine Learning

Similar Reads

    Python Sklearn – sklearn.datasets.load_breast_cancer() Function
    In this article, we are going to see how to convert sklearn dataset to a pandas dataframe in Python. Sklearn is a python library that is used widely for data science and machine learning operations. Sklearn library provides a vast list of tools and functions to train machine learning models. The lib
    2 min read
    Linear Regression using Boston Housing Dataset - ML
    Boston Housing Data: This dataset was taken from the StatLib library and is maintained by Carnegie Mellon University. This dataset concerns the housing prices in the housing city of Boston. The dataset provided has 506 instances with 13 features.The Description of the dataset is taken from the below
    3 min read
    How to split a Dataset into Train and Test Sets using Python
    One of the most important steps in preparing data for training a ML model is splitting the dataset into training and testing sets. This simply means dividing the data into two parts: one to train the machine learning model (training set), and another to evaluate how well it performs on unseen data (
    3 min read
    How To Do Train Test Split Using Sklearn In Python
    In this article, let's learn how to do a train test split using Sklearn in Python. Train Test Split Using Sklearn The train_test_split() method is used to split our data into train and test sets.  First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_t
    5 min read
    Python | Linear Regression using sklearn
    Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences