Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Generate Test Datasets for Machine learning
Next article icon

Generate Test Datasets for Machine learning

Last Updated : 11 Apr, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

Whenever we think of Machine Learning, the first thing that comes to our mind is a dataset. While there are many datasets that you can find on websites such as Kaggle, sometimes it is useful to extract data on your own and generate your own dataset. Generating your own dataset gives you more control over the data and allows you to train your machine-learning model.  In this article, we will generate random datasets using sklearn.datasets library in Python.

Generate test datasets for Classification:

Binary Classification

Example 1: The 2d binary classification data generated by make_circles() have a spherical decision boundary.

Python3
# Import necessary libraries from sklearn.datasets import make_circles import matplotlib.pyplot as plt  # Generate 2d classification dataset  X, y = make_circles(n_samples=200, shuffle=True,                      noise=0.1, random_state=42) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show() 

Output:

make_circles() - Geeksforgeeks
make_circles()

Example 2: Two interlocking half circles represent the 2d binary classification data produced by the make_moons() function.

Python3
#import the necessary libraries from sklearn.datasets import make_moons import matplotlib.pyplot as plt # generate 2d classification dataset X, y = make_moons(n_samples=500, shuffle=True,                   noise=0.15, random_state=42) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show() 

Output:

make_moons() -Geeksforgeeks
make_moons()

Multi-Class Classification

Example 1: Data generated by the function make_blobs() are blobs that can be utilized for clustering.

Python3
#import the necessary libraries from sklearn.datasets import make_blobs import matplotlib.pyplot as plt  # Generate 2d classification dataset X, y = make_blobs(n_samples=500, centers=3, n_features=2, random_state=23)  # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show() 

Output:

make_blobs() - Geeksforgeeks
make_blobs()

Example 2: To generate data by the function make_classification() need to balance between n_informative, n_redundant and n_classes attributes X[:, :n_informative + n_redundant + n_repeated]

Python3
#import the necessary libraries from sklearn.datasets import make_classification import matplotlib.pyplot as plt  # generate 2d classification dataset X, y = make_classification(n_samples = 100,                             n_features=2,                            n_redundant=0,                            n_informative=2,                            n_repeated=0,                            n_classes =3,                            n_clusters_per_class=1)  # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show() 

Output:

make_classification() - Geeksforgeeks
make_classification()

Example 3:A random multi-label classification data is created by the function make make_multilabel_classification() 

Python3
# Import necessary libraries from sklearn.datasets import make_multilabel_classification import pandas as pd import matplotlib.pyplot as plt  # Generate 2d classification dataset  X, y = make_multilabel_classification(n_samples=500, n_features=2,                                        n_classes=2, n_labels=2,                                       allow_unlabeled=True,                                       random_state=23) # create pandas dataframe from generated dataset df = pd.concat([pd.DataFrame(X, columns=['X1', 'X2']),                  pd.DataFrame(y, columns=['Label1', 'Label2'])],                axis=1) display(df.head())  # Plot the generated datasets plt.scatter(df['X1'], df['X2'], c=df['Label1']) plt.show() 

Output:

    X1    X2    Label1    Label2 0    14.0    34.0    0    1 1    30.0    22.0    1    1 2    29.0    19.0    1    1 3    21.0    19.0    1    1 4    16.0    32.0    0    1
make_multilabel_classification() - Geeksforgeeks
make_multilabel_classification()

Generate test datasets for Regression:

Example 1:  Generate a 1-dimensional feature and target for linear regression using make_regression

Python3
# Import necessary libraries from sklearn.datasets import make_regression import matplotlib.pyplot as plt # Generate 1d Regression dataset  X, y = make_regression(n_samples = 50, n_features=1,noise=20, random_state=23) # Plot the generated datasets plt.scatter(X, y) plt.show() 

Output:

make_regression() -Geeksforgeeks
make_regression()

Example 2:  Multilabel feature using make_sparse_uncorrelated()

Python3
# Import necessary libraries from sklearn.datasets import make_sparse_uncorrelated import matplotlib.pyplot as plt # Generate 1d Regression dataset  X, y = make_sparse_uncorrelated(n_samples = 100, n_features=4, random_state=23) # Plot the generated datasets plt.figure(figsize=(12,10)) for i in range(4):     plt.subplot(2,2, i+1)     plt.scatter(X[:,i], y)     plt.xlabel('X'+str(i+1))     plt.ylabel('Y') plt.show() 

Output:

make_sparse_uncorrelated()-Geeksforgeeks
make_sparse_uncorrelated()

Example: 3  Multilabel feature using make_friedman2()

Python3
# Import necessary libraries from sklearn.datasets import make_friedman2 import matplotlib.pyplot as plt # Generate 1d Regression dataset  X, y = make_friedman2(n_samples = 100, random_state=23) # Plot the generated datasets plt.figure(figsize=(12,10)) for i in range(4):     plt.subplot(2,2, i+1)     plt.scatter(X[:,i], y)     plt.xlabel('X'+str(i+1))     plt.ylabel('Y') plt.show() 

Output:

make_friedman2() - Geeksforgeeks
make_friedman2()


 


Next Article
Generate Test Datasets for Machine learning

A

Adith Bharadwaj
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

    ChatGPT Prompt to get Datasets for Machine Learning
    With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we'll learn how to use a Chat
    7 min read
    Generative AI vs Machine Learning
    Artificial Intelligence (AI) is a dynamic and expansive field, driving innovation and reshaping the landscape across numerous industries. Two pivotal branches within this technological marvel—Generative AI and Machine Learning—serve as key players in the AI revolution. While they share a common foun
    3 min read
    What is Test Dataset in Machine Learning?
    In Machine Learning, a Test Dataset plays a crucial role in evaluating the performance of your trained model. In this blog, we will delve into the intricacies of test dataset in machine learning, its significance, and its indispensable role in the data science lifecycle. What is Test Dataset in Mach
    4 min read
    Top Machine Learning Dataset: Find Open Datasets
    In the realm of machine learning, data is the fuel that powers innovation. The quality and quantity of data directly influence the performance and capabilities of machine learning models. Open datasets, in particular, play an important role in democratizing access to data and fostering collaboration
    8 min read
    What is Generative Machine Learning?
    Generative Machine Learning is an interesting subset of artificial intelligence, where models are trained to generate new data samples similar to the original training data. In this article, we'll explore the fundamentals of generative machine learning, compare it with discriminative models, delve i
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences