Generate Test Datasets for Machine learning
Last Updated : 11 Apr, 2023
Whenever we think of Machine Learning, the first thing that comes to our mind is a dataset. While there are many datasets that you can find on websites such as Kaggle, sometimes it is useful to extract data on your own and generate your own dataset. Generating your own dataset gives you more control over the data and allows you to train your machine-learning model. In this article, we will generate random datasets using sklearn.datasets library in Python.
Generate test datasets for Classification:
Binary Classification
Example 1: The 2d binary classification data generated by make_circles() have a spherical decision boundary.
Python3 # Import necessary libraries from sklearn.datasets import make_circles import matplotlib.pyplot as plt # Generate 2d classification dataset X, y = make_circles(n_samples=200, shuffle=True, noise=0.1, random_state=42) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show()
Output:
make_circles() Example 2: Two interlocking half circles represent the 2d binary classification data produced by the make_moons() function.
Python3 #import the necessary libraries from sklearn.datasets import make_moons import matplotlib.pyplot as plt # generate 2d classification dataset X, y = make_moons(n_samples=500, shuffle=True, noise=0.15, random_state=42) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show()
Output:
make_moons()Multi-Class Classification
Example 1: Data generated by the function make_blobs() are blobs that can be utilized for clustering.
Python3 #import the necessary libraries from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Generate 2d classification dataset X, y = make_blobs(n_samples=500, centers=3, n_features=2, random_state=23) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show()
Output:
make_blobs() Example 2: To generate data by the function make_classification() need to balance between n_informative, n_redundant and n_classes attributes X[:, :n_informative + n_redundant + n_repeated]
Python3 #import the necessary libraries from sklearn.datasets import make_classification import matplotlib.pyplot as plt # generate 2d classification dataset X, y = make_classification(n_samples = 100, n_features=2, n_redundant=0, n_informative=2, n_repeated=0, n_classes =3, n_clusters_per_class=1) # Plot the generated datasets plt.scatter(X[:, 0], X[:, 1], c=y) plt.show()
Output:
make_classification() Example 3:A random multi-label classification data is created by the function make make_multilabel_classification()
Python3 # Import necessary libraries from sklearn.datasets import make_multilabel_classification import pandas as pd import matplotlib.pyplot as plt # Generate 2d classification dataset X, y = make_multilabel_classification(n_samples=500, n_features=2, n_classes=2, n_labels=2, allow_unlabeled=True, random_state=23) # create pandas dataframe from generated dataset df = pd.concat([pd.DataFrame(X, columns=['X1', 'X2']), pd.DataFrame(y, columns=['Label1', 'Label2'])], axis=1) display(df.head()) # Plot the generated datasets plt.scatter(df['X1'], df['X2'], c=df['Label1']) plt.show()
Output:
X1 X2 Label1 Label2 0 14.0 34.0 0 1 1 30.0 22.0 1 1 2 29.0 19.0 1 1 3 21.0 19.0 1 1 4 16.0 32.0 0 1
make_multilabel_classification()Generate test datasets for Regression:
Example 1: Generate a 1-dimensional feature and target for linear regression using make_regression
Python3 # Import necessary libraries from sklearn.datasets import make_regression import matplotlib.pyplot as plt # Generate 1d Regression dataset X, y = make_regression(n_samples = 50, n_features=1,noise=20, random_state=23) # Plot the generated datasets plt.scatter(X, y) plt.show()
Output:
make_regression()Example 2: Multilabel feature using make_sparse_uncorrelated()
Python3 # Import necessary libraries from sklearn.datasets import make_sparse_uncorrelated import matplotlib.pyplot as plt # Generate 1d Regression dataset X, y = make_sparse_uncorrelated(n_samples = 100, n_features=4, random_state=23) # Plot the generated datasets plt.figure(figsize=(12,10)) for i in range(4): plt.subplot(2,2, i+1) plt.scatter(X[:,i], y) plt.xlabel('X'+str(i+1)) plt.ylabel('Y') plt.show()
Output:
make_sparse_uncorrelated()Example: 3 Multilabel feature using make_friedman2()
Python3 # Import necessary libraries from sklearn.datasets import make_friedman2 import matplotlib.pyplot as plt # Generate 1d Regression dataset X, y = make_friedman2(n_samples = 100, random_state=23) # Plot the generated datasets plt.figure(figsize=(12,10)) for i in range(4): plt.subplot(2,2, i+1) plt.scatter(X[:,i], y) plt.xlabel('X'+str(i+1)) plt.ylabel('Y') plt.show()
Output:
make_friedman2()
Similar Reads
ChatGPT Prompt to get Datasets for Machine Learning With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we'll learn how to use a Chat
7 min read
Generative AI vs Machine Learning Artificial Intelligence (AI) is a dynamic and expansive field, driving innovation and reshaping the landscape across numerous industries. Two pivotal branches within this technological marvelâGenerative AI and Machine Learningâserve as key players in the AI revolution. While they share a common foun
3 min read
What is Test Dataset in Machine Learning? In Machine Learning, a Test Dataset plays a crucial role in evaluating the performance of your trained model. In this blog, we will delve into the intricacies of test dataset in machine learning, its significance, and its indispensable role in the data science lifecycle. What is Test Dataset in Mach
4 min read
Top Machine Learning Dataset: Find Open Datasets In the realm of machine learning, data is the fuel that powers innovation. The quality and quantity of data directly influence the performance and capabilities of machine learning models. Open datasets, in particular, play an important role in democratizing access to data and fostering collaboration
8 min read
What is Generative Machine Learning? Generative Machine Learning is an interesting subset of artificial intelligence, where models are trained to generate new data samples similar to the original training data. In this article, we'll explore the fundamentals of generative machine learning, compare it with discriminative models, delve i
4 min read