Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Implementation of neural network from scratch using NumPy
Next article icon

Implementation of K-Nearest Neighbors from Scratch using Python

Last Updated : 14 Oct, 2020
Comments
Improve
Suggest changes
Like Article
Like
Report

Instance-Based Learning

K Nearest Neighbors Classification is one of the classification techniques based on instance-based learning. Models based on instance-based learning to generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance (or test example), then they instantly build a relationship between stored training examples and this new instant to assign a target function value for this new instance. Instance-based methods are sometimes called lazy learning methods because they postponed learning until the new instance is encountered for prediction.

Instead of estimating the hypothetical function (or target function) once for the entire space, these methods will estimate it locally and differently for each new instance to be predicted.

K-Nearest Neighbors Classifier Learning

Basic Assumption:

  1. All instances correspond to points in the n-dimensional space where n represents the number of features in any instance.
  2. The nearest neighbors of an instance are defined in terms of the Euclidean distance.
An instance can be represented by < x1, x2, .............., xn >.  Euclidean distance between two instances xa and xb is given by d( xa, xb ) :  
\sqrt{\sum_{j=1}^{n}\left(x_{j}^{a}-x_{j}^{b}\right)^{2}}

How does it work?

K-Nearest Neighbors Classifier first stores the training examples. During prediction, when it encounters a new instance (or test example) to predict, it finds the K number of training instances nearest to this new instance.  Then assigns the most common class among the K-Nearest training instances to this test instance.

The optimal choice for K is by validating errors on test data. K can also be chosen by the square root of m, where m is the number of examples in the dataset.

KNN Graphical Working Representation

In the above figure, “+” denotes training instances labelled with 1. “-” denotes training instances with 0. Here we classified for the test instance xt as the most common class among K-Nearest training instances to it. Here we choose K = 3, so xt is classified as “-” or 0.

Pseudocode:

  1. Store all training examples.
  2. Repeat steps 3, 4, and 5 for each test example.
  3. Find the K number of training examples nearest to the current test example.
  4. y_pred for current test example =  most common class among K-Nearest training instances.
  5. Go to step 2.

Implementation:

Diabetes Dataset used in this implementation can be downloaded from link.

It has 8 features columns like i.e “Age”, “Glucose” e.t.c, and the target variable “Outcome” for 108 patients. So in this, we will create a link Neighbors Classifier model to predict the presence of diabetes or not for patients with such information.

# Importing libraries
  
import pandas as pd
  
import numpy as np
  
from sklearn.model_selection import train_test_split
  
from scipy.stats import mode
  
from sklearn.neighbors import KNeighborsClassifier
  
# K Nearest Neighbors Classification
  
class K_Nearest_Neighbors_Classifier() : 
      
    def __init__( self, K ) :
          
        self.K = K
          
    # Function to store training set
          
    def fit( self, X_train, Y_train ) :
          
        self.X_train = X_train
          
        self.Y_train = Y_train
          
        # no_of_training_examples, no_of_features
          
        self.m, self.n = X_train.shape
      
    # Function for prediction
          
    def predict( self, X_test ) :
          
        self.X_test = X_test
          
        # no_of_test_examples, no_of_features
          
        self.m_test, self.n = X_test.shape
          
        # initialize Y_predict
          
        Y_predict = np.zeros( self.m_test )
          
        for i in range( self.m_test ) :
              
            x = self.X_test[i]
              
            # find the K nearest neighbors from current test example
              
            neighbors = np.zeros( self.K )
              
            neighbors = self.find_neighbors( x )
              
            # most frequent class in K neighbors
              
            Y_predict[i] = mode( neighbors )[0][0]    
              
        return Y_predict
      
    # Function to find the K nearest neighbors to current test example
            
    def find_neighbors( self, x ) :
          
        # calculate all the euclidean distances between current 
        # test example x and training set X_train
          
        euclidean_distances = np.zeros( self.m )
          
        for i in range( self.m ) :
              
            d = self.euclidean( x, self.X_train[i] )
              
            euclidean_distances[i] = d
          
        # sort Y_train according to euclidean_distance_array and 
        # store into Y_train_sorted
          
        inds = euclidean_distances.argsort()
          
        Y_train_sorted = self.Y_train[inds]
          
        return Y_train_sorted[:self.K]
      
    # Function to calculate euclidean distance
              
    def euclidean( self, x, x_train ) :
          
        return np.sqrt( np.sum( np.square( x - x_train ) ) )
  
# Driver code
  
def main() :
      
    # Importing dataset
      
    df = pd.read_csv( "diabetes.csv" )
  
    X = df.iloc[:,:-1].values
  
    Y = df.iloc[:,-1:].values
      
    # Splitting dataset into train and test set
  
    X_train, X_test, Y_train, Y_test = train_test_split( 
      X, Y, test_size = 1/3, random_state = 0 )
      
    # Model training
      
    model = K_Nearest_Neighbors_Classifier( K = 3 )
      
    model.fit( X_train, Y_train )
      
    model1 = KNeighborsClassifier( n_neighbors = 3 )
      
    model1.fit( X_train, Y_train )
      
    # Prediction on test set
  
    Y_pred = model.predict( X_test )
      
    Y_pred1 = model1.predict( X_test )
      
    # measure performance
      
    correctly_classified = 0
      
    correctly_classified1 = 0
      
    # counter
      
    count = 0
      
    for count in range( np.size( Y_pred ) ) :
          
        if Y_test[count] == Y_pred[count] :
              
            correctly_classified = correctly_classified + 1
          
        if Y_test[count] == Y_pred1[count] :
              
            correctly_classified1 = correctly_classified1 + 1
              
        count = count + 1
          
    print( "Accuracy on test set by our model       :  ", ( 
      correctly_classified / count ) * 100 )
    print( "Accuracy on test set by sklearn model   :  ", ( 
      correctly_classified1 / count ) * 100 )
      
      
if __name__ == "__main__" : 
      
    main()
                      
                       

Output :

Accuracy on test set by our model       :   63.888888888888886  Accuracy on test set by sklearn model   :   63.888888888888886  

The accuracy achieved by our model and sklearn is equal which indicates the correct implementation of our model.

Note: Above Implementation is for model creation from scratch, not to improve the accuracy of the diabetes dataset.

K Nearest Neighbors Regression:

K Nearest Neighbors Regression first stores the training examples. During prediction, when it encounters a new instance ( or test example ) to predict,  it finds the K number of training instances nearest to this new instance. Then predicts the target value for this instance by calculating the mean of the target values of these nearest neighbors.

The optimal choice for K is by validating errors on test data. K can also be chosen by the square root of m, where m is the number of examples in the dataset.

Pseudocode :

  1. Store all training examples.
  2. Repeat steps 3, 4, and 5 for each test example.
  3. Find the K number of training examples nearest to the current test example.
  4. y_pred for current test example =  mean of the true target values of these K neighbors.
  5. Go to step 2.

Implementation:

Dataset used in this implementation can be downloaded from link

It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will create a K Nearest Neighbors Regression model to learn the correlation between the number of years of experience of each employee and their respective salary.

The model, we created predicts the same value as the sklearn model predicts for the test set.

Python3

# Importing libraries
  
import pandas as pd
  
import numpy as np
  
from sklearn.model_selection import train_test_split
  
from sklearn.neighbors import KNeighborsRegressor
  
# K Nearest Neighbors Regression
  
class K_Nearest_Neighbors_Regressor() : 
      
    def __init__( self, K ) :
          
        self.K = K
          
    # Function to store training set
          
    def fit( self, X_train, Y_train ) :
          
        self.X_train = X_train
          
        self.Y_train = Y_train
          
        # no_of_training_examples, no_of_features
          
        self.m, self.n = X_train.shape
      
    # Function for prediction
          
    def predict( self, X_test ) :
          
        self.X_test = X_test
          
        # no_of_test_examples, no_of_features
          
        self.m_test, self.n = X_test.shape
          
        # initialize Y_predict
          
        Y_predict = np.zeros( self.m_test )
          
        for i in range( self.m_test ) :
              
            x = self.X_test[i]
              
            # find the K nearest neighbors from current test example
              
            neighbors = np.zeros( self.K )
              
            neighbors = self.find_neighbors( x )
              
            # calculate the mean of K nearest neighbors
              
            Y_predict[i] = np.mean( neighbors )
              
        return Y_predict
      
    # Function to find the K nearest neighbors to current test example
            
    def find_neighbors( self, x ) :
          
        # calculate all the euclidean distances between current test
        # example x and training set X_train
          
        euclidean_distances = np.zeros( self.m )
          
        for i in range( self.m ) :
              
            d = self.euclidean( x, self.X_train[i] )
              
            euclidean_distances[i] = d
          
        # sort Y_train according to euclidean_distance_array and 
        # store into Y_train_sorted
          
        inds = euclidean_distances.argsort()
          
        Y_train_sorted = self.Y_train[inds]
          
        return Y_train_sorted[:self.K]
      
    # Function to calculate euclidean distance
              
    def euclidean( self, x, x_train ) :
          
        return np.sqrt( np.sum( np.square( x - x_train ) ) )
   
# Driver code
  
def main() :
      
    # Importing dataset
      
    df = pd.read_csv( "salary_data.csv" )
  
    X = df.iloc[:,:-1].values
  
    Y = df.iloc[:,1].values
      
    # Splitting dataset into train and test set
  
    X_train, X_test, Y_train, Y_test = train_test_split( 
      X, Y, test_size = 1/3, random_state = 0 )
      
    # Model training
      
    model = K_Nearest_Neighbors_Regressor( K = 3 )
  
    model.fit( X_train, Y_train )
      
    model1 = KNeighborsRegressor( n_neighbors = 3 )
      
    model1.fit( X_train, Y_train )
      
    # Prediction on test set
  
    Y_pred = model.predict( X_test )
      
    Y_pred1 =  model1.predict( X_test )
      
    print( "Predicted values by our model     :  ", np.round( Y_pred[:3], 2 ) ) 
      
    print( "Predicted values by sklearn model :  ", np.round( Y_pred1[:3], 2 ) )
      
    print( "Real values                       :  ", Y_test[:3] )
  
  
if __name__ == "__main__" : 
      
    main()
                      
                       

Output :

Predicted values by our model     :   [ 43024.33 113755.33  58419.  ]  Predicted values by sklearn model :   [ 43024.33 113755.33  58419.  ]  Real values                       :   [ 37731 122391  57081]

Disadvantage: Instance Learning models are computationally very costly because all the computations are done during prediction. It also considers all the training examples for the prediction of every test example.



Next Article
Implementation of neural network from scratch using NumPy
author
mohitbaliyan
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Implementation of Radius Neighbors from Scratch in Python
    Radius Neighbors is also one of the techniques based on instance-based learning. Models based on instance-based learning generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance (or test example), then they instantly build a relatio
    8 min read
  • Implementation of neural network from scratch using NumPy
    Neural networks are a core component of deep learning models, and implementing them from scratch is a great way to understand their inner workings. we will demonstrate how to implement a basic Neural networks algorithm from scratch using the NumPy library in Python, focusing on building a three-lett
    7 min read
  • Implementation of K Nearest Neighbors
    Prerequisite: K nearest neighbors Introduction Say we are given a data set of items, each having numerically valued features (like Height, Weight, Age, etc). If the count of features is n, we can represent the items as points in an n-dimensional grid. Given a new item, we can calculate the distance
    10 min read
  • Implementation of KNN classifier using Scikit - learn - Python
    K-Nearest Neighbors is a most simple but fundamental classifier algorithm in Machine Learning. It is under the supervised learning category and used with great intensity for pattern recognition, data mining and analysis of intrusion. It is widely disposable in real-life scenarios since it is non-par
    3 min read
  • ML | Naive Bayes Scratch Implementation using Python
    Naive Bayes is a probabilistic machine learning algorithms based on the Bayes Theorem. It is a simple yet powerful algorithm because of its understanding, simplicity and ease of implementation. It is popular method for classification applications such as spam filtering and text classification. In th
    7 min read
  • Implementation of Ridge Regression from Scratch using Python
    Prerequisites: Linear Regression Gradient Descent Introduction: Ridge Regression ( or L2 Regularization ) is a variation of Linear Regression. In Linear Regression, it minimizes the Residual Sum of Squares ( or RSS or cost function ) to fit the training examples perfectly as possible. The cost funct
    4 min read
  • Implementing SVM from Scratch in Python
    Support Vector Machines (SVMs) is a supervised machine learning algorithms used for classification and regression tasks. They work by finding the optimal hyperplane that separates data points of different classes with the maximum margin. We can use Scikit library of python to implement SVM but in th
    4 min read
  • k-nearest neighbor algorithm using Sklearn - Python
    K-Nearest Neighbors (KNN) works by identifying the 'k' nearest data points called as neighbors to a given input and predicting its class or value based on the majority class or the average of its neighbors. In this article we will implement it using Python's Scikit-Learn library. Implementation of K
    3 min read
  • ANN - Implementation of Self Organizing Neural Network (SONN) from Scratch
    Prerequisite: ANN | Self Organizing Neural Network (SONN) Learning Algorithm To implement a SONN, here are some essential consideration- Construct a Self Organizing Neural Network (SONN) or Kohonen Network with 100 neurons arranged in a 2-dimensional matrix with 10 rows and 10 columns Train the netw
    4 min read
  • K Nearest Neighbors with Python | ML
    K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm is a simple, easy
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences