Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
r-Nearest neighbors
Next article icon

Implementation of K Nearest Neighbors

Last Updated : 09 Nov, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

Prerequisite: K nearest neighbors 
 

Introduction

Say we are given a data set of items, each having numerically valued features (like Height, Weight, Age, etc). If the count of features is n, we can represent the items as points in an n-dimensional grid. Given a new item, we can calculate the distance from the item to every other item in the set. We pick the k closest neighbors and we see where most of these neighbors are classified in. We classify the new item there.
So the problem becomes how we can calculate the distances between items. The solution to this depends on the data set. If the values are real we usually use the Euclidean distance. If the values are categorical or binary, we usually use the Hamming distance.
Algorithm: 
 

Given a new item:     1. Find distances between new item and all other items     2. Pick k shorter distances     3. Pick the most common class in these k distances     4. That class is where we will classify the new item

 

Reading Data

Let our input file be in the following format:
 

Height, Weight, Age, Class 1.70, 65, 20, Programmer 1.90, 85, 33, Builder 1.78, 76, 31, Builder 1.73, 74, 24, Programmer 1.81, 75, 35, Builder 1.73, 70, 75, Scientist 1.80, 71, 63, Scientist 1.75, 69, 25, Programmer

Each item is a line and under “Class” we see where the item is classified in. The values under the feature names (“Height” etc.) are the value the item has for that feature. All the values and features are separated by commas.
Place these data files in the working directory data2 and data. Choose one and paste the contents as-is into a text file named data.
We will read from the file (named “data.txt”) and we will split the input by lines:
 

f = open('data.txt', 'r'); lines = f.read().splitlines(); f.close();

The first line of the file holds the feature names, with the keyword “Class” at the end. We want to store the feature names into a list:
 

# Split the first line by commas, # remove the first element and  # save the rest into a list. The # list now holds the feature  # names of the data set. features = lines[0].split(', ')[:-1];

Then we move on to the data set itself. We will save the items into a list, named items, whose elements are dictionaries (one for each item). The keys to these item-dictionaries are the feature names, plus “Class” to hold the item class. In the end, we want to shuffle the items in the list (this is a safety measure, in case the items are in a weird order). 
 

Python3




items = [];
 
for i in range(1, len(lines)):
     
    line = lines[i].split(', ');
 
    itemFeatures = {"Class" : line[-1]};
 
    # Iterate through the features
    for j in range(len(features)):
     
        # Get the feature at index j
        f = features[j];
            
        # The first item in the line
        # is the class, skip it
        v = float(line[j]);
         
        # Add feature to dict
        itemFeatures[f] = v;
     
    # Append temp dict to items
    items.append(itemFeatures);
     
shuffle(items);
 
 

Classifying the data

With the data stored into items, we now start building our classifier. For the classifier, we will create a new function, Classify. It will take as input the item we want to classify, the items list, and k, the number of the closest neighbors.
If k is greater than the length of the data set, we do not go ahead with the classifying, as we cannot have more closest neighbors than the total amount of items in the data set. (alternatively, we could set k as the items length instead of returning an error message)
 

if(k > len(Items)):                  # k is larger than list         # length, abort         return "k larger than list length";

We want to calculate the distance between the item to be classified and all the items in the training set, in the end keeping the k shortest distances. To keep the current closest neighbors we use a list, called neighbors. Each element in the least holds two values, one for the distance from the item to be classified and another for the class the neighbor is in. We will calculate distance via the generalized Euclidean formula (for n dimensions). Then, we will pick the class that appears most of the time in neighbors and that will be our pick. In code: 
 

Python3




def Classify(nItem, k, Items):
    if(k > len(Items)):
         
        # k is larger than list
        # length, abort
        return "k larger than list length";
     
    # Hold nearest neighbors.
    # First item is distance,
    # second class
    neighbors = [];
 
    for item in Items:
       
        # Find Euclidean Distance
        distance = EuclideanDistance(nItem, item);
 
        # Update neighbors, either adding
        # the current item in neighbors
        # or not.
        neighbors = UpdateNeighbors(neighbors, item, distance, k);
 
    # Count the number of each
    # class in neighbors
    count = CalculateNeighborsClass(neighbors, k);
 
    # Find the max in count, aka the
    # class with the most appearances.
    return FindMax(count);
 
 

The external functions we need to implement are EuclideanDistance, UpdateNeighbors, CalculateNeighborsClass, and FindMax.
 

Finding Euclidean Distance

The generalized Euclidean formula for two vectors x and y is this: 

distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}

In code: 

Python3




def EuclideanDistance(x, y):
     
    # The sum of the squared
    # differences of the elements
    S = 0;
     
    for key in x.keys():
        S += math.pow(x[key]-y[key], 2);
 
    # The square root of the sum
    return math.sqrt(S);
 
 

Updating Neighbors

We have our neighbors list (which should at most have a length of k) and we want to add an item to the list with a given distance. First, we will check if neighbors have a length of k. If it has less, we add the item to it regardless of the distance (as we need to fill the list up to k before we start rejecting items). If not, we will check if the item has a shorter distance than the item with the max distance in the list. If that is true, we will replace the item with max distance with the new item.
To find the max distance item more quickly, we will keep the list sorted in ascending order. So, the last item in the list will have the max distance. We will replace it with a new item and we will sort it again.
To speed this process up, we can implement an Insertion Sort where we insert new items in the list without having to sort the entire list. The code for this though is rather long and, although simple, will bog the tutorial down. 
 

Python3




def UpdateNeighbors(neighbors, item, distance, k):
     
    if(len(neighbors) > distance):
             
            # If yes, replace the last
            # element with new item
            neighbors[-1] = [distance, item["Class"]];
            neighbors = sorted(neighbors);
 
    return neighbors;
 
 

CalculateNeighborsClass

Here we will calculate the class that appears most often in neighbors. For that, we will use another dictionary, called count, where the keys are the class names appearing in neighbors. If a key doesn’t exist, we will add it, otherwise, we will increment its value. 
 

Python3




def CalculateNeighborsClass(neighbors, k):
    count = {};
     
    for i in range(k):
         
        if(neighbors[i][1] not in count):
         
            # The class at the ith index
            # is not in the count dict.
            # Initialize it to 1.
            count[neighbors[i][1]] = 1;
        else:
             
            # Found another item of class
            # c[i]. Increment its counter.
            count[neighbors[i][1]] += 1;
 
    return count;
 
 

FindMax

We will input to this function the dictionary count we build in CalculateNeighborsClass and we will return its max.
 

Python3




def FindMax(countList):
     
    # Hold the max
    maximum = -1;
     
    # Hold the classification
    classification = "";
     
    for key in countList.keys():
       
        if(countList[key] > maximum):
            maximum = countList[key];
            classification = key;
 
    return classification, maximum;
 
 

Conclusion

With that, this kNN tutorial is finished.
You can now classify new items, setting k as you see fit. Usually, for k an odd number is used, but that is not necessary. To classify a new item, you need to create a dictionary with keys the feature names, and the values that characterize the item. An example of classification:
 

newItem = {'Height' : 1.74, 'Weight' : 67, 'Age' : 22}; print Classify(newItem, 3, items);

The complete code of the above approach is given below:- 
 

Python3




# Python Program to illustrate
# KNN algorithm
 
# For pow and sqrt
import math
from random import shuffle
 
###_Reading_### def ReadData(fileName):
 
    # Read the file, splitting by lines
    f = open(fileName, 'r')
    lines = f.read().splitlines()
    f.close()
 
    # Split the first line by commas,
    # remove the first element and save
    # the rest into a list. The list
    # holds the feature names of the
    # data set.
    features = lines[0].split(', ')[:-1]
 
    items = []
 
    for i in range(1, len(lines)):
         
        line = lines[i].split(', ')
 
        itemFeatures = {'Class': line[-1]}
 
        for j in range(len(features)):
             
            # Get the feature at index j
            f = features[j] 
 
            # Convert feature value to float
            v = float(line[j])
             
             # Add feature value to dict
            itemFeatures[f] = v
         
        items.append(itemFeatures)
 
    shuffle(items)
 
    return items
 
 
###_Auxiliary Function_### def EuclideanDistance(x, y):
     
    # The sum of the squared differences
    # of the elements
    S = 0 
     
    for key in x.keys():
        S += math.pow(x[key] - y[key], 2)
 
    # The square root of the sum
    return math.sqrt(S)
 
def CalculateNeighborsClass(neighbors, k):
    count = {}
 
    for i in range(k):
        if neighbors[i][1] not in count:
 
            # The class at the ith index is
            # not in the count dict.
            # Initialize it to 1.
            count[neighbors[i][1]] = 1
        else:
 
            # Found another item of class
            # c[i]. Increment its counter.
            count[neighbors[i][1]] += 1
 
    return count
 
def FindMax(Dict):
 
    # Find max in dictionary, return
    # max value and max index
    maximum = -1
    classification = ''
 
    for key in Dict.keys():
         
        if Dict[key] > maximum:
            maximum = Dict[key]
            classification = key
 
    return (classification, maximum)
 
 
###_Core Functions_### def Classify(nItem, k, Items):
 
    # Hold nearest neighbours. First item
    # is distance, second class
    neighbors = []
 
    for item in Items:
 
        # Find Euclidean Distance
        distance = EuclideanDistance(nItem, item)
 
        # Update neighbors, either adding the
        # current item in neighbors or not.
        neighbors = UpdateNeighbors(neighbors, item, distance, k)
 
    # Count the number of each class
    # in neighbors
    count = CalculateNeighborsClass(neighbors, k)
 
    # Find the max in count, aka the
    # class with the most appearances
    return FindMax(count)
 
 
def UpdateNeighbors(neighbors, item, distance,
                                          k, ):
    if len(neighbors) < k:
 
        # List is not full, add
        # new item and sort
        neighbors.append([distance, item['Class']])
        neighbors = sorted(neighbors)
    else:
 
        # List is full Check if new
        # item should be entered
        if neighbors[-1][0] > distance:
 
            # If yes, replace the
            # last element with new item
            neighbors[-1] = [distance, item['Class']]
            neighbors = sorted(neighbors)
 
    return neighbors
 
###_Evaluation Functions_### def K_FoldValidation(K, k, Items):
     
    if K > len(Items):
        return -1
 
    # The number of correct classifications
    correct = 0 
     
    # The total number of classifications
    total = len(Items) * (K - 1) 
     
    # The length of a fold
    l = int(len(Items) / K) 
 
    for i in range(K):
 
        # Split data into training set
        # and test set
        trainingSet = Items[i * l:(i + 1) * l]
        testSet = Items[:i * l] + Items[(i + 1) * l:]
 
        for item in testSet:
            itemClass = item['Class']
 
            itemFeatures = {}
 
            # Get feature values
            for key in item:
                if key != 'Class':
 
                    # If key isn't "Class", add
                    # it to itemFeatures
                    itemFeatures[key] = item[key]
 
            # Categorize item based on
            # its feature values
            guess = Classify(itemFeatures, k, trainingSet)[0]
 
            if guess == itemClass:
 
                # Guessed correctly
                correct += 1
 
    accuracy = correct / float(total)
    return accuracy
 
 
def Evaluate(K, k, items, iterations):
 
    # Run algorithm the number of
    # iterations, pick average
    accuracy = 0
     
    for i in range(iterations):
        shuffle(items)
        accuracy += K_FoldValidation(K, k, items)
 
    print accuracy / float(iterations)
 
 
###_Main_### def main():
    items = ReadData('data.txt')
 
    Evaluate(5, 5, items, 100)
 
if __name__ == '__main__':
    main()
 
 

Output: 

0.9375

The output can vary from machine to machine. The code includes a Fold Validation function, but it is unrelated to the algorithm, it is there for calculating the accuracy of the algorithm.



Next Article
r-Nearest neighbors

M

MrDupin
Improve
Article Tags :
  • Algorithms
  • DSA
  • Machine Learning
  • Python
Practice Tags :
  • Algorithms
  • Machine Learning
  • python

Similar Reads

  • Implementation of K-Nearest Neighbors from Scratch using Python
    Instance-Based LearningK Nearest Neighbors Classification is one of the classification techniques based on instance-based learning. Models based on instance-based learning to generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance
    8 min read
  • K-Nearest Neighbors and Curse of Dimensionality
    In high-dimensional data, the performance of the k-nearest neighbor (k-NN) algorithm often deteriorates due to increased computational complexity and the breakdown of the assumption that similar points are proximate. These challenges hinder the algorithm's accuracy and efficiency in high-dimensional
    6 min read
  • Mathematical explanation of K-Nearest Neighbour
    KNN stands for K-nearest neighbour is a popular algorithm in Supervised Learning commonly used for classification tasks. It works by classifying data based on its similarity to neighboring data points. The core idea of KNN is straightforward when a new data point is introduced the algorithm finds it
    4 min read
  • r-Nearest neighbors
    r-Nearest neighbors are a modified version of the k-nearest neighbors. The issue with k-nearest neighbors is the choice of k. With a smaller k, the classifier would be more sensitive to outliers. If the value of k is large, then the classifier would be including many points from other classes. It is
    5 min read
  • K Nearest Neighbors with Python | ML
    K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm is a simple, easy
    5 min read
  • Implementation of Radius Neighbors from Scratch in Python
    Radius Neighbors is also one of the techniques based on instance-based learning. Models based on instance-based learning generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance (or test example), then they instantly build a relatio
    8 min read
  • K-Nearest Neighbor(KNN) Algorithm
    K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. Imagine a streaming service wants to predict if a new user is likely to cancel their subscription (churn) based on their age. They checks the ages of its existing users and whether they churned or stayed. If mo
    10 min read
  • Implementation of KNN using OpenCV
    KNN is one of the most widely used classification algorithms that is used in machine learning. To know more about the KNN algorithm read here KNN algorithm Today we are going to see how we can implement this algorithm in OpenCV and how we can visualize the results in 2D plane showing different featu
    3 min read
  • How To Predict Diabetes using K-Nearest Neighbor in R
    In this article, we are going to predict Diabetes using the K-Nearest Neighbour algorithm and analyze on Diabetes dataset using the R Programming Language. What is the K-Nearest Neighbor algorithm?The K-Nearest Neighbor (KNN) algorithm is a popular supervised learning classifier frequently used by d
    13 min read
  • Demonstration of K-Means Assumptions
    We explore scenarios that reveal the strengths and limitations of the algorithm in this Scikit-learn investigation of K-means assumptions. We study the sensitivity of K-means to incorrect cluster sizes, the difficulties it faces with anisotropic distributions, the difficulties it faces with differen
    6 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences