Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • DSA
  • Practice Problems
  • C
  • C++
  • Java
  • Python
  • JavaScript
  • Data Science
  • Machine Learning
  • Courses
  • Linux
  • DevOps
  • SQL
  • Web Development
  • System Design
  • Aptitude
  • GfG Premium
Open In App
Next Article:
Introduction to Hadoop Distributed File System(HDFS)
Next article icon

Introduction to Hadoop Distributed File System(HDFS)

Last Updated : 04 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of machines. Such filesystems are called distributed filesystems. Since data is stored across a network all the complications of a network come in. 
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely large files with streaming data access pattern, and it runs on commodity hardware. Let's elaborate on the terms:  

  • Extremely large files: Here, we are talking about the data in a range of petabytes (1000 TB).
  • Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times. Once data is written large portions of dataset can be processed any number times.
  • Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one of the features that especially distinguishes HDFS from other file systems.

Nodes: Master-slave nodes typically form the HDFS cluster. 

  1. NameNode(MasterNode): 
    • Manages all the slave nodes and assigns work to them.
    • It executes filesystem namespace operations like opening, closing, and renaming files and directories.
    • It should be deployed on reliable hardware that has a high configuration. not on commodity hardware.
  2. DataNode(SlaveNode): 
    • Actual worker nodes do the actual work like reading, writing, processing, etc.
    • They also perform creation, deletion, and replication upon instruction from the master.
    • They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in the background. 

  • Namenodes: 
    • Run on the master node.
    • Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
    • Requires a high amount of RAM.
    • Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk.
  • DataNodes: 
    • Run on slave nodes.
    • Require high memory as data is actually stored here.

Data storage in HDFS: Now let's see how the data is stored in a distributed manner. 

Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across different datanodes(slavenode). Datanodes(slavenode) replicate the blocks among themselves and the information of what blocks they contain is sent to the master. Default replication factor is 3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here. 

Note: MasterNode has the record of everything, it knows the location and info of each and every single data nodes and the blocks they contain, i.e. nothing is done without the permission of masternode. 

Why divide the file into blocks? 

Answer: Let's assume that we don't divide, now it's very difficult to store a 100 TB file on a single machine. Even if we store, then each read and write operation on that whole file is going to take very high seek time. But if we have multiple blocks of size 128MB then its become easy to perform various read and write operations on it compared to doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce seek time. 

Why replicate the blocks in data nodes while storing? 

Answer: Let's assume we don't replicate and only one yellow block is present on datanode D1. Now if the data node D1 crashes we will lose the block and which will make the overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance. 


Terms related to HDFS:  

  • HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn't receive heartbeat from a datanode then it will consider it dead.
  • Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
  • Replication:: It is done by datanode.


Note: No two replicas of the same block are present on the same datanode. 

Features:  

  • Distributed data storage.
  • Blocks reduce seek time.
  • The data is highly available as the same block is present at multiple datanodes.
  • Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
  • High fault tolerance.


Limitations: Though HDFS provide many features there are some areas where it doesn't work well. 

  • Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
  • Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.

Next Article
Introduction to Hadoop Distributed File System(HDFS)

S

SrjSunny
Improve
Article Tags :
  • GBlog
  • Technical Scripter

Similar Reads

    Hadoop - Introduction
    The definition of a powerful person has changed in this world. A powerful is one who has access to the data. This is because data is increasing at a tremendous rate. Suppose we are living in a 100% data world. Then, 90% of the data is produced in the last 2 to 4 years. This is because now, when a ch
    10 min read
    Top 10 Hadoop Analytics Tools For Big Data
    Hadoop is an open-source framework written in Java that uses lots of other analytical tools to improve its data analytics operations. The article demonstrates the most widely and essential analytics tools that Hadoop can use to improve its reliability and processing to generate new insight into data
    5 min read
    Difference Between Hadoop and MapReduce
    In today’s data-driven world, businesses and organizations handle massive amounts of information every second. Managing and analyzing such large datasets—known as Big Data—requires powerful tools. That’s where Hadoop comes in. Hadoop is an open-source framework that helps store and process huge volu
    5 min read
    10 Best Recommended Books To Learn Hadoop
    Hadoop is a Big Data tool that is written into Java to analyze and handle very large-size data using cheaper systems/servers. It is also known for its efficient and reliable storage technique. Hadoop works on MapReduce Programming Algorithm and Master-Slave architecture. Top Companies like Facebook,
    8 min read
    Top 7 Reasons to Learn Hadoop
    Hadoop is a data processing tool used to process large-scale data over distributed commodity hardware. The trend of the Big Data Hadoop market is on the boom, and it's not showing any kind of deceleration in its growth. Today, industries are capable of storing all the data generated by their busines
    6 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences