Hadoop - Mapper In MapReduce Last Updated : 31 Jul, 2025 Comments Improve Suggest changes Like Article Like Report In Hadoop’s MapReduce framework, the Mapper is the core component of the Map Phase, responsible for processing raw input data and converting it into a structured form (key-value pairs) that Hadoop can efficiently handle.A Mapper is a user-defined Java class that takes input splits (chunks of data from HDFS), processes each record and emits intermediate key-value pairs. These pairs are then shuffled and sorted before being passed to the Reducer (or directly stored in case of a Map-only job).For Example:Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>Parameters :KEYIN : Input key (e.g., line offset in a file).VALUEIN : Input value (e.g., a line of text).KEYOUT : Output key (e.g., word).VALUEOUT : Output value (e.g., integer count).Mapper WorkflowThe Mapper’s task is completed with the help of five key components:1. InputThe Mapper process starts with the input, which consists of raw datasets stored in HDFS. An InputFormat is used to locate and interpret this data so it can be processed properly.2. Input SplitsThe input is divided into input splits, allowing Hadoop to process data in parallel. Each split is handled by a separate Mapper task. The split size can be configured with mapred.max.split.size, and the number of Mappers is calculated as:Number of Mappers = Total Data Size / Input Split SizeFor example, a 10TB file with 128MB splits results in about 81,920 Mappers.3. RecordReaderEach split is then converted into key-value pairs by a RecordReader. By default, Hadoop uses TextInputFormat, where the key is the byte offset of a line and the value is the text itself.4. Map FunctionThe map() function contains the user-defined logic. It processes each key-value pair and produces intermediate key-value pairs, which serve as input for the Reduce phase.5. Intermediate Output DiskThe Mapper’s output is stored temporarily, first in an in-memory buffer (100MB by default, configurable via io.sort.mb). When the buffer is full, the data is spilled to the local disk. These results are not written to HDFS unless it is a Map-only job with no Reducer.Example: WordCount MapperThe WordCount program demonstrates the Mapper’s role clearly. Java public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("\\s+"); // Split line into words for (String w : words) { word.set(w); context.write(word, one); // Emit (word, 1) } } } Input:Hello HadoopHello MapperMapper Output (Intermediate Data):(Hello, 1)(Hadoop, 1)(Hello, 1)(Mapper, 1)Explanation: Mapper Definition : extends Mapper<LongWritable, Text, Text, IntWritable> defines input (line offset, line text) and output (word, count).Setup: IntWritable one = new IntWritable(1) assigns count = 1 and Text word stores each word.In map(), each line (value) is split into words using whitespace.For every word, context.write(word, one) emits (word, 1) as the intermediate key-value pair.Key Features of Hadoop MapperParallelism : Each input split is handled by a separate Mapper task running in parallel.Intermediate Data : Produces temporary key-value pairs for the Reducer.Flexibility : Logic can be customized depending on the use case (filtering, parsing, transformation).Map-Only Jobs : If no Reducer is needed, Mapper output itself can be written to HDFS.Local Storage of Output : To avoid replication overhead, intermediate results are kept on local disk until shuffled.How to Calculate the Number of Mappers in HadoopThe number of Mappers is determined by the input split size, not directly by the number of HDFS blocks. Each split is handled by one Mapper task. By default, the split size equals the HDFS block size (e.g., 128 MB), but it can be configured.Formula:Number of Mappers = Total Data Size / Input Split SizeExample: For a dataset of 10 TB (≈10,240,000 MB) with a split size of 128 MB:10,240,000/128=80,000 MappersRelated ArticlesIntroduction to HDFSReducer in MapReduceCombiner in MapReduceMapReduce Job Flow Comment More infoAdvertise with us D dikshantmalidev Follow Improve Article Tags : Data Engineering MapReduce Similar Reads What is Data Engineering? Data engineering forms the backbone of modern data-driven enterprises, encompassing the design, development, and maintenance of crucial systems and infrastructure for managing data throughout its lifecycle. In this article, we will explore key aspects of data engineering, its key features, importanc 9 min read Data Engineering BasicsETL Process in Data WarehouseETL (Extract, Transform, Load) is a key process in data warehousing that prepares data for analysis. It involves:Extracting data from multiple sourcesTransforming it into a consistent formatLoading it into a central data warehouse or data lakeETL helps businesses unify and clean data, making it reli 7 min read Explain the ETL (Extract, Transform, Load) Process in Data EngineeringETL stands for Extract, Transform, and Load and represents the backbone of data engineering where data gathered from different sources is normalized and consolidated for the purpose of analysis and reporting. It involves the extraction of data in its basic form from different sources, the cleaning a 5 min read Difference between Batch Processing and Stream ProcessingToday, an immense amount of data is generated, which needs to be managed properly for the efficient functioning of any business organization. Two clear ways of dealing with data are the batch and stream processes. Even though both methods are designed to handle data, there are significant difference 7 min read Difference between Batch Processing and Stream ProcessingToday, an immense amount of data is generated, which needs to be managed properly for the efficient functioning of any business organization. Two clear ways of dealing with data are the batch and stream processes. Even though both methods are designed to handle data, there are significant difference 7 min read Data Storage & DatabasesIntroduction of DBMS (Database Management System)DBMS is a software system that manages, stores, and retrieves data efficiently in a structured format.It allows users to create, update, and query databases efficiently.Ensures data integrity, consistency, and security across multiple users and applications.Reduces data redundancy and inconsistency 6 min read Data WarehousingData warehousing is the process of collecting, integrating, storing, and managing data from multiple sources in a central repository. It enables organizations to organize large volumes of historical data for efficient querying, analysis, and reporting.The main goal of data warehousing is to support 6 min read What is Data Lake ?In todayâs data-driven world, managing large volumes of raw data is a challenge. Data Lakes help solve this by offering a centralized storage system for structured, semi-structured, and unstructured data in its original form. Unlike traditional databases, data lakes donât require predefined schemas, 5 min read SQL TutorialStructured Query Language (SQL) is the standard language used to interact with relational databases. Mainly used to manage data. Whether you want to create, delete, update or read data, SQL provides commands to perform these operations. Widely supported across various database systems like MySQL, Or 7 min read Introduction to NoSQLThe NoSQL system or "Not Only SQL" is essentially a database that is made specifically for unstructured and semi-structured data in very large quantities. Unlike Conventional Relational Databases, where data are organized into tables using predefined schemas. NoSQL allows flexible models to be organ 3 min read Difference Between Row oriented and Column oriented data stores in DBMSDatabases are essential for managing and retrieving data in a variety of applications, and the performance of these systems is greatly influenced by the way they store and arrange data. The two main strategies used in relational database management systems (RDBMS) are data stores that are row-orient 6 min read Data Processing FrameworksWhat is Big Data?Big Data refers to vast and rapidly growing volumes of data that are too large and complex for traditional data processing tools to manage. This data comes in many forms structured (e.g., tables), semi-structured (e.g., JSON, XML), and unstructured (e.g., text, images, video).With the explosion of d 3 min read Introduction to HadoopHadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framewo 3 min read Overview of Apache SparkIn this article, we are going to discuss the introductory part of Apache Spark, and the history of spark, and why spark is important. Let's discuss one by one. According to Databrick's definition "Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was ori 2 min read What is Apache Kafka?Apache Kafka is a publish-subscribe messaging system. A messaging system lets you send messages between processes, applications, and servers. Broadly Speaking, Apache Kafka is software where topics (a topic might be a category) can be defined and further processed. Applications may connect to this s 13 min read What is Apache Airflow?Apache Airflow is an open-source tool to programmatically author, schedule, and monitor workflows. It is used by Data Engineers for orchestrating workflows or pipelines. One can easily visualize your data pipelines' dependencies, progress, logs, code, trigger tasks, and success status. Complex data 3 min read Apache Flink vs Apache Spark: Top DifferencesApache Flink and Apache Spark are two well-liked competitors in the rapidly growing field of big data, where information flows like a roaring torrent. These distributed processing frameworks are available as open-source software and can handle large datasets with unparalleled speed and effectiveness 10 min read Data Modeling & ArchitectureStar Schema in Data Warehouse modelingA star schema is a type of data modeling technique used in data warehousing to represent data in a structured and intuitive way. In a star schema, data is organized into a central fact table that contains the measures of interest, surrounded by dimension tables that describe the attributes of the me 5 min read Database Sharding - System DesignDatabase sharding is a technique for horizontal scaling of databases, where the data is split across multiple database instances, or shards, to improve performance and reduce the impact of large amounts of data on a single database.Database ShardingIt is basically a database architecture pattern in 8 min read Introduction to Database NormalizationNormalization is an important process in database design that helps improve the database's efficiency, consistency, and accuracy. It makes it easier to manage and maintain the data and ensures that the database is adaptable to changing business needs.Database normalization is the process of organizi 6 min read Difference Between OLAP and OLTP in DatabasesOLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are both integral parts of data management, but they have different functionalities.OLTP focuses on handling large numbers of transactional operations in real time, ensuring data consistency and reliability for daily busine 6 min read Data Engineering Tools and SkillsData engineering is a pivotal discipline within the broader field of data science and analytics. It focuses on designing, building, and maintaining the systems that manage and process data, enabling organizations to derive actionable insights and make data-driven decisions. To excel in this role, da 6 min read Data Engineering ToolsData Engineering Tools and SkillsData engineering is a pivotal discipline within the broader field of data science and analytics. It focuses on designing, building, and maintaining the systems that manage and process data, enabling organizations to derive actionable insights and make data-driven decisions. To excel in this role, da 6 min read Amazon Web Services (AWS) TutorialAmazon Web Service (AWS) is the worldâs leading cloud computing platform by Amazon. It offers on-demand computing services, such as virtual servers and storage, that can be used to build and run applications and websites. AWS is known for its security, reliability, and flexibility, which makes it a 13 min read Google Cloud Platform TutorialGoogle Cloud Platform (GCP) is a set of cloud services provided by Google, built on the same technology that powers Google services like Search, Gmail, YouTube, Google Docs, and Google Drive. Many companies prefer GCP because it can be up to 20% cheaper for storing data and databases compared to oth 8 min read Google Cloud Platform TutorialGoogle Cloud Platform (GCP) is a set of cloud services provided by Google, built on the same technology that powers Google services like Search, Gmail, YouTube, Google Docs, and Google Drive. Many companies prefer GCP because it can be up to 20% cheaper for storing data and databases compared to oth 8 min read Kubernetes - Introduction to Container OrchestrationIn this article, we will look into Container Orchestration in Kubernetes. But first, let's explore the trends that gave rise to containers, the need for container orchestration, and how that it has created the space for Kubernetes to rise to dominance and growth. The growth of technology into every 4 min read Data Governance & SecurityWhat is Data Governance ?At present businesses are flooded with dataâcustomer details, sales records, user behavior, and more. But having tons of data means nothing if itâs messy, outdated, or unsecured. Thatâs where data governance comes in. Data governance is a set of rules and processes that help companies manage, protec 15 min read Difference between Data Privacy and Data ProtectionThe terms Data privacy and Data security are used interchangeably and seem to be the same. But actually, they are not the same. In reality, they can have different meanings depending upon their actual process and use. But they are very closely interconnected and one complements the other during the 5 min read What is Meta Data in Data Warehousing?Metadata is data that describes and contextualizes other data. It provides information about the content, format, structure, and other characteristics of data, and can be used to improve the organization, discoverability, and accessibility of data. Metadata can be stored in various forms, such as te 8 min read Like