Apache Hadoop in Big Data - At A Glance


is an open source software. The Apache Hadoop software library is a framework  that  enables the distributed storage & distributed processing of large data sets i.e. Big data, on computer clusters of commodity servers using simple programming models.

Commodity computing

is refered to low-budget cluster computing. Here multiple computers, multiple storage devices, and redundant interconnections compose the user equivalent of a single highly available system. Large numbers of computing components are used for parallel computing to get the greatest amount of computation at very low cost. 

We prefer here commodity computing for its inexpensive, modestly performing hardware components which work in parallel. Commodity computing systems are conceived, designed, and built to minimize the cost per unit of performance.

It is designed in such a way that it can be scaled up from a single server to thousands of machines , each offering local computation and storage ,with a very high degree of fault tolerance.it dows not rely on Rather than rely on hardware to deliver high-availability  rather Hadoop  software is able to detect and handle failures at the application layer making the computer clusters extremely resilient.

It  delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures.Hadoop can easily handle high volumes of complex data in petabytes.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common 

         contains libraries and utilities needed by other Hadoop modules.

  • Hadoop Distributed File System (HDFS) 

         Hadoop Distributed File System (HDFS) is a file system that spans all the nodes in a Hadoop              cluster for data storage. It links together the file systems on many local nodes to make them into          one big file system. A distributed file-system that stores data on commodity machines

  • Hadoop YARN 

        Yet Another Resource Negotiator (YARN) assigns CPU, memory, and storage to applications              running on a Hadoop cluster. YARN is a resource-management platform which manages                    compute resources in clusters and uses them for scheduling of users' applications. YARN enables         other application frameworks (like Spark) to run on Hadoop as well.

  • Hadoop MapReduce 

 a programming model for large scale data processing.It is the task tracker that manages the                  programming jobs

Hadoop Distributed File System

Its Hadoop Distributed File System (HDFS) splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. Then to process the data, the Hadoop Map/Reduce ships code (Jar files) to the nodes that have the required data, and the nodes then process the data in parallel.
Additional software packages can be installed on top of or alongside Hadoop, such as :

  • Apache Pig
  • Apache Hive
  • Apache Hbase
  • Apache Spark


YARN stands for "Yet Another Resource Negotiator".YARN holds the resource management capabilities in such a way that they can be used by new engines. It also helps MapReduce to do data process. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

MapReduce Java code is common used but any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

Hadoop has changed the dynamics of large scale computing with its four salient characteristics.

Hadoop enables a computing solution that is:

  • Cost effective 

Hadoop brings parallel computing to commodity servers. The result is a considerable decrease in the cost per terabyte of storage.

  • Scalable

We can add New nodes as needed, and add them without changing the data formats, how data is loaded, how jobs are written, or the applications on top.

  • Flexible

Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses .

  • Fault tolerant

When you lose a node, the system redirects work to another location of the data and continues processing .