JIMS Vasant Kunj

Apache Hadoop

Hadoop is a framework from Apache Foundation for storage and large scale processing of data-sets on clusters of commodity hardware.It is an open source framework. Commodity hardware consists of already available computer hardware that is inexpensive, easy to maintain and works on plug and play basis with other hardware. The principle of using commodity hardware is that it is better to have more computing power by pooling low-cost and low-power hardware in parallel then having a fewer high-cost and high-power hardware. The advantages are use of components based on open standards, easy switching between different hardware and linear scaling in case of increase load. All the modules of Hadoop are designed with the assumption of hardware failures because the commodity hardware is more prone to failure.

Apache Hadoop has its genesis in Google File System and MapReduce paper published in 2003. In 2003 at Apache Foundation Doug Cutting was working on a web crawler project named “Nutch”. To meet the parallel processing need of crawling and indexing, the project implemented distributed file system and MapReduce. In 2006 Hadoop is born out of Nutch and Hadoop is named after the Doug Cutting son’s yellow elephant toy.

Apache Framework : Basic Modules

1. Hadoop Distributed File System (HDFS)

2. Hadoop MapReduce

3. Hadoop Common

4. Hadoop YARN

IN HDFS, data file is divided into blocks and copies of these blocks are created and stored on other nodes in the Hadoop cluster. The default block size is 64 MB. This redundancy offers high availability. A Higher Block size is recommended as it requires less metadata by the NameNode. Whereas in typical file system on disk the block size is 512 bytes. In RDMS block sizes varies from 4 KB to 32 KB.Each data block is stored by default on three different servers; in Hadoop, this is implemented by HDFS working behind the scenes to make sure at least two blocks are stored on a different rack to increase reliability in the event you lose an entire rack of servers.All of Hadoop data placement logic is managed by a special server called NameNode. The NameNode server keeps record of all the files in HDFS, like where the blocks are kept and more. NameNode is stored inside the memory for quick response.

Hadoop works on servers available in very large cluster. MapReduce tries to assign commutation to the nodes where the data to be processed is stored, this is known as data locality.Because of above principle SAN or NAS in a Hadoop is not recommended. SAN or NAS the extra network communication overhead can cause performance bottlenecks.IN Hadoop Map Reduce we don't have to deal with Name Node. When a Hadoop Job is fired and the application has to read data and starts to work on the programmed Map Reduce tasks, Hadoop will contact the Name Node, locate the disks that have the parts of the data that need to be accessed to carry out the job and then send your application to run locally on those nodes. Map Reduce is two distinct jobs. The first is map job, which takes a set of data and converts it into individual elements, which are broken down into rows(key/value pairs). The reduce job takes the output from a map as input and combined those data rows into a smaller set of rows.

In Hadoop, a Map Reduce program is known as a job. A job is executed by dividing it down into pieces called tasks.An application submits a job to a specific node in a Hadoop cluster, which is running a program called the JobTracker, then the JobTracker establish the connection with the NameNode to find out where all of the data required for this job is stored across the cluster, and then breaks the job down into map and reduce tasks for each node to work on in the cluster, then these tasks are scheduled on the nodes in the cluster where the data exists. Note that a node might be assigneda task for which the data required by that task is not local to that node. In such a case, the data will be transferred to that node. In Hadoop cluster, a set of continually running programs, referred to as TaskTracker agents, monitor the status of each task. If the task fails the status is sent back to JobTracker.

Hadoop common components are a set of libraries and programs that are used by the various Hadoop sub projects.YARN (Yet Another Resource Negotiator) is a resource management platform responsible for managing the resources in the Hadoop cluster and utilizing them in order to schedule users and applications. The difference between MapReduce and YARN is that, in MapReduce there is single JobTracker, whereas in YARN we have multiple Application Master which will manage the applications by assigning them to different containers on different Nodes.

Mr. Vijay Gupta

Assistant Professor

Dept. of Information Technology

JIMS Vasant Kunj

Wednesday, 16 November 2016

7 comments:

Blog Archive