Hadoop is a framework from Apache Foundation for
storage and large scale processing of data-sets on clusters of commodity
hardware.It is an open source framework. Commodity hardware consists of already
available computer hardware that is inexpensive, easy to maintain and works on
plug and play basis with other hardware. The principle of using commodity
hardware is that it is better to have more computing power by pooling low-cost
and low-power hardware in parallel then having a fewer high-cost and high-power
hardware. The advantages are use of components based on open standards, easy
switching between different hardware and linear scaling in case of increase
load. All the modules of Hadoop are designed with the assumption of hardware
failures because the commodity hardware is more prone to failure.
Apache Hadoop has its genesis in Google File
System and MapReduce paper published in 2003. In 2003 at Apache Foundation Doug
Cutting was working on a web crawler project named “Nutch”. To meet the
parallel processing need of crawling and indexing, the project implemented
distributed file system and MapReduce. In 2006 Hadoop is born out of Nutch and
Hadoop is named after the Doug Cutting son’s yellow elephant toy.
Apache Framework : Basic Modules
1. Hadoop Distributed File System (HDFS)
2. Hadoop MapReduce
3. Hadoop Common
4. Hadoop YARN
IN HDFS, data file is divided into blocks and
copies of these blocks are created and stored on other nodes in the Hadoop
cluster. The default block size is 64 MB. This redundancy offers high
availability. A Higher Block size is recommended as it requires less metadata
by the NameNode. Whereas in typical file system on disk the block size is 512
bytes. In RDMS block sizes varies from 4 KB to 32 KB.Each data block is stored
by default on three different servers; in Hadoop, this is implemented by HDFS
working behind the scenes to make sure at least two blocks are stored on a
different rack to increase reliability in the event you lose an entire rack of
servers.All of Hadoop data placement logic is managed by a special server
called NameNode. The NameNode server keeps record of all the files in HDFS, like where the blocks are kept
and more. NameNode is stored inside the memory for quick response.
Hadoop works on servers available in very large
cluster. MapReduce tries to assign commutation to the nodes where the data to
be processed is stored, this is known as data locality.Because of above
principle SAN or NAS in a Hadoop is not recommended. SAN or NAS the extra
network communication overhead can cause performance bottlenecks.IN Hadoop
Map Reduce we don't have to deal with Name Node. When a Hadoop Job is fired and
the application has to read data and starts to work on the programmed Map Reduce
tasks, Hadoop will contact the Name Node, locate the disks that have the parts
of the data that need to be accessed to carry out the job and then send your
application to run locally on those nodes. Map Reduce is two distinct jobs. The
first is map job, which takes a set of data and converts it into individual
elements, which are broken down into rows(key/value pairs). The reduce job takes the output from a map as
input and combined those data rows into a smaller set of rows.
In Hadoop, a Map Reduce program is known as a job.
A job is executed by dividing it down into pieces called tasks.An application
submits a job to a specific node in a Hadoop cluster, which is running a
program called the JobTracker, then the JobTracker establish the connection with
the NameNode to find out where all of the data required for this job is stored
across the cluster, and then breaks the job down into map and reduce tasks for
each node to work on in the cluster, then these tasks are scheduled on the
nodes in the cluster where the data exists. Note that a node might be assigneda
task for which the data required by that task is not local to that node. In
such a case, the data will be transferred to that node. In Hadoop cluster, a set
of continually running programs, referred to as TaskTracker agents, monitor the
status of each task. If the task fails the status is sent back to JobTracker.
Hadoop common components are a set of libraries
and programs that are used by the various Hadoop sub projects.YARN (Yet Another
Resource Negotiator) is a resource management platform responsible for managing
the resources in the Hadoop cluster and utilizing them in order to schedule
users and applications. The difference between MapReduce and YARN is that, in MapReduce
there is single JobTracker, whereas in YARN we have multiple Application Master
which will manage the applications by assigning them to different containers on
different Nodes.
Mr. Vijay Gupta
Assistant Professor
Dept. of Information Technology
The charging takes place for each 100ms your code executes and for the variety of instances the code is triggered. It is a good factor since you not pay for an idle compute.This is great blog. If you want to know more about this visit here Apache Hadoop Service.
ReplyDeleteAmazing article! I was confused about technology, but now got a clear view of the definition.Appreciate your hard work!
ReplyDeleteBig Data training in Chennai
Innovative blog thanks for sharing this information.
ReplyDeleteNode JS Training in Chennai
Node JS Training Institute in chennai
IELTS Coaching centre in Chennai
Japanese Language Classes in Chennai
Best Spoken English Classes in Chennai
TOEFL Classes in Chennai
Node JS Training in Adyar
Spoken English Class in Anna Nagar
spoken english classes in chennai anna nagar
Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating. If you are looking for any Hadoop admin related information, please visit our website Hadoop admin training in bangalore
ReplyDeleteUpdating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)
ReplyDeleteVisit SKARTEC
Click Here
SKARTEC Digital Marketing Academy
digital marketing course in chennai with placement
digital marketing training institute in chennai
digital marketing course near me
digital marketing course in chennai fees
best institute for digital marketing course in chennai
digital marketing course with placement
online digital marketing course in chennai
advance digital marketing course in chennai
digital marketing training institute near me
digital marketing course near me
digital marketing training in india
seo training
Apache hadoop is a very good article.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
I am happy to visit and read useful articles here.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery