Hadoop Ecosystem
Hadoop is an open source that supports the
processing and storage of very large data in a distributed computing
environment. Hadoop was created by Doug Cutting and Hadoop got it’s name from
the stuffed yellow elephant toy. The Hadoop Ecosystem consist of following
components :
1. Hadoop Distributed Filesystem (HDFS)
2. Hadoop MapReduce
3. Hadoop Common
4. Hadoop YARN
Hadoop
distributed File System (HDFS) which is an open source implementation of
the GoogleFile System (GFS) for storing data.
IN HDFS, data file is divided into blocks and copies of these blocks are
stored on other servers in the Hadoop cluster. The default block size is 64 MB.
This redundancy offers high availability. A Higher Block size is recommended as
it requires less metadata by the NameNode. Each data block is stored by default
on three different servers; in Hadoop, this is implemented by HDFS working
behind the scenes to make sure at least two blocks are stored on a separate
server rack to improve reliability in the event you lose an entire rack of
servers. All of Hadoop data placement logic is managed by a special server
called NameNode. This NameNode server keeps tracks of all the data files in
HDFS, such as where the blocks are stored and more. NameNode is stored inside
the memory for quick response.
Hadoop works on servers available in very large
cluster. MapReduce tries to assign
workloads to these servers where the data to be processed is stored, this is known
as data locality. . When
a Hadoop Job is fired and the application has to read data and starts to work
on the programmed MapReduce tasks, Hadoop will contact the NameNode, find the
servers that hold the parts of the data that need to be accessed to carry out
the job and then send your application to run locally on those nodes. MapReduce is two distinct jobs. The first is
map job, which takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as
input and combines those data tuples into a smaller set of tuples.
The Hadoop Common is the base/core of the Hadoop framework, as
it provides essential services and basic processes such as abstraction of the
underlying operating system and its file system. Hadoop Common also contains
the necessary Java Archive (JAR) files and scripts required to start Hadoop.
The Hadoop Common package also provides source code and documentation, as well
as a contribution section that includes different projects from the Hadoop
Community.
YARN stands
for (Yet Another Resource Negotiator) is a cluster management solution.
YARN is present in the Hadoop 2.0 and works as a resource manager with duty of
assigning resources to the operations. YARN works above the HDFS layer and all
other Hadoop systems like Pig, Hive, Hbase and Spark etc. uses the services of
YARN to access HDFS and underling Hardware. The YARN increases the performance
of Hadoop in the following ways Multi-tenancy , cluster utilization,
scalability and compatibility.
Some of the
Hadoop Related Projects are:
Pig: Pig
was initially developed at Yahoo to focus more on analyzing large data sets and
spend less time on map reduce programs. Pig can handle any data types.
It is of two components : First is language known
as PigLatin and the second is runtime environment where PigLatin programs are
executed. Think as of JVM and Java
HIVE: PIG can be powerful and simple, the
downside is that it is new to learn. At Facebook HIVE was created based on SQL
and new language is called HQL (Hive Query Language). We can use HQL from Hive
Shell, JDBC, ODBC, HIVE Thrift Client.
HIVE Thrift Client is much like any database client that gets installed
on a user client machine, it communicates with the HIVE services running on the
server. Hive is read based and therefore not appropriate for transaction
processing that typically involves a high percentage of write operations
FLUME :
Flume is an Apache Project. It allow you to flow data from a source into
your Hadoop environment.
ZOOKEEPER : It is an Apache project that
provides a centralized infrastructure and services that enables synchronization
across a cluster.
Hbase
: It is an column oriented database management system that runs on top of
HDFS. It is well suited for sparse data sets, which are common in many Big Data
use cases. HBase does not support SQL. HBase application are written in java.
Avro
: It is an apache project that provides data serialization services. Data
structures within context of Hadoop MapReduce jobs
Sqoop : Sqoop is a tool designed to transfer
data between Hadoop and relational database servers. It is used to import data
from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from
Hadoop file system to relational databases
Spark
: Apache Spark is
a lightning-fast cluster computing designed for fast computation. It was built
on top of Hadoop MapReduce and it extends the MapReduce . General
processing engine for streaming, sql, machine learning and graph processing
No comments:
Post a Comment