Sunday 5 November 2017

Hadoop Ecosystem

Hadoop is an open source that supports the processing and storage of very large data in a distributed computing environment. Hadoop was created by Doug Cutting and Hadoop got it’s name from the stuffed yellow elephant toy. The Hadoop Ecosystem consist of following components :

1. Hadoop Distributed Filesystem (HDFS)
2. Hadoop MapReduce
3. Hadoop Common
4. Hadoop YARN

Hadoop distributed File System (HDFS) which is an open source implementation of the GoogleFile System (GFS) for storing data.  IN HDFS, data file is divided into blocks and copies of these blocks are stored on other servers in the Hadoop cluster. The default block size is 64 MB. This redundancy offers high availability. A Higher Block size is recommended as it requires less metadata by the NameNode. Each data block is stored by default on three different servers; in Hadoop, this is implemented by HDFS working behind the scenes to make sure at least two blocks are stored on a separate server rack to improve reliability in the event you lose an entire rack of servers. All of Hadoop data placement logic is managed by a special server called NameNode. This NameNode server keeps tracks of all the data files in HDFS, such as where the blocks are stored and more. NameNode is stored inside the memory for quick response.

Hadoop works on servers available in very large cluster. MapReduce tries to assign workloads to these servers where the data to be processed is stored, this is known as data locality. . When a Hadoop Job is fired and the application has to read data and starts to work on the programmed MapReduce tasks, Hadoop will contact the NameNode, find the servers that hold the parts of the data that need to be accessed to carry out the job and then send your application to run locally on those nodes.  MapReduce is two distinct jobs. The first is map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).  The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

The Hadoop Common  is the base/core of the Hadoop framework, as it provides essential services and basic processes such as abstraction of the underlying operating system and its file system. Hadoop Common also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop. The Hadoop Common package also provides source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.

YARN stands for (Yet Another Resource Negotiator) is a cluster management solution. YARN is present in the Hadoop 2.0 and works as a resource manager with duty of assigning resources to the operations. YARN works above the HDFS layer and all other Hadoop systems like Pig, Hive, Hbase and Spark etc. uses the services of YARN to access HDFS and underling Hardware. The YARN increases the performance of Hadoop in the following ways Multi-tenancy , cluster utilization, scalability and compatibility.

Some of the Hadoop Related Projects are:

Pig: Pig was initially developed at Yahoo to focus more on analyzing large data sets and spend less time on map reduce programs. Pig can handle any data types.
It is of two components : First is language known as PigLatin and the second is runtime environment where PigLatin programs are executed. Think as of JVM and Java
                                         
 HIVE: PIG can be powerful and simple, the downside is that it is new to learn. At Facebook HIVE was created based on SQL and new language is called HQL (Hive Query Language). We can use HQL from Hive Shell, JDBC, ODBC, HIVE Thrift Client.  HIVE Thrift Client is much like any database client that gets installed on a user client machine, it communicates with the HIVE services running on the server. Hive is read based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations

FLUME : Flume is an Apache Project. It allow you to flow data from a source into your Hadoop environment.

ZOOKEEPER :  It is an Apache project that provides a centralized infrastructure and services that enables synchronization across a cluster.

Hbase : It is an column oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many Big Data use cases. HBase does not support SQL. HBase application are written in java.

Avro : It is an apache project that provides data serialization services. Data structures within context of Hadoop MapReduce jobs

Sqoop : Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases

Spark : Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce . General processing engine for streaming, sql, machine learning and graph processing



Mr. Vijay Gupta
Assistant Professor
Dept of Information Technology




No comments:

Post a Comment