What Is Hadoop
Hadoop is an open source framework for writing and running distributed applications and is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is
■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected
with the assumption of frequent hardware malfunctions. It can gracefully
handle most such failures.
■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to
the cluster
.
■ Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop’s accessibility and simplicity give it an edge over writing and running large
distributed programs.
Its robustness and scalability make it suitable for
even the most demanding jobs at Yahoo and Facebook. These features make Hadoop
popular in both academia and industry.
History :
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.Hadoop Architecture :
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.
We can use following diagram to depict these four components available in Hadoop framework.
A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.
Hadoop data type :
Hadoop comes with a number of predefi ned classes that implement WritableComparable, including wrapper classes for all the basic data types, as seen in table.
List of frequently used types for the key/value pairs . These classes all implement the
WritableComparable interface.
Advantages of Hadoop
·
Distribute data and computation. The computation local to data
prevents the network overload.
·
Linear scaling in the ideal case. It used to design for cheap,
commodity hardware.
·
Simple programming model. The end-user programmer only writes
map-reduce tasks.
·
HDFS store large amount of information
·
HDFS is simple and robust coherency model
·
HDFS is scalable and fast access to this information and it also
possible to serve s large number of clients by simply adding more machines to
the cluster.
·
HDFS provide streaming read performance.
·
Rough manner:- Hadoop
Map-reduce and HDFS are rough in manner. Because the software under active development.
·
Programming model is very restrictive:- Lack of central data can be preventive.
·
Joins of multiple datasets are tricky and slow:- No indices! Often
entire dataset gets copied in the process.
·
Cluster management is hard:- In the cluster, operations like debugging,
distributing software, collection logs etc are too hard.
·
Still single master which requires care and may limit scaling
·
Managing job flow isn’t trivial when intermediate data should
be kept
Summary :
Hadoop is a software
framework that demands a different perspective on data processing.
It has its own fi
lesystem, HDFS, that stores data in a way optimized for data-intensive
processing. You need
specialized Hadoop tools to work with HDFS, but fortunately
most of those tools
follow familiar Unix or Java syntax.
The data processing part
of the Hadoop framework is better known as MapReduce.
Although the highlight of
a MapReduce program is, not surprisingly, the Map and the
Reduce operations, other
operations done by the framework, such as data splitting
and shuffl ing, are
crucial to how the framework works. You can customize the other
operations, such as
Partitioning and Combining. Hadoop provides options for reading
data and also to output
data of different formats.
Developer(s) Apache Software Foundation
Initial release December 10, 2011; 3 years ago
Stable release 2.7.1 / July 6, 2015
Written in Java
Operating system Cross-platform
What is HDFS
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality.
HDFS holds
very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data losses in case of
failure. HDFS also makes applications available to parallel processing.
HDFS stores
large files (typically in the range of gigabytes to terabytes) across multiple
machines.
Features of HDFS
·
It is
suitable for the distributed storage and processing.
·
Hadoop
provides a command interface to interact with HDFS.
·
The
built-in servers of namenode and datanode help users to easily check the status
of cluster.
·
Streaming
access to file system data.
·
HDFS
provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
Namenode :
The HDFS namespace is a
hierarchy of files and directories. Files and directories are represented on
the NameNode by inodes.
Inodes record attributes
like permissions, modification and access times, namespace and disk space
quotas.
The file content is split
into large blocks (typically 128 megabytes, but user selectable file-by-file),
and each block of the file is independently replicated at multiple
DataNodes.
The NameNode maintains
the namespace tree and the mapping of blocks to DataNodes.
The cluster can have
thousands of DataNodes and tens of thousands of HDFS clients per cluster, as
each DataNode may execute multiple application tasks concurrently.
DataNodes :
A DataNode
stores data in the [HadoopFileSystem]. A functional filesystem has more than
one DataNode, with data replicated across them.
Datanodes
perform read-write operations on the file systems, as per client request.
They also
perform operations such as block creation, deletion, and replication according
to the instructions of the namenode
Goals of HDFS
- Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
- Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
- Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
What is Map Reduce
Hadoop MapReduce is a software framework for distributed
processing of large data sets on compute clusters of commodity hardware. It is
a sub-project of the Apache hadoop project. The framework takes care of
scheduling tasks, monitoring them and re-executing any failed tasks.
According to the Apache Software Foundation, the primary objective
of Map/Reduce is to split the input data set into independent chunks that are
processed in a completely parallel manner. The Hadoop MapReduce framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically,
both the input and the output of the job are stored in a file system.
The MapReduce algorithm contains two important tasks, namely
Map and Reduce.
Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value
pairs).
Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
The major advantage of MapReduce is that it is easy to scale
data processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted
many programmers to use the MapReduce model.
Terminology
- PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
- Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
- NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
- DataNode - Node where data is presented in advance before any processing takes place.
- MasterNode - Node where JobTracker runs and which accepts job requests from clients.
- SlaveNode - Node where Map and Reduce program runs.
- JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
- Task Tracker - Tracks the task and reports status to JobTracker.
- Job - A program is an execution of a Mapper and Reducer across a dataset.
- Task - An execution of a Mapper or a Reducer on a slice of data.
- Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
Why Hadoop Is Important In Handling Big Data
Hadoop is changing the perception of handling Big Data especially the unstructured data. Let’s know how Apache Hadoop software library, which is a framework, plays a vital role in handling Big Data. Apache Hadoop enables surplus data to be streamlined for any distributed processing system across clusters of computers using simple programming models. It truly is made to scale up from single servers to a large number of machines, each and every offering local computation, and storage space. Instead of depending on hardware to provide high-availability, the library itself is built to detect and handle breakdowns at the application layer, so providing an extremely available service along with a cluster of computers, as both versions might be vulnerable to failures.
Activities performed on Big Data:
Store – Big data need to be collected in a seamless repository, and it is not necessary to store in a single physical database.
Process – The process becomes more tedious than traditional one in terms of cleansing, enriching, calculating, transforming, and running algorithms.
Access – There is no business sense of it at all when the data cannot be searched, retrieved easily, and can be virtually showcased along the business lines.
seen
ReplyDelete