Analytics Mart : Basic Learning of Hadoop

What Is Hadoop

Hadoop is an open source framework for writing and running distributed applications and is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is

■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud

computing services such as Amazon’s Elastic Compute Cloud (EC2 ).

■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected

with the assumption of frequent hardware malfunctions. It can gracefully

handle most such failures.

■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to

the cluster

■ Simple—Hadoop allows users to quickly write efficient parallel code.

Hadoop’s accessibility and simplicity give it an edge over writing and running large

distributed programs.

Its robustness and scalability make it suitable for

even the most demanding jobs at Yahoo and Facebook. These features make Hadoop

popular in both academia and industry.

History :

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

Hadoop Architecture :

Hadoop framework includes following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

We can use following diagram to depict these four components available in Hadoop framework.

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.

Hadoop data type :

Hadoop comes with a number of predefi ned classes that implement WritableComparable, including wrapper classes for all the basic data types, as seen in table.

List of frequently used types for the key/value pairs . These classes all implement the

WritableComparable interface.

Advantages and Disadvantages

Advantages of Hadoop

· Distribute data and computation. The computation local to data prevents the network overload.

· Linear scaling in the ideal case. It used to design for cheap, commodity hardware.

· Simple programming model. The end-user programmer only writes map-reduce tasks.

· HDFS store large amount of information

· HDFS is simple and robust coherency model

· HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.

· HDFS provide streaming read performance.

Disadvantages of Hadoop

· Rough manner:- Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.

· Programming model is very restrictive:- Lack of central data can be preventive.

· Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.

· Cluster management is hard:- In the cluster, operations like debugging, distributing software, collection logs etc are too hard.

· Still single master which requires care and may limit scaling

· Managing job flow isn’t trivial when intermediate data should be kept

Summary :

Hadoop is a software framework that demands a different perspective on data processing.

It has its own fi lesystem, HDFS, that stores data in a way optimized for data-intensive

processing. You need specialized Hadoop tools to work with HDFS, but fortunately

most of those tools follow familiar Unix or Java syntax.

The data processing part of the Hadoop framework is better known as MapReduce.

Although the highlight of a MapReduce program is, not surprisingly, the Map and the

Reduce operations, other operations done by the framework, such as data splitting

and shuffl ing, are crucial to how the framework works. You can customize the other

operations, such as Partitioning and Combining. Hadoop provides options for reading

data and also to output data of different formats.

Developer(s) Apache Software Foundation

Initial release December 10, 2011; 3 years ago

Stable release 2.7.1 / July 6, 2015

Written in Java

Operating system Cross-platform

What is HDFS

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines.

Features of HDFS

· It is suitable for the distributed storage and processing.

· Hadoop provides a command interface to interact with HDFS.

· The built-in servers of namenode and datanode help users to easily check the status of cluster.

· Streaming access to file system data.

· HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

Namenode :

The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes.

Inodes record attributes like permissions, modification and access times, namespace and disk space quotas.

The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes.

The NameNode maintains the namespace tree and the mapping of blocks to DataNodes.

The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.

DataNodes :

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

Datanodes perform read-write operations on the file systems, as per client request.

They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode

Goals of HDFS

Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

What is Map Reduce

Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

According to the Apache Software Foundation, the primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

Terminology

PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Why Hadoop Is Important In Handling Big Data

Hadoop is changing the perception of handling Big Data especially the unstructured data. Let’s know how Apache Hadoop software library, which is a framework, plays a vital role in handling Big Data. Apache Hadoop enables surplus data to be streamlined for any distributed processing system across clusters of computers using simple programming models. It truly is made to scale up from single servers to a large number of machines, each and every offering local computation, and storage space. Instead of depending on hardware to provide high-availability, the library itself is built to detect and handle breakdowns at the application layer, so providing an extremely available service along with a cluster of computers, as both versions might be vulnerable to failures.

Activities performed on Big Data:

Store – Big data need to be collected in a seamless repository, and it is not necessary to store in a single physical database.

Process – The process becomes more tedious than traditional one in terms of cleansing, enriching, calculating, transforming, and running algorithms.

Access – There is no business sense of it at all when the data cannot be searched, retrieved easily, and can be virtually showcased along the business lines.

Analytics Mart

Saturday, 21 November 2015

Basic Learning of Hadoop