Sunday, 20 December 2015

Lets Start Wth Hive - Hadoop

Java, Pig , R those are all programming language, but what if you are not comfortable with those regular programming language, what if you only know SQL. There is still a way out for you in Hadoop, it’s called Hive. It is old SQL in different packet called HQL (Hadoop Query Language).
As an elementary task in Hive we are going to do the same kind data processing task as we did with Pig

Steps we will follow:
  1. We have several files of baseball statistics that we are going to upload into Hive.
  2. Do some simple computing with them.
  3. Find the player with the highest runs for each year.
  4. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
This file has all the statistics from 1871–2011 and contains more than 90,000 rows.
Input file path:
Step 1 –  Load input file:
We need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset in “file browser” like below
hive3
In Hue there is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.
On left there is a “query editor”. A query may span multiple lines, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new window for another query.
Pig is a scripting language so there all data objects are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.
In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. 

Step 2 – Create empty table and load data in Hive
In “Table” we need to select “Create a new table from a file”, which will lead us to the “file browser”, where we will select “batting.csv” file and we will name the new table as “temp_batting”
Else we can select “query editor” and run “create” query to create the table.
Create table temp_batting (col_value STRING);

hive4 Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor
LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and it will no longer be seen in the file browser.
hive6
Now we know that we have loaded the data, we have to verify the same. To do so we execute the following command, this will show us the first 100 rows from the table.

SELECT * from temp_batting LIMIT 100;
 hive7

The results of the query should look like:

hive8

Step 3 – Create a batting table and transfer data from the temporary table to batting table
Now we will extract the contents of temp_batting into a new table called ‘batting’ which should contain the following columns:
a)  player_id
b)  year
c)  runs
hive9
Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.
create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting 
SELECT    
 regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,    
regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,    
regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run 
from temp_batting 

Step 4 – Create a query to show the highest score per year
Next is simple command to do a “group by” in ‘batting’ by year, so that we have the highest scores by year.    
    SELECT year, max(runs) FROM batting GROUP BY year 
Result of executing the above query is shown below:
hive10

hive11
Step 5 – Get final result (who scored the maximum runs, year-wise)
As our year wise maximum runs are ready, we will execute final query which will show the player who scored the maximum runs in a year.
    SELECT a.year, a.player_id, a.runs from batting a 
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 
    ON (a.year = b.year AND a.runs = b.runs) ;

  The result of the above query
Hive_final1
Hive_final2

Sunday, 6 December 2015

Thinking like a Pig

                                

Introduction



Pig is a Hadoop extension that simplifi es Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability. Yahoo , one of the heaviest user of Hadoop (and a backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig.
Twitter is also another well-known user of Pig.

Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin.

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.


Pig has two major components:



  1.  A high-level data processing language called Pig Latin .
  2.  A compiler that compiles and runs your Pig Latin script in a choice of evaluation mechanisms.

The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.

Pig simplifies programming because of the ease of expressing your code in Pig Latin.


Thinking like a Pig



Pig has a certain philosophy about its design. We expect ease of use, high performance, and massive scalability from any Hadoop subproject. More unique and crucial to understanding Pig are the design choices of its programming language (a data flow language called Pig Latin), the data types it supports, and its treatment of user-defined functions (UDFs ) as first-class citizens.


Data types




We can summarize Pig’s philosophy toward data types in its slogan of “Pigs eat anything.”
Input data can come in any format. Popular formats, such as tab-delimited text files, are natively supported. Users can add functions to support other data file formats as well. Pig doesn't require metadata or schema on data, but it can take advantage of them if they’re provided.

Pig can operate on data that is relational, nested, semi structured, or unstructured.
To support this diversity of data, Pig supports complex data types, such as bags and tuples that can be nested to form fairly sophisticated data structures.




  • Pig Latin Data types




Pig has six simple atomic types and three complex types, shown in tables.


The three complex types are tuple, bag, and map.




User-defined functions



Pig was designed with many applications in mind—processing log data, natural language processing, analyzing network graphs, and so forth. It’s expected that many of the computations will require custom processing.
Knowing how to write UDFs is a big part of learning to use Pig.



Basic Idea of Running Pig



We can run Pig Latin commands in three ways—via the Grunt interactive shell, through a script file, and as embedded queries inside Java programs. Each way can work in one of two modes—local mode and Hadoop mode . (Hadoop mode is sometimes called Mapreduce mode in the Pig documentation.)

You can think of Pig programs as similar to SQL queries, and Pig provides a PigServer class that allows any Java program to execute Pig queries.

Running Pig in Hadoop mode means the compile Pig program will physically execute in a Hadoop installation. Typically the Hadoop installation is a fully distributed cluster.

The execution mode is specified to the pig command via the -x or -exectype option. 
You can enter the Grunt shell in local mode through:


pig -x local

Entering the Grunt shell in Hadoop mode is

pig -x mapreduce

or use the pig command without arguments, as it chooses the Hadoop mode by default.



Expressions and functions



You can apply expressions and functions to data fields to compute various values.





Summary



Pig is a higher-level data processing layer on top of Hadoop. Its Pig Latin language provides programmers a more intuitive way to specify data flows. It supports schemas in processing structured data, yet it’s flexible enough to work with unstructured textor semistructured XML data. It’s extensible with the use of UDFs. 
It vastly simplifies data joining and job chaining—two aspects of MapReduce programming that many developers found overly complicated. To demonstrate its usefulness, our example of computing patent cocitation shows a complex MapReduce program written in a dozen lines of Pig Latin.












Saturday, 21 November 2015

Basic Learning of Hadoop

What Is Hadoop


Hadoop is an open source framework for writing and running distributed applications and  is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

 Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is

■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected
with the assumption of frequent hardware malfunctions. It can gracefully
handle most such failures.

■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to
the cluster
.
■ Simple—Hadoop allows users to quickly write efficient parallel code.





Hadoop’s accessibility and simplicity give it an edge over writing and running large
distributed programs.

Its robustness and scalability make it suitable for
even the most demanding jobs at Yahoo and Facebook. These features make Hadoop
popular in both academia and industry.

History :

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.


Hadoop Architecture :


Hadoop framework includes following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

We can use following diagram to depict these four components available in Hadoop framework.




A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.


Hadoop data type :


Hadoop comes with a number of predefi ned classes that implement WritableComparable, including wrapper classes for all the basic data types, as seen in table.






List of frequently used types for the key/value pairs . These classes all implement the
WritableComparable interface.


Advantages and Disadvantages

Advantages of Hadoop

·         Distribute data and computation. The computation local to data prevents the network overload.
·         Linear scaling in the ideal case. It used to design for cheap, commodity hardware.
·         Simple programming model. The end-user programmer only writes map-reduce tasks.
·         HDFS store large amount of information
·         HDFS is simple and robust coherency model
·         HDFS is scalable and fast access to this information and it also possible to serve s large number of clients by simply adding more machines to the cluster.

·         HDFS provide streaming read performance.

Disadvantages  of Hadoop

·          Rough manner:-  Hadoop Map-reduce and HDFS are rough in manner. Because the software under active development.
·         Programming model is very restrictive:-  Lack of central data can be preventive.
·         Joins of multiple datasets are tricky and slow:- No indices! Often entire dataset gets copied in the process.
·         Cluster management is hard:-  In the cluster, operations like debugging, distributing software, collection logs etc are too hard.
·         Still single master which requires care and may limit scaling

·         Managing job flow isn’t trivial when intermediate data should  be kept


Summary :


Hadoop is a software framework that demands a different perspective on data processing.
It has its own fi lesystem, HDFS, that stores data in a way optimized for data-intensive
processing. You need specialized Hadoop tools to work with HDFS, but fortunately
most of those tools follow familiar Unix or Java syntax.

The data processing part of the Hadoop framework is better known as MapReduce.
Although the highlight of a MapReduce program is, not surprisingly, the Map and the
Reduce operations, other operations done by the framework, such as data splitting
and shuffl ing, are crucial to how the framework works. You can customize the other
operations, such as Partitioning and Combining. Hadoop provides options for reading

data and also to output data of different formats.


Developer(s)                Apache Software Foundation
Initial release               December 10, 2011; 3 years ago
Stable release               2.7.1 / July 6, 2015
Written in                     Java
Operating system        Cross-platform



What is HDFS



The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines.


Features of HDFS


·        It is suitable for the distributed storage and processing.

·        Hadoop provides a command interface to interact with HDFS.

·        The built-in servers of namenode and datanode help users to easily check the status of cluster.

·        Streaming access to file system data.

·        HDFS provides file permissions and authentication.

                                                                                                                                                       


HDFS Architecture

Given below is the architecture of a Hadoop File System.






Namenode :

The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. 

Inodes record attributes like permissions, modification and access times, namespace and disk space quotas.

The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes.

The NameNode maintains the namespace tree and the mapping of blocks to DataNodes.

The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.

                

DataNodes :

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode

                                                               

Goals of HDFS


  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.


What is Map Reduce 



 Hadoop MapReduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. 

According to the Apache Software Foundation, the primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.


The MapReduce algorithm contains two important tasks, namely Map and Reduce. 
Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). 
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.


Terminology


  • PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
  • Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
  • NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
  • DataNode - Node where data is presented in advance before any processing takes place.
  • MasterNode - Node where JobTracker runs and which accepts job requests from clients.
  • SlaveNode - Node where Map and Reduce program runs.
  • JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
  • Task Tracker - Tracks the task and reports status to JobTracker.
  • Job - A program is an execution of a Mapper and Reducer across a dataset.
  • Task - An execution of a Mapper or a Reducer on a slice of data.
  • Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.




Why Hadoop Is Important In Handling Big Data



Hadoop is changing the perception of handling Big Data especially the unstructured data. Let’s know how Apache Hadoop software library, which is a framework, plays a vital role in handling Big Data.  Apache Hadoop enables surplus data to be streamlined for any distributed processing system across clusters of computers using simple programming models. It truly is made to scale up from single servers to a large number of machines, each and every offering local computation, and storage space. Instead of depending on hardware to provide high-availability, the library itself is built to detect and handle breakdowns at the application layer, so providing an extremely available service along with a cluster of computers, as both versions might be vulnerable to failures.



Activities performed on Big Data:

Store  –       Big data need to be collected in a seamless repository, and it is not necessary to                         store in a single physical database.

Process  –   The process becomes more tedious than traditional one in terms of cleansing,                             enriching, calculating, transforming, and running algorithms.

Access –    There is no business sense of it at all when the data cannot be searched, retrieved                     easily, and can be virtually showcased along the business lines.