Analytics Mart : December 2015

Sunday, 20 December 2015

Lets Start Wth Hive - Hadoop

Java, Pig , R those are all programming language, but what if you are not comfortable with those regular programming language, what if you only know SQL. There is still a way out for you in Hadoop, it’s called Hive. It is old SQL in different packet called HQL (Hadoop Query Language).

As an elementary task in Hive we are going to do the same kind data processing task as we did with Pig

Steps we will follow:

We have several files of baseball statistics that we are going to upload into Hive.
Do some simple computing with them.
Find the player with the highest runs for each year.
Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

This file has all the statistics from 1871–2011 and contains more than 90,000 rows.

Input file path:

http://seanlahman.com/files/database/lahman591-csv.zip

Step 1 – Load input file:

We need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset in “file browser” like below

In Hue there is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.

On left there is a “query editor”. A query may span multiple lines, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new window for another query.

Pig is a scripting language so there all data objects are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.

In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query.

Step 2 – Create empty table and load data in Hive

In “Table” we need to select “Create a new table from a file”, which will lead us to the “file browser”, where we will select “batting.csv” file and we will name the new table as “temp_batting”

Else we can select “query editor” and run “create” query to create the table.

Create table temp_batting (col_value STRING);

Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor

LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;

Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and it will no longer be seen in the file browser.

Now we know that we have loaded the data, we have to verify the same. To do so we execute the following command, this will show us the first 100 rows from the table.

SELECT * from temp_batting LIMIT 100;

The results of the query should look like:

Step 3 – Create a batting table and transfer data from the temporary table to batting table

Now we will extract the contents of temp_batting into a new table called ‘batting’ which should contain the following columns:

a) player_id
b) year
c) runs

Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.

create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting

SELECT

regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,

regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,

regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run

from temp_batting;

Step 4 – Create a query to show the highest score per year

Next is simple command to do a “group by” in ‘batting’ by year, so that we have the highest scores by year.

SELECT year, max(runs) FROM batting GROUP BY year;

Result of executing the above query is shown below:

Step 5 – Get final result (who scored the maximum runs, year-wise)

As our year wise maximum runs are ready, we will execute final query which will show the player who scored the maximum runs in a year.

SELECT a.year, a.player_id, a.runs from batting a

JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b

ON (a.year = b.year AND a.runs = b.runs) ;

The result of the above query

Sunday, 6 December 2015

Thinking like a Pig

Introduction

Pig is a Hadoop extension that simplifi es Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability. Yahoo , one of the heaviest user of Hadoop (and a backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig.

Twitter is also another well-known user of Pig.

Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin.

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.

Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.

Pig has two major components:

A high-level data processing language called Pig Latin .
A compiler that compiles and runs your Pig Latin script in a choice of evaluation mechanisms.

The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.

Pig simplifies programming because of the ease of expressing your code in Pig Latin.

Thinking like a Pig

Pig has a certain philosophy about its design. We expect ease of use, high performance, and massive scalability from any Hadoop subproject. More unique and crucial to understanding Pig are the design choices of its programming language (a data flow language called Pig Latin), the data types it supports, and its treatment of user-defined functions (UDFs ) as first-class citizens.

Data types

We can summarize Pig’s philosophy toward data types in its slogan of “Pigs eat anything.”

Input data can come in any format. Popular formats, such as tab-delimited text files, are natively supported. Users can add functions to support other data file formats as well. Pig doesn't require metadata or schema on data, but it can take advantage of them if they’re provided.

Pig can operate on data that is relational, nested, semi structured, or unstructured.

To support this diversity of data, Pig supports complex data types, such as bags and tuples that can be nested to form fairly sophisticated data structures.

Pig Latin Data types

Pig has six simple atomic types and three complex types, shown in tables.

The three complex types are tuple, bag, and map.

User-defined functions

Pig was designed with many applications in mind—processing log data, natural language processing, analyzing network graphs, and so forth. It’s expected that many of the computations will require custom processing.

Knowing how to write UDFs is a big part of learning to use Pig.

Basic Idea of Running Pig

We can run Pig Latin commands in three ways—via the Grunt interactive shell, through a script file, and as embedded queries inside Java programs. Each way can work in one of two modes—local mode and Hadoop mode . (Hadoop mode is sometimes called Mapreduce mode in the Pig documentation.)

You can think of Pig programs as similar to SQL queries, and Pig provides a PigServer class that allows any Java program to execute Pig queries.

Running Pig in Hadoop mode means the compile Pig program will physically execute in a Hadoop installation. Typically the Hadoop installation is a fully distributed cluster.

The execution mode is specified to the pig command via the -x or -exectype option.

You can enter the Grunt shell in local mode through:

pig -x local

Entering the Grunt shell in Hadoop mode is

pig -x mapreduce

or use the pig command without arguments, as it chooses the Hadoop mode by default.

Expressions and functions

You can apply expressions and functions to data fields to compute various values.

Summary

Pig is a higher-level data processing layer on top of Hadoop. Its Pig Latin language provides programmers a more intuitive way to specify data flows. It supports schemas in processing structured data, yet it’s flexible enough to work with unstructured textor semistructured XML data. It’s extensible with the use of UDFs.
It vastly simplifies data joining and job chaining—two aspects of MapReduce programming that many developers found overly complicated. To demonstrate its usefulness, our example of computing patent cocitation shows a complex MapReduce program written in a dozen lines of Pig Latin.