Analytics Mart : Lets Start Wth Hive

Java, Pig , R those are all programming language, but what if you are not comfortable with those regular programming language, what if you only know SQL. There is still a way out for you in Hadoop, it’s called Hive. It is old SQL in different packet called HQL (Hadoop Query Language).

As an elementary task in Hive we are going to do the same kind data processing task as we did with Pig

Steps we will follow:

We have several files of baseball statistics that we are going to upload into Hive.
Do some simple computing with them.
Find the player with the highest runs for each year.
Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

This file has all the statistics from 1871–2011 and contains more than 90,000 rows.

Input file path:

http://seanlahman.com/files/database/lahman591-csv.zip

Step 1 – Load input file:

We need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset in “file browser” like below

In Hue there is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.

On left there is a “query editor”. A query may span multiple lines, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new window for another query.

Pig is a scripting language so there all data objects are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.

In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query.

Step 2 – Create empty table and load data in Hive

In “Table” we need to select “Create a new table from a file”, which will lead us to the “file browser”, where we will select “batting.csv” file and we will name the new table as “temp_batting”

Else we can select “query editor” and run “create” query to create the table.

Create table temp_batting (col_value STRING);

Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor

LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;

Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and it will no longer be seen in the file browser.

Now we know that we have loaded the data, we have to verify the same. To do so we execute the following command, this will show us the first 100 rows from the table.

SELECT * from temp_batting LIMIT 100;

The results of the query should look like:

Step 3 – Create a batting table and transfer data from the temporary table to batting table

Now we will extract the contents of temp_batting into a new table called ‘batting’ which should contain the following columns:

a) player_id
b) year
c) runs

Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.

create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting

SELECT

regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,

regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,

regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run

from temp_batting;

Step 4 – Create a query to show the highest score per year

Next is simple command to do a “group by” in ‘batting’ by year, so that we have the highest scores by year.

SELECT year, max(runs) FROM batting GROUP BY year;

Result of executing the above query is shown below:

Step 5 – Get final result (who scored the maximum runs, year-wise)

As our year wise maximum runs are ready, we will execute final query which will show the player who scored the maximum runs in a year.

SELECT a.year, a.player_id, a.runs from batting a

JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b

ON (a.year = b.year AND a.runs = b.runs) ;

The result of the above query

Analytics Mart

Sunday, 20 December 2015

Lets Start Wth Hive - Hadoop

No comments:

Post a Comment