Sunday, 20 December 2015

Lets Start Wth Hive - Hadoop

Java, Pig , R those are all programming language, but what if you are not comfortable with those regular programming language, what if you only know SQL. There is still a way out for you in Hadoop, it’s called Hive. It is old SQL in different packet called HQL (Hadoop Query Language).
As an elementary task in Hive we are going to do the same kind data processing task as we did with Pig

Steps we will follow:
  1. We have several files of baseball statistics that we are going to upload into Hive.
  2. Do some simple computing with them.
  3. Find the player with the highest runs for each year.
  4. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
This file has all the statistics from 1871–2011 and contains more than 90,000 rows.
Input file path:
Step 1 –  Load input file:
We need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset in “file browser” like below
hive3
In Hue there is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.
On left there is a “query editor”. A query may span multiple lines, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new window for another query.
Pig is a scripting language so there all data objects are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.
In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. 

Step 2 – Create empty table and load data in Hive
In “Table” we need to select “Create a new table from a file”, which will lead us to the “file browser”, where we will select “batting.csv” file and we will name the new table as “temp_batting”
Else we can select “query editor” and run “create” query to create the table.
Create table temp_batting (col_value STRING);

hive4 Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor
LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and it will no longer be seen in the file browser.
hive6
Now we know that we have loaded the data, we have to verify the same. To do so we execute the following command, this will show us the first 100 rows from the table.

SELECT * from temp_batting LIMIT 100;
 hive7

The results of the query should look like:

hive8

Step 3 – Create a batting table and transfer data from the temporary table to batting table
Now we will extract the contents of temp_batting into a new table called ‘batting’ which should contain the following columns:
a)  player_id
b)  year
c)  runs
hive9
Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.
create table batting (player_id STRING, year INT, runs INT);

insert overwrite table batting 
SELECT    
 regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,    
regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,    
regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run 
from temp_batting 

Step 4 – Create a query to show the highest score per year
Next is simple command to do a “group by” in ‘batting’ by year, so that we have the highest scores by year.    
    SELECT year, max(runs) FROM batting GROUP BY year 
Result of executing the above query is shown below:
hive10

hive11
Step 5 – Get final result (who scored the maximum runs, year-wise)
As our year wise maximum runs are ready, we will execute final query which will show the player who scored the maximum runs in a year.
    SELECT a.year, a.player_id, a.runs from batting a 
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 
    ON (a.year = b.year AND a.runs = b.runs) ;

  The result of the above query
Hive_final1
Hive_final2

No comments:

Post a Comment