Sunday, 6 December 2015

Thinking like a Pig

                                

Introduction



Pig is a Hadoop extension that simplifi es Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability. Yahoo , one of the heaviest user of Hadoop (and a backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig.
Twitter is also another well-known user of Pig.

Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin.

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.


Pig has two major components:



  1.  A high-level data processing language called Pig Latin .
  2.  A compiler that compiles and runs your Pig Latin script in a choice of evaluation mechanisms.

The main evaluation mechanism is Hadoop. Pig also supports a local mode for development purposes.

Pig simplifies programming because of the ease of expressing your code in Pig Latin.


Thinking like a Pig



Pig has a certain philosophy about its design. We expect ease of use, high performance, and massive scalability from any Hadoop subproject. More unique and crucial to understanding Pig are the design choices of its programming language (a data flow language called Pig Latin), the data types it supports, and its treatment of user-defined functions (UDFs ) as first-class citizens.


Data types




We can summarize Pig’s philosophy toward data types in its slogan of “Pigs eat anything.”
Input data can come in any format. Popular formats, such as tab-delimited text files, are natively supported. Users can add functions to support other data file formats as well. Pig doesn't require metadata or schema on data, but it can take advantage of them if they’re provided.

Pig can operate on data that is relational, nested, semi structured, or unstructured.
To support this diversity of data, Pig supports complex data types, such as bags and tuples that can be nested to form fairly sophisticated data structures.




  • Pig Latin Data types




Pig has six simple atomic types and three complex types, shown in tables.


The three complex types are tuple, bag, and map.




User-defined functions



Pig was designed with many applications in mind—processing log data, natural language processing, analyzing network graphs, and so forth. It’s expected that many of the computations will require custom processing.
Knowing how to write UDFs is a big part of learning to use Pig.



Basic Idea of Running Pig



We can run Pig Latin commands in three ways—via the Grunt interactive shell, through a script file, and as embedded queries inside Java programs. Each way can work in one of two modes—local mode and Hadoop mode . (Hadoop mode is sometimes called Mapreduce mode in the Pig documentation.)

You can think of Pig programs as similar to SQL queries, and Pig provides a PigServer class that allows any Java program to execute Pig queries.

Running Pig in Hadoop mode means the compile Pig program will physically execute in a Hadoop installation. Typically the Hadoop installation is a fully distributed cluster.

The execution mode is specified to the pig command via the -x or -exectype option. 
You can enter the Grunt shell in local mode through:


pig -x local

Entering the Grunt shell in Hadoop mode is

pig -x mapreduce

or use the pig command without arguments, as it chooses the Hadoop mode by default.



Expressions and functions



You can apply expressions and functions to data fields to compute various values.





Summary



Pig is a higher-level data processing layer on top of Hadoop. Its Pig Latin language provides programmers a more intuitive way to specify data flows. It supports schemas in processing structured data, yet it’s flexible enough to work with unstructured textor semistructured XML data. It’s extensible with the use of UDFs. 
It vastly simplifies data joining and job chaining—two aspects of MapReduce programming that many developers found overly complicated. To demonstrate its usefulness, our example of computing patent cocitation shows a complex MapReduce program written in a dozen lines of Pig Latin.












No comments:

Post a Comment