Introduction to Apache Pig

Introduction To PIGThe evolution of data processing frameworks

What is PIG?Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programsPig generates and compiles a Map/Reduce program(s) on the fly.

Why PIG?Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

File FormatsPigStorageCustom Load / Store Functions

Installing PIGDownload / Unpack tarball (pig.apache.org)Install RPM / DEB package (cloudera.com)

Running PIGGrunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.Script File: Place Pig commands in a script file and run the script.Embedded Program: Embed Pig commands in a host language and run the program.

Run ModesLocal Mode: To run Pig in local mode, you need access to a single machine.Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.

Sample PIG scriptA = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;store B into ‘id.out’;

Sample Script With SchemaA = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);

Eval FunctionsAVGCONCATExampleCOUNTCOUNT_STARDIFFIsEmptyMAXMINSIZESUMTOKENIZE

Math Functions# Math FunctionsABSACOSASINATANCBRTCEILCOSHCOSEXPFLOORLOGLOG10RANDOMROUNDSINSINHSQRTTANTANH

Sample CW PIG scriptRawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;GroupedInput = GROUP input BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();

Sample PIG script (Filtering)RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);GroupedInput = GROUP defFilter BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();

What is PIG UDF?UDF - User Defined FunctionTypes of UDF’s:Eval Functions (extends EvalFunc<String>)Aggregate Functions (extends EvalFunc<Long> implements Algebraic)Filter Functions (extends FilterFunc)UDFContextAllows UDFs to get access to the JobConfobjectAllows UDFs to pass configuration information between instantiations of the UDF on the front and backends.

Sample UDFpublic class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); }}

UDF In ActionREGISTER '$WORK_DIR/pig-support.jar';DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain

ResourcesApache PIG http://pig.apache.org/Apache Hadoophttp://hadoop.apache.org/Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation

Introduction to Apache Pig

More Related Content

Introduction to Apache Pig