Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.
2. What is PIG?Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programsPig generates and compiles a Map/Reduce program(s) on the fly.
3. Why PIG?Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
6. Running PIGGrunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.Script File: Place Pig commands in a script file and run the script.Embedded Program: Embed Pig commands in a host language and run the program.
7. Run ModesLocal Mode: To run Pig in local mode, you need access to a single machine.Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
8. Sample PIG scriptA = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;store B into ‘id.out’;
9. Sample Script With SchemaA = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);
13. Sample CW PIG scriptRawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;GroupedInput = GROUP input BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
14. Sample PIG script (Filtering)RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);GroupedInput = GROUP defFilter BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
15. What is PIG UDF?UDF - User Defined FunctionTypes of UDF’s:Eval Functions (extends EvalFunc<String>)Aggregate Functions (extends EvalFunc<Long> implements Algebraic)Filter Functions (extends FilterFunc)UDFContextAllows UDFs to get access to the JobConfobjectAllows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
16. Sample UDFpublic class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); }}
17. UDF In ActionREGISTER '$WORK_DIR/pig-support.jar';DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain