SlideShare a Scribd company logo
Confidential and Proprietary to Daugherty Business Solutions
Big Data Certifications
Jan 2018
Confidential and Proprietary to Daugherty Business Solutions
Adam Doyle
• Co-Organizer, St. Louis HUG
• Big Data Community Lead,
Daugherty Business Solutions
• Formerly Lead Big Data
developer at Mercy
• Speaker at local and national
Big Data conferences
Adam Riggs
• Sr. Recruiter at Daugherty Business
• 6 years of technical recruiting experience
specializing in application development
and Big Data.
• I’ve hired candidates for most of the local
fortune 500 companies including Wells
Fargo, Express Scripts, MasterCard,
Charter, and Anheuser Busch.
Introduction
Confidential and Proprietary to Daugherty Business Solutions
• Why get certified?
• What certifications are available?
• What is the certification test like?
• Now what?
Agenda
Confidential and Proprietary to Daugherty Business Solutions
• For consulting services organizations, there are reasons to get certified
– Partnering with vendors
– Discounts
– Publicity
• For companies, there are also reasons to get certified
– Publicity
– Recognition
4
Why get certified?
Confidential and Proprietary to Daugherty Business Solutions
Confidential and Proprietary to Daugherty
Business Solutions
Why get
certified?
Potential increase of 7-9%
with Hadoop certification in.
The big data market is still
immature. Few companies
have defined their big data
strategy and hiring has been
slow due to the lack of
qualified candidates. As this
matures the value of a
certification will increase, in
my opinion.
5
Big Data Supply and Demand
Certification Demand
Confidential and Proprietary to Daugherty Business Solutions
• What do I want out of the certification?
• How does my current employer value the credential and do they
understand the business case?
• Does this credential complement my experience or add depth to my
skills?
• Are the opportunities I’m seeking, requiring this credential?
• What is my plan to continue education after this certification?
6
Evaluating Certifications
Confidential and Proprietary to Daugherty Business Solutions
Topic CCP DataEngineer HDPCD HDPCD-Java MCHD MCHBD CCA Data Analyst HCA MCDA AWSSA CCDK HDPCD-Spark MCSD
CCA Spark and
Hadoop Developer DCD CCA Administrator HDPCA MCCA CCAK
Sqoop x x x
HDFS x x x x x
Hive x x x x x x
Impala x x x
Hive DDL x x x x
Hive QL x x x x
Spark x x x x
Spark SQL x x x
Spark MLLIB x x
Spark Streaming x x
Python x x x
Java x x x
Scala x x x x
Cloudera Manager x x
Hadoop Admin Utils x x
Flume x x
Avro x
Parquet x
Oozie x
Pig x x x x
Tez x
HDP Ambari x
Knox x
Ranger x
MapReduce x x
MapR Admin x
MapR FS x
MapR DB x
HBase x
Drill x
AWS x
Kafka x
Kafka Admin x
7
Certification Coverage
Confidential and Proprietary to Daugherty Business Solutions 8
Hadoop Developer
Confidential and Proprietary to Daugherty Business Solutions 9
Administrator
Confidential and Proprietary to Daugherty Business Solutions 10
Data Analyst
Confidential and Proprietary to Daugherty Business Solutions 11
Spark Developer
Confidential and Proprietary to Daugherty Business Solutions 12
Other certifications
Confidential and Proprietary to Daugherty Business Solutions
Confidential and Proprietary to Daugherty
Business Solutions
• Data Ingest (HDFS, Sqoop, Flume)
– Import and export from RDBMS
– Ingest Streaming data
– Use HDFS Commands
• Transform, Stage, Store (Hive, Pig)
– Convert formats
– Use compression
– Transform values
– Purge bad values, Deduplication
– Denormalize data
– Evolve an Avro/Parquet schema
– Partition data
– Tune for query performance
• Data Analysis (Hive)
– Aggregate queries, statistics
– Filter
– Rank/sort data
– Join data sets
– Create Hive table from existing data on HDFS
• Workflow (Oozie)
– Linear workflow
– Branching workflow
– Schedule workflow
Exam Objectives
Example
Cloudera Certified
Professional
Confidential and Proprietary to Daugherty Business Solutions
Confidential and Proprietary to Daugherty
Business Solutions
14
• Data Ingestion (HDFS, Sqoop, Flume)
– Import data from a table in a relational database into HDFS
– Import the results of a query from a relational database into HDFS
– Import a table from a relational database into a new or existing Hive table
– Insert or update data from HDFS into a table in a relational database
– Given a Flume configuration file, start a Flume agent
– Given a configured sink and source, configure a Flume memory channel with a specified capacity
• Data Transformation (Pig)
– Write and execute a Pig script
– Load data into a Pig relation without a schema
– Load data into a Pig relation with a schema
– Load data from a Hive table into a Pig relation
– Use Pig to transform data into a specified format
– Transform data to match a given Hive schema
– Group the data of one or more Pig relations
– Use Pig to remove records with null values from a relation
– Store the data from a Pig relation into a folder in HDFS
– Store the data from a Pig relation into a Hive table
– Sort the output of a Pig relation
– Remove the duplicate tuples of a Pig relation
– Specify the number of reduce tasks for a Pig MapReduce job
– Join two datasets using Pig
– Perform a replicated join using Pig
– Run a Pig job using Tez
– Within a Pig script, register a JAR file of User Defined Functions
– Within a Pig script, define an alias for a User Defined Function
– Within a Pig script, invoke a User Defined Function
• Data Analysis (Hive)
– Write and execute a Hive query
– Define a Hive-managed table
– Define a Hive external table
– Define a partitioned Hive table
– Define a bucketed Hive table
– Define a Hive table from a select query
– Define a Hive table that uses the ORCFile format
– Create a new ORCFile table from the data in an existing non-ORCFile Hive table
– Specify the storage format of a Hive table
– Specify the delimiter of a Hive table
– Load data into a Hive table from a local directory
– Load data into a Hive table from an HDFS directory
– Load data into a Hive table as the result of a query
– Load a compressed data file into a Hive table
– Update a row in a Hive table
– Delete a row from a Hive table
– Insert a new row into a Hive table
– Join two Hive tables
– Run a Hive query using Tez
– Run a Hive query using vectorization
– Output the execution plan for a Hive query
– Use a subquery within a Hive query
– Output data from a Hive query that is totally ordered across multiple reducers
– Set a Hadoop or Hive configuration property from within a Hive query
Exam Objectives
Example
Hortonworks HDPCD
Certification
Confidential and Proprietary to Daugherty Business Solutions
Confidential and Proprietary to Daugherty
Business Solutions
15
• Core Spark
– Write a Spark Core application in Python or
Scala
– Initialize a Spark application
– Run a Spark job on YARN
– Create an RDD
– Create an RDD from a file or directory in
HDFS
– Persist an RDD in memory or on disk
– Perform Spark transformations on an RDD
– Perform Spark actions on an RDD
– Create and use broadcast variables and
accumulators
– Configure Spark properties
• Spark SQL
– Create Spark DataFrames from an existing
RDD
– Perform operations on a DataFrame
– Write a Spark SQL application
– Use Hive with ORC from Spark SQL
– Write a Spark SQL application that reads
and writes data from Hive tables
Exam Objectives
Example
HDPCD: Spark
Confidential and Proprietary to Daugherty Business Solutions
• Some vendors have practice exams and study guides
– Hortonworks
• https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-
content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf
– Map R
• http://learn.mapr.com/mapr-certified-data-analyst-mcda-study-
guide?_ga=2.160653181.131562425.1514478687-888544134.1514478687
16
Practice Exams
Confidential and Proprietary to Daugherty Business Solutions
• Caveat
• Register at examslocal.com
• Cost
– Cloudera $400
– MapR $250
– Hortonworks $250
• Remotely proctored
• ~4 Hour time limit
17
Taking the test
Confidential and Proprietary to Daugherty Business Solutions 18
Preparing your test space
Confidential and Proprietary to Daugherty Business Solutions
Exam delivery and cluster information
CCP Data Engineer Exam (DE575) is a remote-proctored exam available anywhere, anytime.
CCP Data Engineer Exam (DE575) is a hands-on, practical exam using Cloudera technologies. Each user
is given their own CDH cluster (currently 5.10.1) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig,
Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster
also comes with Python 2.7 and 3.4, Perl 5.16, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm,
Scala 2.11, Scalding, IDEA, Sublime, Eclipse, and NetBeans.
Documentation Available online during the exam
Cloudera Product Documentation
Apache Hadoop
Apache Hive
Apache Impala (Incubating)
Apache Sqoop
Spark
Apache Crunch
Apache Pig
Kite SDK
Apache Avro
Apache Parquet
Cloudera HUE
Apache Oozie
Apache Flume
DataFu
JDK 7 API Docs
Python 2.7 Documentation
Python 3.4 Documentation
Scala Documentation
19
Available documentation
Confidential and Proprietary to Daugherty Business Solutions
• In my experience, the results have come back the same day.
• The results include a pass/fail for the test and a pass fail marker for
each problem. If you failed there is a high level description for why you
failed.
20
Getting your results
Confidential and Proprietary to Daugherty Business Solutions
• Share the accomplishment with your current employer- explain the value
add.
– Its annual assessment time for many employers and it’s an excellent time to
discuss the investment in your career with your employer.
• Update your LinkedIn or other sites (GitHub, Stack overflow) to let your
professional network know.
– You can update your title or add the credential to your name and skills
summary.
• Keep investing in your career. Have a plan for your professional
development beyond certifications (user groups, trade shows, conferences,
open-source projects, hackathons).
• If you are seeking the certification to increase salary, understand the
market and how to present the value add to employers.
– Leverage online resources like salary.com or payscale.com to get market
averages.
– Good recruiters should also be able to assess your market value after you
achieve certifications and most are willing to conduct resume reviews with
you.
– Discuss your reasons for achieving the certification and how it adds to your
value.
• Do not let the certifications expire.
21
So you’ve got your certification, now what?
Confidential and Proprietary to Daugherty Business Solutions
Join Our Team
Contact:
Your.name@daugherty.com
Confidential and Proprietary to Daugherty Business Solutions

More Related Content

Big Data Certification

  • 1. Confidential and Proprietary to Daugherty Business Solutions Big Data Certifications Jan 2018
  • 2. Confidential and Proprietary to Daugherty Business Solutions Adam Doyle • Co-Organizer, St. Louis HUG • Big Data Community Lead, Daugherty Business Solutions • Formerly Lead Big Data developer at Mercy • Speaker at local and national Big Data conferences Adam Riggs • Sr. Recruiter at Daugherty Business • 6 years of technical recruiting experience specializing in application development and Big Data. • I’ve hired candidates for most of the local fortune 500 companies including Wells Fargo, Express Scripts, MasterCard, Charter, and Anheuser Busch. Introduction
  • 3. Confidential and Proprietary to Daugherty Business Solutions • Why get certified? • What certifications are available? • What is the certification test like? • Now what? Agenda
  • 4. Confidential and Proprietary to Daugherty Business Solutions • For consulting services organizations, there are reasons to get certified – Partnering with vendors – Discounts – Publicity • For companies, there are also reasons to get certified – Publicity – Recognition 4 Why get certified?
  • 5. Confidential and Proprietary to Daugherty Business Solutions Confidential and Proprietary to Daugherty Business Solutions Why get certified? Potential increase of 7-9% with Hadoop certification in. The big data market is still immature. Few companies have defined their big data strategy and hiring has been slow due to the lack of qualified candidates. As this matures the value of a certification will increase, in my opinion. 5 Big Data Supply and Demand Certification Demand
  • 6. Confidential and Proprietary to Daugherty Business Solutions • What do I want out of the certification? • How does my current employer value the credential and do they understand the business case? • Does this credential complement my experience or add depth to my skills? • Are the opportunities I’m seeking, requiring this credential? • What is my plan to continue education after this certification? 6 Evaluating Certifications
  • 7. Confidential and Proprietary to Daugherty Business Solutions Topic CCP DataEngineer HDPCD HDPCD-Java MCHD MCHBD CCA Data Analyst HCA MCDA AWSSA CCDK HDPCD-Spark MCSD CCA Spark and Hadoop Developer DCD CCA Administrator HDPCA MCCA CCAK Sqoop x x x HDFS x x x x x Hive x x x x x x Impala x x x Hive DDL x x x x Hive QL x x x x Spark x x x x Spark SQL x x x Spark MLLIB x x Spark Streaming x x Python x x x Java x x x Scala x x x x Cloudera Manager x x Hadoop Admin Utils x x Flume x x Avro x Parquet x Oozie x Pig x x x x Tez x HDP Ambari x Knox x Ranger x MapReduce x x MapR Admin x MapR FS x MapR DB x HBase x Drill x AWS x Kafka x Kafka Admin x 7 Certification Coverage
  • 8. Confidential and Proprietary to Daugherty Business Solutions 8 Hadoop Developer
  • 9. Confidential and Proprietary to Daugherty Business Solutions 9 Administrator
  • 10. Confidential and Proprietary to Daugherty Business Solutions 10 Data Analyst
  • 11. Confidential and Proprietary to Daugherty Business Solutions 11 Spark Developer
  • 12. Confidential and Proprietary to Daugherty Business Solutions 12 Other certifications
  • 13. Confidential and Proprietary to Daugherty Business Solutions Confidential and Proprietary to Daugherty Business Solutions • Data Ingest (HDFS, Sqoop, Flume) – Import and export from RDBMS – Ingest Streaming data – Use HDFS Commands • Transform, Stage, Store (Hive, Pig) – Convert formats – Use compression – Transform values – Purge bad values, Deduplication – Denormalize data – Evolve an Avro/Parquet schema – Partition data – Tune for query performance • Data Analysis (Hive) – Aggregate queries, statistics – Filter – Rank/sort data – Join data sets – Create Hive table from existing data on HDFS • Workflow (Oozie) – Linear workflow – Branching workflow – Schedule workflow Exam Objectives Example Cloudera Certified Professional
  • 14. Confidential and Proprietary to Daugherty Business Solutions Confidential and Proprietary to Daugherty Business Solutions 14 • Data Ingestion (HDFS, Sqoop, Flume) – Import data from a table in a relational database into HDFS – Import the results of a query from a relational database into HDFS – Import a table from a relational database into a new or existing Hive table – Insert or update data from HDFS into a table in a relational database – Given a Flume configuration file, start a Flume agent – Given a configured sink and source, configure a Flume memory channel with a specified capacity • Data Transformation (Pig) – Write and execute a Pig script – Load data into a Pig relation without a schema – Load data into a Pig relation with a schema – Load data from a Hive table into a Pig relation – Use Pig to transform data into a specified format – Transform data to match a given Hive schema – Group the data of one or more Pig relations – Use Pig to remove records with null values from a relation – Store the data from a Pig relation into a folder in HDFS – Store the data from a Pig relation into a Hive table – Sort the output of a Pig relation – Remove the duplicate tuples of a Pig relation – Specify the number of reduce tasks for a Pig MapReduce job – Join two datasets using Pig – Perform a replicated join using Pig – Run a Pig job using Tez – Within a Pig script, register a JAR file of User Defined Functions – Within a Pig script, define an alias for a User Defined Function – Within a Pig script, invoke a User Defined Function • Data Analysis (Hive) – Write and execute a Hive query – Define a Hive-managed table – Define a Hive external table – Define a partitioned Hive table – Define a bucketed Hive table – Define a Hive table from a select query – Define a Hive table that uses the ORCFile format – Create a new ORCFile table from the data in an existing non-ORCFile Hive table – Specify the storage format of a Hive table – Specify the delimiter of a Hive table – Load data into a Hive table from a local directory – Load data into a Hive table from an HDFS directory – Load data into a Hive table as the result of a query – Load a compressed data file into a Hive table – Update a row in a Hive table – Delete a row from a Hive table – Insert a new row into a Hive table – Join two Hive tables – Run a Hive query using Tez – Run a Hive query using vectorization – Output the execution plan for a Hive query – Use a subquery within a Hive query – Output data from a Hive query that is totally ordered across multiple reducers – Set a Hadoop or Hive configuration property from within a Hive query Exam Objectives Example Hortonworks HDPCD Certification
  • 15. Confidential and Proprietary to Daugherty Business Solutions Confidential and Proprietary to Daugherty Business Solutions 15 • Core Spark – Write a Spark Core application in Python or Scala – Initialize a Spark application – Run a Spark job on YARN – Create an RDD – Create an RDD from a file or directory in HDFS – Persist an RDD in memory or on disk – Perform Spark transformations on an RDD – Perform Spark actions on an RDD – Create and use broadcast variables and accumulators – Configure Spark properties • Spark SQL – Create Spark DataFrames from an existing RDD – Perform operations on a DataFrame – Write a Spark SQL application – Use Hive with ORC from Spark SQL – Write a Spark SQL application that reads and writes data from Hive tables Exam Objectives Example HDPCD: Spark
  • 16. Confidential and Proprietary to Daugherty Business Solutions • Some vendors have practice exams and study guides – Hortonworks • https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp- content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf – Map R • http://learn.mapr.com/mapr-certified-data-analyst-mcda-study- guide?_ga=2.160653181.131562425.1514478687-888544134.1514478687 16 Practice Exams
  • 17. Confidential and Proprietary to Daugherty Business Solutions • Caveat • Register at examslocal.com • Cost – Cloudera $400 – MapR $250 – Hortonworks $250 • Remotely proctored • ~4 Hour time limit 17 Taking the test
  • 18. Confidential and Proprietary to Daugherty Business Solutions 18 Preparing your test space
  • 19. Confidential and Proprietary to Daugherty Business Solutions Exam delivery and cluster information CCP Data Engineer Exam (DE575) is a remote-proctored exam available anywhere, anytime. CCP Data Engineer Exam (DE575) is a hands-on, practical exam using Cloudera technologies. Each user is given their own CDH cluster (currently 5.10.1) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster also comes with Python 2.7 and 3.4, Perl 5.16, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm, Scala 2.11, Scalding, IDEA, Sublime, Eclipse, and NetBeans. Documentation Available online during the exam Cloudera Product Documentation Apache Hadoop Apache Hive Apache Impala (Incubating) Apache Sqoop Spark Apache Crunch Apache Pig Kite SDK Apache Avro Apache Parquet Cloudera HUE Apache Oozie Apache Flume DataFu JDK 7 API Docs Python 2.7 Documentation Python 3.4 Documentation Scala Documentation 19 Available documentation
  • 20. Confidential and Proprietary to Daugherty Business Solutions • In my experience, the results have come back the same day. • The results include a pass/fail for the test and a pass fail marker for each problem. If you failed there is a high level description for why you failed. 20 Getting your results
  • 21. Confidential and Proprietary to Daugherty Business Solutions • Share the accomplishment with your current employer- explain the value add. – Its annual assessment time for many employers and it’s an excellent time to discuss the investment in your career with your employer. • Update your LinkedIn or other sites (GitHub, Stack overflow) to let your professional network know. – You can update your title or add the credential to your name and skills summary. • Keep investing in your career. Have a plan for your professional development beyond certifications (user groups, trade shows, conferences, open-source projects, hackathons). • If you are seeking the certification to increase salary, understand the market and how to present the value add to employers. – Leverage online resources like salary.com or payscale.com to get market averages. – Good recruiters should also be able to assess your market value after you achieve certifications and most are willing to conduct resume reviews with you. – Discuss your reasons for achieving the certification and how it adds to your value. • Do not let the certifications expire. 21 So you’ve got your certification, now what?
  • 22. Confidential and Proprietary to Daugherty Business Solutions Join Our Team Contact: Your.name@daugherty.com
  • 23. Confidential and Proprietary to Daugherty Business Solutions