Big Data Certification

Confidential and Proprietary to Daugherty Business Solutions
Big Data Certifications
Jan 2018

Adam Doyle
• Co-Organizer, St. Louis HUG
• Big Data Community Lead,
Daugherty Business Solutions
• Formerly Lead Big Data
developer at Mercy
• Speaker at local and national
Big Data conferences
Adam Riggs
• Sr. Recruiter at Daugherty Business
• 6 years of technical recruiting experience
specializing in application development
and Big Data.
• I’ve hired candidates for most of the local
fortune 500 companies including Wells
Fargo, Express Scripts, MasterCard,
Charter, and Anheuser Busch.
Introduction

• Why get certified?
• What certifications are available?
• What is the certification test like?
• Now what?
Agenda

• For consulting services organizations, there are reasons to get certified
– Partnering with vendors
– Discounts
– Publicity
• For companies, there are also reasons to get certified
– Publicity
– Recognition
4
Why get certified?

Confidential and Proprietary to Daugherty
Business Solutions
Why get
certified?
Potential increase of 7-9%
with Hadoop certification in.
The big data market is still
immature. Few companies
have defined their big data
strategy and hiring has been
slow due to the lack of
qualified candidates. As this
matures the value of a
certification will increase, in
my opinion.
5
Big Data Supply and Demand
Certification Demand

• What do I want out of the certification?
• How does my current employer value the credential and do they
understand the business case?
• Does this credential complement my experience or add depth to my
skills?
• Are the opportunities I’m seeking, requiring this credential?
• What is my plan to continue education after this certification?
6
Evaluating Certifications

Topic CCP DataEngineer HDPCD HDPCD-Java MCHD MCHBD CCA Data Analyst HCA MCDA AWSSA CCDK HDPCD-Spark MCSD
CCA Spark and
Hadoop Developer DCD CCA Administrator HDPCA MCCA CCAK
Sqoop x x x
HDFS x x x x x
Hive x x x x x x
Impala x x x
Hive DDL x x x x
Hive QL x x x x
Spark x x x x
Spark SQL x x x
Spark MLLIB x x
Spark Streaming x x
Python x x x
Java x x x
Scala x x x x
Cloudera Manager x x
Hadoop Admin Utils x x
Flume x x
Avro x
Parquet x
Oozie x
Pig x x x x
Tez x
HDP Ambari x
Knox x
Ranger x
MapReduce x x
MapR Admin x
MapR FS x
MapR DB x
HBase x
Drill x
AWS x
Kafka x
Kafka Admin x
7
Certification Coverage

Confidential and Proprietary to Daugherty Business Solutions 8
Hadoop Developer

Administrator

Data Analyst

Spark Developer

Other certifications

Business Solutions
• Data Ingest (HDFS, Sqoop, Flume)
– Import and export from RDBMS
– Ingest Streaming data
– Use HDFS Commands
• Transform, Stage, Store (Hive, Pig)
– Convert formats
– Use compression
– Transform values
– Purge bad values, Deduplication
– Denormalize data
– Evolve an Avro/Parquet schema
– Partition data
– Tune for query performance
• Data Analysis (Hive)
– Aggregate queries, statistics
– Filter
– Rank/sort data
– Join data sets
– Create Hive table from existing data on HDFS
• Workflow (Oozie)
– Linear workflow
– Branching workflow
– Schedule workflow
Exam Objectives
Example
Cloudera Certified
Professional

Business Solutions
14
• Data Ingestion (HDFS, Sqoop, Flume)
– Import data from a table in a relational database into HDFS
– Import the results of a query from a relational database into HDFS
– Import a table from a relational database into a new or existing Hive table
– Insert or update data from HDFS into a table in a relational database
– Given a Flume configuration file, start a Flume agent
– Given a configured sink and source, configure a Flume memory channel with a specified capacity
• Data Transformation (Pig)
– Write and execute a Pig script
– Load data into a Pig relation without a schema
– Load data into a Pig relation with a schema
– Load data from a Hive table into a Pig relation
– Use Pig to transform data into a specified format
– Transform data to match a given Hive schema
– Group the data of one or more Pig relations
– Use Pig to remove records with null values from a relation
– Store the data from a Pig relation into a folder in HDFS
– Store the data from a Pig relation into a Hive table
– Sort the output of a Pig relation
– Remove the duplicate tuples of a Pig relation
– Specify the number of reduce tasks for a Pig MapReduce job
– Join two datasets using Pig
– Perform a replicated join using Pig
– Run a Pig job using Tez
– Within a Pig script, register a JAR file of User Defined Functions
– Within a Pig script, define an alias for a User Defined Function
– Within a Pig script, invoke a User Defined Function
• Data Analysis (Hive)
– Write and execute a Hive query
– Define a Hive-managed table
– Define a Hive external table
– Define a partitioned Hive table
– Define a bucketed Hive table
– Define a Hive table from a select query
– Define a Hive table that uses the ORCFile format
– Create a new ORCFile table from the data in an existing non-ORCFile Hive table
– Specify the storage format of a Hive table
– Specify the delimiter of a Hive table
– Load data into a Hive table from a local directory
– Load data into a Hive table from an HDFS directory
– Load data into a Hive table as the result of a query
– Load a compressed data file into a Hive table
– Update a row in a Hive table
– Delete a row from a Hive table
– Insert a new row into a Hive table
– Join two Hive tables
– Run a Hive query using Tez
– Run a Hive query using vectorization
– Output the execution plan for a Hive query
– Use a subquery within a Hive query
– Output data from a Hive query that is totally ordered across multiple reducers
– Set a Hadoop or Hive configuration property from within a Hive query
Exam Objectives
Example
Hortonworks HDPCD
Certification

Business Solutions
15
• Core Spark
– Write a Spark Core application in Python or
Scala
– Initialize a Spark application
– Run a Spark job on YARN
– Create an RDD
– Create an RDD from a file or directory in
HDFS
– Persist an RDD in memory or on disk
– Perform Spark transformations on an RDD
– Perform Spark actions on an RDD
– Create and use broadcast variables and
accumulators
– Configure Spark properties
• Spark SQL
– Create Spark DataFrames from an existing
RDD
– Perform operations on a DataFrame
– Write a Spark SQL application
– Use Hive with ORC from Spark SQL
– Write a Spark SQL application that reads
and writes data from Hive tables
Exam Objectives
Example
HDPCD: Spark

• Some vendors have practice exams and study guides
– Hortonworks
• https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-
content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf
– Map R
• http://learn.mapr.com/mapr-certified-data-analyst-mcda-study-
guide?_ga=2.160653181.131562425.1514478687-888544134.1514478687
16
Practice Exams

• Caveat
• Register at examslocal.com
• Cost
– Cloudera $400
– MapR $250
– Hortonworks $250
• Remotely proctored
• ~4 Hour time limit
17
Taking the test

Preparing your test space

Exam delivery and cluster information
CCP Data Engineer Exam (DE575) is a remote-proctored exam available anywhere, anytime.
CCP Data Engineer Exam (DE575) is a hands-on, practical exam using Cloudera technologies. Each user
is given their own CDH cluster (currently 5.10.1) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig,
Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster
also comes with Python 2.7 and 3.4, Perl 5.16, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm,
Scala 2.11, Scalding, IDEA, Sublime, Eclipse, and NetBeans.
Documentation Available online during the exam
Cloudera Product Documentation
Apache Hadoop
Apache Hive
Apache Impala (Incubating)
Apache Sqoop
Spark
Apache Crunch
Apache Pig
Kite SDK
Apache Avro
Apache Parquet
Cloudera HUE
Apache Oozie
Apache Flume
DataFu
JDK 7 API Docs
Python 2.7 Documentation
Python 3.4 Documentation
Scala Documentation
19
Available documentation

• In my experience, the results have come back the same day.
• The results include a pass/fail for the test and a pass fail marker for
each problem. If you failed there is a high level description for why you
failed.
20
Getting your results

• Share the accomplishment with your current employer- explain the value
add.
– Its annual assessment time for many employers and it’s an excellent time to
discuss the investment in your career with your employer.
• Update your LinkedIn or other sites (GitHub, Stack overflow) to let your
professional network know.
– You can update your title or add the credential to your name and skills
summary.
• Keep investing in your career. Have a plan for your professional
development beyond certifications (user groups, trade shows, conferences,
open-source projects, hackathons).
• If you are seeking the certification to increase salary, understand the
market and how to present the value add to employers.
– Leverage online resources like salary.com or payscale.com to get market
averages.
– Good recruiters should also be able to assess your market value after you
achieve certifications and most are willing to conduct resume reviews with
you.
– Discuss your reasons for achieving the certification and how it adds to your
value.
• Do not let the certifications expire.
21
So you’ve got your certification, now what?

Join Our Team
Contact:
Your.name@daugherty.com

Big Data Certification

Related slideshows

More Related Content

Big Data Certification