How to obtain the Cloudera Data Engineer Certification

DATA ENGINEER CERTIFICATION
Austin Sun
June 27th 2017

• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE

• Data Scientist:
a person employed to analyze and interpret complex digital data,
such as the usage statistics of a website, especially in order to assist
a business in its decision-making.
• Data Engineer:
a worker whose primary job responsibilities involve preparing data for
analytical or operational uses. Data engineers enable data scientists
to do their jobs more effectively.
DATA ENGINEER & DATA SCIENTIST

“An experienced open-source developer who earns
the Cloudera Certified Data Engineer credential is
able to perform core competencies required to
ingest, transform, store, and analyze data in
Cloudera's CDH environment. The credential is
earned after successfully passing the CCP Data
Engineer Exam (DE575).” -- Cloudera
CCP DATA ENGINEER

• Data Ingest
• Transform, Stage, Store
• Data Analysis
• Workflow
SKILL SET

The skills to transfer data between external systems and
your cluster. This includes the following:
• Import and export data between an external RDBMS and your cluster, including
the ability to import specific subsets, change the delimiter and file format of
imported data during ingest, and alter the data access pattern or privileges.
• Ingest real-time and near-real time (NRT) streaming data into HDFS, including the
ability to distribute to multiple data sources and convert data on ingest from one
format to another.
• Load data into and out of HDFS using the Hadoop File System (FS) commands.
DATA INGEST

Convert a set of data values in a given format stored in HDFS
into new data values and/or a new data format and write
them into HDFS or Hive/HCatalog. This includes the following
skills:
• Convert data from one file format to another
• Write your data with compression
• Convert data from one set of values to another (e.g., Lat/Long to Postal Address
using an external library)
• Change the data format of values in a data set
TRANSFORM, STAGE, STORE

TRANSFORM, STAGE, STORE
• Purge bad records from a data set, e.g., null values
• Deduplication and merge data
• De-normalize data from multiple disparate data sets
• Evolve an Avro or Parquet schema
• Partition an existing data set according to one or more partition keys
• Tune data for optimal query performance

Filter, sort, join, aggregate, and/or transform one or more data
sets in a given format stored in HDFS to produce a specified
result. All of these tasks may include reading from Parquet, Avro,
JSON, delimited text, and natural language text. The queries will
include complex data types (e.g., array, map, struct), the
implementation of external libraries, partitioned data,
compressed data, and require the use of metadata from
Hive/HCatalog.
DATA ANALYSIS

• Write a query to aggregate multiple rows of data
• Write a query to calculate aggregate statistics (e.g., average or sum)
• Write a query to filter data
• Write a query that produces ranked or sorted data
• Write a query that joins multiple data sets
• Read and/or create a Hive or an HCatalog table from existing data in HDFS
DATA ANALYSIS

The ability to create and execute various jobs and actions that
move data towards greater value and use in a system. This
includes the following skills:
• Create and execute a linear workflow with actions that include Hadoop jobs, Hive
jobs, Pig jobs, custom actions, etc.
• Create and execute a branching workflow with actions that include Hadoop jobs,
Hive jobs, Pig jobs, custom action, etc.
• Orchestrate a workflow to execute regularly at predefined times, including
workflows that have data dependencies
WORK FLOW

• Familiar with all related command tools
• Use Cloudera quickstart VW to practice
• Take sample test

• Hive, Impala, Sqoop, Spark, Crunch, Pig, Kite, Avro,
Parquet, Cloudera HUE, oozie, Flume, DataFu, JDK 7
API Docs, Python 2.7 , Python 3.4 , Scala
Only the above documentation are accessible during the exam.
FAMILIAR WITH ALL RELATED
COMMAND TOOLS

• Create an account at www.examslocal.com.
• Select the exam
• Choose a date and time
• Select a time slot for exam
• Pass the compatibility tool and install the screen
sharing Chrome Extension
STEPS TO SCHEDULE EXAM

• The exam is remote, it takes less then 2 hour
• Partly open book exam.
• Some documentation are available online during
the exam
• All other websites, including Google/search
functionality is disabled. No notes or other exam
aids.
WHEN EXAM

• Apache & Cloudera official documents
• My website:
https://godataengineer.wordpress.com/
USEFUL LINKS

• Data Warehouse for Machine Learning app
• Using Flume, Hive, HDFS, Spark and Phoenix
USE CASE

How to obtain the Cloudera Data Engineer Certification

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to How to obtain the Cloudera Data Engineer Certification

Similar to How to obtain the Cloudera Data Engineer Certification (20)

More from elephantscale

More from elephantscale (8)

Recently uploaded

Recently uploaded (20)

How to obtain the Cloudera Data Engineer Certification