SlideShare a Scribd company logo
DATA ENGINEER CERTIFICATION
Austin Sun
June 27th 2017
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Data Scientist:
a person employed to analyze and interpret complex digital data,
such as the usage statistics of a website, especially in order to assist
a business in its decision-making.
• Data Engineer:
a worker whose primary job responsibilities involve preparing data for
analytical or operational uses. Data engineers enable data scientists
to do their jobs more effectively.
DATA ENGINEER & DATA SCIENTIST

Recommended for you

How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale

Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming. Know thy Lag While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is. Reading Data In Fan Out Pattern using minPartitions to Use Kafka Efficiently Overload protection using maxOffsetsPerTrigger More Apache Spark Settings used to optimize Throughput MicroBatching Best Practices Map() +ForEach() vs MapPartitons + forEachPartition Adobe Spark Speculation and its Effects Calculating Streaming Statistics Windowing Importance of the State Store RocksDB FTW Broadcast joins Custom Aggegators OffHeap Counters using Redis Pipelining

Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352

Hadoop has become synonymous to Big Data. Oracle has release the latest standard to Java EE stack: Batch Processing JSR 352. Batch processing has been around for decades and there are many Java framework already available such Spring Batch. This talks provides a perspective about Hadoop and JSR352. Knowing when to use or the other or both together.

apache hadoophadoopjava
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark

The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.

spark + ai summit

 *
How to obtain the Cloudera Data Engineer Certification
“An experienced open-source developer who earns
the Cloudera Certified Data Engineer credential is
able to perform core competencies required to
ingest, transform, store, and analyze data in
Cloudera's CDH environment. The credential is
earned after successfully passing the CCP Data
Engineer Exam (DE575).” -- Cloudera
CCP DATA ENGINEER
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Data Ingest
• Transform, Stage, Store
• Data Analysis
• Workflow
SKILL SET

Recommended for you

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

This document discusses Devon Energy's efforts to modernize its data landscape by implementing a data hub architecture. The data hub consolidates various data sources and tools on cloud services like Snowflake, Databricks and Azure. This has improved agility, reduced costs, and allowed various teams to access and analyze data. Devon Energy is working to improve continuous integration/deployment, testing, and monitoring across its data engineering and analytics workflows on the data hub platform.

Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming

Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations. Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.

Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast

Gojek, Indonesia’s first billion-dollar startup, has seen an explosive growth in both users and data over the past three years.

spark + ai summit
The skills to transfer data between external systems and
your cluster. This includes the following:
• Import and export data between an external RDBMS and your cluster, including
the ability to import specific subsets, change the delimiter and file format of
imported data during ingest, and alter the data access pattern or privileges.
• Ingest real-time and near-real time (NRT) streaming data into HDFS, including the
ability to distribute to multiple data sources and convert data on ingest from one
format to another.
• Load data into and out of HDFS using the Hadoop File System (FS) commands.
DATA INGEST
Convert a set of data values in a given format stored in HDFS
into new data values and/or a new data format and write
them into HDFS or Hive/HCatalog. This includes the following
skills:
• Convert data from one file format to another
• Write your data with compression
• Convert data from one set of values to another (e.g., Lat/Long to Postal Address
using an external library)
• Change the data format of values in a data set
TRANSFORM, STAGE, STORE
TRANSFORM, STAGE, STORE
• Purge bad records from a data set, e.g., null values
• Deduplication and merge data
• De-normalize data from multiple disparate data sets
• Evolve an Avro or Parquet schema
• Partition an existing data set according to one or more partition keys
• Tune data for optimal query performance
Filter, sort, join, aggregate, and/or transform one or more data
sets in a given format stored in HDFS to produce a specified
result. All of these tasks may include reading from Parquet, Avro,
JSON, delimited text, and natural language text. The queries will
include complex data types (e.g., array, map, struct), the
implementation of external libraries, partitioned data,
compressed data, and require the use of metadata from
Hive/HCatalog.
DATA ANALYSIS

Recommended for you

Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera

Accelerating Data Science Initiatives with Databricks’ Rapid SQL Analytics and Privacera’s Centralized Data Access Governance. Databricks’ SQL Analytics helps data teams consolidate and simplify their data architectures. With SQL Analytics, data teams can perform BI and SQL workloads on the same multi-cloud lakehouse architecture enabling data scientists to perform advanced analytics on unstructured and large-scale data. This session will explore how Privacera’s advanced security, privacy, and governance capabilities seamlessly integrate with Databricks’ unified SQL Analytics approach to provide single pane visibility of data analytics from a centralized location. Attendees will learn how to: Rapidly access data to run high-fidelity analytics Implement a fully secure solution that ensures productivity, while controlling data access at fine-grained levels (row, column, and file) Easily enable consistent access policies across all systems and applications Support true data transparency across enterprises Comply with stringent industry and privacy regulations like GDPR, LGPD, HIPAA, CCPA, PCI DSS, RTBF, and more with rich auditing and reporting

Lessons from Driverless AI going to Production
Lessons from Driverless AI going to ProductionLessons from Driverless AI going to Production
Lessons from Driverless AI going to Production

Driverless AI can run on various cloud platforms and on-premises servers. It supports Linux environments with CUDA GPUs. The document provides step-by-step instructions for setting up Driverless AI on an IBM Power P9 system, including installing prerequisites, running experiments through a web interface, and automating training with Python. It also addresses common customer questions about installation, deployment, and productionizing Driverless AI models and pipelines.

ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse

This document discusses ETL practices and opportunities for improving data integration processes. It presents ELT and RIT approaches to extract, load, and transform data in Hadoop/MPP systems for better performance and scalability. While data modeling is still important, the document questions how to balance normalization with ease of querying for analytics. Integration is noted as key to bringing value from distributed data sources, and challenges of unique identifiers and cross-referencing data are discussed. The document also emphasizes best practices like profiling, prototyping, deploying to sandboxes before production, and ensuring tools for performance monitoring, problem detection and education are in place.

eltsqlhadoop
• Write a query to aggregate multiple rows of data
• Write a query to calculate aggregate statistics (e.g., average or sum)
• Write a query to filter data
• Write a query that produces ranked or sorted data
• Write a query that joins multiple data sets
• Read and/or create a Hive or an HCatalog table from existing data in HDFS
DATA ANALYSIS
The ability to create and execute various jobs and actions that
move data towards greater value and use in a system. This
includes the following skills:
• Create and execute a linear workflow with actions that include Hadoop jobs, Hive
jobs, Pig jobs, custom actions, etc.
• Create and execute a branching workflow with actions that include Hadoop jobs,
Hive jobs, Pig jobs, custom action, etc.
• Orchestrate a workflow to execute regularly at predefined times, including
workflows that have data dependencies
WORK FLOW
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Familiar with all related command tools
• Use Cloudera quickstart VW to practice
• Take sample test

Recommended for you

Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery

This document provides an introduction to Google BigQuery, a cloud-based data warehouse that allows users to interactively query and analyze massive datasets. It begins with background on big data and technologies like Hadoop, Hive, and Spark. It then explains the differences between row-based and column-based data stores, with BigQuery using a columnar approach. The rest of the document demonstrates BigQuery through an example query on public datasets and provides pricing and resource information.

bigqueryhadoopmapreduce
R in Power BI
R in Power BIR in Power BI
R in Power BI

An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations

rpower bidata
Data virtualization using polybase
Data virtualization using polybaseData virtualization using polybase
Data virtualization using polybase

This document provides an overview of using Polybase for data virtualization in SQL Server. It discusses installing and configuring Polybase, connecting external data sources like Azure Blob Storage and SQL Server, using Polybase DMVs for monitoring and troubleshooting, and techniques for optimizing performance like predicate pushdown and creating statistics on external tables. The presentation aims to explain how Polybase can be leveraged to virtually access and query external data using T-SQL without needing to know the physical data locations or move the data.

antonios chatzipavlissqlschool.grsql server
• Hive, Impala, Sqoop, Spark, Crunch, Pig, Kite, Avro,
Parquet, Cloudera HUE, oozie, Flume, DataFu, JDK 7
API Docs, Python 2.7 , Python 3.4 , Scala
Only the above documentation are accessible during the exam.
FAMILIAR WITH ALL RELATED
COMMAND TOOLS
DOWNLOAD QUICKSTART VM
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification

Recommended for you

Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse

This document provides an introduction to Azure SQL Data Warehouse. It discusses the architecture of ASDW including how it is built on Azure SQL Database and Analytics Platform System (APS). It covers various topics like database design, querying, data loading, tooling, and maintenance for ASDW. The goals are to understand the basic infrastructure, learn design/querying/migration methods, and investigate available tooling for automation and monitoring of ASDW.

databaseazure sql data warehouseazure
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.

delta lakesparkbig data
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL Database

This document discusses using MongoDB as an agile NoSQL database for big data applications. It describes MongoDB's schema-less design, horizontal scaling, and replication capabilities which make it a good fit for frequently changing agile projects. The document includes examples of using MongoDB for an e-commerce catalog with dynamic product data and reviews from multiple sources.

agile mongodb
How to obtain the Cloudera Data Engineer Certification
SAMPLE EXAM QUESTION
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Create an account at www.examslocal.com.
• Select the exam
• Choose a date and time
• Select a time slot for exam
• Pass the compatibility tool and install the screen
sharing Chrome Extension
STEPS TO SCHEDULE EXAM

Recommended for you

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark

Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse. Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table. These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business. Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data. And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table. Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.

ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks

This document summarizes Mark Kromer's presentation on using Azure Data Factory and Azure Databricks for ETL. It discusses using ADF for nightly data loads, slowly changing dimensions, and loading star schemas into data warehouses. It also covers using ADF for data science scenarios with data lakes. The presentation describes ADF mapping data flows for code-free data transformations at scale in the cloud without needing expertise in Spark, Scala, Python or Java. It highlights how mapping data flows allow users to focus on business logic and data transformations through an expression language and provides debugging and monitoring of data flows.

Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI

This document discusses techniques for optimizing Power BI performance. It recommends tracing queries using DAX Studio to identify slow queries and refresh times. Tracing tools like SQL Profiler and log files can provide insights into issues occurring in the data sources, Power BI layer, and across the network. Focusing on optimization by addressing wait times through a scientific process can help resolve long-term performance problems.

power bioptimizationmicrosoft
• The exam is remote, it takes less then 2 hour
• Partly open book exam.
• Some documentation are available online during
the exam
• All other websites, including Google/search
functionality is disabled. No notes or other exam
aids.
WHEN EXAM
• Introduction
• Skill set
• How to prepare
• Register test
• Reference & etc.
OUTLINE
• Apache & Cloudera official documents
• My website:
https://godataengineer.wordpress.com/
USEFUL LINKS
How to obtain the Cloudera Data Engineer Certification

Recommended for you

Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing

What is Data Warehousing? , Who needs Data Warehousing? , Why Data Warehouse is required? , Types of Systems , OLTP OLAP Maintenance of Data Warehouse Data Warehousing Life Cycle

datastage business intelligence developer course icomputersdatastage crash course in mumbai
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator

The document describes BlueData's Big Data Lab Accelerator solution which provides software and professional services to deploy a multi-tenant Hadoop and Spark lab environment in two weeks for evaluation of Big Data tools. BlueData's EPIC software simplifies Big Data infrastructure deployment using containers and virtualization. The Accelerator solution includes deployment of the EPIC platform, setup of Hadoop and Spark clusters, data pipeline workshops and implementation of sample use cases to get started with Big Data.

big datahadoopdevops
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight

This document summarizes a presentation on using SQL Server Integration Services (SSIS) with HDInsight. It introduces Tillmann Eitelberg and Oliver Engels, who are experts on SSIS and HDInsight. The agenda covers traditional ETL processes, challenges of big data, useful Apache Hadoop components for ETL, clarifying statements about Hadoop and ETL, using Hadoop in the ETL process, how SSIS is more than just an ETL tool, tools for working with HDInsight, getting started with Azure HDInsight, and using SSIS to load and transform data on HDInsight clusters.

ssis hdinsight pass
• Data Warehouse for Machine Learning app
• Using Flume, Hive, HDFS, Spark and Phoenix
USE CASE
THANK YOU

More Related Content

What's hot

GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
The Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL ServerThe Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL Server
Jason Strate
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
IBM Cloud Data Services
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
Databricks
 
Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352
Armel Nene
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
Databricks
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
Databricks
 
Lessons from Driverless AI going to Production
Lessons from Driverless AI going to ProductionLessons from Driverless AI going to Production
Lessons from Driverless AI going to Production
Sri Ambati
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
Eric Sun
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
Csaba Toth
 
R in Power BI
R in Power BIR in Power BI
R in Power BI
Eric Bragas
 
Data virtualization using polybase
Data virtualization using polybaseData virtualization using polybase
Data virtualization using polybase
Antonios Chatzipavlis
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL Database
Gaurav Awasthi
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
Databricks
 

What's hot (20)

GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
The Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL ServerThe Plan Cache Whisperer - Performance Tuning SQL Server
The Plan Cache Whisperer - Performance Tuning SQL Server
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
 
Lessons from Driverless AI going to Production
Lessons from Driverless AI going to ProductionLessons from Driverless AI going to Production
Lessons from Driverless AI going to Production
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
R in Power BI
R in Power BIR in Power BI
R in Power BI
 
Data virtualization using polybase
Data virtualization using polybaseData virtualization using polybase
Data virtualization using polybase
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL Database
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 

Similar to How to obtain the Cloudera Data Engineer Certification

Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
Vibrant Technologies & Computers
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
BlueData, Inc.
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
BizTalk360
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Sap bods Training in Hyderabad | Sap bods Online Training
Sap bods Training in Hyderabad | Sap bods  Online Training Sap bods Training in Hyderabad | Sap bods  Online Training
Sap bods Training in Hyderabad | Sap bods Online Training
CHENNAKESHAVAKATAGAR
 
Sap bods training in hyderabad
Sap bods training in hyderabadSap bods training in hyderabad
Sap bods training in hyderabad
Rajitha D
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
Adam Doyle
 
Pallavi_Resume
Pallavi_ResumePallavi_Resume
Pallavi_Resume
pallavi Mahajan
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Saurabh's_profile
Saurabh's_profileSaurabh's_profile
Saurabh's_profile
Saurabh Srivastava
 
Pankaj_Kumar_~3 yr exp.docx
Pankaj_Kumar_~3  yr exp.docxPankaj_Kumar_~3  yr exp.docx
Pankaj_Kumar_~3 yr exp.docx
Kumar Pankaj
 
Cnam azure ze cloud resource manager
Cnam azure ze cloud  resource managerCnam azure ze cloud  resource manager
Cnam azure ze cloud resource manager
Aymeric Weinbach
 

Similar to How to obtain the Cloudera Data Engineer Certification (20)

Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Sap bods Training in Hyderabad | Sap bods Online Training
Sap bods Training in Hyderabad | Sap bods  Online Training Sap bods Training in Hyderabad | Sap bods  Online Training
Sap bods Training in Hyderabad | Sap bods Online Training
 
Sap bods training in hyderabad
Sap bods training in hyderabadSap bods training in hyderabad
Sap bods training in hyderabad
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Pallavi_Resume
Pallavi_ResumePallavi_Resume
Pallavi_Resume
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Saurabh's_profile
Saurabh's_profileSaurabh's_profile
Saurabh's_profile
 
Pankaj_Kumar_~3 yr exp.docx
Pankaj_Kumar_~3  yr exp.docxPankaj_Kumar_~3  yr exp.docx
Pankaj_Kumar_~3 yr exp.docx
 
Cnam azure ze cloud resource manager
Cnam azure ze cloud  resource managerCnam azure ze cloud  resource manager
Cnam azure ze cloud resource manager
 

More from elephantscale

AI for Kids
AI for KidsAI for Kids
AI for Kids
elephantscale
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Team
elephantscale
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultin
elephantscale
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use cases
elephantscale
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
elephantscale
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
elephantscale
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
elephantscale
 

More from elephantscale (8)

AI for Kids
AI for KidsAI for Kids
AI for Kids
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Team
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultin
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use cases
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 

Recently uploaded

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 

Recently uploaded (20)

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 

How to obtain the Cloudera Data Engineer Certification

  • 2. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 3. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 4. • Data Scientist: a person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making. • Data Engineer: a worker whose primary job responsibilities involve preparing data for analytical or operational uses. Data engineers enable data scientists to do their jobs more effectively. DATA ENGINEER & DATA SCIENTIST
  • 6. “An experienced open-source developer who earns the Cloudera Certified Data Engineer credential is able to perform core competencies required to ingest, transform, store, and analyze data in Cloudera's CDH environment. The credential is earned after successfully passing the CCP Data Engineer Exam (DE575).” -- Cloudera CCP DATA ENGINEER
  • 7. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 8. • Data Ingest • Transform, Stage, Store • Data Analysis • Workflow SKILL SET
  • 9. The skills to transfer data between external systems and your cluster. This includes the following: • Import and export data between an external RDBMS and your cluster, including the ability to import specific subsets, change the delimiter and file format of imported data during ingest, and alter the data access pattern or privileges. • Ingest real-time and near-real time (NRT) streaming data into HDFS, including the ability to distribute to multiple data sources and convert data on ingest from one format to another. • Load data into and out of HDFS using the Hadoop File System (FS) commands. DATA INGEST
  • 10. Convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. This includes the following skills: • Convert data from one file format to another • Write your data with compression • Convert data from one set of values to another (e.g., Lat/Long to Postal Address using an external library) • Change the data format of values in a data set TRANSFORM, STAGE, STORE
  • 11. TRANSFORM, STAGE, STORE • Purge bad records from a data set, e.g., null values • Deduplication and merge data • De-normalize data from multiple disparate data sets • Evolve an Avro or Parquet schema • Partition an existing data set according to one or more partition keys • Tune data for optimal query performance
  • 12. Filter, sort, join, aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. All of these tasks may include reading from Parquet, Avro, JSON, delimited text, and natural language text. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and require the use of metadata from Hive/HCatalog. DATA ANALYSIS
  • 13. • Write a query to aggregate multiple rows of data • Write a query to calculate aggregate statistics (e.g., average or sum) • Write a query to filter data • Write a query that produces ranked or sorted data • Write a query that joins multiple data sets • Read and/or create a Hive or an HCatalog table from existing data in HDFS DATA ANALYSIS
  • 14. The ability to create and execute various jobs and actions that move data towards greater value and use in a system. This includes the following skills: • Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc. • Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc. • Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies WORK FLOW
  • 15. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 16. • Familiar with all related command tools • Use Cloudera quickstart VW to practice • Take sample test
  • 17. • Hive, Impala, Sqoop, Spark, Crunch, Pig, Kite, Avro, Parquet, Cloudera HUE, oozie, Flume, DataFu, JDK 7 API Docs, Python 2.7 , Python 3.4 , Scala Only the above documentation are accessible during the exam. FAMILIAR WITH ALL RELATED COMMAND TOOLS
  • 23. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 24. • Create an account at www.examslocal.com. • Select the exam • Choose a date and time • Select a time slot for exam • Pass the compatibility tool and install the screen sharing Chrome Extension STEPS TO SCHEDULE EXAM
  • 25. • The exam is remote, it takes less then 2 hour • Partly open book exam. • Some documentation are available online during the exam • All other websites, including Google/search functionality is disabled. No notes or other exam aids. WHEN EXAM
  • 26. • Introduction • Skill set • How to prepare • Register test • Reference & etc. OUTLINE
  • 27. • Apache & Cloudera official documents • My website: https://godataengineer.wordpress.com/ USEFUL LINKS
  • 29. • Data Warehouse for Machine Learning app • Using Flume, Hive, HDFS, Spark and Phoenix USE CASE