SlideShare a Scribd company logo
Big Data Tools in AWS
Scott (考特)
Sep. 29th, 2021 (Wed)
AWS User Group
Oh. So. CDK
Scott (考特)
Shu-Jeng, Hsieh
● Sr. Data Engineer, the 104
● AWS Community Builder
Agenda
AWS EMR
AWS Glue 3.0
AWS EMR
HCatalog
Transient or long-running clusters
Long-running and auto scaling Transient and job scoped
1.Great for lines of business leaders
2.Great for short-running jobs or ad hoc
queries
3.Ideal to save costs for multi-tenanted
data science and data engineering
jobs
1.Works well for job-scoped pipelines
2.Reduces blast radius
3.Easier to upgrade clusters and restart
jobs
Example use cases:
● Notebooks
● Ad-hoc jobs and experimentation
● streaming
Example use cases:
● Large-scale transformation
● ETL to other DWH or Data Lake
● Building ML jobs
Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech
Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
Automation
API requesting via AWS SDK
API requesting via AWS Lambda
State machine
Amazon Data Pipeline
Deployment
options
Amazon EC2 Amazon EKS AWS Outposts
Ad hoc workloads Data pipelines Data science
notebooks
Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to
Unify 200 Billion Records 5x Faster than On-Premises
Amazon S3
marts
Amazon S3
source data
Amazon EMR
Prepare data
Launch
Service
Use data
JDBC
Access
Zeppelin
AirFlow pipelines
Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How
Nielsen built a multi-petabyte data platform using Amazon EMR.
Big Data Tools in AWS
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps {
/**
* Specifies the step concurrency level to allow multiple steps to run in parallel
*
* Requires EMR release label 5.28.0 or above.
* Must be in range [1, 256].
*
* @default 1 - no step concurrency allowed
*/
readonly stepConcurrencyLevel?: number;
}
class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster {
protected readonly stepConcurrencyLevel: number;
constructor(
scope: cdk.Construct,
id: string,
props: ExtendedEmrCreateClusterProps
) {
super(scope, id, props);
this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1;
}
protected _renderTask(): any {
const originalObject = super._renderTask();
const extensionObject = {};
Object.assign(extensionObject, originalObject, {
Parameters: {
StepConcurrencyLevel: cdk.numberToCloudFormation(
this.stepConcurrencyLevel
),
...originalObject.Parameters,
},
});
return extensionObject;
}}
CDK issues
● #15223
● #15242
import * as sfn from '@aws-cdk/aws-stepfunctions';
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
tasks.EmrSetClusterTerminationProtection
tasks.EmrAddStep
tasks.EmrTerminateCluster
sfn.Choice
sfn.Condition
sfn.Parallel
Constructs that you’ll encounter
pretty much frequently
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Make Traditional Chinese
available', {
name: 'modify metadata',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'bash',
'-c',
`aws s3 cp
s3://${this.demoBucketName}/modify_meta_database.sh .;
chmod +x modify_meta_database.sh;
./modify_meta_database.sh;
rm modify_meta_database.sh;`,
],
})
);
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Some ETL', {
name: 'Execute an ETL',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--num-executors',
'2',
'--executor-cores',
'8',
'--executor-memory',
'12g',
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
`s3://${this.demoBucketName}/etl/spark-etl.py`,
],
})
);
const dataMovementParallel = new sfn.Parallel(
this,
'Do some complex things in an EMR Cluster',
{
resultPath: sfn.JsonPath.DISCARD,
}
);
Example assignment of
paralleling tasks for an EMR cluster
import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: stateMachineName,
definition: shouldLaunchCluster,
});
const stateMachineTarget = new targets.SfnStateMachine(
stateMachine,
{
input: events.RuleTargetInput.fromObject({
LaunchCluster: true,
TerminateCluster: false,
}),
}
);
const stateMachineRule = new events.Rule(
this,
'StateMachineRule',
{
schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`),
ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`,
enabled: true
description:
'An event rule to launch an EMR cluster via AWS Step Functions.',
}
);
stateMachineRule.addTarget(stateMachineTarget);
Example of production workload
Common Errors in
AWS Step Functions
https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html
src
├── custom-glue-workflows.ts
├── emr-scripts
│ ├── bootstrap
│ │ ├── modify_meta_database.sh
│ │ └── update_ssm.sh
│ ├── hive
│ │ ├── hive_create_database.q
│ │ └── hive_create_table.q
│ └── spark
│ └── remove-auto.py
├── glue-resources.ts
├── glue-scripts
│ ├── app-team
│ │ └── resume_detection.py
│ └── schema
│ └── update_workflow_property.py
AWS Glue
AWS
Glue
AWS Services
On-premises
Big Data Data Warehouse
SaaS
Cross-cloud
Data Store
Inferring schema, detecting data
drift, keeping metadata up to date
Reusable data pipelines,
event-triggered workflow
Visual data preparation tool
for data analysis
Materialized views
AWS Glue 3.0
● Performance-optimized Spark runtime
○ upgrading from Spark 2.4 to Spark 3.1.1
○ upgraded JDBC drivers
● Faster read and write access
● Faster and efficient partition pruning
● Fine-grained access control
● ACID transactions
● Improved user experience for monitoring, debugging,
and tuning Spark applications
Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for
Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>
Shuffled hash join improvement (SPARK-32461)
● Preserve shuffled hash join build side partitioning (SPARK-32330)
● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383)
● Coalesce bucketed tables for shuffled hash join (SPARK-32286)
● Add code-gen for shuffled hash join (SPARK-32421)
● Support full outer join in shuffled hash join (SPARK-32399)
Event-triggered
Glue workflow
Big Data Tools in AWS
{
"Sid": "S3Event1",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1"
},
{
"Sid": "S3Event2",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
S3 Bucket Policy
import * as cloudtrail from '@aws-cdk/aws-cloudtrail';
const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', {
trailName: `s3-events-trail`,
bucket: s3.Bucket.fromBucketName(
this,
'DemoBucket',
`scott-demo-events-${cdk.Aws.REGION}`
),
s3KeyPrefix: 'event-folder',
});
glueWorkflowTrail.addS3EventSelector(
[
{
bucket: s3.Bucket.fromBucketName(
this,
'BucketWhereFileWillBePut',
'scott-target-bucket-${cdk.Aws.REGION}'
),
objectPrefix: 'hakunamatata/',
},
],
{
includeManagementEvents: false,
readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY,
}
);
CloudTrail
configuration
import * as glue from '@aws-cdk/aws-glue';
const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', {
name: `scott-demo-glue-workflow`,
description:
'A Glue workflow',
});
const glueWorkFlowArn =
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor
kFlow.name}`;
Glue workflow
import * as events from '@aws-cdk/aws-events';
const eventRule = new events.Rule(this, 'FileDetectionRule', {
ruleName: `event-glue-workflow-trigger`,
description:
'An event rule to trigger the Glue workflow',
eventPattern: {
source: ['aws.s3'],
detailType: ['AWS API Call via CloudTrail'],
detail: {
['eventSource']: ['s3.amazonaws.com'],
['eventName']: ['PutObject'],
['requestParameters']: {
bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'],
key: [{ prefix: 'hakunamatata/' }],
},
},
},
});
Amazon EventBridge
const cfnEventRule = eventRule.node.defaultChild as events.CfnRule;
cfnEventRule.targets = [
{
arn: glueWorkFlowArn,
id: 'CloudTrailTriggersWorkflow',
roleArn: eventBridgeExecutionRole.roleArn,
},
];
Set target as the Glue workflow
import * as cr from '@aws-cdk/custom-resources';
const updateTriggerSdkCall: cr.AwsSdkCall = {
service: 'Glue',
action: 'updateTrigger',
parameters: {
Name: triggerEntity.name,
TriggerUpdate: {
Actions: [
{
JobName: jobName,
Timeout: 1,
},
],
Description: triggerEntity.description,
EventBatchingCondition: {
BatchSize: 8,
BatchWindow: 120,
},
},
},
physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()),
};
new cr.AwsCustomResource(this, id + 'CustomResource', {
onCreate: updateTriggerSdkCall,
onUpdate: updateTriggerSdkCall,
policy: cr.AwsCustomResourcePolicy.fromStatements([
new iam.PolicyStatement({
actions: ['glue:UpdateTrigger'],
resources: [
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`,
],
}),
]),
logRetention: logs.RetentionDays.ONE_WEEK,
});}
https://databricks.com/product/delta-lake-on-databricks
Ask
Me
Anything

More Related Content

Big Data Tools in AWS

  • 1. Big Data Tools in AWS Scott (考特) Sep. 29th, 2021 (Wed) AWS User Group Oh. So. CDK
  • 2. Scott (考特) Shu-Jeng, Hsieh ● Sr. Data Engineer, the 104 ● AWS Community Builder
  • 6. Transient or long-running clusters Long-running and auto scaling Transient and job scoped 1.Great for lines of business leaders 2.Great for short-running jobs or ad hoc queries 3.Ideal to save costs for multi-tenanted data science and data engineering jobs 1.Works well for job-scoped pipelines 2.Reduces blast radius 3.Easier to upgrade clusters and restart jobs Example use cases: ● Notebooks ● Ad-hoc jobs and experimentation ● streaming Example use cases: ● Large-scale transformation ● ETL to other DWH or Data Lake ● Building ML jobs Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
  • 7. Automation API requesting via AWS SDK API requesting via AWS Lambda State machine Amazon Data Pipeline
  • 8. Deployment options Amazon EC2 Amazon EKS AWS Outposts Ad hoc workloads Data pipelines Data science notebooks
  • 9. Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to Unify 200 Billion Records 5x Faster than On-Premises
  • 10. Amazon S3 marts Amazon S3 source data Amazon EMR Prepare data Launch Service Use data JDBC Access Zeppelin AirFlow pipelines Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How Nielsen built a multi-petabyte data platform using Amazon EMR.
  • 12. import * as tasks from '@aws-cdk/aws-stepfunctions-tasks'; interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps { /** * Specifies the step concurrency level to allow multiple steps to run in parallel * * Requires EMR release label 5.28.0 or above. * Must be in range [1, 256]. * * @default 1 - no step concurrency allowed */ readonly stepConcurrencyLevel?: number; } class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster { protected readonly stepConcurrencyLevel: number; constructor( scope: cdk.Construct, id: string, props: ExtendedEmrCreateClusterProps ) { super(scope, id, props); this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1; } protected _renderTask(): any { const originalObject = super._renderTask(); const extensionObject = {}; Object.assign(extensionObject, originalObject, { Parameters: { StepConcurrencyLevel: cdk.numberToCloudFormation( this.stepConcurrencyLevel ), ...originalObject.Parameters, }, }); return extensionObject; }} CDK issues ● #15223 ● #15242
  • 13. import * as sfn from '@aws-cdk/aws-stepfunctions'; import * as tasks from '@aws-cdk/aws-stepfunctions-tasks'; tasks.EmrSetClusterTerminationProtection tasks.EmrAddStep tasks.EmrTerminateCluster sfn.Choice sfn.Condition sfn.Parallel Constructs that you’ll encounter pretty much frequently
  • 14. dataMovementParallel.branch( new tasks.EmrAddStep(this, 'Make Traditional Chinese available', { name: 'modify metadata', clusterId: sfn.JsonPath.stringAt('$.ClusterId'), actionOnFailure: tasks.ActionOnFailure.CONTINUE, jar: 'command-runner.jar', args: [ 'bash', '-c', `aws s3 cp s3://${this.demoBucketName}/modify_meta_database.sh .; chmod +x modify_meta_database.sh; ./modify_meta_database.sh; rm modify_meta_database.sh;`, ], }) ); dataMovementParallel.branch( new tasks.EmrAddStep(this, 'Some ETL', { name: 'Execute an ETL', clusterId: sfn.JsonPath.stringAt('$.ClusterId'), actionOnFailure: tasks.ActionOnFailure.CONTINUE, jar: 'command-runner.jar', args: [ 'spark-submit', '--deploy-mode', 'cluster', '--master', 'yarn', '--num-executors', '2', '--executor-cores', '8', '--executor-memory', '12g', '--conf', 'spark.yarn.submit.waitAppCompletion=true', `s3://${this.demoBucketName}/etl/spark-etl.py`, ], }) ); const dataMovementParallel = new sfn.Parallel( this, 'Do some complex things in an EMR Cluster', { resultPath: sfn.JsonPath.DISCARD, } ); Example assignment of paralleling tasks for an EMR cluster
  • 15. import * as events from '@aws-cdk/aws-events'; import * as targets from '@aws-cdk/aws-events-targets'; const stateMachine = new sfn.StateMachine(this, 'StateMachine', { stateMachineName: stateMachineName, definition: shouldLaunchCluster, }); const stateMachineTarget = new targets.SfnStateMachine( stateMachine, { input: events.RuleTargetInput.fromObject({ LaunchCluster: true, TerminateCluster: false, }), } ); const stateMachineRule = new events.Rule( this, 'StateMachineRule', { schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`), ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`, enabled: true description: 'An event rule to launch an EMR cluster via AWS Step Functions.', } ); stateMachineRule.addTarget(stateMachineTarget);
  • 17. Common Errors in AWS Step Functions https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html
  • 18. src ├── custom-glue-workflows.ts ├── emr-scripts │ ├── bootstrap │ │ ├── modify_meta_database.sh │ │ └── update_ssm.sh │ ├── hive │ │ ├── hive_create_database.q │ │ └── hive_create_table.q │ └── spark │ └── remove-auto.py ├── glue-resources.ts ├── glue-scripts │ ├── app-team │ │ └── resume_detection.py │ └── schema │ └── update_workflow_property.py
  • 20. AWS Glue AWS Services On-premises Big Data Data Warehouse SaaS Cross-cloud Data Store
  • 21. Inferring schema, detecting data drift, keeping metadata up to date Reusable data pipelines, event-triggered workflow Visual data preparation tool for data analysis Materialized views
  • 23. ● Performance-optimized Spark runtime ○ upgrading from Spark 2.4 to Spark 3.1.1 ○ upgraded JDBC drivers ● Faster read and write access ● Faster and efficient partition pruning ● Fine-grained access control ● ACID transactions ● Improved user experience for monitoring, debugging, and tuning Spark applications
  • 24. Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>
  • 25. Shuffled hash join improvement (SPARK-32461) ● Preserve shuffled hash join build side partitioning (SPARK-32330) ● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383) ● Coalesce bucketed tables for shuffled hash join (SPARK-32286) ● Add code-gen for shuffled hash join (SPARK-32421) ● Support full outer join in shuffled hash join (SPARK-32399)
  • 28. { "Sid": "S3Event1", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:GetBucketAcl", "Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1" }, { "Sid": "S3Event2", "Effect": "Allow", "Principal": { "Service": "cloudtrail.amazonaws.com" }, "Action": "s3:PutObject", "Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*", "Condition": { "StringEquals": { "s3:x-amz-acl": "bucket-owner-full-control" } } } S3 Bucket Policy
  • 29. import * as cloudtrail from '@aws-cdk/aws-cloudtrail'; const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', { trailName: `s3-events-trail`, bucket: s3.Bucket.fromBucketName( this, 'DemoBucket', `scott-demo-events-${cdk.Aws.REGION}` ), s3KeyPrefix: 'event-folder', }); glueWorkflowTrail.addS3EventSelector( [ { bucket: s3.Bucket.fromBucketName( this, 'BucketWhereFileWillBePut', 'scott-target-bucket-${cdk.Aws.REGION}' ), objectPrefix: 'hakunamatata/', }, ], { includeManagementEvents: false, readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY, } ); CloudTrail configuration
  • 30. import * as glue from '@aws-cdk/aws-glue'; const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', { name: `scott-demo-glue-workflow`, description: 'A Glue workflow', }); const glueWorkFlowArn = `arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor kFlow.name}`; Glue workflow
  • 31. import * as events from '@aws-cdk/aws-events'; const eventRule = new events.Rule(this, 'FileDetectionRule', { ruleName: `event-glue-workflow-trigger`, description: 'An event rule to trigger the Glue workflow', eventPattern: { source: ['aws.s3'], detailType: ['AWS API Call via CloudTrail'], detail: { ['eventSource']: ['s3.amazonaws.com'], ['eventName']: ['PutObject'], ['requestParameters']: { bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'], key: [{ prefix: 'hakunamatata/' }], }, }, }, }); Amazon EventBridge
  • 32. const cfnEventRule = eventRule.node.defaultChild as events.CfnRule; cfnEventRule.targets = [ { arn: glueWorkFlowArn, id: 'CloudTrailTriggersWorkflow', roleArn: eventBridgeExecutionRole.roleArn, }, ]; Set target as the Glue workflow
  • 33. import * as cr from '@aws-cdk/custom-resources'; const updateTriggerSdkCall: cr.AwsSdkCall = { service: 'Glue', action: 'updateTrigger', parameters: { Name: triggerEntity.name, TriggerUpdate: { Actions: [ { JobName: jobName, Timeout: 1, }, ], Description: triggerEntity.description, EventBatchingCondition: { BatchSize: 8, BatchWindow: 120, }, }, }, physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()), }; new cr.AwsCustomResource(this, id + 'CustomResource', { onCreate: updateTriggerSdkCall, onUpdate: updateTriggerSdkCall, policy: cr.AwsCustomResourcePolicy.fromStatements([ new iam.PolicyStatement({ actions: ['glue:UpdateTrigger'], resources: [ `arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`, ], }), ]), logRetention: logs.RetentionDays.ONE_WEEK, });}