Big Data Tools in AWS
- 1. Big Data Tools in AWS
Scott (考特)
Sep. 29th, 2021 (Wed)
AWS User Group
Oh. So. CDK
- 6. Transient or long-running clusters
Long-running and auto scaling Transient and job scoped
1.Great for lines of business leaders
2.Great for short-running jobs or ad hoc
queries
3.Ideal to save costs for multi-tenanted
data science and data engineering
jobs
1.Works well for job-scoped pipelines
2.Reduces blast radius
3.Easier to upgrade clusters and restart
jobs
Example use cases:
● Notebooks
● Ad-hoc jobs and experimentation
● streaming
Example use cases:
● Large-scale transformation
● ETL to other DWH or Data Lake
● Building ML jobs
Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech
Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>
- 9. Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to
Unify 200 Billion Records 5x Faster than On-Premises
- 10. Amazon S3
marts
Amazon S3
source data
Amazon EMR
Prepare data
Launch
Service
Use data
JDBC
Access
Zeppelin
AirFlow pipelines
Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How
Nielsen built a multi-petabyte data platform using Amazon EMR.
- 12. import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps {
/**
* Specifies the step concurrency level to allow multiple steps to run in parallel
*
* Requires EMR release label 5.28.0 or above.
* Must be in range [1, 256].
*
* @default 1 - no step concurrency allowed
*/
readonly stepConcurrencyLevel?: number;
}
class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster {
protected readonly stepConcurrencyLevel: number;
constructor(
scope: cdk.Construct,
id: string,
props: ExtendedEmrCreateClusterProps
) {
super(scope, id, props);
this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1;
}
protected _renderTask(): any {
const originalObject = super._renderTask();
const extensionObject = {};
Object.assign(extensionObject, originalObject, {
Parameters: {
StepConcurrencyLevel: cdk.numberToCloudFormation(
this.stepConcurrencyLevel
),
...originalObject.Parameters,
},
});
return extensionObject;
}}
CDK issues
● #15223
● #15242
- 13. import * as sfn from '@aws-cdk/aws-stepfunctions';
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
tasks.EmrSetClusterTerminationProtection
tasks.EmrAddStep
tasks.EmrTerminateCluster
sfn.Choice
sfn.Condition
sfn.Parallel
Constructs that you’ll encounter
pretty much frequently
- 14. dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Make Traditional Chinese
available', {
name: 'modify metadata',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'bash',
'-c',
`aws s3 cp
s3://${this.demoBucketName}/modify_meta_database.sh .;
chmod +x modify_meta_database.sh;
./modify_meta_database.sh;
rm modify_meta_database.sh;`,
],
})
);
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Some ETL', {
name: 'Execute an ETL',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--num-executors',
'2',
'--executor-cores',
'8',
'--executor-memory',
'12g',
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
`s3://${this.demoBucketName}/etl/spark-etl.py`,
],
})
);
const dataMovementParallel = new sfn.Parallel(
this,
'Do some complex things in an EMR Cluster',
{
resultPath: sfn.JsonPath.DISCARD,
}
);
Example assignment of
paralleling tasks for an EMR cluster
- 15. import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: stateMachineName,
definition: shouldLaunchCluster,
});
const stateMachineTarget = new targets.SfnStateMachine(
stateMachine,
{
input: events.RuleTargetInput.fromObject({
LaunchCluster: true,
TerminateCluster: false,
}),
}
);
const stateMachineRule = new events.Rule(
this,
'StateMachineRule',
{
schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`),
ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`,
enabled: true
description:
'An event rule to launch an EMR cluster via AWS Step Functions.',
}
);
stateMachineRule.addTarget(stateMachineTarget);
- 17. Common Errors in
AWS Step Functions
https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html
- 18. src
├── custom-glue-workflows.ts
├── emr-scripts
│ ├── bootstrap
│ │ ├── modify_meta_database.sh
│ │ └── update_ssm.sh
│ ├── hive
│ │ ├── hive_create_database.q
│ │ └── hive_create_table.q
│ └── spark
│ └── remove-auto.py
├── glue-resources.ts
├── glue-scripts
│ ├── app-team
│ │ └── resume_detection.py
│ └── schema
│ └── update_workflow_property.py
- 21. Inferring schema, detecting data
drift, keeping metadata up to date
Reusable data pipelines,
event-triggered workflow
Visual data preparation tool
for data analysis
Materialized views
- 23. ● Performance-optimized Spark runtime
○ upgrading from Spark 2.4 to Spark 3.1.1
○ upgraded JDBC drivers
● Faster read and write access
● Faster and efficient partition pruning
● Fine-grained access control
● ACID transactions
● Improved user experience for monitoring, debugging,
and tuning Spark applications
- 24. Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for
Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>
- 25. Shuffled hash join improvement (SPARK-32461)
● Preserve shuffled hash join build side partitioning (SPARK-32330)
● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383)
● Coalesce bucketed tables for shuffled hash join (SPARK-32286)
● Add code-gen for shuffled hash join (SPARK-32421)
● Support full outer join in shuffled hash join (SPARK-32399)
- 28. {
"Sid": "S3Event1",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1"
},
{
"Sid": "S3Event2",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
S3 Bucket Policy
- 29. import * as cloudtrail from '@aws-cdk/aws-cloudtrail';
const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', {
trailName: `s3-events-trail`,
bucket: s3.Bucket.fromBucketName(
this,
'DemoBucket',
`scott-demo-events-${cdk.Aws.REGION}`
),
s3KeyPrefix: 'event-folder',
});
glueWorkflowTrail.addS3EventSelector(
[
{
bucket: s3.Bucket.fromBucketName(
this,
'BucketWhereFileWillBePut',
'scott-target-bucket-${cdk.Aws.REGION}'
),
objectPrefix: 'hakunamatata/',
},
],
{
includeManagementEvents: false,
readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY,
}
);
CloudTrail
configuration
- 30. import * as glue from '@aws-cdk/aws-glue';
const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', {
name: `scott-demo-glue-workflow`,
description:
'A Glue workflow',
});
const glueWorkFlowArn =
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor
kFlow.name}`;
Glue workflow
- 31. import * as events from '@aws-cdk/aws-events';
const eventRule = new events.Rule(this, 'FileDetectionRule', {
ruleName: `event-glue-workflow-trigger`,
description:
'An event rule to trigger the Glue workflow',
eventPattern: {
source: ['aws.s3'],
detailType: ['AWS API Call via CloudTrail'],
detail: {
['eventSource']: ['s3.amazonaws.com'],
['eventName']: ['PutObject'],
['requestParameters']: {
bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'],
key: [{ prefix: 'hakunamatata/' }],
},
},
},
});
Amazon EventBridge
- 32. const cfnEventRule = eventRule.node.defaultChild as events.CfnRule;
cfnEventRule.targets = [
{
arn: glueWorkFlowArn,
id: 'CloudTrailTriggersWorkflow',
roleArn: eventBridgeExecutionRole.roleArn,
},
];
Set target as the Glue workflow
- 33. import * as cr from '@aws-cdk/custom-resources';
const updateTriggerSdkCall: cr.AwsSdkCall = {
service: 'Glue',
action: 'updateTrigger',
parameters: {
Name: triggerEntity.name,
TriggerUpdate: {
Actions: [
{
JobName: jobName,
Timeout: 1,
},
],
Description: triggerEntity.description,
EventBatchingCondition: {
BatchSize: 8,
BatchWindow: 120,
},
},
},
physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()),
};
new cr.AwsCustomResource(this, id + 'CustomResource', {
onCreate: updateTriggerSdkCall,
onUpdate: updateTriggerSdkCall,
policy: cr.AwsCustomResourcePolicy.fromStatements([
new iam.PolicyStatement({
actions: ['glue:UpdateTrigger'],
resources: [
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`,
],
}),
]),
logRetention: logs.RetentionDays.ONE_WEEK,
});}