Big Data Tools in AWS

Big Data Tools in AWS
Scott （考特）
Sep. 29th, 2021 (Wed)
AWS User Group
Oh. So. CDK

Scott (考特)
Shu-Jeng, Hsieh
● Sr. Data Engineer, the 104
● AWS Community Builder

Transient or long-running clusters
Long-running and auto scaling Transient and job scoped
1.Great for lines of business leaders
2.Great for short-running jobs or ad hoc
queries
3.Ideal to save costs for multi-tenanted
data science and data engineering
jobs
1.Works well for job-scoped pipelines
2.Reduces blast radius
3.Easier to upgrade clusters and restart
jobs
Example use cases:
● Notebooks
● Ad-hoc jobs and experimentation
● streaming
Example use cases:
● Large-scale transformation
● ETL to other DWH or Data Lake
● Building ML jobs
Liem, M., 2020. Amazon EMR Deep Dive and Best Practices - AWS Online Tech
Talks. [video] Available at: <https://www.youtube.com/watch?v=dU40df0Suoo>

Automation
API requesting via AWS SDK
API requesting via AWS Lambda
State machine
Amazon Data Pipeline

Deployment
options
Amazon EC2 Amazon EKS AWS Outposts
Ad hoc workloads Data pipelines Data science
notebooks

Richardson, C., Novikova, M. and Zhang, K., 2021. How Tamr Optimized Amazon EMR Workloads to
Unify 200 Billion Records 5x Faster than On-Premises

Amazon S3
marts
Amazon S3
source data
Amazon EMR
Prepare data
Launch
Service
Use data
JDBC
Access
Zeppelin
AirFlow pipelines
Dubrovsky, O. and Reuveni, Y., 2020. AWS re:Invent 2020: How
Nielsen built a multi-petabyte data platform using Amazon EMR.

import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
interface ExtendedEmrCreateClusterProps extends tasks.EmrCreateClusterProps {
/**
* Specifies the step concurrency level to allow multiple steps to run in parallel
*
* Requires EMR release label 5.28.0 or above.
* Must be in range [1, 256].
*
* @default 1 - no step concurrency allowed
*/
readonly stepConcurrencyLevel?: number;
}
class ExtendedEmrCreateCluster extends tasks.EmrCreateCluster {
protected readonly stepConcurrencyLevel: number;
constructor(
scope: cdk.Construct,
id: string,
props: ExtendedEmrCreateClusterProps
) {
super(scope, id, props);
this.stepConcurrencyLevel = props.stepConcurrencyLevel ?? 1;
}
protected _renderTask(): any {
const originalObject = super._renderTask();
const extensionObject = {};
Object.assign(extensionObject, originalObject, {
Parameters: {
StepConcurrencyLevel: cdk.numberToCloudFormation(
this.stepConcurrencyLevel
),
...originalObject.Parameters,
},
});
return extensionObject;
}}
CDK issues
● #15223
● #15242

import * as sfn from '@aws-cdk/aws-stepfunctions';
import * as tasks from '@aws-cdk/aws-stepfunctions-tasks';
tasks.EmrSetClusterTerminationProtection
tasks.EmrAddStep
tasks.EmrTerminateCluster
sfn.Choice
sfn.Condition
sfn.Parallel
Constructs that you’ll encounter
pretty much frequently

dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Make Traditional Chinese
available', {
name: 'modify metadata',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'bash',
'-c',
`aws s3 cp
s3://${this.demoBucketName}/modify_meta_database.sh .;
chmod +x modify_meta_database.sh;
./modify_meta_database.sh;
rm modify_meta_database.sh;`,
],
})
);
dataMovementParallel.branch(
new tasks.EmrAddStep(this, 'Some ETL', {
name: 'Execute an ETL',
clusterId: sfn.JsonPath.stringAt('$.ClusterId'),
actionOnFailure: tasks.ActionOnFailure.CONTINUE,
jar: 'command-runner.jar',
args: [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--num-executors',
'2',
'--executor-cores',
'8',
'--executor-memory',
'12g',
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
`s3://${this.demoBucketName}/etl/spark-etl.py`,
],
})
);
const dataMovementParallel = new sfn.Parallel(
this,
'Do some complex things in an EMR Cluster',
{
resultPath: sfn.JsonPath.DISCARD,
}
);
Example assignment of
paralleling tasks for an EMR cluster

import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
const stateMachine = new sfn.StateMachine(this, 'StateMachine', {
stateMachineName: stateMachineName,
definition: shouldLaunchCluster,
});
const stateMachineTarget = new targets.SfnStateMachine(
stateMachine,
{
input: events.RuleTargetInput.fromObject({
LaunchCluster: true,
TerminateCluster: false,
}),
}
);
const stateMachineRule = new events.Rule(
this,
'StateMachineRule',
{
schedule: events.Schedule.expression(`cron(20 0 ? * Mon-Fri *)`),
ruleName: `${process.env.DEPLOYMENT_ENV}-sql-analytics-statemachine-rule`,
enabled: true
description:
'An event rule to launch an EMR cluster via AWS Step Functions.',
}
);
stateMachineRule.addTarget(stateMachineTarget);

Example of production workload

Common Errors in
AWS Step Functions
https://docs.aws.amazon.com/step-functions/latest/apireference/CommonErrors.html

src
├── custom-glue-workflows.ts
├── emr-scripts
│ ├── bootstrap
│ │ ├── modify_meta_database.sh
│ │ └── update_ssm.sh
│ ├── hive
│ │ ├── hive_create_database.q
│ │ └── hive_create_table.q
│ └── spark
│ └── remove-auto.py
├── glue-resources.ts
├── glue-scripts
│ ├── app-team
│ │ └── resume_detection.py
│ └── schema
│ └── update_workflow_property.py

AWS
Glue
AWS Services
On-premises
Big Data Data Warehouse
SaaS
Cross-cloud
Data Store

Inferring schema, detecting data
drift, keeping metadata up to date
Reusable data pipelines,
event-triggered workflow
Visual data preparation tool
for data analysis
Materialized views

● Performance-optimized Spark runtime
○ upgrading from Spark 2.4 to Spark 3.1.1
○ upgraded JDBC drivers
● Faster read and write access
● Faster and efficient partition pruning
● Fine-grained access control
● ACID transactions
● Improved user experience for monitoring, debugging,
and tuning Spark applications

Xue, C. and Zhou, Y., 2021. Building a SIMD Supported Vectorized Native Engine for
Spark SQL. [video] Available at: <https://youtu.be/hwAzodnaqa0>

Shuffled hash join improvement (SPARK-32461)
● Preserve shuffled hash join build side partitioning (SPARK-32330)
● Preserve hash join (BHJ and SHJ) stream side ordering (SPARK-32383)
● Coalesce bucketed tables for shuffled hash join (SPARK-32286)
● Add code-gen for shuffled hash join (SPARK-32421)
● Support full outer join in shuffled hash join (SPARK-32399)

{
"Sid": "S3Event1",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1"
},
{
"Sid": "S3Event2",
"Effect": "Allow",
"Principal": {
"Service": "cloudtrail.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::scott-target-bucket-ap-northeast-1/hakunamatata/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
S3 Bucket Policy

import * as cloudtrail from '@aws-cdk/aws-cloudtrail';
const glueWorkflowTrail = new cloudtrail.Trail(this, 'GlueWorkflowTrail', {
trailName: `s3-events-trail`,
bucket: s3.Bucket.fromBucketName(
this,
'DemoBucket',
`scott-demo-events-${cdk.Aws.REGION}`
),
s3KeyPrefix: 'event-folder',
});
glueWorkflowTrail.addS3EventSelector(
[
{
bucket: s3.Bucket.fromBucketName(
this,
'BucketWhereFileWillBePut',
'scott-target-bucket-${cdk.Aws.REGION}'
),
objectPrefix: 'hakunamatata/',
},
],
{
includeManagementEvents: false,
readWriteType: cloudtrail.ReadWriteType.WRITE_ONLY,
}
);
CloudTrail
configuration

import * as glue from '@aws-cdk/aws-glue';
const glueWorkFlow = new glue.CfnWorkflow(this, 'GlueWorkFlow', {
name: `scott-demo-glue-workflow`,
description:
'A Glue workflow',
});
const glueWorkFlowArn =
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:workflow/${glueWor
kFlow.name}`;
Glue workflow

import * as events from '@aws-cdk/aws-events';
const eventRule = new events.Rule(this, 'FileDetectionRule', {
ruleName: `event-glue-workflow-trigger`,
description:
'An event rule to trigger the Glue workflow',
eventPattern: {
source: ['aws.s3'],
detailType: ['AWS API Call via CloudTrail'],
detail: {
['eventSource']: ['s3.amazonaws.com'],
['eventName']: ['PutObject'],
['requestParameters']: {
bucketName: ['scott-target-bucket-${cdk.Aws.REGION}'],
key: [{ prefix: 'hakunamatata/' }],
},
},
},
});
Amazon EventBridge

const cfnEventRule = eventRule.node.defaultChild as events.CfnRule;
cfnEventRule.targets = [
{
arn: glueWorkFlowArn,
id: 'CloudTrailTriggersWorkflow',
roleArn: eventBridgeExecutionRole.roleArn,
},
];
Set target as the Glue workflow

import * as cr from '@aws-cdk/custom-resources';
const updateTriggerSdkCall: cr.AwsSdkCall = {
service: 'Glue',
action: 'updateTrigger',
parameters: {
Name: triggerEntity.name,
TriggerUpdate: {
Actions: [
{
JobName: jobName,
Timeout: 1,
},
],
Description: triggerEntity.description,
EventBatchingCondition: {
BatchSize: 8,
BatchWindow: 120,
},
},
},
physicalResourceId: cr.PhysicalResourceId.of(Date.now().toString()),
};
new cr.AwsCustomResource(this, id + 'CustomResource', {
onCreate: updateTriggerSdkCall,
onUpdate: updateTriggerSdkCall,
policy: cr.AwsCustomResourcePolicy.fromStatements([
new iam.PolicyStatement({
actions: ['glue:UpdateTrigger'],
resources: [
`arn:${cdk.Aws.PARTITION}:glue:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:trigger/*`,
],
}),
]),
logRetention: logs.RetentionDays.ONE_WEEK,
});}

https://databricks.com/product/delta-lake-on-databricks

Big Data Tools in AWS

Related slideshows

More Related Content

Big Data Tools in AWS