SlideShare a Scribd company logo
1 
November 13, 2014| Las Vegas, NV 
AmandeepKhurana
2 
About me 
•Principal Solutions Architect @ Cloudera 
•Engineer @ AWS 
•Co-author, HBasein Action
3 
Agenda 
•Motivation 
•Deployment paradigms 
•Storage 
•Networking 
•Instances 
•Security 
•High availability, backups, disaster recovery 
•Planning your cluster 
•Available resources
4 
•Parallel trends 
–Commoditizing infrastructure 
–Commoditizing data 
•Worlds converging… but with considerations 
–Cost 
–Flexibility 
–Ease of use 
–Operations 
–Location 
–Performance 
–Security 
Why you should care
5 
Intersection
6 
The devil…
7 
Primary consideration –Storage (source of truth) 
Amazon S3 
•Ad-hoc batch workloads 
•SLA batch workloads 
HDFS 
•Ad-hoc batch workloads 
•SLA batch workloads 
•Ad-hoc interactive workloads 
•SLA interactive workloads 
Predominantly transient clusters 
Long running clusters
9 
Deployment models 
Transient clusters 
Long-running clusters 
Primary storage substrate 
S3 or remote HDFS 
HDFS 
Backups 
S3 
S3 or second HDFS cluster 
Workloads 
•Batch(MapReduce, Spark) 
•Interactive is an anti- pattern 
•Batch(MapReduce, Spark) 
•Interactive (HBase, Solr, Impala) 
Role of cluster 
Compute only 
Compute and storage
10 
Storage 
Access pattern, performance
11 
Storage considerations 
Hadoop paradigm: 
Bring compute to storage 
Cloud paradigm: 
Everything as a service
12 
•Instance store 
–Local storage attached to instance 
–Temporary 
–Instance dependent (not configurable) 
•Amazon Elastic Block Store (EBS) -Block-level storage volume 
–External to instance 
–Lifecycle independent of instance 
•Amazon Simple Storage Service (S3) –BLOB store 
–External data store 
–Simple API –Get, Put, Delete 
–Instance dependent bandwidth 
Storage choices in AWS
13 
•In MapReducejobs by using s3a URI 
•Distcp 
–hadoopdistcp<options> hdfs:///foo/bar s3a:///mybucket/foo/ 
•HBasesnapshot export 
–hbaseorg.apache.hadoop.hbase.snapshot.ExportSnapshot 
<options>-Dmapred.task.timeout=15000000 
-snapshot <name> -mappers <nmappers> -copy-to <dir> 
Interacting with S3
14 
•Multiple implementations in the Hadoopproject 
–S3 (block based) 
–S3N (file based, using jets3t) 
–S3A (file based, using AWS SDK) Latest stuff 
•Bandwidth to S3 depends on instance type 
–<200 MB/s per instance on some of the larger ones 
•Process 
Interacting with S3 –how it works
15 
•Tune 
•Parallelize 
•Writing to S3 
–Multi-part upload for > 5 GB files 
–Pick multiple drives for local staging (HADOOP-10610) 
–Up the task timeouts when writing large files 
•Reading from S3 
–Range reads within map tasks via multiple threads 
•Large objects are better (less load on metadata lookups) 
•Randomize file names (metadata lookups are spread out) 
Optimizing S3 interaction
16 
•Ephemeral drives on Amazon EC2 instances 
•Persistent for as long as the instances are alive (no pausing) 
•Use S3 for backups 
•No EBS 
–Over the network 
–Designed for random I/O 
HDFS in AWS
17 
Networking 
Performance, access, and security
18 
Topologies – Deploy in Virtual Private Cloud (VPC) 
Cluster in public subnet Cluster in private subnet 
AWS VPC 
Corporate 
network 
Server Server Server Server 
VPN or 
Direct 
Connect 
EC2 
instance 
EC2 
instance 
EC2 
instance 
EC2 
instance 
Cloudera 
Enterprise 
Cluster 
in a public 
subnet 
Internet, Other AWS services 
EC2 
instance 
EC2 
instance 
Edge 
Nodes 
EC2 
instance 
EC2 
instance 
Edge 
Nodes 
AWS VPC 
Corporate 
network 
Server Server Server Server 
VPN or 
Direct 
Connect 
EC2 
instance 
EC2 
instance 
EC2 
instance 
EC2 
instance 
Cloudera 
Enterprise 
Cluster 
in a private 
subnet 
Internet, Other AWS services 
NAT EC2 
instance 
Public 
subnet 
EC2 
instance 
EC2 
instance 
Edge 
Nodes 
EC2 
instance 
EC2 
instance 
Edge 
Nodes
19 
•Instance <-> Instance link 
–10G 
–10G + SR-IOV (HVM) 
–!10G 
•Instance <-> S3 (equal to instance to public internet) 
•Placement groups 
–Performance maydip outside of PGs 
•Clusters within a single Availability Zone 
Performance considerations
20 
EC2 instances 
Storage, cost, performance, availability, and fault tolerance
21 
Picking the right instance 
Transient clusters 
•Primary considerations: 
–Bandwidth 
–CPU 
–Memory 
•Secondary considerations 
–Availability and fault tolerance 
–Local storage density 
•Typical choices 
–C3 family, M3 family, M1 family 
–Anti pattern to use storage dense 
Long running clusters 
•Primary considerations 
–Local storage is key 
–CPU 
–Memory 
–Availability and fault tolerance 
–Bandwidth 
•Typical choices 
–hs1.8xlarge, cc2.8xlarge, i2.8xlarge
22 
Amazon Machine Image (AMI) 
•2 kinds –PV and HVM. 
•Pick a dependable base AMI 
•Things to look out for 
–Kernel patches 
–Third-party software and library versions 
•Increase root volume size
23 
Security
24 
•Amazon Virtual Private Cloud (VPC) options 
–Private subnet 
•All traffic outside of VPC via NAT 
–Public subnet 
•Network ACLS at subnet level 
•Security groups 
•EDH guidelines for Kerberos, Active Directory, and Encryption 
•S3 provides server-side encryption 
Security considerations
25 
High Availability, Backups, Disaster Recovery
26 
•High Availability available in the Hadoopstack 
–Run NamenodeHA with 5 Journal Nodes 
–Run 5 Zookeepers 
–Run multiple HBasemasters 
•Backups and disaster recovery (based on RPO/RTO requirements) 
–Hot backup: Active-Active clusters 
–Warm backup: S3 
•Hadoop level snapshots –HDFS, HBase 
–Cold backup: Amazon Glacier 
HA, Backups, DR
27 
Planning your cluster
28 
Capacity, performance, access patterns 
•Bad news –no simple answer. You have to think through it. 
•Good news –mistakes are cheap. Learn from ours to make them even cheaper. 
•Start with workload type (ad-hoc / SLA, batch / interactive) 
•How much % of the day will you use your cluster? 
•How much data do you want to store? 
•What are the performance requirements? 
•How are you ingesting data? What does the workflow look like?
29 
•Just released –Cloudera Director! 
•AWS Quickstart 
•Available resources 
–Reference Architecture (just refreshed) 
–Best practices blog 
To make life easier
Thank you 
We are hiring!
31 
•Smarter with topology 
•Amazon EBS as storage for HDFS 
•Deeper S3 integration 
•Amazon Kinesis integration 
•Workflow management 
Opportunities
http://bit.ly/awsevals

More Related Content

(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS re:Invent 2014

  • 1. 1 November 13, 2014| Las Vegas, NV AmandeepKhurana
  • 2. 2 About me •Principal Solutions Architect @ Cloudera •Engineer @ AWS •Co-author, HBasein Action
  • 3. 3 Agenda •Motivation •Deployment paradigms •Storage •Networking •Instances •Security •High availability, backups, disaster recovery •Planning your cluster •Available resources
  • 4. 4 •Parallel trends –Commoditizing infrastructure –Commoditizing data •Worlds converging… but with considerations –Cost –Flexibility –Ease of use –Operations –Location –Performance –Security Why you should care
  • 7. 7 Primary consideration –Storage (source of truth) Amazon S3 •Ad-hoc batch workloads •SLA batch workloads HDFS •Ad-hoc batch workloads •SLA batch workloads •Ad-hoc interactive workloads •SLA interactive workloads Predominantly transient clusters Long running clusters
  • 8. 9 Deployment models Transient clusters Long-running clusters Primary storage substrate S3 or remote HDFS HDFS Backups S3 S3 or second HDFS cluster Workloads •Batch(MapReduce, Spark) •Interactive is an anti- pattern •Batch(MapReduce, Spark) •Interactive (HBase, Solr, Impala) Role of cluster Compute only Compute and storage
  • 9. 10 Storage Access pattern, performance
  • 10. 11 Storage considerations Hadoop paradigm: Bring compute to storage Cloud paradigm: Everything as a service
  • 11. 12 •Instance store –Local storage attached to instance –Temporary –Instance dependent (not configurable) •Amazon Elastic Block Store (EBS) -Block-level storage volume –External to instance –Lifecycle independent of instance •Amazon Simple Storage Service (S3) –BLOB store –External data store –Simple API –Get, Put, Delete –Instance dependent bandwidth Storage choices in AWS
  • 12. 13 •In MapReducejobs by using s3a URI •Distcp –hadoopdistcp<options> hdfs:///foo/bar s3a:///mybucket/foo/ •HBasesnapshot export –hbaseorg.apache.hadoop.hbase.snapshot.ExportSnapshot <options>-Dmapred.task.timeout=15000000 -snapshot <name> -mappers <nmappers> -copy-to <dir> Interacting with S3
  • 13. 14 •Multiple implementations in the Hadoopproject –S3 (block based) –S3N (file based, using jets3t) –S3A (file based, using AWS SDK) Latest stuff •Bandwidth to S3 depends on instance type –<200 MB/s per instance on some of the larger ones •Process Interacting with S3 –how it works
  • 14. 15 •Tune •Parallelize •Writing to S3 –Multi-part upload for > 5 GB files –Pick multiple drives for local staging (HADOOP-10610) –Up the task timeouts when writing large files •Reading from S3 –Range reads within map tasks via multiple threads •Large objects are better (less load on metadata lookups) •Randomize file names (metadata lookups are spread out) Optimizing S3 interaction
  • 15. 16 •Ephemeral drives on Amazon EC2 instances •Persistent for as long as the instances are alive (no pausing) •Use S3 for backups •No EBS –Over the network –Designed for random I/O HDFS in AWS
  • 16. 17 Networking Performance, access, and security
  • 17. 18 Topologies – Deploy in Virtual Private Cloud (VPC) Cluster in public subnet Cluster in private subnet AWS VPC Corporate network Server Server Server Server VPN or Direct Connect EC2 instance EC2 instance EC2 instance EC2 instance Cloudera Enterprise Cluster in a public subnet Internet, Other AWS services EC2 instance EC2 instance Edge Nodes EC2 instance EC2 instance Edge Nodes AWS VPC Corporate network Server Server Server Server VPN or Direct Connect EC2 instance EC2 instance EC2 instance EC2 instance Cloudera Enterprise Cluster in a private subnet Internet, Other AWS services NAT EC2 instance Public subnet EC2 instance EC2 instance Edge Nodes EC2 instance EC2 instance Edge Nodes
  • 18. 19 •Instance <-> Instance link –10G –10G + SR-IOV (HVM) –!10G •Instance <-> S3 (equal to instance to public internet) •Placement groups –Performance maydip outside of PGs •Clusters within a single Availability Zone Performance considerations
  • 19. 20 EC2 instances Storage, cost, performance, availability, and fault tolerance
  • 20. 21 Picking the right instance Transient clusters •Primary considerations: –Bandwidth –CPU –Memory •Secondary considerations –Availability and fault tolerance –Local storage density •Typical choices –C3 family, M3 family, M1 family –Anti pattern to use storage dense Long running clusters •Primary considerations –Local storage is key –CPU –Memory –Availability and fault tolerance –Bandwidth •Typical choices –hs1.8xlarge, cc2.8xlarge, i2.8xlarge
  • 21. 22 Amazon Machine Image (AMI) •2 kinds –PV and HVM. •Pick a dependable base AMI •Things to look out for –Kernel patches –Third-party software and library versions •Increase root volume size
  • 23. 24 •Amazon Virtual Private Cloud (VPC) options –Private subnet •All traffic outside of VPC via NAT –Public subnet •Network ACLS at subnet level •Security groups •EDH guidelines for Kerberos, Active Directory, and Encryption •S3 provides server-side encryption Security considerations
  • 24. 25 High Availability, Backups, Disaster Recovery
  • 25. 26 •High Availability available in the Hadoopstack –Run NamenodeHA with 5 Journal Nodes –Run 5 Zookeepers –Run multiple HBasemasters •Backups and disaster recovery (based on RPO/RTO requirements) –Hot backup: Active-Active clusters –Warm backup: S3 •Hadoop level snapshots –HDFS, HBase –Cold backup: Amazon Glacier HA, Backups, DR
  • 26. 27 Planning your cluster
  • 27. 28 Capacity, performance, access patterns •Bad news –no simple answer. You have to think through it. •Good news –mistakes are cheap. Learn from ours to make them even cheaper. •Start with workload type (ad-hoc / SLA, batch / interactive) •How much % of the day will you use your cluster? •How much data do you want to store? •What are the performance requirements? •How are you ingesting data? What does the workflow look like?
  • 28. 29 •Just released –Cloudera Director! •AWS Quickstart •Available resources –Reference Architecture (just refreshed) –Best practices blog To make life easier
  • 29. Thank you We are hiring!
  • 30. 31 •Smarter with topology •Amazon EBS as storage for HDFS •Deeper S3 integration •Amazon Kinesis integration •Workflow management Opportunities