(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS re:Invent 2014
- 2. 2
About me
•Principal Solutions Architect @ Cloudera
•Engineer @ AWS
•Co-author, HBasein Action
- 3. 3
Agenda
•Motivation
•Deployment paradigms
•Storage
•Networking
•Instances
•Security
•High availability, backups, disaster recovery
•Planning your cluster
•Available resources
- 4. 4
•Parallel trends
–Commoditizing infrastructure
–Commoditizing data
•Worlds converging… but with considerations
–Cost
–Flexibility
–Ease of use
–Operations
–Location
–Performance
–Security
Why you should care
- 7. 7
Primary consideration –Storage (source of truth)
Amazon S3
•Ad-hoc batch workloads
•SLA batch workloads
HDFS
•Ad-hoc batch workloads
•SLA batch workloads
•Ad-hoc interactive workloads
•SLA interactive workloads
Predominantly transient clusters
Long running clusters
- 8. 9
Deployment models
Transient clusters
Long-running clusters
Primary storage substrate
S3 or remote HDFS
HDFS
Backups
S3
S3 or second HDFS cluster
Workloads
•Batch(MapReduce, Spark)
•Interactive is an anti- pattern
•Batch(MapReduce, Spark)
•Interactive (HBase, Solr, Impala)
Role of cluster
Compute only
Compute and storage
- 11. 12
•Instance store
–Local storage attached to instance
–Temporary
–Instance dependent (not configurable)
•Amazon Elastic Block Store (EBS) -Block-level storage volume
–External to instance
–Lifecycle independent of instance
•Amazon Simple Storage Service (S3) –BLOB store
–External data store
–Simple API –Get, Put, Delete
–Instance dependent bandwidth
Storage choices in AWS
- 12. 13
•In MapReducejobs by using s3a URI
•Distcp
–hadoopdistcp<options> hdfs:///foo/bar s3a:///mybucket/foo/
•HBasesnapshot export
–hbaseorg.apache.hadoop.hbase.snapshot.ExportSnapshot
<options>-Dmapred.task.timeout=15000000
-snapshot <name> -mappers <nmappers> -copy-to <dir>
Interacting with S3
- 13. 14
•Multiple implementations in the Hadoopproject
–S3 (block based)
–S3N (file based, using jets3t)
–S3A (file based, using AWS SDK) Latest stuff
•Bandwidth to S3 depends on instance type
–<200 MB/s per instance on some of the larger ones
•Process
Interacting with S3 –how it works
- 14. 15
•Tune
•Parallelize
•Writing to S3
–Multi-part upload for > 5 GB files
–Pick multiple drives for local staging (HADOOP-10610)
–Up the task timeouts when writing large files
•Reading from S3
–Range reads within map tasks via multiple threads
•Large objects are better (less load on metadata lookups)
•Randomize file names (metadata lookups are spread out)
Optimizing S3 interaction
- 15. 16
•Ephemeral drives on Amazon EC2 instances
•Persistent for as long as the instances are alive (no pausing)
•Use S3 for backups
•No EBS
–Over the network
–Designed for random I/O
HDFS in AWS
- 17. 18
Topologies – Deploy in Virtual Private Cloud (VPC)
Cluster in public subnet Cluster in private subnet
AWS VPC
Corporate
network
Server Server Server Server
VPN or
Direct
Connect
EC2
instance
EC2
instance
EC2
instance
EC2
instance
Cloudera
Enterprise
Cluster
in a public
subnet
Internet, Other AWS services
EC2
instance
EC2
instance
Edge
Nodes
EC2
instance
EC2
instance
Edge
Nodes
AWS VPC
Corporate
network
Server Server Server Server
VPN or
Direct
Connect
EC2
instance
EC2
instance
EC2
instance
EC2
instance
Cloudera
Enterprise
Cluster
in a private
subnet
Internet, Other AWS services
NAT EC2
instance
Public
subnet
EC2
instance
EC2
instance
Edge
Nodes
EC2
instance
EC2
instance
Edge
Nodes
- 18. 19
•Instance <-> Instance link
–10G
–10G + SR-IOV (HVM)
–!10G
•Instance <-> S3 (equal to instance to public internet)
•Placement groups
–Performance maydip outside of PGs
•Clusters within a single Availability Zone
Performance considerations
- 19. 20
EC2 instances
Storage, cost, performance, availability, and fault tolerance
- 20. 21
Picking the right instance
Transient clusters
•Primary considerations:
–Bandwidth
–CPU
–Memory
•Secondary considerations
–Availability and fault tolerance
–Local storage density
•Typical choices
–C3 family, M3 family, M1 family
–Anti pattern to use storage dense
Long running clusters
•Primary considerations
–Local storage is key
–CPU
–Memory
–Availability and fault tolerance
–Bandwidth
•Typical choices
–hs1.8xlarge, cc2.8xlarge, i2.8xlarge
- 21. 22
Amazon Machine Image (AMI)
•2 kinds –PV and HVM.
•Pick a dependable base AMI
•Things to look out for
–Kernel patches
–Third-party software and library versions
•Increase root volume size
- 23. 24
•Amazon Virtual Private Cloud (VPC) options
–Private subnet
•All traffic outside of VPC via NAT
–Public subnet
•Network ACLS at subnet level
•Security groups
•EDH guidelines for Kerberos, Active Directory, and Encryption
•S3 provides server-side encryption
Security considerations
- 25. 26
•High Availability available in the Hadoopstack
–Run NamenodeHA with 5 Journal Nodes
–Run 5 Zookeepers
–Run multiple HBasemasters
•Backups and disaster recovery (based on RPO/RTO requirements)
–Hot backup: Active-Active clusters
–Warm backup: S3
•Hadoop level snapshots –HDFS, HBase
–Cold backup: Amazon Glacier
HA, Backups, DR
- 27. 28
Capacity, performance, access patterns
•Bad news –no simple answer. You have to think through it.
•Good news –mistakes are cheap. Learn from ours to make them even cheaper.
•Start with workload type (ad-hoc / SLA, batch / interactive)
•How much % of the day will you use your cluster?
•How much data do you want to store?
•What are the performance requirements?
•How are you ingesting data? What does the workflow look like?
- 28. 29
•Just released –Cloudera Director!
•AWS Quickstart
•Available resources
–Reference Architecture (just refreshed)
–Best practices blog
To make life easier
- 30. 31
•Smarter with topology
•Amazon EBS as storage for HDFS
•Deeper S3 integration
•Amazon Kinesis integration
•Workflow management
Opportunities