SlideShare a Scribd company logo
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations:
Moving elephant herds
www.datatons.com
Motivation
● Everybody wants to jump into Big Data
● Everybody wants their new setup to be cheap
– Cloud is an excellent option for this
● These environments generally start as a PoC
– They should be re-implemented
– Sometimes they are not
www.datatons.com
Motivation
● You may need to move your Hadoop cluster
– You want to reduce costs
– You need more performance
– Because of corporate policy
– For legal reasons
● But moving big data volumes is a problem!
– Example: 20 TB at 10 MB/s ~ 2 ½ days
www.datatons.com
Initial idea
● Set up a second cluster in the new environment
● The new cluster is initially empty
● We need to populate it
www.datatons.com
Classic UNIX methods
● Well-known file transfer technologies:
– (s)FTP
– Rsync
– NFS + cp
● You need to set up a staging area
● This acts as an intermediate space between
Hadoop and the classic UNIX world
www.datatons.com
Classic UNIX methods
www.datatons.com
Classic UNIX methods
● Disadvantages:
– Needs a big staging area
– Transfer times are slow
– Single nodes act as bottlenecks
– Metadata needs to be copied separately
– Everything must be stopped during the copy to avoid
data loss
– Total downtime: several hours or days (don't even try
if your data is bigger)
www.datatons.com
Using Amazon S3
● AWS S3 storage is also an option for staging
● Cheaper than VM disks
● Available almost everywhere
● An access key is needed
– Create a user only with S3 permissions
● Transfer is done using distcp
– (We'll see more about this later)
www.datatons.com
Using Amazon S3
www.datatons.com
Distcp
● Distcp copies data between two Hadoop clusters
● No staging area needed (Hadoop native)
● High throughput
● Metadata needs to be copied separately
● Clusters need to be connected
– Via VPN for hdfs protocol
– NAT can be used when using webhdfs
● Kerberos complicates matters
www.datatons.com
Distcp
www.datatons.com
Remote cluster access
● As a side note, remote filesystems can also be
used outside distcp
● For example, as LOCATION for Hive tables
● While we're at it...
● We can transform data
– For example, convert files to Parquet
● Is this the right time?
www.datatons.com
Extending Hadoop
● Do like the caterpillar!
● We want to step on the new platform while the
old one continues working
www.datatons.com
Requirements
● Install servers in the new platform
– Enough to hold ALL data
– Same OS + config as original platform
– Config management tools are helpful for this
● Set up connectivity
– VPN (private networking) is needed
● Rack-aware configuration: new nodes need to be on a
new rack
● System times and time zones should be consistent
www.datatons.com
Requirements
www.datatons.com
Starting the copy
● New nodes will have a DataNode role
● No computing yet (YARN, Impala, etc.)
● DataNode roles will be stopped at first
● When started:
– If there is only one rack in the original platform, the
copy process will begin immediately
– If there is more than one rack in the original,
manual intervention will be required
www.datatons.com
Starting the copy
www.datatons.com
Starting the copy
www.datatons.com
Starting the copy
www.datatons.com
Starting the copy
www.datatons.com
Transfer speed
● Two parameters affect the data transfer speed:
– dfs.datanode.balance.bandwidthPerSec
– dfs.namenode.replication.work.multiplier.per.iteration
● No jobs are launched in the new nodes
– Data flow is almost exclusive to the copy
www.datatons.com
Transfer speed
www.datatons.com
Moving master roles
● When possible, take advantage of HA:
– Zookeeper (just add two)
– NameNode
– ResourceManager
● Others need to be migrated manually:
– Hive metastore DB needs to be copied
– Having a DNS name for the DB helps
www.datatons.com
Moving master roles
www.datatons.com
Moving data I/O
● Once data is copied (fully or most of it), new
computation roles will be deployed:
– NodeManager
– Impalad
● Roles will be stopped at first
● Auxiliary nodes (front-end, app nodes, etc) need to
be deployed in the new platform
● A planned intervention (at a low usage time) needs to
take place
www.datatons.com
Moving data I/O
www.datatons.com
During the intervention
● The cluster is stopped
● If necessary, client configuration is redeployed
● Services are started and tested in this order:
– Zookeeper
– HDFS
– YARN (only for the new platform)
– Impala (only for the new platform)
● Auxiliary services in the new platform are tested
● Green light? Change the DNS for the entry points
www.datatons.com
Final picture
www.datatons.com
Conclusions and afterthoughts
● Minimal downtime, similar to non-Hadoop
planned works
● Data and service are never at risk
● Hadoop tools are used to solve a Hadoop
problem
● No user impact: no change in data or access
● Kerberos is not an issue (same REALM + kdc)
Thank you!
Carlos Izquierdo
cizquierdo@datatons.com
www.datatons.com

More Related Content

Big Migrations: Moving elephant herds by Carlos Izquierdo