SlideShare a Scribd company logo
Best practices:
Hadoop eco-system migration
from on-premises to Azure HDInsight
PASS SUMMIT 2018 | Seattle | Nov 7th 2018
• The most trusted and
compliant platform
A secure and managed Apache Hadoop and Spark platform for building data lakes in Azure
Best Practices: Hadoop migration to Azure HDInsight
Workload HDInsight Cluster type
Batch processing (ETL / ELT) Hadoop, Spark
Data warehousing Hadoop, Spark, Interactive Query
IoT / Streaming Kafka, Storm, Spark
NoSQL Transactional processing HBase
Interactive and Faster queries with in-memory caching Interactive Query
Data Science ML Services, Spark
• Clusters can be deleted once the workload has been successfully completed
• Deleting cluster does not delete the storage account and external metadata associated with
cluster
• Storage does not need to be co-located with compute
• Can be in Azure storage, Azure Data Lake store or both
• Hadoop credential provider path can be used to protect storage keys in
• Cluster configs
• DistCp jobs
• Identify the number of worker nodes
• Choose the VM size and type
• Choose the Region
• Choose storage location and size
Node type Cluster type
Hadoop HBase Interactive Query Storm Spark ML Server
Head
D3 v2, D4 v2, D12
v2
D3 v2, D4 v2, D12
v2 D13, D14
A4 v2, A8 v2,
A2m v2
D12 v2, D13 v2,
D14 v2
D12 v2, D13 v2,
D14 v2
Worker
D3 v2, D4 v2, D12
v2
D3 v2, D4 v2, D12
v2 D13, D14
D3 v2, D4 v2,
D12 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
Zookeeper
A4 v2, A8 v2, A2m
v2
A2 v2, A4 v2, A8
v2
Edge
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2, D13 v2,
D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
D4 v2, D12 v2, D13
v2, D14 v2
D4 v2, D12 v2,
D13 v2, D14 v2
• Secure communication between Azure resources
• Ability to filter and route network traffic
• Securely connect to
• Azure Blob Storage
• Azure Data Lake Storage Gen2
• Cosmos DB
• SQL databases
• Traffic flows through secured route from within the Azure data center
• HDInsight cluster is joined to the Active Directory domain
• Supports
• Active Directory-based authentication
• Multiuser support
• Role-based access control
• Auditing
• Provides elasticity to scale up and scale down the number of worker nodes
• Allows to shrink cluster after hours or on weekends and expand it during peak business demands
• Edge node is a Linux VM with the same client tools configured as in the headnode
• Edge node can be used
• to access the cluster
• to test client applications
• to host client applications
• Main metastores
• Hive
• Oozie
• Ranger
• Uses Azure SQL Database as metastores
• Clusters can be created and deleted without losing metadata
• Single metastore db can be shared across different types of clusters
• Consider using LLAP cluster for interactive Hive queries
• Consider using Spark jobs in place of Hive jobs
• Consider replacing impala-based queries with LLAP queries
• Consider replacing MapReduce jobs with Spark jobs
• Consider replacing low-latency Spark batch jobs using Spark Structured Streaming jobs
• Data orchestration – consider using Azure Data Factory(ADF) 2.0
• Consider Ambari for Cluster Management
• Change data storage from on-premises HDFS to wasb or adls
• Consider using Ranger RBAC on Hive tables and auditing
• Transfer data over network with TLS
• DistCp
• Azure Data Factory
• AzureCp
• Third party tools including WANDisco
• Kafka Mirrormaker
• Sqoop
• Shipping data
• Import / Export service
• Data Box
• Hive metastore migration using scripts
• Generate the Hive DDLs
• Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS urls
• Execute the updated DDL on the metastore from the HDI cluster
• Hive metastore migration using DB Replication
• Ranger metastore migration
• Export on-premises Ranger policies to xml files
• Transform on-prem specific HDFS based paths to WASB/ADLS
• import the policies on to Ranger running on HDI
• Remediate applications
• Perform Tests
• Optimize
https://aka.ms/PASS2018Survey
Take the survey at our survey station or on
your mobile device!
Once completed, come by the reception
desk for your Microsoft prize, and to collect
your raffle ticket!

More Related Content

Best Practices: Hadoop migration to Azure HDInsight

  • 1. Best practices: Hadoop eco-system migration from on-premises to Azure HDInsight PASS SUMMIT 2018 | Seattle | Nov 7th 2018
  • 2. • The most trusted and compliant platform A secure and managed Apache Hadoop and Spark platform for building data lakes in Azure
  • 4. Workload HDInsight Cluster type Batch processing (ETL / ELT) Hadoop, Spark Data warehousing Hadoop, Spark, Interactive Query IoT / Streaming Kafka, Storm, Spark NoSQL Transactional processing HBase Interactive and Faster queries with in-memory caching Interactive Query Data Science ML Services, Spark
  • 5. • Clusters can be deleted once the workload has been successfully completed • Deleting cluster does not delete the storage account and external metadata associated with cluster • Storage does not need to be co-located with compute • Can be in Azure storage, Azure Data Lake store or both • Hadoop credential provider path can be used to protect storage keys in • Cluster configs • DistCp jobs
  • 6. • Identify the number of worker nodes • Choose the VM size and type • Choose the Region • Choose storage location and size Node type Cluster type Hadoop HBase Interactive Query Storm Spark ML Server Head D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 D13, D14 A4 v2, A8 v2, A2m v2 D12 v2, D13 v2, D14 v2 D12 v2, D13 v2, D14 v2 Worker D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 D13, D14 D3 v2, D4 v2, D12 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 Zookeeper A4 v2, A8 v2, A2m v2 A2 v2, A4 v2, A8 v2 Edge D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2 D4 v2, D12 v2, D13 v2, D14 v2
  • 7. • Secure communication between Azure resources • Ability to filter and route network traffic • Securely connect to • Azure Blob Storage • Azure Data Lake Storage Gen2 • Cosmos DB • SQL databases • Traffic flows through secured route from within the Azure data center
  • 8. • HDInsight cluster is joined to the Active Directory domain • Supports • Active Directory-based authentication • Multiuser support • Role-based access control • Auditing
  • 9. • Provides elasticity to scale up and scale down the number of worker nodes • Allows to shrink cluster after hours or on weekends and expand it during peak business demands • Edge node is a Linux VM with the same client tools configured as in the headnode • Edge node can be used • to access the cluster • to test client applications • to host client applications
  • 10. • Main metastores • Hive • Oozie • Ranger • Uses Azure SQL Database as metastores • Clusters can be created and deleted without losing metadata • Single metastore db can be shared across different types of clusters
  • 11. • Consider using LLAP cluster for interactive Hive queries • Consider using Spark jobs in place of Hive jobs • Consider replacing impala-based queries with LLAP queries • Consider replacing MapReduce jobs with Spark jobs • Consider replacing low-latency Spark batch jobs using Spark Structured Streaming jobs • Data orchestration – consider using Azure Data Factory(ADF) 2.0 • Consider Ambari for Cluster Management • Change data storage from on-premises HDFS to wasb or adls • Consider using Ranger RBAC on Hive tables and auditing
  • 12. • Transfer data over network with TLS • DistCp • Azure Data Factory • AzureCp • Third party tools including WANDisco • Kafka Mirrormaker • Sqoop • Shipping data • Import / Export service • Data Box
  • 13. • Hive metastore migration using scripts • Generate the Hive DDLs • Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS urls • Execute the updated DDL on the metastore from the HDI cluster • Hive metastore migration using DB Replication • Ranger metastore migration • Export on-premises Ranger policies to xml files • Transform on-prem specific HDFS based paths to WASB/ADLS • import the policies on to Ranger running on HDI
  • 14. • Remediate applications • Perform Tests • Optimize
  • 15. https://aka.ms/PASS2018Survey Take the survey at our survey station or on your mobile device! Once completed, come by the reception desk for your Microsoft prize, and to collect your raffle ticket!

Editor's Notes

  1. Azure HDInsight is a secure and managed platform for building data lakes on Azure based on the Apache Hadoop and Spark frameworks. So, what all does HDInsight have to offer? Reliable Open Source analytics with an Industry leading SLA HDInsight allows you to easily spin up open source cluster types guaranteed with the industry’s best 99.9% SLA and 24/7 support. We guarantee this SLA for the entire big data solution, not just the VM instances. HDInsight is architected for full redundancy and high availability including head node replication, data geo-replication, and built-in standby NameNode making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24x7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core, more than all other managed cloud providers combined to support your deployment and the ability to fix and commit code back to Hadoop. Enterprise Grade Security & Monitoring HDInsight protects your data assets and easily extends your on-premise security and governance controls to the cloud. We feature single sign-on (SSO), multi-factor authentication and seamless management of millions of identities through Azure Active Directory. You can authorize users and groups with fine-grained access control policies over all your enterprise data with Apache Ranger. HDInsight meets HIPAA, PCI, SOC compliance, ensuring your enterprise data assets are always protected with the highest security and regulatory compliance. To ensure the highest level of business continuity, HDInsight extends capabilities for alerting, monitoring, defining pre-emptive actions, and enhanced workload protection through native integration with Azure Operations Management Suite (OMS). Most Productive platform for developers and scientists HDInsight offers developers tailored experiences through rich productivity suites for Hadoop & Spark with integrated development environments using Visual Studio, Eclipse, and IntelliJ supporting Scala, Python, R, Java, and .Net. HDInsight gives data scientists the ability to create narratives that combine code, statistical equations, and visualizations that tell a story about the data through integration to the two most popular notebooks: Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server means handling up to 1000x more data and up to 50x faster speeds than open source R—helping you train more accurate models for better predictions than previously possible. Cost effective cloud scale HDInsight has decoupled compute and storage, enabling you to cost-effectively scale workloads up or down, independent of storage. Local storage can still be used for caching and fast I/O. Spark and interactive Hive users can choose SSD memory for interactive performance; while Kafka users can retain all streaming data in premium managed disks. You only pay for the compute and storage you use and are given the ability to choose any Azure VM types that enables the best utilization of resources. A recent study showed HDInsight delivering 63% lower TCO than deploying Hadoop on premises over 5 years.* Integration with leading Productivity Applications In the broader ecosystem for Hadoop, there is a thriving market of independent software vendors (ISVs) who provide value added solutions. Through a unique design where every cluster is extended with edge nodes and script action, HDInsight lets customers spin up Hadoop and Spark clusters pre-integrated and pre-tuned with any ISV application out-of-the-box. Datameer, Cask, AtScale, StreamSets are few such applications, which are very popular on the HDInsight platform today. Easy for administrators to manage With HDInsight, administrators can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. There’s also no time-consuming installation or set up. There is also no need to patch the operating system or upgrade the Hadoop versions. Azure does it for you. Launch your first cluster in minutes.
  2. So, to bring it all together, here's where Microsoft has invested, across these four areas: identity and access management, information protection, threat protection, and security management. We’ve put a tremendous amount of investment into these, and the way it shows up is across a pretty broad array of product areas and features. Our Identity and Access Management tools enable you to take an identity-based approach to security, and establish truly conditional access policies Our Information Protection solutions help you apply protection that travels with the information as it moves around—both inside and outside your organization Our Threat Protection capabilities are built in to the platform, so you can strengthen both pre-breach protection with deep capabilities across e-mail, collaboration services, and end points including hardware based protection; and post-breach detection that includes memory and kernel based protection and response with automation. And our Security Management tools give you the visibility and more importantly the guidance to manage policy centrally