Tips, Tricks & Best Practices for large scale HDInsight Deployments

HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store
Network

Network
HDInsight cluster
Azure BLOB Store/
Azure Data Lake
Store

To get your standard storage accounts to grow past the advertised limits in capacity,
ingress/egress and request rate, please make a request through Azure Support

https://blogs.msdn.microsoft.com/ashish/2016/09/02/hdinsight-hbase-9-things-you-must-do-to-get-great-
hbase-performance/

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-multiple-clusters-data-lake-store

Storage Storage
HDInsight Spark/Hive/MR cluster
1. Create cluster
2. Submit jobs
6. Drop cluster jobs

HDInsight orchestration With Azure Data
Factory (ADF)

{"ErrorCode":"AzureResourceCreationFailedErrorCo
de","ErrorDescription":"Internal server error
occurred while processing the request. Please retry
the request or contact support."}

Source 1
Source n
Source 2
Source 3
Result Set 1
HDInsight Cluster
Result Set 2
Result Set 3
Result Set n

The HDInsight cluster has been scaled down to a very few nodes. The number of nodes is below or close to
the HDFS replication factor.
hdfs dfsadmin -D "fs.default.name=hdfs://mycluster/" –report
hdfs fsck -D "fs.default.name=hdfs://mycluster/" /
Fix the issue by leaving safemode
hdfs dfsadmin -D "fs.default.name=hdfs://mycluster/" -safemode leave

Capability Interactive Query Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Intelligent Cache Eviction Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level security Yes [Apache Ranger+ AAD] High Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors

• LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from the TPC-DS benchmark that ran on all engines
successfully

• We used number of different concurrency levels to test the concurrency
performance
• 99 queries on 1 TB data with 32 worker node cluster with max concurrency set
to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1

• Use custom Metastore whenever possible, this will help you separate Compute and
Metadata
• Start with S2 tier which will give you 50 DTU and 250 GB of storage, you can always
scale the database up in case you see bottlenecks
• Ensure that the Metastore created for one HDInsight cluster version is not shared
across different HDInsight cluster versions. This is due to different Hive versions has
different schemas. Example - Hive 1.2 and Hive 2.1 clusters trying to use same
Metastore.
• Back-up your custom Metastore periodically for OOPS recovery and DR needs
• Keep Metastore and HDInsight cluster in same region
• Monitor your Metastore for performance and availability with Azure SQL DB
Monitoring tools [Azure Portal , Azure Log Analytics]

for d in `hive -e "show databases"`; do echo "create database $d; use $d;" >> alltables.sql ; for t in `hive --database
$d -e "show tables"` ; do ddl=`hive --database $d -e "show create table $t"`; echo "$ddl ;" >> alltables.sql ; echo
"$ddl" | grep -q "PARTITIONEDs*BY" && echo "MSCK REPAIR TABLE $t ;" >> alltables.sql ; done; done
hive -f alltables.sql
Export
Import

DataLakeProbe
HBaseHealthProbe
HBaseMetricsProbe
HBaseProbe
HdfsProbe
HdinsightZookeeperProbe
……..
EdgenodeSSHWatchdog
GatewayTCPPingWatchdog
SSHTCPPingWatchdog
RStudioWatchdog
CertRolloverWatchdog
JobSubmissionPingWatchdog
OozieWatchdog
DataNodesUpWatchdog
NodeManagersUpWatchdog
ResourceHealthWatchdog
AzureNodeStatusWatchdog
ClusterMALoggingHashWatchdo
g
ClusterAvailabilityWatchdog
ClusterHealthWatchdog
……..
namenode_ha_health
ams_metrics_collector_process
ams_metrics_collector_autostart
ams_metrics_collector_hbase_master_p
rocess
namenode_last_checkpoint
namenode_webui
increase_nn_heap_usage_daily
hive_metastore_process
ambari_server_stale_alerts
ambari_server_agent_heartbeat
metrics_monitor_process_percent
……….

OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
HDInsight Log Analytics Architecture

Tips, Tricks & Best Practices for large scale HDInsight Deployments

Related slideshows

More Related Content

Tips, Tricks & Best Practices for large scale HDInsight Deployments