SlideShare a Scribd company logo
Chef
Patterns
at
Bloomberg
Scale
//
CHEF PATTERNS AT
BLOOMBERG SCALE
HADOOP INFRASTRUCTURE TEAM
https://github.com/bloomberg/chef-bach
Freenode: #chef-bach
Chef
Patterns
at
Bloomberg
Scale
//
BLOOMBERG CLUSTERS 2
• APPLICATION SPECIFIC
• Hadoop, Kafka
• ENVIRONMENT SPECIFIC
• Networking, Storage
• BUILT REGULARLY
• DEDICATED “BOOTSTRAP” SERVER
• Virtual Machine
• DEDICATED CHEF-SERVER
Chef
Patterns
at
Bloomberg
Scale
//
WHY A VM? 3
• LIGHTWEIGHT PRE-REQUISITE
• Low memory/Storage Requirements
• RAPID DEPLOYMENT
• Vagrant for Bring-Up
• Vagrant for Re-Configuration
• EASY RELEASE MANAGEMENT
• MULTIPLE VM PER HYPERVISOR
• Multiple Clusters
• EASY RELOCATION
Chef
Patterns
at
Bloomberg
Scale
//
SERVICES OFFERED 4
• REPOSITORIES
• APT
• Ruby Gems
• Static Files (Chef!)
• CHEF SERVER
• KERBEROS KDC
• PXE SERVER
• DHCP/TFTP Server
• Cobbler (https://github.com/bloomberg/cobbler-cookbook)
• Bridged Networking (for test VMs)
• STRONG ISOLATION
Chef
Patterns
at
Bloomberg
Scale
//
BUILDING BOOTSTRAP 5
• CHEF AND VAGRANT
• Generic Image (Jenkins)
• NETWORK CONFIGURATION
• CORRECTING “KNIFE.RB”
• CHEF SERVER RECONFIGURATION
• CLEAN UP (CHEF REST API)
• CONVERT BOOTSTRAP TO BE AN ADMIN CLIENT
• Secrets/Keys
Chef
Patterns
at
Bloomberg
Scale
//
BUILDING BOOTSTRAP 6
• CHEF-SOLO PROVISIONER
# Chef provisioning
bootstrap.vm.provision "chef_solo" do |chef|
chef.environments_path = [[:vm,""]]
chef.environment = env_name
chef.cookbooks_path = [[:vm,""]]
chef.roles_path = [[:vm,""]]
chef.add_recipe("bcpc::bootstrap_network")
chef.log_level="debug"
chef.verbose_logging=true
chef.provisioning_path="/home/vagrant/chef-bcpc/"
end
• CHEF SERVER RECONFIGURATION
• NGINX, SOLR, RABBITMQ
# Reconfigure chef-server
bootstrap.vm.provision :shell, :inline => "chef-server-ctl reconfigure"
Chef
Patterns
at
Bloomberg
Scale
//
BUILDING BOOTSTRAP 7
• CLEAN UP (REST API)
ruby_block "cleanup-old-environment-databag" do
block do
rest = Chef::REST.new(node[:chef_client][:server_url], "admin", 
"/etc/chef-server/admin.pem")
rest.delete("/environments/GENERIC")
rest.delete("/data/configs/GENERIC")
end
end
ruby_block "cleanup-old-clients" do
block do
system_clients = ["chef-validator", "chef-webui"]
rest = Chef::REST.new(node[:chef_client][:server_url], "admin", 
"/etc/chef-server/admin.pem")
rest.get_rest("/clients").each do |client|
if !system_clients.include?(client.first)
rest.delete("/clients/#{client.first}")
end
end
end
end
Chef
Patterns
at
Bloomberg
Scale
//
BUILDING BOOTSTRAP 8
• CONVERT TO ADMIN (BOOTSTRAP_CONFIG.RB)
ruby_block "convert-bootstrap-to-admin" do
block do
rest = Chef::REST.new(node[:chef_client][:server_url],
"admin",
"/etc/chef-server/admin.pem")
rest.put_rest("/clients/#{node[:hostname]}",{:admin => true})
rest.put_rest("/nodes/#{node[:hostname]}",
{ :name => node[:hostname],
:run_list => ['role[BCPC-Bootstrap]'] }
)
end
end
Chef
Patterns
at
Bloomberg
Scale
//
CLUSTER USABILITY 9
• CODE DEPLOYMENT
• APPLICATION COOKBOOKS
• RUBY GEMS
• Zookeeper, WebHDFS
• CLUSTERS ARE NOT SINGLE MACHINE
• Which machine to deploy
• Idempotency; Races
Chef
Patterns
at
Bloomberg
Scale
//
DEPLOY TO HDFS 10
• USE CHEF DIRECTORY RESOURCE
• USE CUSTOM PROVIDER
• https://github.com/bloomberg/chef-
bach/blob/master/cookbooks/bcpc-
hadoop/libraries/hdfsdirectory.rb
directory “/projects/myapp” do
mode 755
owner “foo”
recursive true
provider BCPC::HdfsDirectory
end
Chef
Patterns
at
Bloomberg
Scale
//
DEPLOY KAFKA TOPIC 11
• USE LWRP
• Dynamic Topic; Right Zookeeper
• PROVIDER CODE AVAILABLE AT
• https://github.com/mthssdrbrg/kafka-cookbook/pull/49
# Kafka Topic Resource
actions :create, :update
attribute :name, :kind_of => String , :name_attribute => true
attribute :partitions, :kind_of => Integer, :default => 1
attribute :replication, :kind_of => Integer, :default => 1
Chef
Patterns
at
Bloomberg
Scale
//
KERBEROS 12
• KEYTABS
• Per Service / Host
• Up to 10 Keytabs per Host
• WHAT ABOUT MULTI HOMED HOSTS?
• Hadoop imputes _HOST
• PROVIDERS
• WebHDFS uses SPNEGO
• SYSTEM ROLE ACCOUNTS
• TENANT ROLE ACCOUNTS
• AVAILABLE AT
• https://github.com/bloomberg/chef-bach/tree/kerberos
Chef
Patterns
at
Bloomberg
Scale
//
LOGIC INJECTION 13
• COMPLETE CODE CAN BE FOUND AT
• Community cookbook
• https://github.com/mthssdrbrg/kafka-cookbook#controlling-restart-of-
kafka-brokers-in-a-cluster
• Wrapper custom recipe
• https://github.com/bloomberg/chef-
bach/blob/rolling_restart/cookbooks/kafka-bcpc/recipes/coordinate.rb
Statutory Warning 
Code snippets are edited to fit the slides which may have resulted in logic
incoherence, bugs and un-readability. Readers discretion requested.
Chef
Patterns
at
Bloomberg
Scale
//
LOGIC INJECTION 14
• WE USE COMMUNITY COOKBOOKS
• Takes care of standard install, enable and starting of services
• NEED TO ADD LOGIC TO COOKBOOK RECIPES
• Take action on a service only when conditions are satisfied
• Take action on a service based on dependent service state
Chef
Patterns
at
Bloomberg
Scale
//
template ::File.join(node.kafka.config_dir, 'server.properties') do
source 'server.properties.erb'
...
helpers(Kafka::Configuration)
if restart_on_configuration_change?
notifies :restart, 'service[kafka]', :delayed
end
end
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
end
LOGIC INJECTION 15
VANILLA COMMUNITY COOKBOOK:
Chef
Patterns
at
Bloomberg
Scale
//
template ::File.join(node.kafka.config_dir, 'server.properties') do
source 'server.properties.erb'
...
helpers(Kafka::Configuration)
if restart_on_configuration_change?
notifies :restart, 'service[kafka]', :delayed
end
end
#----- Remove ----#
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
end
#----- Remove----#
LOGIC INJECTION 16
VANILLA COMMUNITY COOKBOOK:
Chef
Patterns
at
Bloomberg
Scale
//
template ::File.join(node.kafka.config_dir, 'server.properties') do
source 'server.properties.erb’
...
helpers(Kafka::Configuration)
if restart_on_configuration_change?
notifies :create, 'ruby_block[pre-shim]', :immediately
end
end
#----- Replace----#
include_recipe node["kafka"]["start_coordination"]["recipe"]
#----- Replace----#
LOGIC INJECTION 17
VANILLA COMMUNITY COOKBOOK 2.0:
Chef
Patterns
at
Bloomberg
Scale
//
ruby_block 'pre-shim' do
# pre-restart no-op
notifies :restart, 'service[kafka] ', :delayed
end
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
end
LOGIC INJECTION 18
COOKBOOK COORDINATOR RECIPE:
Chef
Patterns
at
Bloomberg
Scale
//
ruby_block 'pre-shim' do
# pre-restart done here
notifies :restart, 'service[kafka] ', :delayed
end
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
notifies :create, 'ruby_block[post-shim] ', :immediately
end
ruby_block 'post-shim' do
# clean-up done here
end
LOGIC INJECTION 19
WRAPPER COORDINATOR RECIPE:
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE ON DEMAND 20
• COMMON SERVICE WHICH CAN BE REQUESTED
• Copy log files from applications into a centralized location
• Single location for users to review logs and helps with security
• Service available on all the nodes
• Applications can request the service dynamically
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE ON DEMAND 21
• NODE ATTRIBUTE TO STORE SERVICE REQUESTS
default['bcpc']['hadoop']['copylog'] = {}
• DATA STRUCTURE TO MAKE SERVICE REQUESTS
{
'app_id' => { 'logfile' => "/path/file_name_of_log_file",
'docopy' => true (or false)
},...
}
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE ON DEMAND 22
• APPLICATION RECIPES MAKE SERVICE REQUESTS
#
# Updating node attributes to copy HBase master log file to HDFS
#
node.default['bcpc']['hadoop']['copylog']['hbase_master'] = {
'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.log",
'docopy' => true
}
node.default['bcpc']['hadoop']['copylog']['hbase_master_out'] = {
'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.out",
'docopy' => true
}
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE ON DEMAND 23
• RECIPE FOR THE COMMON SERVICE
node['bcpc']['hadoop']['copylog'].each do |id,f|
if f['docopy']
template "/etc/flume/conf/flume-#{id}.conf" do
source "flume_flume-conf.erb”
action :create ...
variables(:agent_name => "#{id}",
:log_location => "#{f['logfile']}" )
notifies :restart,"service[flume-agent-multi-#{id}]",:delayed
end
service "flume-agent-multi-#{id}" do
supports :status => true, :restart => true, :reload => false
service_name "flume-agent-multi"
action :start
start_command "service flume-agent-multi start #{id}"
restart_command "service flume-agent-multi restart #{id}"
status_command "service flume-agent-multi status #{id}"
end
Chef
Patterns
at
Bloomberg
Scale
//
PLUGGABLE ALERTS 24
• SINGLE SOURCE FOR MONITORED STATS
• Allows users to visualize stats across different parameters
• Didn’t want to duplicate the stats collection by alerting system
• Need to feed data to the alerting system to generate alerts
Chef
Patterns
at
Bloomberg
Scale
//
PLUGGABLE ALERTS 25
• ATTRIBUTE WHERE USERS CAN DEFINE ALERTS
default["bcpc"]["hadoop"]["graphite"]["queries"] = {
'hbase_master' => [
{ 'type' => "jmx",
'query' => "memory.NonHeapMemoryUsage_committed",
'key' => "hbasenonheapmem",
'trigger_val' => "max(61,0)",
'trigger_cond' => "=0",
'trigger_name' => "HBaseMasterAvailability",
'trigger_dep' => ["NameNodeAvailability"],
'trigger_desc' => "HBase master seems to be down",
'severity' => 1
},{
'type' => "jmx",
'query' => "memory.HeapMemoryUsage_committed",
'key' => "hbaseheapmem",
...
},...], ’namenode' => [...] ...}
Query to pull stats
from data source
Define alert criteria
Chef
Patterns
at
Bloomberg
Scale
//
TEMPLATE PITFALLS 26
• LIBRARY FUNCTION CALLS IN WRAPPER COOKBOOKS
• Community cookbook provider accepts template as an attribute
• Template passed from wrapper makes a library function call
• Wrapper recipe includes the module of library function
Chef
Patterns
at
Bloomberg
Scale
//
TEMPLATE PITFALLS 27
...
Chef::Resource.send(:include, Bcpc::OSHelper)
...
cobbler_profile "bcpc_host" do
kickstart "cobbler.bcpc_ubuntu_host.preseed"
distro "ubuntu-12.04-mini-x86_64”
end
...
...
d-i passwd/user-password-crypted password
<%="#{get_config(@node, 'cobbler-root-password-salted')}"%>
d-i passwd/user-uid string
...
• WRAPPER RECIPE
• FUNCTION CALL IN TEMPLATE
Chef
Patterns
at
Bloomberg
Scale
//
TEMPLATE PITFALLS 28
...
d-i passwd/user-password-crypted password
<%="#{Bcpc::OSHelper.get_config(@node, 'cobbler-root-password-
salted')}"%>
d-i passwd/user-uid string
...
• MODIFIED FUNCTION CALL IN TEMPLATE
Chef
Patterns
at
Bloomberg
Scale
//
DYNAMIC RESOURCES 29
• ANIT-PATTERN?
ruby_block "create namenode directories" do
block do
node[:bcpc][:storage][:mounts].each do |d|
dir = Chef::Resource::Directory.new("#{mount_root}/#{d}/dfs/nn",
run_context)
dir.owner "hdfs"
dir.group "hdfs"
dir.mode 0755
dir.recursive true
dir.run_action :create
exe = Chef::Resource::Execute.new("fixup nn owner", run_context)
exe.command "chown -Rf hdfs:hdfs #{mount_root}/#{d}/dfs"
exe.only_if {
Etc.getpwuid(File.stat("#{mount_root}/#{d}/dfs/").uid).name !=
"hdfs "
}
end
end
Chef
Patterns
at
Bloomberg
Scale
//
DYNAMIC RESOURCES 30
• SYSTEM CONFIGURATION
• Lengthy Configuration of a Storage Controller
• Setting Attributes at Converge Time
• Compile Time Actions?
• MUST WRAP IN RUBY_BLOCK’S
• Does not Update the Resource Collection
• Lazy’s everywhere:
• Guards: not_if{lazy{node[…]}.call.map{…}}
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE RESTART 31
• WE USE JMXTRANS TO MONITOR JMX STATS
• Service to be monitored varies with node
• There can be more than one service to be monitored
• Monitored service restart requires JMXtrans to be restarted**
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE RESTART 32
• DATA STRUCTURE IN ROLES TO DEFINE THE SERVICES
"default_attributes" : {
"jmxtrans”:{
"servers”:[
{
"type": "datanode",
"service": "hadoop-hdfs-datanode",
"service_cmd":
"org.apache.hadoop.hdfs.server.datanode.DataNode"
}, {
"type": "hbase_rs",
"service": "hbase-regionserver",
"service_cmd":
“org.apache.hadoop.hbase.regionserver.HRegionServer"
}
]
} ...
Dependent Service Name
String to uniquely identify
the service process
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE RESTART 33
• JMXTRANS SERVICE RESTART LOGIC BUILT DYNAMICALLY
jmx_services = Array.new
jmx_srvc_cmds = Hash.new
node['jmxtrans']['servers'].each do |server|
jmx_services.push(server['service'])
jmx_srvc_cmds[server['service']] = server['service_cmd']
end
service "restart jmxtrans on dependent service" do
service_name "jmxtrans"
supports :restart => true, :status => true, :reload => true
action :restart
jmx_services.each do |jmx_dep_service|
subscribes :restart, "service[#{jmx_dep_service}]", :delayed
end
only_if {process_require_restart?("jmxtrans","jmxtrans-all.jar”,
jmx_srvc_cmds)}
end
What if a
process is
re/started
externally?
Store the dependent service
name and process ids in
local variables
Subscribes from all
dependent services
Chef
Patterns
at
Bloomberg
Scale
//
SERVICE RESTART 34
def process_require_restart?(process_name, process_cmd, dep_cmds)
tgt_proces_pid = `pgrep -f #{process_cmd}`
...
tgt_proces_stime = `ps --no-header -o start_time #{tgt_process_pid}`
...
ret = false
restarted_processes = Array.new
dep_cmds.each do |dep_process, dep_cmd|
dep_pids = `pgrep -f #{dep_cmd}`
if dep_pids != ""
dep_pids_arr = dep_pids.split("n")
dep_pids_arr.each do |dep_pid|
dep_process_stime = `ps --no-header -o start_time #{dep_pid}`
if DateTime.parse(tgt_proces_stime) <
DateTime.parse(dep_process_stime)
restarted_processes.push(dep_process)
ret = true
end ...
Start time of the service process
Start time of all the service processes on
which it is dependent on
Compare the start time
Chef
Patterns
at
Bloomberg
Scale
//
ROLLING RESTART 35
• AUTOMATIC CONVERGENCE
• AVAILABILITY
• High Availability
• Toxic Configuration
• HOW
• Check Masters for Slave Status
• Synchronous Communication
• Locking
Chef
Patterns
at
Bloomberg
Scale
//
ROLLING RESTART 36
• FLAGGING
• Negative Flagging – flag when a service is down
• Positive Flagging – flag when a service is reconfiguring
• Deadlock Avoidance
• CONTENTION
• Poll & Wait
• Fail the Run
• Simply Skip Service Restart and Go On
• Store the Need for Restart
• Breaks Assumptions of Procedural Chef Runs
Chef
Patterns
at
Bloomberg
Scale
//
ROLLING RESTART 37
HADOOP_SERVICE "ZOOKEEPER-SERVER" DO
DEPENDENCIES ["TEMPLATE[/ETC/ZOOKEEPER/CONF/ZOO.CFG]",
"TEMPLATE[/USR/LIB/ZOOKEEPER/BIN/ZKSERVER.SH]",
"TEMPLATE[/ETC/DEFAULT/ZOOKEEPER-SERVER]"]
PROCESS_IDENTIFIER "ORG.APACHE.ZOOKEEPER ... QUORUMPEERMAIN"
END
• SERVICE DEFINITION
Chef
Patterns
at
Bloomberg
Scale
//
ROLLING RESTART 38
• SYNCH STATE STORE
• Zookeeper
• SERVICE RESTART (KAFKA) VALIDATION CHECK
• Based on Jenkins pattern for wait_until_ready!
• Verifies that the service is up to an acceptable level
• Passes or stops the Chef run
• FUTURE DIRECTIONS
• Topology Aware Deployment
• Data Aware Deployment
Chef
Patterns
at
Bloomberg
Scale
//
WE ARE HIRING
JOBS.BLOOMBERG.COM:
https://github.com/bloomberg/chef-bach
Freenode: #chef-bach
• Hadoop Infrastructure Engineer
• DevOps Engineer Search Infrastructure

More Related Content

Chef conf-2015-chef-patterns-at-bloomberg-scale

  • 1. Chef Patterns at Bloomberg Scale // CHEF PATTERNS AT BLOOMBERG SCALE HADOOP INFRASTRUCTURE TEAM https://github.com/bloomberg/chef-bach Freenode: #chef-bach
  • 2. Chef Patterns at Bloomberg Scale // BLOOMBERG CLUSTERS 2 • APPLICATION SPECIFIC • Hadoop, Kafka • ENVIRONMENT SPECIFIC • Networking, Storage • BUILT REGULARLY • DEDICATED “BOOTSTRAP” SERVER • Virtual Machine • DEDICATED CHEF-SERVER
  • 3. Chef Patterns at Bloomberg Scale // WHY A VM? 3 • LIGHTWEIGHT PRE-REQUISITE • Low memory/Storage Requirements • RAPID DEPLOYMENT • Vagrant for Bring-Up • Vagrant for Re-Configuration • EASY RELEASE MANAGEMENT • MULTIPLE VM PER HYPERVISOR • Multiple Clusters • EASY RELOCATION
  • 4. Chef Patterns at Bloomberg Scale // SERVICES OFFERED 4 • REPOSITORIES • APT • Ruby Gems • Static Files (Chef!) • CHEF SERVER • KERBEROS KDC • PXE SERVER • DHCP/TFTP Server • Cobbler (https://github.com/bloomberg/cobbler-cookbook) • Bridged Networking (for test VMs) • STRONG ISOLATION
  • 5. Chef Patterns at Bloomberg Scale // BUILDING BOOTSTRAP 5 • CHEF AND VAGRANT • Generic Image (Jenkins) • NETWORK CONFIGURATION • CORRECTING “KNIFE.RB” • CHEF SERVER RECONFIGURATION • CLEAN UP (CHEF REST API) • CONVERT BOOTSTRAP TO BE AN ADMIN CLIENT • Secrets/Keys
  • 6. Chef Patterns at Bloomberg Scale // BUILDING BOOTSTRAP 6 • CHEF-SOLO PROVISIONER # Chef provisioning bootstrap.vm.provision "chef_solo" do |chef| chef.environments_path = [[:vm,""]] chef.environment = env_name chef.cookbooks_path = [[:vm,""]] chef.roles_path = [[:vm,""]] chef.add_recipe("bcpc::bootstrap_network") chef.log_level="debug" chef.verbose_logging=true chef.provisioning_path="/home/vagrant/chef-bcpc/" end • CHEF SERVER RECONFIGURATION • NGINX, SOLR, RABBITMQ # Reconfigure chef-server bootstrap.vm.provision :shell, :inline => "chef-server-ctl reconfigure"
  • 7. Chef Patterns at Bloomberg Scale // BUILDING BOOTSTRAP 7 • CLEAN UP (REST API) ruby_block "cleanup-old-environment-databag" do block do rest = Chef::REST.new(node[:chef_client][:server_url], "admin", "/etc/chef-server/admin.pem") rest.delete("/environments/GENERIC") rest.delete("/data/configs/GENERIC") end end ruby_block "cleanup-old-clients" do block do system_clients = ["chef-validator", "chef-webui"] rest = Chef::REST.new(node[:chef_client][:server_url], "admin", "/etc/chef-server/admin.pem") rest.get_rest("/clients").each do |client| if !system_clients.include?(client.first) rest.delete("/clients/#{client.first}") end end end end
  • 8. Chef Patterns at Bloomberg Scale // BUILDING BOOTSTRAP 8 • CONVERT TO ADMIN (BOOTSTRAP_CONFIG.RB) ruby_block "convert-bootstrap-to-admin" do block do rest = Chef::REST.new(node[:chef_client][:server_url], "admin", "/etc/chef-server/admin.pem") rest.put_rest("/clients/#{node[:hostname]}",{:admin => true}) rest.put_rest("/nodes/#{node[:hostname]}", { :name => node[:hostname], :run_list => ['role[BCPC-Bootstrap]'] } ) end end
  • 9. Chef Patterns at Bloomberg Scale // CLUSTER USABILITY 9 • CODE DEPLOYMENT • APPLICATION COOKBOOKS • RUBY GEMS • Zookeeper, WebHDFS • CLUSTERS ARE NOT SINGLE MACHINE • Which machine to deploy • Idempotency; Races
  • 10. Chef Patterns at Bloomberg Scale // DEPLOY TO HDFS 10 • USE CHEF DIRECTORY RESOURCE • USE CUSTOM PROVIDER • https://github.com/bloomberg/chef- bach/blob/master/cookbooks/bcpc- hadoop/libraries/hdfsdirectory.rb directory “/projects/myapp” do mode 755 owner “foo” recursive true provider BCPC::HdfsDirectory end
  • 11. Chef Patterns at Bloomberg Scale // DEPLOY KAFKA TOPIC 11 • USE LWRP • Dynamic Topic; Right Zookeeper • PROVIDER CODE AVAILABLE AT • https://github.com/mthssdrbrg/kafka-cookbook/pull/49 # Kafka Topic Resource actions :create, :update attribute :name, :kind_of => String , :name_attribute => true attribute :partitions, :kind_of => Integer, :default => 1 attribute :replication, :kind_of => Integer, :default => 1
  • 12. Chef Patterns at Bloomberg Scale // KERBEROS 12 • KEYTABS • Per Service / Host • Up to 10 Keytabs per Host • WHAT ABOUT MULTI HOMED HOSTS? • Hadoop imputes _HOST • PROVIDERS • WebHDFS uses SPNEGO • SYSTEM ROLE ACCOUNTS • TENANT ROLE ACCOUNTS • AVAILABLE AT • https://github.com/bloomberg/chef-bach/tree/kerberos
  • 13. Chef Patterns at Bloomberg Scale // LOGIC INJECTION 13 • COMPLETE CODE CAN BE FOUND AT • Community cookbook • https://github.com/mthssdrbrg/kafka-cookbook#controlling-restart-of- kafka-brokers-in-a-cluster • Wrapper custom recipe • https://github.com/bloomberg/chef- bach/blob/rolling_restart/cookbooks/kafka-bcpc/recipes/coordinate.rb Statutory Warning  Code snippets are edited to fit the slides which may have resulted in logic incoherence, bugs and un-readability. Readers discretion requested.
  • 14. Chef Patterns at Bloomberg Scale // LOGIC INJECTION 14 • WE USE COMMUNITY COOKBOOKS • Takes care of standard install, enable and starting of services • NEED TO ADD LOGIC TO COOKBOOK RECIPES • Take action on a service only when conditions are satisfied • Take action on a service based on dependent service state
  • 15. Chef Patterns at Bloomberg Scale // template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb' ... helpers(Kafka::Configuration) if restart_on_configuration_change? notifies :restart, 'service[kafka]', :delayed end end service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions end LOGIC INJECTION 15 VANILLA COMMUNITY COOKBOOK:
  • 16. Chef Patterns at Bloomberg Scale // template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb' ... helpers(Kafka::Configuration) if restart_on_configuration_change? notifies :restart, 'service[kafka]', :delayed end end #----- Remove ----# service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions end #----- Remove----# LOGIC INJECTION 16 VANILLA COMMUNITY COOKBOOK:
  • 17. Chef Patterns at Bloomberg Scale // template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb’ ... helpers(Kafka::Configuration) if restart_on_configuration_change? notifies :create, 'ruby_block[pre-shim]', :immediately end end #----- Replace----# include_recipe node["kafka"]["start_coordination"]["recipe"] #----- Replace----# LOGIC INJECTION 17 VANILLA COMMUNITY COOKBOOK 2.0:
  • 18. Chef Patterns at Bloomberg Scale // ruby_block 'pre-shim' do # pre-restart no-op notifies :restart, 'service[kafka] ', :delayed end service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions end LOGIC INJECTION 18 COOKBOOK COORDINATOR RECIPE:
  • 19. Chef Patterns at Bloomberg Scale // ruby_block 'pre-shim' do # pre-restart done here notifies :restart, 'service[kafka] ', :delayed end service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions notifies :create, 'ruby_block[post-shim] ', :immediately end ruby_block 'post-shim' do # clean-up done here end LOGIC INJECTION 19 WRAPPER COORDINATOR RECIPE:
  • 20. Chef Patterns at Bloomberg Scale // SERVICE ON DEMAND 20 • COMMON SERVICE WHICH CAN BE REQUESTED • Copy log files from applications into a centralized location • Single location for users to review logs and helps with security • Service available on all the nodes • Applications can request the service dynamically
  • 21. Chef Patterns at Bloomberg Scale // SERVICE ON DEMAND 21 • NODE ATTRIBUTE TO STORE SERVICE REQUESTS default['bcpc']['hadoop']['copylog'] = {} • DATA STRUCTURE TO MAKE SERVICE REQUESTS { 'app_id' => { 'logfile' => "/path/file_name_of_log_file", 'docopy' => true (or false) },... }
  • 22. Chef Patterns at Bloomberg Scale // SERVICE ON DEMAND 22 • APPLICATION RECIPES MAKE SERVICE REQUESTS # # Updating node attributes to copy HBase master log file to HDFS # node.default['bcpc']['hadoop']['copylog']['hbase_master'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.log", 'docopy' => true } node.default['bcpc']['hadoop']['copylog']['hbase_master_out'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.out", 'docopy' => true }
  • 23. Chef Patterns at Bloomberg Scale // SERVICE ON DEMAND 23 • RECIPE FOR THE COMMON SERVICE node['bcpc']['hadoop']['copylog'].each do |id,f| if f['docopy'] template "/etc/flume/conf/flume-#{id}.conf" do source "flume_flume-conf.erb” action :create ... variables(:agent_name => "#{id}", :log_location => "#{f['logfile']}" ) notifies :restart,"service[flume-agent-multi-#{id}]",:delayed end service "flume-agent-multi-#{id}" do supports :status => true, :restart => true, :reload => false service_name "flume-agent-multi" action :start start_command "service flume-agent-multi start #{id}" restart_command "service flume-agent-multi restart #{id}" status_command "service flume-agent-multi status #{id}" end
  • 24. Chef Patterns at Bloomberg Scale // PLUGGABLE ALERTS 24 • SINGLE SOURCE FOR MONITORED STATS • Allows users to visualize stats across different parameters • Didn’t want to duplicate the stats collection by alerting system • Need to feed data to the alerting system to generate alerts
  • 25. Chef Patterns at Bloomberg Scale // PLUGGABLE ALERTS 25 • ATTRIBUTE WHERE USERS CAN DEFINE ALERTS default["bcpc"]["hadoop"]["graphite"]["queries"] = { 'hbase_master' => [ { 'type' => "jmx", 'query' => "memory.NonHeapMemoryUsage_committed", 'key' => "hbasenonheapmem", 'trigger_val' => "max(61,0)", 'trigger_cond' => "=0", 'trigger_name' => "HBaseMasterAvailability", 'trigger_dep' => ["NameNodeAvailability"], 'trigger_desc' => "HBase master seems to be down", 'severity' => 1 },{ 'type' => "jmx", 'query' => "memory.HeapMemoryUsage_committed", 'key' => "hbaseheapmem", ... },...], ’namenode' => [...] ...} Query to pull stats from data source Define alert criteria
  • 26. Chef Patterns at Bloomberg Scale // TEMPLATE PITFALLS 26 • LIBRARY FUNCTION CALLS IN WRAPPER COOKBOOKS • Community cookbook provider accepts template as an attribute • Template passed from wrapper makes a library function call • Wrapper recipe includes the module of library function
  • 27. Chef Patterns at Bloomberg Scale // TEMPLATE PITFALLS 27 ... Chef::Resource.send(:include, Bcpc::OSHelper) ... cobbler_profile "bcpc_host" do kickstart "cobbler.bcpc_ubuntu_host.preseed" distro "ubuntu-12.04-mini-x86_64” end ... ... d-i passwd/user-password-crypted password <%="#{get_config(@node, 'cobbler-root-password-salted')}"%> d-i passwd/user-uid string ... • WRAPPER RECIPE • FUNCTION CALL IN TEMPLATE
  • 28. Chef Patterns at Bloomberg Scale // TEMPLATE PITFALLS 28 ... d-i passwd/user-password-crypted password <%="#{Bcpc::OSHelper.get_config(@node, 'cobbler-root-password- salted')}"%> d-i passwd/user-uid string ... • MODIFIED FUNCTION CALL IN TEMPLATE
  • 29. Chef Patterns at Bloomberg Scale // DYNAMIC RESOURCES 29 • ANIT-PATTERN? ruby_block "create namenode directories" do block do node[:bcpc][:storage][:mounts].each do |d| dir = Chef::Resource::Directory.new("#{mount_root}/#{d}/dfs/nn", run_context) dir.owner "hdfs" dir.group "hdfs" dir.mode 0755 dir.recursive true dir.run_action :create exe = Chef::Resource::Execute.new("fixup nn owner", run_context) exe.command "chown -Rf hdfs:hdfs #{mount_root}/#{d}/dfs" exe.only_if { Etc.getpwuid(File.stat("#{mount_root}/#{d}/dfs/").uid).name != "hdfs " } end end
  • 30. Chef Patterns at Bloomberg Scale // DYNAMIC RESOURCES 30 • SYSTEM CONFIGURATION • Lengthy Configuration of a Storage Controller • Setting Attributes at Converge Time • Compile Time Actions? • MUST WRAP IN RUBY_BLOCK’S • Does not Update the Resource Collection • Lazy’s everywhere: • Guards: not_if{lazy{node[…]}.call.map{…}}
  • 31. Chef Patterns at Bloomberg Scale // SERVICE RESTART 31 • WE USE JMXTRANS TO MONITOR JMX STATS • Service to be monitored varies with node • There can be more than one service to be monitored • Monitored service restart requires JMXtrans to be restarted**
  • 32. Chef Patterns at Bloomberg Scale // SERVICE RESTART 32 • DATA STRUCTURE IN ROLES TO DEFINE THE SERVICES "default_attributes" : { "jmxtrans”:{ "servers”:[ { "type": "datanode", "service": "hadoop-hdfs-datanode", "service_cmd": "org.apache.hadoop.hdfs.server.datanode.DataNode" }, { "type": "hbase_rs", "service": "hbase-regionserver", "service_cmd": “org.apache.hadoop.hbase.regionserver.HRegionServer" } ] } ... Dependent Service Name String to uniquely identify the service process
  • 33. Chef Patterns at Bloomberg Scale // SERVICE RESTART 33 • JMXTRANS SERVICE RESTART LOGIC BUILT DYNAMICALLY jmx_services = Array.new jmx_srvc_cmds = Hash.new node['jmxtrans']['servers'].each do |server| jmx_services.push(server['service']) jmx_srvc_cmds[server['service']] = server['service_cmd'] end service "restart jmxtrans on dependent service" do service_name "jmxtrans" supports :restart => true, :status => true, :reload => true action :restart jmx_services.each do |jmx_dep_service| subscribes :restart, "service[#{jmx_dep_service}]", :delayed end only_if {process_require_restart?("jmxtrans","jmxtrans-all.jar”, jmx_srvc_cmds)} end What if a process is re/started externally? Store the dependent service name and process ids in local variables Subscribes from all dependent services
  • 34. Chef Patterns at Bloomberg Scale // SERVICE RESTART 34 def process_require_restart?(process_name, process_cmd, dep_cmds) tgt_proces_pid = `pgrep -f #{process_cmd}` ... tgt_proces_stime = `ps --no-header -o start_time #{tgt_process_pid}` ... ret = false restarted_processes = Array.new dep_cmds.each do |dep_process, dep_cmd| dep_pids = `pgrep -f #{dep_cmd}` if dep_pids != "" dep_pids_arr = dep_pids.split("n") dep_pids_arr.each do |dep_pid| dep_process_stime = `ps --no-header -o start_time #{dep_pid}` if DateTime.parse(tgt_proces_stime) < DateTime.parse(dep_process_stime) restarted_processes.push(dep_process) ret = true end ... Start time of the service process Start time of all the service processes on which it is dependent on Compare the start time
  • 35. Chef Patterns at Bloomberg Scale // ROLLING RESTART 35 • AUTOMATIC CONVERGENCE • AVAILABILITY • High Availability • Toxic Configuration • HOW • Check Masters for Slave Status • Synchronous Communication • Locking
  • 36. Chef Patterns at Bloomberg Scale // ROLLING RESTART 36 • FLAGGING • Negative Flagging – flag when a service is down • Positive Flagging – flag when a service is reconfiguring • Deadlock Avoidance • CONTENTION • Poll & Wait • Fail the Run • Simply Skip Service Restart and Go On • Store the Need for Restart • Breaks Assumptions of Procedural Chef Runs
  • 37. Chef Patterns at Bloomberg Scale // ROLLING RESTART 37 HADOOP_SERVICE "ZOOKEEPER-SERVER" DO DEPENDENCIES ["TEMPLATE[/ETC/ZOOKEEPER/CONF/ZOO.CFG]", "TEMPLATE[/USR/LIB/ZOOKEEPER/BIN/ZKSERVER.SH]", "TEMPLATE[/ETC/DEFAULT/ZOOKEEPER-SERVER]"] PROCESS_IDENTIFIER "ORG.APACHE.ZOOKEEPER ... QUORUMPEERMAIN" END • SERVICE DEFINITION
  • 38. Chef Patterns at Bloomberg Scale // ROLLING RESTART 38 • SYNCH STATE STORE • Zookeeper • SERVICE RESTART (KAFKA) VALIDATION CHECK • Based on Jenkins pattern for wait_until_ready! • Verifies that the service is up to an acceptable level • Passes or stops the Chef run • FUTURE DIRECTIONS • Topology Aware Deployment • Data Aware Deployment
  • 39. Chef Patterns at Bloomberg Scale // WE ARE HIRING JOBS.BLOOMBERG.COM: https://github.com/bloomberg/chef-bach Freenode: #chef-bach • Hadoop Infrastructure Engineer • DevOps Engineer Search Infrastructure