With the new version of Chef we have more options and more features and an even better knife status command, which brings us to the discussion at hand which is how to alert for stale nodes on Chef using Nagios:-
The knife status command is used to display a brief summary of nodes on a Chef Server:-
knife status (options)
When used with -H switch it gives us the time on when the last successful Chef run was excluding nodes which ran in the past hour e.g:-
knife status -H
20 hours ago, dev-vm.nclouds.com, ubuntu 10.04, dev-vm.nclouds.com, 10.66.44.126 3 hours ago, i-225f954f, ubuntu 10.04, ec2-67-202-63-102.compute-1.amazonaws.com, 67.202.63.102
We can use this command to help us in alerting for stale nodes with a small script in ruby and some settings in nagios. Let’s start with the ruby script first:-
#!/opt/chef/embedded/bin/ruby require 'rubygems' require 'chef/config' require 'chef/rest' require 'chef/search/query' ##Define hours to be alerted upon and chef client.rb path so the script can execute knife status command critical = 12 warning = 1 Chef::Config.from_file(File.expand_path("/etc/chef/client.rb")) OK_STATE = 0 WARNING_STATE = 1 CRITICAL_STATE = 2 UNKNOWN_STATE = 3 if warning > critical || warning < 0 puts "Warning: warning should be less than critical and bigger than zero" exit(WARNING_STATE) end query = Chef::Search::Query.new all_nodes = [] cnodes = [] wnodes = [] query.search('node', "*:*") do |node| all_nodes << node end all_nodes.each do |node| hours=(Time.now.to_i - node['ohai_time'].to_i)/3600 if hours >= critical cnodes << node.name elsif hours >= warning wnodes << node.name end end if cnodes.length > 0 puts "CRITICAL: "+cnodes.join(',')+" did not check in for "+critical.to_s+" hours" exit(CRITICAL_STATE) elsif wnodes.length > 0 puts "Warning :"+wnodes.join(',')+" did not check in for "+warning.to_s+" hours" exit(WARNING_STATE) elsif cnodes.length == 0 and wnodes.join(',') == 0 puts "OK: All nodes are ok!" exit(OK_STATE) else puts "UNKNOWN" exit(UNKNOWN_STATE) end
Now in the above script if a certain node has not checked in within the 12 hours time period defined we will put it in CRITICAL STATE and generate an alert with the following settings in Nagios:-
Please note that this machine needs to be able to connect to the Chef-Server using knife as we defined in the script.
Install the script in your Nagios plugins directory like :-
cp check_chef_nodes.rb /usr/lib64/nagios/plugins/check_chef_nodes.rb
Then in the nagios configuration define the command, host and service like this:-
define command { command_name check_chef_node_status command_line $USER1$/check_chef_nodes.rb } define host { use linux-server contact_groups admins address 127.0.0.1 host_name localhost } define service { use local-service ; Name of service template to use host_name localhost service_description Chef Node Health Check check_command check_chef_node_status notifications_enabled 0 }
Once everything is configured restart nagios and you should see a service monitor for Chef Node Check Health.
That’s all for today folks, now you have an alert on stale nodes on chef-server and can take steps to ensure all your nodes are up to date accordingly.