Showing posts with label icinga. Show all posts
Showing posts with label icinga. Show all posts

Friday, January 10, 2014

Book review of Icinga Network Monitoring by Packt Publishing

I got an opportunity to technically review Icinga Network Monitoring by Packt Publishing by chance in 2013. I believe that I have written down some articles about Icinga on this blog.
 It was my first time to technically review a book actually and I already expressed my gratitude for this opportunity, but I'd like to thank to those who gave me this opportunity again and I also hope that Icinga Network Monitoring will help especially the beginners learn about Icinga.

Icinga Network MonitoringIcinga Network Monitoring by Viranch Mehta

My rating: 3 of 5 stars


I honestly believe that this book will be the help for beginers who want to learn Icinga, on the other hand, it might not be entirely satisfactory for those who want go deeper into Icinga.
What I like about this book is it has good schemes to systematically understand its architecture and how to, because it is hard and takes time to grasp the whole structure of a monitoring software.
For example,

* Each chapters are well composed with introdcution and summary
* Enough tables, diagrams, and graphs to help us understand their images or concept
* Showing how to write and integrate custom plugins
* Threshold values about warning and critical are explained with command line and interepretation
* Index implemented

You can check the Icinga architecure with the graphs at the official HP, if you want to understand it deeper or more.
https://www.icinga.org/about/architec...

I'm expecting that the book for Icinga-2.x will be published soon.

View all my reviews

Finally, I am introducing one of my document about monitoring solution as it's not today's theme, but close to it. I am currently looking for and verifying the cloud monitoring solution, both OSS and SaaS which AWS Partner Network to automatically register and monitor the nodes on AWS. I tried these solutions below for example,



I summarized my evaluation on Zabbix for HyClops, StackDriver, and CopperEgg on my slideshare (written in Japanese)



I am also thinking that how much Icinga automates discovering, registering and monitoring them some day!

Today, that's it!

Sunday, October 14, 2012

Monitoring tool - pnp4nagios special template


When we use Icinga or Nagios with pnp4nagios for graphing the resources of hardwares or middle wares, we often want to look a resource graph from several services in one graph like load average of all of the servers or HTTP response time of web servers to compare them and examine each the difference.
We can make it by creating own templates called special template with pnp4nagios.

"special templates (starting with PNP 0.6.5) are used to combine data from arbitrary hosts and services and thus are not connected directly to a host or service.", says about pnp4nagios special template, here.

Then, what is it like the special template? This is an example which combines load average from all of the hosts in one graph.
sample_load.php
<?php
$this->MACRO['TITLE']   = "LOADAVERAGE";
$this->MACRO['COMMENT'] = "For All Servers";
$services = $this->tplGetServices("","LOADAVERAGE$");
# The Datasource Name for Graph 0
$ds_name[0] = "LOADAVERAGE";
$opt[0]     = "--title \"LOADAVERAGE\"";
$def[0]     = "";
# Iterate through the list of hosts
foreach($services as $key=>$val){
  $data = $this->tplGetData($val['host'],$val['service']);
  #throw new Kohana_exception(print_r($a,TRUE));
  $hostname   = rrd::cut($data['MACRO']['HOSTNAME']);
  $def[0]    .= rrd::def("var$key" , $data['DS'][0]['RRDFILE'], $data['DS'][0]['DS'] );
  $def[0]    .= rrd::line1("var$key", rrd::color($key), $hostname);
  $def[0]    .= rrd::gprint("var$key", array("MAX", "AVERAGE"));
}
?> 


The sample template is on my github. Please see the official reference for more detailed information about how to define the special template.

Next, I would like to demonstrate about how to create own template for the number of http accesses one by one, including setting up nagios plugin, pnp4nagios custom template and special template.
If you need to install Icinga or pnp4nagios, please see the past articles, here.

This is the graph which sample_apache_access.php generates.


  • setup nagios plugin (/usr/local/icinga/libexec)
# for modules in LWP::UserAgent Time::HiRes Digest::MD5 ; docpan -if $modules ; done
# wget  http://blog.spreendigital.de/wp-content/uploads/2009/07/check_apachestatus_auto.tgz -O- | tar zx
# ./check_apachestatus_auto.pl -H 127.0.0.1
APACHE OK - 0.050 sec. response time, Busy/Idle 1/9, open 246/256, ReqPerSec 0.4, BytesPerReq 17, BytesPerSec 5|Idle=9 Busy=1 OpenSlots=246 Slots=256 Starting=0 Reading=0 Sending=1 Keepalive=0 DNS=0 Closing=0 Logging=0 Finishing=0 ReqPerSec=0.350877 BytesPerReq=17 BytesPerSec=5.988304 Accesses=60

  • enable mod_status module if it's not enable(httpd.conf or including configuration)
ExtendedStatus On
<VirtualHost *:80>
  ServerName 127.0.0.1
  <Location /server-status>
    SetHandler server-status
    Order deny,allow
    Allow from 192.168.0.0/24
  </Location>
</VirtualHost>

  • define command and service configuration for Icinga/Nagios
define command{
        command_name    check_apache_performance
        command_line    $USER1$/check_apachestatus_auto.pl -H $HOSTADDREESS$ -t $ARG1$
}
define  service{
        use                    generic-service
        host_name               ha-mgr02, eco-web01, eco-web02
        service_description     Apache:Performance
        check_command           check_apache_performance!60
}
※Make sure that the hosts are defined on hosts.cfg.

  • setup custom template
    put check_apache_performance.php on /usr/local/pnp4nagios/share/template/
  • setup special template
    put sample_apache_access.php on /usr/local/pnp4nagios/share/templates.special/
  • Take a look at http://<your icinga server>/pnp4nagios/special?tpl=sample_apache_access

Lastly, I'm going to show you some examples.

Let's enjoy creating your own template and saving time to look around all of the graphs.

Monday, July 16, 2012

Monitoring tool - pnp4nagios custom template

I introduced about how to setup icinga an icinga-web, and setup icinga-web with pnp4nagios  to setup a monitoring server with icinga and pnp4nagios before. I'm going to show you pnp4nagios custom templates which influences the appearance of RRD graphs.

Why is it necessary to create custom templates?


I belive that the reason is that we are sometimes obliged to look into the graphs with specific hardware resource or performance data, when we analyze or investigate network devices, servers or middle ware performance, for example, how much cpu or memory resources are utilized, how much disk space is left, how much traffic is transferred, and so on.

If you need further information about custom templates for pnp4nagios, please see the official reference.

I'll give you an example of custom template based on default templates($pnp4nagios_prefix/share/templates.dist/interger.php) for traffic and graphs with nagios plugins, check_tcptraffic.

  • check_tcptraffic
# for module in \
Carp \
English \
Nagios::Plugin \
Readonly
do cpan -i install $module ; done
# wget https://trac.id.ethz.ch/projects/nagios_plugins/downloads/check_tcptraffic-2.2.4.tar.gz
# tar zxf check_tcptraffic-2.2.4.tar.gz
# cd check_tcptraffic-2.2.4
# check_tcptraffic -i eth0 -s 100 -w 10 -c 20
TCPTRAFFIC CRITICAL - eth0 182216 bytes/s | TOTAL=182216Byte;10;20 IN=180221Byte;; OUT=1995Byte;; TIME=204852Byte;; 
  • commands.cfg
define command{
        command_name    check_traffic
        command_line    $USER1$/check_tcptraffic -t $ARG1$ -s 1000 -w $ARG2$ -c $ARG3$ -i $ARG4$
        }
  • services.cfg 
define  service{
        use                     generic-service
        host_name               <hostname>
        service_description     TRAFFIC:eth0
        check_command           check_traffic!60!10000000!20000000!eth0
}
  • check_traffic.php (custom template for pnp4nagios)
    ※template_dirs=/usr/local/pnp4nagios/share/templates
<?php
$ds_name[1] = "$NAGIOS_AUTH_SERVICEDESC"; 
$opt[1] = "--vertical-label \"$UNIT[1]\" --title \"$hostname / $servicedesc\" ";
$def[1]  = rrd::def("var1", $RRDFILE[1], $DS[1], "AVERAGE");
$def[1] .= rrd::def("var2", $RRDFILE[2], $DS[2], "AVERAGE");
$def[1] .= rrd::def("var3", $RRDFILE[3], $DS[3], "AVERAGE");

if ($WARN[1] != "") {
    $def[1] .= "HRULE:$WARN[1]#FFFF00 ";
}
if ($CRIT[1] != "") {
    $def[1] .= "HRULE:$CRIT[1]#FF0000 ";       
}
$def[1] .= rrd::line1("var1", "#000000", "$NAME[1]") ;
$def[1] .= rrd::gprint("var1", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
$def[1] .= rrd::area("var2", "#00ff00", "$NAME[2]") ;
$def[1] .= rrd::gprint("var2", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
$def[1] .= rrd::line1("var3", "#0000ff", "$NAME[3]") ;
$def[1] .= rrd::gprint("var3", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
?>

check_traffic.php generates the graphs below.









Other custom templates are open to the public at my github.
These are the list of custom templates and sample graphs.

  • check_apache_performance.php




  • check_connections.php


  • check_cpu.php
 
  • check_disk.php 
 
  • check_diskio.php
 
  • check_http.php

  • check_load.php
 
  • check_mem.php
 
  • check_mysql_health.php
 
  • check_nagios_latency_service.php
 
  • check_traffic.php

Sunday, May 20, 2012

HA Monitoring - MySQL replication

It is necessary to construct a redundant system with High Availability to serve it all times or reduce downtime.
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. Wikipedia High Availability
And, it is also important to assure that a system with high availability i s running under such a condition, Master/Slave or Primary/Secondary.
I am going to introduce of how to monitor a such a system with nagios in the following examples and let's see the mysql replication first.
  • MySQL Replication
  • PostgresSQL Replication
  • HA Cluster with DRBD & Pacemaker

MySQL Replication

It is important to monitor a master server binary dump running and a slave server I/O, SQL thread running and slave lag(seconds behind master). MySQL official introduces about the details about MySQL replication implementation , here. I would like to show you about monitoring the status of slave server( I/O and SQL thread) with a nagios plug-in called check_mysql_health, released by Console Labs.

This plug-in, by the way, is absolutely useful because it is enable to check the various mysql parameters, such as the number of connections, query cache hit rate, or the number of slow queries including the health of mysql replication.

System Structure


OS CentOS-5.8
Kernel 2.6.18-274.el5
DB mysql-5.5.24
Scripting Language perl-5.14.2
Nagios Plugin check_mysql_health-2.1.5.1
icinga core icinga-1.6.1

Install check_mysql_health

  •  compile & install
# wget http://labs.consol.de/wp-content/uploads/2011/04/check_mysql_health-2.1.5.1.tar.gz
# tar zxf check_mysql_health-2.1.5.1.tar.gz
# cd check_mysql_health-2.1.5.1
# ./configure \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--with-mymodules-dir=/usr/lib64/nagios/plugins
# make
# make instal
# cp -p plugins-scripts/check_mysql_health /usr/local/nagios/libexec
  • install cpan modules
# for modules in \
DBI \
DBD::mysql \
Time::HiRes \
IO::File \
File::Copy \
File::Temp \
Time::HiRes \
IO::File \
Data::Dumper \File::Basename \
Getopt::Long
 do cpan -i $modules
done
  • grant privileges for mysql user
# mysql -uroot -p mysql -e "GRANT SELECT, SUPER,REPLICATION CLIENT ON *.* TO nagios@'localhost' IDENTIFIED BY 'nagios'; FLUSH PRIVILEGES ;" 
# mysql -uroot -p mysql -e "SELECT * FROM user WHERE User = 'nagios'\G;"
*************************** 1. row ***************************
                  Host: localhost
                  User: nagios
              Password: *82802C50A7A5CDFDEA2653A1503FC4B8939C4047
           Select_priv: Y
           Insert_priv: N
           Update_priv: N
           Delete_priv: N
           Create_priv: N
             Drop_priv: N
           Reload_priv: N
         Shutdown_priv: N
          Process_priv: N
             File_priv: N
            Grant_priv: N
       References_priv: N
            Index_priv: N
            Alter_priv: N
          Show_db_priv: N
            Super_priv: Y
 Create_tmp_table_priv: N
      Lock_tables_priv: N
          Execute_priv: N
       Repl_slave_priv: N
      Repl_client_priv: Y
      Create_view_priv: N
        Show_view_priv: N
   Create_routine_priv: N
    Alter_routine_priv: N
      Create_user_priv: N
            Event_priv: N
          Trigger_priv: N
Create_tablespace_priv: N
              ssl_type: 
            ssl_cipher: 
           x509_issuer: 
          x509_subject: 
         max_questions: 0
           max_updates: 0
       max_connections: 0
  max_user_connections: 0
                plugin: 
 authentication_string: NULL
  • revise parentheses deprecated error
    Please just revise the line below if parentheses deprecated error detected.
# check_mysql_health --hostname localhost --username root --mode uptime
Use of qw(...) as parentheses is deprecated at check_mysql_health line 1247.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 2596.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 3473.
OK - database is up since 2677 minutes | uptime=160628s
# cp -p check_mysql_health{,.bak}
# vi check_mysql_health
...
# diff -u check_mysql_health.bak check_mysql_health
--- check_mysql_health.bak    2011-07-15 17:46:28.000000000 +0900
+++ check_mysql_health        2011-07-17 14:04:45.000000000 +0900
@@ -1244,7 +1244,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -2593,7 +2593,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -3469,8 +3469,8 @@
   $needs_restart = 1;
   # if the calling script has a path for shared libs and there is no --environment
   # parameter then the called script surely needs the variable too.
-  foreach my $important_env qw(LD_LIBRARY_PATH SHLIB_PATH 
-      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10) {
+  foreach my $important_env (qw(LD_LIBRARY_PATH SHLIB_PATH 
+      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10)) {
     if ($ENV{$important_env} && ! scalar(grep { /^$important_env=/ } 
         keys %{$commandline{environment}})) {
       $commandline{environment}->{$important_env} = $ENV{$important_env};

Verification

I am going to verify the mysql replication status about slave lag, I/O thread and SQL thread in the following condition, supposing that mysql replication is running.
Please see the official information of how to setup mysql replication.
  1. Both I/O thread and SQL thread running
  2. I/O thread stopped, SQL thread running
  3. I/O thread running, SQL thread stopped
  • Both I/O thread and SQL thread running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
OK - Slave is 0 seconds behind master | slave_lag=0;5;1
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
# mysql -uroot -p myql -e "STOP SLAVE IO_THREAD;" 
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get slave lag, because io thread is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
CRITICAL - Slave io is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
  • I/O thread running, SQL thread stopped
# mysql -uroot -p myql -e "STOP SLAVE SQL_THREAD;"  
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get replication inf
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
CRITICAL - Slave sql is not running


Let's see how to monitor PostgreSQL streaming replication, next.

Sunday, May 13, 2012

Monitoring tool - monitor icinga/nagios with monit

As introducing of how to install monit before, I am explaining how to monitor icinga/nagios with monit. Monit has several kinds of testing and defines them here. I adopt PPID testing which tests the process parent process identification number (ppid) of a process for changes to check icinga daemon.

The configurations for service entry statement are released on my github.

Configuration

  • setup pidfile of icinga.cfg(nagios.cfg)
    Though the directive says "lock_file", it actually outputs the process id number to the file. Nagios official says, here.
    "This option specifies the location of the lock file that Nagios should create when it runs as a daemon (when started with the -d command line argument). This file contains the process id (PID) number of the running Nagios process."
# grep '^lock_file' icinga.cfg 
lock_file=/var/run/icinga.pid
# /etc/init.d/icinga reload
  • setup service entry statement of icinga
# cat > /etc/monit.d/icinga.conf >> EOF
check process icinga
      with pidfile "/var/run/icinga.pid"
      start program = "/etc/init.d/icinga start"
      stop program = "/etc/init.d/icinga stop"
      if 3 restarts within 3 cycles then alert

EOF

Start up

  • begin monitoring
# monit monitor icinga
# monit start icinga
  • see the summary
# monit summary | grep 'icinga'
Process 'icinga'                    Running
  • see the monit log file
# tail -f /var/log/monit/monit.log
[JST May 13 14:35:48] info     : 'icinga' monitor on user request
[JST May 13 14:35:48] info     : monit daemon at 13661 awakened
[JST May 13 14:35:48] info     : Awakened by User defined signal 1
[JST May 13 14:35:48] info     : 'icinga' monitor action done
[JST May 13 14:37:07] error    : monit: invalid argument -- staus  (-h will show valid arguments)
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done

Verification 

  • verify icinga daemon begins if  its process is stopped 
# /etc/init.d/icinga status
icinga (pid  31107) is running...
# kill `pgrep icinga`
  • see the log file that monit begins icinga
# cat /var/log/monit/monit.log
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done
[JST May 13 14:45:40] error    : 'icinga' process is not running
[JST May 13 14:45:40] info     : 'icinga' trying to restart
[JST May 13 14:45:40] info     : 'icinga' start: /etc/init.d/icinga
  • check icinga is running.
# /etc/init.d/icinga status
icinga (pid  21093) is running...

Configuration examples(ido2db, npcd)

  • setup pidfile of ido2db.cfg (ndo2db)
# grep '^lock_file' ido2db.cfg 
lock_file=/var/run/ido2db.pid
  • setup service entry statement of ido2db
# cat > /etc/monit.d/ido2db.monit << EOF
check process ido2db
      with pidfile "/var/run/ido2db.pid"
      start program = "/etc/init.d/ido2db start"
      stop program = "/etc/init.d/ido2db stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor ido2db
# monit start ido2db
  • setup pidfile of npcd.cfg (pnp4nagios)
# grep '^pid_file' npcd.cfg 
pid_file=/var/run/npcd.pid
  • setup service entry statement of npcd
# cat > /etc/monit.d/npcd.monit << EOF
check process npcd
      with pidfile "/var/run/npcd.pid"
      start program = "/etc/init.d/npcd start"
      stop program = "/etc/init.d/npcd stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor npcd
# monit start npcd


Monitoring tool - install monit

Icinga, nagios and other monitoring tools can monitor a specified daemon or process running. Though they can monitor the icinga or nagios daemon and check that they are running, what would happen if icinga or nagios daemon themselves stop.
Monit is capable of monitoring a daemon by checking a specified process or port running and restarting the daemon or even stopping it.
"Monit is a free open source utility for managing and monitoring, processes, programs, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations." MONIT Official
I'd like to introduce about installing monit first, and how to monitor icinga with monit then.
The configurations are released on my github, here.

Reference

Install monit

  •  setup rpmforge repository
# rpm -ivh http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
# sed -i 's/enabled = 1/enabled = 0/' /etc/yum.repos.d/rpmforge.repo
  • install monit
# yum -y --enablerepo=rpmforge install monit
  • verify installation
# monit -V
This is Monit version 5.3.2
Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.

Configuration

  •  /etc/monitrc (monit control file)
    Please see the official documentation if you need further information about monit control file.
    The set alert directive below means that monit sends alert if it matches the actions except for from checksum to timestamp.
# cat > /etc/monitrc << EOF
set daemon 120 with start delay 30
set logfile /var/log/monit/monit.log
## Sending E-mail, put off the comment below
set mailserver localhost
set alert username@domain not {
checksum
content
data
exec
gid
icmp
invalid
fsflags
permission
pid
ppid
size
timestamp
#action
#nonexist
#timeout
}
mail-format {
from: monit@$HOST
subject: Monit Alert -- $SERVICE $EVENT --
message:
Hostname:       $HOST
Service:        $SERVICE
Action:         $ACTION
Date/Time:      $DATE
Info:           $DESCRIPTION
}
set idfile /var/monit/id
set statefile /var/monit/state
set eventqueue
    basedir /var/monit  
    slots 100           
set httpd port 2812 and
    allow localhost 
    allow 192.168.0.0/24
    allow admin:monit      
include /etc/monit.d/*.conf
EOF
  • setup logging 
# mkdir /var/log/monit
# cat > /etc/logrotate.d/monit <<EOF
/var/log/monit/*.log {
  missingok
  notifempty
  rotate 12
  weekly
  compress
  postrotate
    /usr/bin/monit quit  
  endscript
}
EOF 
  • setup include file (service entry statement)
    The following is example of monitoring ntpd.
# cat > /etc/monit.d/ntpd.conf
check process ntpd
        with pidfile "/var/run/ntpd.pid"
        start program = "/etc/init.d/ntpd start"
        stop program = "/etc/init.d/ntpd stop"
        if 3 restarts within 3 cycles then alert

EOF
  •  verify syntax
# monit -t
Control file syntax OK

Start up

  • run monit from init
    It is enable to run monit from init script, but I want to make it certain of always having a running Monit daemon on the system.
# cat >> /etc/inittab <<EOF
mo:2345:respawn:/usr/bin/monit -Ic /etc/monitrc
EOF
  • re-examine /etc/inittab 
# telinit q
# tail -f /var/log/messages
May 13 12:34:35 ha-mgr02 init: Re-reading inittab
  • check monit running
# ps awuxc | grep 'monit'
root      1431  0.0  0.0  57432  1876 ?        Ssl  11:38   0:00 monit 
  • stop monit process and check that init begins monit
# kill `pgrep monit` ; ps cawux | grep 'monit'
root     13661  0.0  0.0  57432  1780 ?        Ssl  13:31   0:00 monit

  • show status and summary
# show status
Process 'ntpd'
  status                            Running
  monitoring status                 Monitored
  pid                               32307
  parent pid                        1
  uptime                            12d 17h 44m 
  children                          0
  memory kilobytes                  5040
  memory kilobytes total            5040
  memory percent                    0.2%
  memory percent total              0.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Sun, 13 May 2012 12:34:35

System 'system_ha-mgr02.forschooner.net'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.09] [0.20] [0.14]
  cpu                               1.6%us 3.2%sy 0.3%wa
  memory usage                      672540 kB [32.6%]
  swap usage                        120 kB [0.0%]
  data collected                    Sun, 13 May 2012 12:32:35
  • show summary 
# monit summary
The Monit daemon 5.3.2 uptime: 58m 

Process 'sshd'                      Running
Process 'ntpd'                      Running
System 'system_ha-mgr02.forschooner.net' Running

Start up from upstart

As RHEL-6.x and CentOS-6.x adopts upstart, it is necessary to use upstart but for init with those OS.
  • setup /etc/init/monit.conf
# monit_bin=$(which monit)
# cat > /etc/init/monit.conf << EOF
# monit respawn
description     "Monit"

start on runlevel [2345]
stop on runlevel [!2345]
 
respawn
exec $monit_bin -Ic /etc/monit.conf
EOF 
  • show a list of the known jobs and instances
# initctl list
 rc stop/waiting
 tty (/dev/tty3) start/running, process 1249
 ...
 monit stop/waiting
 serial (hvc0) start/running, process 1239
 rcS-sulogin stop/waiting
  • begin monit
# initctl start monit
 monit start/running, process 6873
  • see the status of the job(monit)
 # initctl status monit
 monit start/running, process 6873
  • stop monit process
# kill `pgrep monit`
  • check that upstart begins monit
# ps cawux | grep monit
 root      7140  0.0  0.1   7004  1840 ?        Ss   21:42   0:00 monit
  • see the log file that monit is respawning
# tail -1 /var/log/messages
 Oct 20 12:42:41 ip-10-171-47-212 init: monit main process ended, respawning

Verification

  • access to the monit service manager (http://IP Address:2812)

  • check ntp daemon starts if it stops 
# /etc/init.d/ntpd status
ntpd (pid  32307) is running...
# /etc/init.d/ntpd stop  
Shutting down ntpd:                                        [  OK  ]
  • see the log file that monit starts ntpd 
# cat /var/log/monit/monit.log
[JST May 13 12:52:24] error    : 'ntpd' process is not running
[JST May 13 12:52:24] info     : 'ntpd' trying to restart
[JST May 13 12:52:24] info     : 'ntpd' start: /etc/init.d/ntpd
  • check ntpd is running
# /etc/init.d/ntpd status
ntpd (pid  9475) is running...

Mail sample format

The following is examples of alert mail when monit works.
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Does not exist --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         restart
Date/Time:      Sun, 13 May 2012 12:52:24
Info:           process is not running 
  • notifying that the daemon starts
<Subject>
Monit Alert -- ntpd Action done --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           start action done 
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Exists --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           process is running with pid 9475









Friday, May 4, 2012

Key Value Store - monitor cassandra and multinode cluster

As installing cassandra and creating multinode cluster, I'm introducing of how to monitor cassandra and multinode cluster with own nagios-plugin.

Monitor cassandra node(check_by_ssh+cassandra-cli)

There's several ways to monitor cassandra node with Nagios or Icinga such as, JMX or check_jmx. Though they are fairly effective way to monitor cassandra, they need to take some time to prepare. I am afraid that using check_by_ssh and cassandra-cli  is more simple than those ones and no need to install any libraries except for cassandra itself.
  • commands.cfg
define command{
        command_name    check_by_ssh
        command_line    $USER1$/check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H $HOSTADDRESS$ -t $ARG1$ -C '$ARG2$'

  • services.cfg
define service{
      use                     generic-service
      host_name               cassandra
      service_description     Cassandra Node
      check_command           check_by_ssh!22!60!"/usr/local/apache-cassandra/bin/cassandra-cli -h localhost --jmxport 9160 -f /tmp/cassandra_load.txt"
  • setup the file to load statements
    setup the statement file in the cassandra node to be monitored.
    "show cluster name;" shows its cluster name.
# cat > /tmp/cassandra_load.txt << EOF
show cluster name;
EOF
  • plugin status when cassandra is running(service status is OK)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Connected to: "Test Cluster" on 192.168.213.91/9160
Test Cluster
  • plugin status when cassandra is stopped(service status is CRITICAL)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -l root -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Remote command execution failed: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Monitor multinode cluster(check_cassandra_cluster.sh)

 The plugin has been released at Nagios Exchange and see the detail there, please.
  • overview
    check if the number of live nodes which belong to multinode cluster is less than the specified number.
    it is enable to specify the threshold with option -w <warning> and -c <critical>.
    get the number of live nodes, their status, and performance data.
  • software requirements
    cassandra(using nodetool command)
  • command help
# check_cassandra_cluster.sh -h
Usage: ./check_cassandra_cluster.sh -H <host> -P <port> -w <warning> -c <critical>

 -H <host> IP address or hostname of the cassandra node to connect, localhost by default.
 -P <port> JMX port, 7199 by default.
 -w <warning> alert warning state, if the number of live nodes is less than <warning>.
 -c <critical> alert critical state, if the number of live nodes is less than <critical>.
 -h show command option
 -V show command version 
  •  when service status is OK
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 1 -c 0
OK - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when service status is WARNING
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 2 -c 0
WARNING - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05% 
  •  when status is CRITICAL
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 2
CRITICAL - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when the threshold of warning is less than the one of critical
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 4
-w <warning> 3 must be less than -c <critical> 4.

Monitoring tool - init script for icinga, ido2db(idoutils), and npcd(pnp4nagios)

As installing finishes icinga, icinga-web, and pnp4nagios, it's necessary to setup init scripts to run and stop daemon. Of course, each of the source files includes ones, but I prefer a typical format based on RPM package to the ones in the source file. So I modified the init scripts based on RPM packages.

I am going to introduce of  each of the init scripts and verification about how they work.
They are open to the public in my github.
  • daemon and init script
Icinga (based on Nagios RPM package) /etc/init.d/icinga
IDOUtils ( based on NDOUtils RPM package) /etc/init.d/ido2mod
PNP4nagios ( based on Nagios RPM Package a little) /etc/init.d/npcd

icinga

  • create init script based on nagios RPM package
    The patch file is stored here.
# yumdownloader --enablerepo=rpmforge icinga
# mkdir work
# cd work
# rpm2cpio ../ nagios-3.2.3-3.el5.rf.x86_64.rpm | cpid -id ./etc/rc.d/init.d/nagios
# cp etc/rc.d/init.d/nagios ./icinga
# cp icinga{,_diff}
...
# diff -c icinga icinga_diff > icinga.patch
# patch -p0 < icinga.patch
# cp icinga /etc/init.d/icinga
  • start daemon
# /etc/init.d/icinga start
Starting icinga:                                           [  OK  ]
  • stop daemon
# /etc/init.d/icinga stop
Stopping icinga:                                           [  OK  ]
  • restart daemon
# /etc/init.d/icinga restart
Stopping icinga:                                           [  OK  ]
Starting icinga:                                           [  OK  ]
  • condrestart daemon
# /etc/init.d/icinga condrestart
Stopping icinga:                                           [  OK  ]
Starting icinga:                                           [  OK  ]
  • reload daemon
# /etc/init.d/icinga reload
icinga (pid  17359) is running...
Reloading icinga:                                          [  OK  ]
  • check if daemon is running
# /etc/init.d/icinga status
icinga (pid  17359) is running...
  • difference between nagios(rpmpackage) and icinga
# diff -u nagios icinga_diff
--- nagios     2012-05-01 23:34:15.000000000 +0900
+++ icinga_diff        2012-05-03 20:52:17.000000000 +0900
@@ -1,36 +1,38 @@
 #!/bin/sh
 # $Id$
-# Nagios      Startup script for the Nagios monitoring daemon
+# Icinga      Startup script for the Nagios monitoring daemon
 #
 # chkconfig:  - 85 15
-# description:        Nagios is a service monitoring system
-# processname: nagios
-# config: /etc/nagios/nagios.cfg
-# pidfile: /var/nagios/nagios.pid
+# description:        Icinga is a service monitoring system
+# processname: icinga
+# config: /usr/local/icinga/etc/icinga.cfg
+# pidfile: /var/run/icinga.pid
 #
 ### BEGIN INIT INFO
-# Provides:           nagios
+# Provides:           icinga
 # Required-Start:     $local_fs $syslog $network
 # Required-Stop:      $local_fs $syslog $network
-# Short-Description:    start and stop Nagios monitoring server
-# Description:                Nagios is is a service monitoring system
+# Short-Description:    start and stop Icinga monitoring server
+# Description:                Icinga is is a service monitoring system
 ### END INIT INFO

 # Source function library.
 . /etc/rc.d/init.d/functions

-prefix="/usr"
-exec_prefix="/usr"
-exec="/usr/bin/nagios"
-prog="nagios"
-config="/etc/nagios/nagios.cfg"
-pidfile="/var/nagios/nagios.pid"
-user="nagios"
+user="icinga"
+prog="icinga"
+prefix="/usr/local/$prog"
+exec_prefix="${prefix}"
+exec="${prefix}/bin/$prog"
+config="${prefix}/etc/$prog.cfg"
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="$piddir/$prog.pid"
+lockfile="${lockdir}/$prog"

+[ -d "$piddir" ] || mkdir -p piddir && chown $prog:$prog $piddir
 [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

-lockfile=/var/lock/subsys/$prog
-
 start() {
     [ -x $exec ] || exit 5
     [ -f $config ] || exit 6
@@ -47,7 +49,7 @@
     killproc -d 10 $exec
     retval=$?
     echo
-    [ $retval -eq 0 ] && rm -f $lockfile
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
     return $retval
 }

@@ -60,7 +62,7 @@
 reload() {
     echo -n $"Reloading $prog: "
     killproc $exec -HUP
-    RETVAL=$?
+    retval=$?
     echo
 }

@@ -70,8 +72,8 @@

 check_config() {
         $nice runuser -s /bin/bash - $user -c "$corelimit >/dev/null 2>&1 ; $exec -v $config > /dev/null 2>&1"
-        RETVAL=$?
-        if [ $RETVAL -ne 0 ] ; then
+        retval=$?
+        if [ $retval -ne 0 ] ; then
                 echo -n $"Configuration validation failed"
                 failure
                 echo
  • about the pidfile and the lockfile path
    Icinga.cfg(also nagios.cfg) defines lockfile as pidfile.
    I'm not sure why they're defined as so, but I think they should be separated.
    I defined he path of pidfile and lockfile in the init script and icinga.cfg
# grep '^lock_file'icinga.cfg
lock_file=/var/run/icinga.pid
# egrep '^(pid|lock)' /etc/init.d/icinga 
piddir="/var/run"
lockdir="/var/lock/subsys"
pidfile="$piddir/$prog.pid"
lockfile="${lockdir}/$prog"

ido2db

  • create init script for ndoutils based on ndo2utils RPM packageThe patch file is stored here.
# yumdownloader --enablerepo=rpmforge ndo2utils
# mkdir work
# cd work
# rpm2cpio ../ndoutils-1.4-0.beta7.3.el5.rf.x86_64.rpm | cpio -id ./etc/init.d/ndoutils
# cp etc/init.d/ndoutils ./ido2db
# cp ido2db{,_diff}
# vi ido2db_diff
...
# diff -c ido2db ido2db_diff > ido2db.patch
# patch -p0 < ido2db.patch
# cp ido2db /etc/init.d/ido2db
  • start daemon
# /etc/init.d/ido2db start
Starting ido2db:                                           [  OK  ]
  • stop daemon
# /etc/init.d/ido2db stop
Stopping ido2db:                                           [  OK  ]
  • restart daemon
# /etc/init.d/ido2db restart
Stopping ido2db:                                           [  OK  ]
Starting ido2db:                                           [  OK  ]
  • condrestart daemon
# /etc/init.d/ido2db condrestart
Stopping ido2db:                                           [  OK  ]
Starting ido2db:                                           [  OK  ]
  • difference between ndo2utils(rpmpackage) and ido2db
# diff ndoutils ndoutils_diff

@@ -1,37 +1,42 @@
 #!/bin/sh
-# Startup script for ndo-daemon
+# Startup script for ido2db-daemon
 #
 # chkconfig: 2345 95 05
-# description: Nagios Database Objects daemon
+# description: Icinga Database Objects daemon

 # Source function library.
 . /etc/rc.d/init.d/functions

-
-BINARY=ndo2db-3x
-DAEMON=/usr/sbin/$BINARY
-CONFIG=/etc/nagios/ndo2db.cfg
-
-[ -f $DAEMON ] || exit 0
-
-prog="ndo2db"
+prog=ido2db
+user=icinga
+prefix=/usr/local/icinga
+exec=$prefix/bin/$prog
+config=$prefix/etc/ido2db.cfg
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="$piddir/$prog.pid"
+lockfile="${lockdir}/$prog"

 start() {
+    [ -x $exec ] || exit 5
+    [ -f $config ] || exit 6
     echo -n $"Starting $prog: "
-    daemon --user nagios $DAEMON -c $CONFIG
-    RETVAL=$?
+    daemon --user $user $exec -c $config
+    retval=$?
+    [ $retval -eq 0 ] && touch $lockfile
     echo
-    return $RETVAL
+    return $retval
 }

 stop() {
-    if test "x`pidof $BINARY`" != x; then
+    if test "x`pidof $prog`" != x; then
         echo -n $"Stopping $prog: "
-        killproc ndo2db-3x
+        killproc $prog
         echo
     fi
-    RETVAL=$?
-    return $RETVAL
+    retval=$?
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
+    return $retval
 }

 case "$1" in
@@ -44,14 +49,14 @@
             ;;

         status)
-            status $BINARY
+            status $prog
             ;;
         restart)
             stop
             start
             ;;
         condrestart)
-            if test "x`pidof $BINARY`" != x; then
+            if test "x`pidof $prog`" != x; then
                 stop
                 start
             fi
@@ -63,5 +68,5 @@

 esac

-exit $RETVAL
+exit $retval
  • about the pidfile and the lockfile path
    Icinga.cfg(also nagios.cfg) defines lockfile as pidfile.
    I'm not sure why they're defined as so, but I think they should be separated.
    I defined he path of pidfile and lockfile in the init script and icinga.cfg
# grep '^lock_file'ido2db.cfg
lock_file=/var/run/ido2db.pid
# egrep '^(pid|lock)' /etc/init.d/icinga 
piddir="/var/run"
lockdir="/var/lock/subsys"
pidfile="$piddir/$prog.pid"
lockfile="${lockdir}/$prog"


npcd

  • create init script for npcd based on nagios RPM packageThe patch file is stored here.
# yumdownloader --enablerepo=rpmforge icinga
# mkdir work
# cd work
# rpm2cpio ../ nagios-3.2.3-3.el5.rf.x86_64.rpm | cpid -id ./etc/rc.d/init.d/nagios
# cp etc/rc.d/init.d/nagios ./npcd
# cp npcd{,_diff}
...
# diff -c npcd npcd_diff > npcd.patch
# patch -p0 < npcd.patch
# cp npcd /etc/init.d/npcd
  • start daemon
# /etc/init.d/npcd start
npcd is stopped
Starting npcd:                                             [  OK  ]
  • stop daemon
# /etc/init.d/npcd stop
npcd (pid  14128) is running...
Stopping npcd:                                             [  OK  ]
  • restart daemon
# /etc/init.d/npcd restart
Starting npcd:                                             [  OK  ]
Starting npcd:                                             [  OK  ]
  • condrestart daemon
# /etc/init.d/npcd condrestart
npcd (pid  14216) is running...
Stopping npcd:                                             [  OK  ]
Starting npcd:                                             [  OK  ]
  • reload daemon
# /etc/init.d/npcd reload
npcd (pid  14233) is running...
Reloading npcd:                                            [  OK  ]
  • check if daemon is running
# /etc/init.d/npcd status
 npcd (pid 14233) is running...
  • difference between nagios(rpmpackage) and npcd
# diff -u npcd npcd_diff
--- npcd       2012-05-04 10:47:11.000000000 +0900
+++ npcd_diff  2012-05-03 22:45:28.000000000 +0900
@@ -1,41 +1,40 @@
 #!/bin/sh
-# $Id$
-# Nagios      Startup script for the Nagios monitoring daemon
-#
-# chkconfig:  - 85 15
-# description:        Nagios is a service monitoring system
-# processname: nagios
-# config: /etc/nagios/nagios.cfg
-# pidfile: /var/nagios/nagios.pid
 #
 ### BEGIN INIT INFO
-# Provides:           nagios
-# Required-Start:     $local_fs $syslog $network
-# Required-Stop:      $local_fs $syslog $network
-# Short-Description:    start and stop Nagios monitoring server
-# Description:                Nagios is is a service monitoring system
+# Short-Description: pnp4nagios NPCD Daemon Version 0.6.16
+# Description: Nagios Performance Data C Daemon
+# chkconfig: 345 99 01
+# processname: npcd
+# config: /usr/local/pnp4nagios/etc/npcd.cfg
+# pidfile: /var/run/npcd.pid
+# Provides:          npcd
+# Required-Start:
+# Required-Stop:
+# Default-Start:     2 3 4 5
+# Default-Stop:      0 1 6
 ### END INIT INFO

 # Source function library.
 . /etc/rc.d/init.d/functions

-prefix="/usr"
-exec_prefix="/usr"
-exec="/usr/bin/nagios"
-prog="nagios"
-config="/etc/nagios/nagios.cfg"
-pidfile="/var/nagios/nagios.pid"
-user="nagios"
+user="icinga"
+prog="npcd"
+prefix="/usr/local/pnp4nagios"
+exec_prefix="${prefix}"
+exec="${prefix}/bin/$prog"
+config="${prefix}/etc/$prog.cfg"
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="/var/run/$prog.pid"
+lockfile="${lockdir}/$prog"

 [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

-lockfile=/var/lock/subsys/$prog
-
 start() {
     [ -x $exec ] || exit 5
     [ -f $config ] || exit 6
     echo -n $"Starting $prog: "
-    daemon --user=$user $exec -d $config
+    daemon --user=$user $exec -d -f $config
     retval=$?
     echo
     [ $retval -eq 0 ] && touch $lockfile
@@ -47,7 +46,7 @@
     killproc -d 10 $exec
     retval=$?
     echo
-    [ $retval -eq 0 ] && rm -f $lockfile
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
     return $retval
 }

@@ -60,31 +59,14 @@
 reload() {
     echo -n $"Reloading $prog: "
     killproc $exec -HUP
-    RETVAL=$?
+    retval=$?
     echo
 }

-force_reload() {
-    restart
-}
-
-check_config() {
-        $nice runuser -s /bin/bash - $user -c "$corelimit >/dev/null 2>&1 ; $exec -v $config > /dev/null 2>&1"
-        RETVAL=$?
-        if [ $RETVAL -ne 0 ] ; then
-                echo -n $"Configuration validation failed"
-                failure
-                echo
-                exit 1
-
-        fi
-}
-

 case "$1" in
     start)
         status $prog && exit 0
-      check_config
         $1
         ;;
     stop)
@@ -92,33 +74,21 @@
         $1
         ;;
     restart)
-      check_config
         $1
         ;;
     reload)
         status $prog || exit 7
-      check_config
         $1
         ;;
-    force-reload)
-      check_config
-        force_reload
-        ;;
     status)
         status $prog
         ;;
-    condrestart|try-restart)
+    condrestart)
         status $prog|| exit 0
-      check_config
         restart
         ;;
-    configtest)
-        echo -n  $"Checking config for $prog: "
-        check_config && success
-        echo
-      ;;
     *)
-        echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload|configtest}"
+        echo $"Usage: $0 {start|stop|status|restart|condrestart|reload}"
         exit 2
 esac
 exit $?


I will list the other configurations for icinga, idoutils, and pnp4nagios next time.

iJAWS@Doorkeeper