Showing posts with label nagios-plugin. Show all posts
Showing posts with label nagios-plugin. Show all posts

Sunday, May 20, 2012

HA Monitoring - MySQL replication

It is necessary to construct a redundant system with High Availability to serve it all times or reduce downtime.
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. Wikipedia High Availability
And, it is also important to assure that a system with high availability i s running under such a condition, Master/Slave or Primary/Secondary.
I am going to introduce of how to monitor a such a system with nagios in the following examples and let's see the mysql replication first.
  • MySQL Replication
  • PostgresSQL Replication
  • HA Cluster with DRBD & Pacemaker

MySQL Replication

It is important to monitor a master server binary dump running and a slave server I/O, SQL thread running and slave lag(seconds behind master). MySQL official introduces about the details about MySQL replication implementation , here. I would like to show you about monitoring the status of slave server( I/O and SQL thread) with a nagios plug-in called check_mysql_health, released by Console Labs.

This plug-in, by the way, is absolutely useful because it is enable to check the various mysql parameters, such as the number of connections, query cache hit rate, or the number of slow queries including the health of mysql replication.

System Structure


OS CentOS-5.8
Kernel 2.6.18-274.el5
DB mysql-5.5.24
Scripting Language perl-5.14.2
Nagios Plugin check_mysql_health-2.1.5.1
icinga core icinga-1.6.1

Install check_mysql_health

  •  compile & install
# wget http://labs.consol.de/wp-content/uploads/2011/04/check_mysql_health-2.1.5.1.tar.gz
# tar zxf check_mysql_health-2.1.5.1.tar.gz
# cd check_mysql_health-2.1.5.1
# ./configure \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--with-mymodules-dir=/usr/lib64/nagios/plugins
# make
# make instal
# cp -p plugins-scripts/check_mysql_health /usr/local/nagios/libexec
  • install cpan modules
# for modules in \
DBI \
DBD::mysql \
Time::HiRes \
IO::File \
File::Copy \
File::Temp \
Time::HiRes \
IO::File \
Data::Dumper \File::Basename \
Getopt::Long
 do cpan -i $modules
done
  • grant privileges for mysql user
# mysql -uroot -p mysql -e "GRANT SELECT, SUPER,REPLICATION CLIENT ON *.* TO nagios@'localhost' IDENTIFIED BY 'nagios'; FLUSH PRIVILEGES ;" 
# mysql -uroot -p mysql -e "SELECT * FROM user WHERE User = 'nagios'\G;"
*************************** 1. row ***************************
                  Host: localhost
                  User: nagios
              Password: *82802C50A7A5CDFDEA2653A1503FC4B8939C4047
           Select_priv: Y
           Insert_priv: N
           Update_priv: N
           Delete_priv: N
           Create_priv: N
             Drop_priv: N
           Reload_priv: N
         Shutdown_priv: N
          Process_priv: N
             File_priv: N
            Grant_priv: N
       References_priv: N
            Index_priv: N
            Alter_priv: N
          Show_db_priv: N
            Super_priv: Y
 Create_tmp_table_priv: N
      Lock_tables_priv: N
          Execute_priv: N
       Repl_slave_priv: N
      Repl_client_priv: Y
      Create_view_priv: N
        Show_view_priv: N
   Create_routine_priv: N
    Alter_routine_priv: N
      Create_user_priv: N
            Event_priv: N
          Trigger_priv: N
Create_tablespace_priv: N
              ssl_type: 
            ssl_cipher: 
           x509_issuer: 
          x509_subject: 
         max_questions: 0
           max_updates: 0
       max_connections: 0
  max_user_connections: 0
                plugin: 
 authentication_string: NULL
  • revise parentheses deprecated error
    Please just revise the line below if parentheses deprecated error detected.
# check_mysql_health --hostname localhost --username root --mode uptime
Use of qw(...) as parentheses is deprecated at check_mysql_health line 1247.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 2596.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 3473.
OK - database is up since 2677 minutes | uptime=160628s
# cp -p check_mysql_health{,.bak}
# vi check_mysql_health
...
# diff -u check_mysql_health.bak check_mysql_health
--- check_mysql_health.bak    2011-07-15 17:46:28.000000000 +0900
+++ check_mysql_health        2011-07-17 14:04:45.000000000 +0900
@@ -1244,7 +1244,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -2593,7 +2593,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -3469,8 +3469,8 @@
   $needs_restart = 1;
   # if the calling script has a path for shared libs and there is no --environment
   # parameter then the called script surely needs the variable too.
-  foreach my $important_env qw(LD_LIBRARY_PATH SHLIB_PATH 
-      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10) {
+  foreach my $important_env (qw(LD_LIBRARY_PATH SHLIB_PATH 
+      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10)) {
     if ($ENV{$important_env} && ! scalar(grep { /^$important_env=/ } 
         keys %{$commandline{environment}})) {
       $commandline{environment}->{$important_env} = $ENV{$important_env};

Verification

I am going to verify the mysql replication status about slave lag, I/O thread and SQL thread in the following condition, supposing that mysql replication is running.
Please see the official information of how to setup mysql replication.
  1. Both I/O thread and SQL thread running
  2. I/O thread stopped, SQL thread running
  3. I/O thread running, SQL thread stopped
  • Both I/O thread and SQL thread running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
OK - Slave is 0 seconds behind master | slave_lag=0;5;1
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
# mysql -uroot -p myql -e "STOP SLAVE IO_THREAD;" 
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get slave lag, because io thread is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
CRITICAL - Slave io is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
  • I/O thread running, SQL thread stopped
# mysql -uroot -p myql -e "STOP SLAVE SQL_THREAD;"  
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get replication inf
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
CRITICAL - Slave sql is not running


Let's see how to monitor PostgreSQL streaming replication, next.

Friday, May 4, 2012

Key Value Store - monitor cassandra and multinode cluster

As installing cassandra and creating multinode cluster, I'm introducing of how to monitor cassandra and multinode cluster with own nagios-plugin.

Monitor cassandra node(check_by_ssh+cassandra-cli)

There's several ways to monitor cassandra node with Nagios or Icinga such as, JMX or check_jmx. Though they are fairly effective way to monitor cassandra, they need to take some time to prepare. I am afraid that using check_by_ssh and cassandra-cli  is more simple than those ones and no need to install any libraries except for cassandra itself.
  • commands.cfg
define command{
        command_name    check_by_ssh
        command_line    $USER1$/check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H $HOSTADDRESS$ -t $ARG1$ -C '$ARG2$'

  • services.cfg
define service{
      use                     generic-service
      host_name               cassandra
      service_description     Cassandra Node
      check_command           check_by_ssh!22!60!"/usr/local/apache-cassandra/bin/cassandra-cli -h localhost --jmxport 9160 -f /tmp/cassandra_load.txt"
  • setup the file to load statements
    setup the statement file in the cassandra node to be monitored.
    "show cluster name;" shows its cluster name.
# cat > /tmp/cassandra_load.txt << EOF
show cluster name;
EOF
  • plugin status when cassandra is running(service status is OK)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Connected to: "Test Cluster" on 192.168.213.91/9160
Test Cluster
  • plugin status when cassandra is stopped(service status is CRITICAL)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -l root -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Remote command execution failed: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Monitor multinode cluster(check_cassandra_cluster.sh)

 The plugin has been released at Nagios Exchange and see the detail there, please.
  • overview
    check if the number of live nodes which belong to multinode cluster is less than the specified number.
    it is enable to specify the threshold with option -w <warning> and -c <critical>.
    get the number of live nodes, their status, and performance data.
  • software requirements
    cassandra(using nodetool command)
  • command help
# check_cassandra_cluster.sh -h
Usage: ./check_cassandra_cluster.sh -H <host> -P <port> -w <warning> -c <critical>

 -H <host> IP address or hostname of the cassandra node to connect, localhost by default.
 -P <port> JMX port, 7199 by default.
 -w <warning> alert warning state, if the number of live nodes is less than <warning>.
 -c <critical> alert critical state, if the number of live nodes is less than <critical>.
 -h show command option
 -V show command version 
  •  when service status is OK
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 1 -c 0
OK - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when service status is WARNING
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 2 -c 0
WARNING - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05% 
  •  when status is CRITICAL
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 2
CRITICAL - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when the threshold of warning is less than the one of critical
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 4
-w <warning> 3 must be less than -c <critical> 4.

Key Value Store - create cassandra multinode cluster

As introducing of installing cassandra before, I am explaining how to create cassandra multinode cluster.

Reference

Create Multinode cluster

  • cassandra nodes
_images/cassandra_node.png
  • Configuring Multinode Cluster 1st node (kvs01)
As cassandra.yaml is for setting up single node by default, it is necessary to change the configurations to create multinode cluster.
# cd /usr/local/apache-cassandra/conf/
# vi cassandra.yaml
auto_bootstrap : false
- seeds: "192.168.213.91"
listen_address: 192.168.213.91
rpc_address: 192.168.213.91
  • difference between the unrevised cassandra.yaml  and revised one
# diff -u cassandra.yaml.bak cassandra.yaml
--- cassandra.yaml.bak 2012-02-22 23:21:44.000000000 +0900
+++ cassandra.yaml     2012-05-04 07:51:31.000000000 +0900
@@ -8,6 +8,7 @@
 # The name of the cluster. This is mainly used to prevent machines in
 # one logical cluster from joining another.
 cluster_name: 'Test Cluster'
+auto_bootstrap : false

 # You should always specify InitialToken when setting up a production
 # cluster for the first time, and often when adding capacity later.
@@ -95,7 +96,7 @@
       parameters:
           # seeds is actually a comma-delimited list of addresses.
           # Ex: "<ip1>,<ip2>,<ip3>"
-          - seeds: "127.0.0.1"
+          - seeds: "192.168.213.91"

 # emergency pressure valve: each time heap usage after a full (CMS)
 # garbage collection is above this fraction of the max, Cassandra will
@@ -178,7 +179,7 @@
 # address associated with the hostname (it might not be).
 #
 # Setting this to 0.0.0.0 is always wrong.
-listen_address: localhost
+listen_address: 192.168.213.91

 # Address to broadcast to other Cassandra nodes
 # Leaving this blank will set it to the same value as listen_address
@@ -190,7 +191,7 @@
 #
 # Leaving this blank has the same effect it does for ListenAddress,
 # (i.e. it will be based on the configured hostname of the node).
-rpc_address: localhost
+rpc_address: 192.168.213.91
 # port for Thrift to listen for clients on
 rpc_port: 9160
  • restart daemon
# pgrep -f cassandra | xargs kill -9
# /usr/local/apache-cassandra/bin/cassandra
  • Configuring Multinode Cluster other node (kvs02,kvs03)
 listen_address and rpc_address are replaced with those of each servers'
It is no need to enable auto_bootstrap as cassandra-1.x is enabled by default.
# cd /usr/local/apache-cassandra/conf/
# vi cassandra.yaml
- seeds: "192.168.213.91"
listen_address: 192.168.213.92
rpc_address: 192.168.213.92
  • difference between the unrevised cassandra.yaml  and revised one
# diff -u cassandra.yaml.bak cassandra.yaml
--- cassandra.yaml.bak 2012-03-23 04:00:43.000000000 +0900
+++ cassandra.yaml     2012-05-04 08:44:14.000000000 +0900
@@ -8,6 +8,7 @@
 # The name of the cluster. This is mainly used to prevent machines in
 # one logical cluster from joining another.
 cluster_name: 'Test Cluster'
+auto_bootstrap: true

 # You should always specify InitialToken when setting up a production
 # cluster for the first time, and often when adding capacity later.
@@ -95,7 +96,7 @@
       parameters:
           # seeds is actually a comma-delimited list of addresses.
           # Ex: "<ip1>,<ip2>,<ip3>"
-          - seeds: "localhost"
+          - seeds: "192.168.213.91"

 # emergency pressure valve: each time heap usage after a full (CMS)
 # garbage collection is above this fraction of the max, Cassandra will
@@ -178,7 +179,7 @@
 # address associated with the hostname (it might not be).
 #
 # Setting this to 0.0.0.0 is always wrong.
-listen_address: localhost
+listen_address: 192.168.213.92

 # Address to broadcast to other Cassandra nodes
 # Leaving this blank will set it to the same value as listen_address
@@ -190,7 +191,7 @@
 #
 # Leaving this blank has the same effect it does for ListenAddress,
 # (i.e. it will be based on the configured hostname of the node).
-rpc_address: localhost
+rpc_address: 192.168.213.92
 # port for Thrift to listen for clients on
 rpc_port: 9160
  • restart daemon
# pgrep -f cassandra | xargs kill -9
# /usr/local/apache-cassandra/bin/cassandra
  • Verify ring status
# nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token
                                                                               100438156989107092060814573762535799562
192.168.213.92  datacenter1 rack1       Up     Normal  53.6 KB         93.47%  89332387546649365392870509741689618961
192.168.213.93  datacenter1 rack1       Up     Normal  49.19 KB        3.26%   94885272267878228726842541752112709261
192.168.213.91  datacenter1 rack1       Up     Normal  55.71 KB        3.26%   100438156989107092060814573762535
 Finally, I'm introducing of the monitoring phase next.

Key Value Store - install cassandra

I recently got an opportunity to monitor Key-Value Store, cassandra.  Though I know how to monitor RDBMS, such as MySQL or PostgreSQL, I know little about cassandra. I'm going to install cassandra and introduce of  monitoring cassandra with cassandra-cli. In addition, as I need to monitor cassandra multinode cluster, I'll show you about creating it and monitor it with nagios-plugins which I wrote.

Installation

  • install java(JDK)
    get the binary file here and transfer it.
# sh jdk-6u31-linux-x64-rpm.bin
  • install cassandra
# wget http://ftp.jaist.ac.jp/pub/apache//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz
# tar -C /usr/local/ -zxf apache-cassandra-1.0.8-bin.tar.gz
# ln -s /usr/local/apache-cassandra-1.0.8 /usr/local/apache-cassandra
  • setup PATH
# vi /etc/profile
...
if [ "$EUID" = "0" ]; then
      pathmunge /usr/local/apache-cassandra/bin       : Add the directive
...
fi
# . /etc/profile

Verification

  • start up cassandra daemon behind
# cassandra
  • connect cassandra Using cassandra-cli
# cassandra-cli -h 127.0.0.1 -p 9160
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.0.8

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown]
  • verify open port
# netstat -lnpt | grep java
tcp        0      0 127.0.0.1:9160              0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:34742               0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 127.0.0.1:7000              0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:47484               0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:7199                0.0.0.0:*                   LISTEN      2124/java
  • Create keyspace
[default@unknown] create keyspace DEMO;
2bbaee00-7442-11e1-0000-242d50cf1fbc
Waiting for schema agreement...
... schemas agree across the cluster

[default@unknown] use DEMO;
Authenticated to keyspace: DEMO

[default@DEMO] create column family Users;
327382c0-7442-11e1-0000-242d50cf1fbc
Waiting for schema agreement...
... schemas agree across the cluster

[default@DEMO] set Users[utf8('1234')][utf8('name')] = utf8('scott');
Value inserted.
Elapsed time: 33 msec(s).

[default@DEMO] set Users[utf8('1234')][utf8('password')] = utf8('tiger');
Value inserted.
Elapsed time: 4 msec(s).

[default@DEMO] get Users[utf8('1234')];
=> (column=6e616d65, value=scott, timestamp=1332436350273000)
=> (column=70617373776f7264, value=tiger, timestamp=1332436354369000)
Returned 2 results.
Elapsed time: 36 msec(s).

[default@DEMO] assume Users keys as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] assume Users comparator as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] assume Users validator as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] get Users['1234'];
=> (column=name, value=scott, timestamp=1332436350273000)
=> (column=password, value=tiger, timestamp=1332436354369000)
Returned 2 results.
Elapsed time: 2 msec(s).

 Let's create cassandra multinode cluster next time.

Sunday, April 1, 2012

NFSv3/v4 monitoring

I introduced how to setup NFSv3/v4 Server in the last session, NFSv3/v4 setup & monitoring. Then, I'd like to show you how to monitor NFSv3/v4 servers with nagios plugins.

NFSv3 NFS Server

  • verify the status with nagios plugins, check_rpc while nfs and portmap are running
# /etc/init.d/nfs status
rpc.mountd (pid 21781) を実行中...
nfsd (pid 21778 21777 21776 21775 21774 21773 21772 21771) を実行中...

# /etc/init.d/portmap status
portmap (pid 10609) を実行中...

# ps awuxc | egrep '(nfs|portmap|idmapd)'
rpc      29800  0.0  0.0   8072   676 ?        Ss   20:42   0:00 portmap
root     29915  0.0  0.0      0     0 ?        S<   20:42   0:00 nfsd4
root     29917  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29918  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29919  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29920  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29921  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29922  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29923  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd
root     29924  0.0  0.0      0     0 ?        S    20:42   0:00 nfsd

# rpcinfo -p
  program vers proto   port
   100000    2   tcp    111  portmapper
   100000    2   udp    111  portmapper
   100024    1   udp    773  status
   100024    1   tcp    776  status
   100003    2   udp   2049  nfs
   100003    3   udp   2049  nfs
   100003    4   udp   2049  nfs
   100021    1   udp  53048  nlockmgr
   100021    3   udp  53048  nlockmgr
   100021    4   udp  53048  nlockmgr
   100003    2   tcp   2049  nfs
   100003    3   tcp   2049  nfs
   100003    4   tcp   2049  nfs
   100021    1   tcp  37837  nlockmgr
   100021    3   tcp  37837  nlockmgr
   100021    4   tcp  37837  nlockmgr
   100005    3   udp    892  mountd
   100005    3   tcp    892  mountd
# check_rpc -H localhost -t -C nfs
OK: RPC program nfs version 2 version 3 version 4 tcp running
  • verify the status while nfs is stopping
# /etc/init.d/nfs stop
NFS mountd を終了中:                                       [  OK  ]
NFS デーモンを終了中:                                      [  OK  ]
NFS サービスを終了中:                                      [  OK  ]
# check_rpc -H localhost -t -C nfs
CRIICAL: RPC program nfs  tcp is not running
  • verify the status while nfs and portmap are stopping
# /etc/init.d/portmap stop
portmap を停止中:                                          [  OK  ]
# check_rpc -H localhost -t -C nfs
CRITICAL: RPC program nfs  tcp is not running

NFSv4 NFS Server

We are going to use the nagios plugins, check_nfs4.0.2.pl to verify the status of nfs v4 server.

  • verify the status with nagios plugins, check_nfsv4.0.2.pl while nfs, portmap, and rpcidmapd are running
# /etc/init.d/nfs status
rpc.mountd (pid 21781) を実行中...
nfsd (pid 21778 21777 21776 21775 21774 21773 21772 21771) を実行中...

# /etc/init.d/portmap status
portmap (pid 10609) を実行中...

# /etc/init.d/rpcidmapd status
rpc.idmapd (pid 18346) を実行中...

# ps awuxc | egrep '(nfs|portmap|idmapd)'
root      8102  0.0  0.0      0     0 ?        S<   Jan02   0:00 nfsiod
rpc      10609  0.0  0.0   8052   580 ?        Ss   11:44   0:00 portmap
root     18346  0.0  0.0  55180  1008 ?        Ss   12:20   0:00 rpc.idmapd
root     21770  0.0  0.0      0     0 ?        S<   12:35   0:00 nfsd4
root     21771  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21772  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21773  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21774  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21775  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21776  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21777  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd
root     21778  0.0  0.0      0     0 ?        S    12:35   0:00 nfsd

# rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    870  rquotad
    100011    2   udp    870  rquotad
    100011    1   tcp    873  rquotad
    100011    2   tcp    873  rquotad
    100003    4   tcp   2049  nfs
    100021    1   udp  32769  nlockmgr
    100021    3   udp  32769  nlockmgr
    100021    4   udp  32769  nlockmgr
    100021    1   tcp  32803  nlockmgr
    100021    3   tcp  32803  nlockmgr
    100021    4   tcp  32803  nlockmgr
    100005    3   udp    757  mountd
    100005    3   tcp    760  mountd
# ./check_nfs4.0.2.pl -v
OK: nfsd cpu = 0% ; nfsd threads = 8 ; nfsd used threads <= 10% ; Server badcalls = 19 ; Server badauth = 19 |nfsd_cpu=0% nfsd_used_threads=10% io_read=0% io_write=0%
  • verify the status while rpcidmapd is stopping
# /etc/init.d/rpcidmapd stop
Stopping RPC idmapd:                                       [  OK  ]
# ./check_nfs4.0.2.pl -v
CRITICAL: nfsd cpu = 0% ; nfsd threads = 8 ; nfsd used threads <= 10% ; daemon idmapd is not running ; Server badcalls = 19 ; Server badauth = 19 |nfsd_cpu=0% nfsd_used_threads=10% io_read=0% io_write=0%
  • verify the status while rpcidmapd and nfs are stopping
# /etc/init.d/nfs stop
NFS mountd を終了中:                                       [  OK  ]
NFS デーモンを終了中:                                      [  OK  ]
NFS サービスを終了中:                                      [  OK  ]
# ./check_nfs4.0.2.pl -v
CRITICAL: nfsd cpu = 0% ; nfsd threads = 0 ; nfsd used threads <= 10% ; daemons idmapd nfsd mountd are not running ; Server badcalls = 19 ; Server badauth = 19 |nfsd_cpu=0% nfsd_used_threads=10% io_read=0% io_write=0%
  • verify the status while rpcidmapd, nfs and portmap are stopping
# /etc/init.d/portmap stop
portmap を停止中:                                          [  OK  ]
# ./check_nfs4.0.2.pl -v
OK: nfsd cpu = 0% ; nfsd threads = 8 ; nfsd used threads <= 10% ; Server badcalls = 19 ; Server badauth = 19 |nfsd_cpu=0% nfsd_used_threads=10% io_read=0% io_write=0%

NFSv4 NFS Server

check_nfsv4.0.2.pl is enable to verify the status of nfsv4 client, too.

  • verify the status while portmap and rpcidmapd are running
# /etc/init.d/portmap status
portmap (pid 30107) を実行中...

# /etc/init.d/rpcidmapd status
rpc.idmapd (pid 30002) を実行中...
# ./check_nfs4.0.2.pl -i -v
OK: |
  • verify the status while rpcidmapd is stopping
# /etc/init.d/rpcidmapd stop
Stopping RPC idmapd:                                       [  OK  ]
# ./check_nfs4.0.2.pl -i -v
CRITICAL: daemon idmapd is not running |
  • verify the status while portmap is stopping
# /etc/init.d/portmap stop
portmap を停止中:                                          [  OK  ]
# ./check_nfs4.0.2.pl -i -v
OK: |
  • verify the status while portmap and rpcidmapd are stopping
# /etc/init.d/rpcidmapd stop
Stopping RPC idmapd:                                       [  OK  ]

# /etc/init.d/portmap stop
portmap を停止中:                                          [  OK  ]
# ./check_nfs4.0.2.pl -i -v
CRITICAL: daemon idmapd is not running |


These are the examples of services.cfg in the nagios configuration file, below.
  • common setting with visudo for nagios user between nfs server and client.
nagios          ALL=(ALL)       NOPASSWD: ALL
  • commands.cfg for check_by_ssh
define command{
        command_name    check_by_ssh_pub
        command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -i /usr/local/nagios/.ssh/id_rsa -l nagios -p $ARG1$ -t $ARG2$ -C $ARG3$
}

  • NFSv3 Server
define service{
        use                     generic-service
        host_name               nfsv3_server
        service_description     NFSv3:Server
        check_command           check_by_ssh_pub!22!60!"/usr/local/nagios/libexec/check_rpc -H localhost -t -C nfs"
}
  • NFSv4 Server
define service{
        use                     generic-service
        host_name               nfsv4_server
        service_description     NFSv4:Server
        check_command           check_by_ssh_pub!22!60!"/usr/bin/sudo /usr/local/nagios/libexec/check_nfs4.0.2.pl -v"
}
  • NFSv4 Client
define service{
        use                     generic-service
        host_name               nfsv4_client
        service_description     NFSv4:Client
        check_command           check_by_ssh_pub!22!60!"/usr/bin/sudo /usr/local/nagios/libexec/check_nfs4.0.2.pl -v"
}

Sunday, March 25, 2012

NFSv3/v4 setup & monitoring

NFS(Network File System) is quite common protocol and still in demand, though even if Cluster File System is in popular and becomes familiar with every engineer. For example, it is sometimes necessary to use NFS, when replacing an old and on-premise system composed with NFS server and clients. I've been using NFS v3, but I will try v4 as it use static tcp 2049 port by default, which enables iptables less directives than that of v3.

I'll list how to setup NFS both v3 and v4, and also how to monitor both of them with nagios.

iJAWS@Doorkeeper