Nagios support

Nagios plugins need to implement more logic than, for example, zabbix plugins. This includes calculating rates and deltas (thus keeping some state about monitored items), deciding whether to raise alert etc.

Zorka agent natively implements such common pieces (rates/deltas, alert thresholds etc.), NRPE protocol, nagios data formatting, nagios command definition and wraps it together using DSL-like fluent API. See below for list of nagios commands currently defined.

Enabling Nagios support

The following settings can be used to enable nagios support (and their default values).

  • nagios = no - enables or disables nagios integration;

  • nagios.listen.addr = 127.0.0.1 - listen address for NRPE protocol;

  • nagios.listen.port = 5669 - listen port for NRPE protocol;

  • nagios.server.addr = 127.0.0.1 - nagios server address; agent will accept connections only from this server; multiple addresses separated by comma can be added here;

As depicted above, nagios support is disabled by default and and needs to be enabled by adding nagios = yes in zorka.properties. Currently NRPE interface cannot use SSL but it has some security measures similar to those implemented in Zabbix interface: nagios.server.addr has to be set to actual IP address of nagios server (instead of 127.0.0.1 which is default value).

Collected metrics

There are several groups of predefined commands that will provide Nagios with status and performance data of major components of monitored application. Nagios commands can be called passing nagios.cmd["NAME"] to check_nrpe. Some examples:

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["MEM_POOL"]'
/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["GC"]'

Basic JVM metrics

There are several commands collect basic metrics of a JVM.

Memory pools

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["MEM_POOL"]'

This generates information about memory pools. Sample result:

MEM_POOL OK - CMS Old Gen 45 MB (0%); | CMS Old Gen=45MB;45;7352;0
CMS Perm Gen 29 MB (36%);
Code Cache 4 MB (8%);
Par Eden Space 95 MB (17%);
Par Survivor Space 4 MB (6%); | CMS Perm Gen=29MB;29;82;36
Code Cache=4MB;4;48;8
Par Eden Space=95MB;95;532;17
Par Survivor Space=4MB;4;66;6

All memory pools are detected automatically and reported. Depending on garbage collector policy, appropriate Old Generation pool is selected as primary metric.

The following performance data is collected (in order of appearance in performance data clauses):

  • pool usage (megabytes);

  • total pool size (megabytes);

  • pool utilization (percent);

The following settings can be used to alter MEM_POOL command behavior:

  • nagios.cmd.MEM_POOL.warn = 80 - warning threshold for pool utilization;

  • nagios.cmd.MEM_POOL.alrt = 90 - alert threshold for pool utilization;

Garbage collectors

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["GC"]'

Sample result:

GC OK - ConcurrentMarkSweep 0 %CPU; | ConcurrentMarkSweep=0%;0
ParNew 0 %CPU; | ParNew=0%;0

The following performance data are collected and reported:

  • CPU utilization by particular GC (%CPU);

  • number of GC cycles per minute;

The following settings can be used to alter GC command behavior:

  • nagios.cmd.GC.warn = 10 - warning threshold for GC CPU utilization;

  • nagios.cmd.GC.alrt = 25 - alert threshold for GC CPU utilization;

Threads

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["THREAD"]'

Sample result:

THREAD OK 49 thr (0 thr/min}); | Threads=49;29;0

The following performance data are collected and reported:

  • number of all threads;

  • number of daemon threads;

  • number of threads created (per minute);

The following settings can be used to alter GC command behavior:

  • nagios.cmd.THREAD.warn = 500 - warning threshold for GC CPU utilization;

  • nagios.cmd.THREAD.alrt = 700 - alert threshold for GC CPU utilization;

Zorka Stats metrics

All commands for monitoring zorka stats behave in a very similar way. They’re enabled when appropriate component has been enabled (eg. HTTP, SQL etc.) and report general stats for monitored component. The following performance data are collected and reported:

  • number of requests per minute;

  • number of errors per minute;

  • peak service time (since last execution of nagios command);

  • average service time (since last execution of nagios command);

The following settings can be used to alter command behavior:

  • nagios.cmd.XX.err.warn = 500 - warning threshold for number of errors occurred;

  • nagios.cmd.XX.err.alrt = 700 - alert threshold for number of errors occurred;

In above settings XX is component name, eg. HTTP or SQL.

HTTP requests

This metric is enabled if HTTP monitoring is enabled (defined in http.bsh, enabled by default for most of application servers). HTTP metric monitors generic zorka stats mbean:

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["HTTP"]'

Sample output:

HTTP OK 0 req/min, 0 err/min, peak(t)=1878 ms, avg(t)=0 ms; | HTTP=0;0;1878;

SQL metrics

This metric is enabled when at least one JDBC monitoring script is loaded (thus enabling SQL monitoring).

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["SQL"]'

LDAP metrics

This metric is enabled when ldap.bsh script has been loaded.

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["LDAP"]'

EJB metrics

This metric is enabled if EJB monitoring is enabled (defined in ejb.bsh, enabled by default if monitored application server supports it):

/usr/lib/nagios/plugins/check_nrpe  -n -H 127.0.0.1 -p 5669 -c 'nagios.cmd["EJB"]'