Sunday, 7 April 2013

Nagios: CRITICAL - Socket timeout after 10 seconds


Socket timeout after 10 seconds:

As any other monitoring system Nagios can produce false alarms. Usually it happens when Nagios fails to get the reply from the host being monitored during some pre-defined timeout. In order to mark service as down Nagios does three checks and if all of them are failed then the service is marked down and administrator will got an alert about its critical status. At the same time even if one of those checks fails Nagios will report administrator about it depending on configuration.

If you face some false alarms occasionally but the service is actually online then it makes sense to increase timeout value from default 10 seconds to, let’s say, 20 seconds.

FIX:

Open one of nagios’ configs where check commands are defined (usually it’s /etc/nagios/commands.cfg file) and find there a block named check_nrpe, add “-t 20″ to the end of its command_line so it will look like below:

define command {
    command_name    check_nrpe
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t 20
}

And restart Nagios.

Besides check_nrpe there are also other commands like check_http, check_smtp and others: all of them supports -t options so just modify them like check_nrpe depending on your Nagios timeout conditions.

2 comments: