Show by Label

Saturday, June 4, 2016

Update on Watchdogs

This post was concocted several years ago to show various ways to make sure applications or the kernel are protected from hang-up issues. Required when you run a server application, security camera or network related devices.


Here is a quick and concise summary of the various ways to use the watchdog functionality.
After all the trouble some of us went through to master the watchdog, it basically distilled down to three different methods.

These three methods cannot be combined because the /dev/watchdog device is claimed by either of the methods.

The watchdog device is already activated at boot time for all three methods.
I tried Method 1 and Method 2, which are RPi specific, on an RPi Model B3+ running Stretch, and on the RPi Model 4 running Buster. Both methods work fine on either RPi. Method 3 is very generic, and only needs one adjustment to avoid a bug.

Method 1

The easy shell method is as follows:
With a little script, you can add protection for kernel and user-space program hang-ups.
You start that process by sending a period "." to /dev/watchdog. This will kick-off what I would call a keep-alive session. You, or your program now needs to continue to send a "." to the /dev/watchdog within a 15 second period. If you don't, the RPi will reboot automatically. You can send the character "V" to the device to cancel this process.

You can use the following command to test this out - watch out however, the RPi will reboot in 15 seconds if this is all you do! :

sudo sh -c "echo '.' >> /dev/watchdog"

Every time you resend this command within a 15 second window, the watchdog counter will be reset. If you stop doing this or wait for more than 15 seconds, the timer overflows, en the RPi gets rebooted.

Creating and activating the following little script (from user sparky777), will protect the RPi for kernel hang-ups.

#!/bin/bash
echo " Starting user level protection"
while :
   do
      sudo sh -c "echo '.' >> /dev/watchdog"
      sleep 14
   done

When this script gets installed by init.d or systemd at boot time, it most likely runs as root so there is no need to do the "sudo sh -c" trick, you can simply use "echo . >> /dev/watchdog" instead.
I took the easy way and installed it with cron. Just add
@reboot /home/pi/name-of-program
and reboot to install.

When this script runs, there is now protection for kernel related issues. This can be tested with the so called fork bomb.
Make sure the script runs.
Simply type the following sequence at a prompt and then hit return to launch the fork-bomb.

: (  ){ : | : &  }; : 

The RPi will reboot in about 15 seconds.


Method 2
The second method with the same functionality can be obtained by using systemd.

To let systemd use the watchdog, and to turn it on, you need to edit the systemd configuration file.


sudo nano /etc/systemd/system.conf 
and change the following line:
#RuntimeWatchdogSec=
to:
RuntimeWatchdogSec=10s
Fifteen seconds is the maximum the BCM hardware allows.
I also suggest you activate the shutdown period protection by removing the '#' in front of the next line.
ShutdownWatchdogSec=10min

After a reboot, this will activate and reserve the watchdog device for systemd use. You can check the activation with :

dmesg | grep watchdog

It should report something like this on an RPi M3+ with Stretch:
[ 0.784298] bcm2835-wdt 3f100000.watchdog: Broadcom BCM2835 watchdog timer
[ 1.696537] systemd[1]: Hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0
[ 1.696628] systemd[1]: Set hardware watchdog to 10s.
The kernel will now update the hardware watchdog automatically every 10/2 seconds. If there is no kernel activity for 10 seconds, the RPi reboots.
This means that there is a default protection for kernel related issues. This can be tested with the so called fork bomb, see above.

If you want the user-space application protection capability, you have to use the systemd API within your program to do that. This is covered in a later post.

Method 3


The third method is not RPi specific and uses a rather large and sophisticated daemon package (pretty much legacy now) that allows you to set many different parameters that will be able to reboot the RPi. After installation you can use

man watchdog

For more information, or go here: https://linux.die.net/man/8/watchdog

The package needs to be installed first.

sudo apt-get install watchdog

Because this is a wide spread legacy package, I'm not going to cover that in detail here.
To set some of the parameters the watchdog daemon should watch :

nano /etc/watchdog.conf

For the fork bomb test I took away the "#" marks from the following lines:
# This is an optional test by pinging my router
ping=192.168.1.1
max-load-1 = 24
min-memory = 1
watchdog-device = /dev/watchdog
watchdog-timeout = 15
Note: The last line is very important and Rpi specific. If this command is not added, you get a bit of a cryptic error (run sudo systemctl status watchdog.service) :
cannot set timeout 60 (errno = 22 = 'Invalid argument')
This is caused by the default wdt counters used in other Linux systems, mostly handlingt 60 seconds. But because the RPi wdt counter on the SOC only handles a maximum of 15 seconds, this line must be added, otherwise the package won't work at all.
Unfortunately, this is a bug that the Foundation missed and the default 15 seconds should have been programmed into the kernel, or added by default in the watchdog.conf file.


---------------------------------------------------------------------------------------------------------------------------------
Using the systemd API to let a program control the watchdog.

Below I will show how to add extra support for your own (Python) application by using the systemd API and framework.

If you want to use the systemd method of using a software watchdog to add control to your own application program, you can use the following method to implement that.

As I showed above, you use the hardware BMC watchdog system to reboot the RPi when the kernel gets unresponsive, or when systemd is no longer operational.

A higher level of control can be added by a software watchdog. Systemd provides that, plus an interface (API) to implement that.
The combination of the two provide the Supervisor chain (in systemd speak).

There are two steps to implement this method.

1. You need to provide a service configuration file for systemd to instruct it what to do.
2. You need to add a few things to your own application to make it all work in this environment.

In essence, you are going to ask systemd to initiate the watchdog, and your application needs to "ping" it at regular intervals. If the application fails to do that, systemd will take action and can ultimately reboot the RPi.

I wrote a systemd service file that will let you test a number of elements.

# This service installs a python test program that allows us to test the
# systemd software watchdog. This watchdog can be used to protect from hangups.
# On top of that, when the service crashes, it is automatically restarted.
# If it crashes too many times, it will be forced to fail, or you can let systemd reboot
#

[Unit]
Description=Installing Python test script for a systemd s/w watchdog
Requires=basic.target
After=multi-user.target

[Service]
Type=notify
WatchdogSec=10s
ExecStart=/usr/bin/python /home/pi/systemd-test.py
Restart=always

# The number of times the service is restarted within a time period can be set
# If that condition is met, the RPi can be rebooted
#
StartLimitBurst=4
StartLimitInterval=180s
# actions can be none|reboot|reboot-force|reboot-immidiate
StartLimitAction=none

# The following are defined the /etc/systemd/system.conf file and are
# global for all services
#
#DefaultTimeoutStartSec=90s
#DefaultTimeoutStopSec=90s
#
# They can also be set on a per process here:
# if they are not defined here, they fall back to the system.conf values
TimeoutStartSec=2s
TimeoutStopSec=2s

[Install]
WantedBy=multi-user.target

Details can be found if you look for systemd.service(5)


I also wrote a Python script that lets you play with this system and experiment to you hearts delight.

#!/usr/bin/python2.7
#-------------------------------------------------------------------------------
# Name:        systemd daemon & watchdog test file
# Purpose:
#
# Author:      paulv
#
# Created:     07-05-2016
# Copyright:   (c) paulv 2016
# Licence:     <your licence>
#-------------------------------------------------------------------------------

import sys
import os
from time import sleep
import signal
import subprocess
import socket

init = True

def sd_notify(unset_environment, s_cmd):

    """
    Notify service manager about start-up completion and to kick the watchdog.

    https://github.com/kirelagin/pysystemd-daemon/blob/master/sddaemon/__init__.py

    This is a reimplementation of systemd's reference sd_notify().
    sd_notify() should be used to notify the systemd manager about the
    completion of the initialization of the application program.
    It is also used to send watchdog ping information.

    """
    global init

    sock = None

    try:
        if not s_cmd:
            sys.stderr.write("error : missing s_cmd\n")
            return(1)

        s_adr = os.environ.get('NOTIFY_SOCKET', None)
        if init : # report this only one time
            sys.stderr.write("Notify socket = " + str(s_adr) + "\n")
            # this will normally return : /run/systemd/notify
            init = False

        if not s_adr:
            sys.stderr.write("error : missing socket\n")
            return(1)

        sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
        sock.sendto(s_cmd, s_adr)
        # sendto() returns number of bytes send
        # in the original code, the return was tested against > 0 ???
        if sock.sendto(s_cmd, s_adr) == 0:
            sys.stderr.write("error : incorrect sock.sendto  return value\n")
            return(1)
    except e:
        pass
    finally:
        # terminate the socket connection
        if sock:
            sock.close()
        if unset_environment:
            if 'NOTIFY_SOCKET' in os.environ:
                del os.environ['NOTIFY_SOCKET']
    return(0) # OK


def sig_handler (signum=None, frame = None):
    """
    This function will catch the most important system signals, but NOT a shutdown!
    During testing, you can use this code to see what termination methods are used or filter
    some out.

    This handler catches the following signals from the OS:
        SIGHUB = (1) SSH Terminal logout
        SIGINT = (2) Ctrl-C
        SIGQUIT = (3) ctrl-\
        IOerror = (5) when terminating the SSH connection (input/output error)
        SIGTERM = (15) Deamon terminate (deamon --stop): is coming from deamon manager
    However, it cannot catch SIGKILL = (9), the kill -9 or the shutdown procedure
    """

    try:
        print "\nSignal handler called with signal : {0}".format(signum)
        if signum == 1 :
            sys.stderr.write("Sighandler: ignoring SIGHUB signal : " + str(signum) + "\n")
            return # ignore SSH logout termination
        sys.stderr.write("terminating : python test script\n")
        sys.exit(1)

    except Exception as e: # IOerror 005 when terminating the SSH connection
        sys.stderr.write("Unexpected Exception in sig_handler() : "+ str(e) + "\n")
        subprocess.call(['logger "Unexpected Exception in sig_handler()"'], shell=True)
        return

def main():

    # setup a catch for the following termination signals: (signal.SIGINT = ctrl-c)
    for sig in (signal.SIGTERM, signal.SIGINT, signal.SIGHUP, signal.SIGQUIT):
        signal.signal(sig, sig_handler)

    # get the timeout period from the systemd-test.service file
    wd_usec = os.environ.get('WATCHDOG_USEC', None)
    if wd_usec == None or wd_usec == 0:
        sys.stderr.write("terminating : incorrect watchdog interval sequence\n")
        exit(1)

    wd_usec = int(wd_usec)
    # use half the time-out value in seconds for the kick-the-dog routine to
    # account for Linux housekeeping chores
    wd_kick = wd_usec / 1000000 / 2
    sys.stderr.write("watchdog kick interval = " + str(wd_kick) + "\n")

    try:
        sys.stderr.write("starting : python daemon watchdog and fail test script started\n")
        # notify systemd that we've started
        retval = sd_notify(0, "READY=1")
        if retval <> 0:
            sys.stderr.write("terminating : fatal sd_notify() error for script start\n")
            exit(1)

        # after the init, ping the watchdog and check for errors
        retval = sd_notify(0, "WATCHDOG=1")
        if retval <> 0:
            sys.stderr.write("terminating : fatal sd_notify() error for watchdog ping\n")
            exit(1)

        ctr = 0 # setup a counter to initiate a watchdog fail
        while True :
            if ctr > 5 :
                sys.stderr.write("forcing watchdog fail, restarting service\n")
                sleep(20)

            sleep(wd_kick)
            sys.stderr.write("kicking the watchdog : ctr = " + str(ctr) + "\n")
            sd_notify(0, "WATCHDOG=1")
            ctr += 1


    except KeyboardInterrupt:
        print "\nTerminating by Ctrl-C"
        exit(0)


if __name__ == '__main__':
    main()

The comments should give you an idea of what is needed. In a nutshell, the application needs to signal systemd that it has finished the initialization. At regular intervals, the software watchdog is updated. There is a fail condition in the code above that will mimic a hung application.

Here is how you install and test this all.
Open an editor:

nano systemd-test.service

Copy and paste the service code above into the editor. Save the file and close the editor. Copy this file into the systemd structure with :

sudo cp systemd-test.service /etc/systemd/system

Open an editor again:

nano systemd-test.py

Copy and paste the Python code above into the editor. Save the file and close the editor. Make the python script executable :

chmod +x systemd-test.py

Run the service script in the systemd environment :

sudo systemctl start systemd-test

Watch what is going on with

tail -f /var/log/syslog

After 4 failures and automatic restarts of the python script, systemd declares it a failed state. You can also let the RPi reboot when this happens and all you need to do is to change StartLimitAction=none to StartLimitAction=reboot in the systemd-test.service file.

If you would like to test the application within the boot process, run this :

sudo systemctl enable systemd-test

After a reboot, you can again watch it all by using the above tail command again.
If you decide to change the Python script, you can do that while the system is running. At the next restart, the new code is automatically loaded and executed. If you want to change parameters in the .service file, you can do that too, but you need to activate and reload those changes. You do that with

sudo systemctl daemon-reload

and then

sudo systemctl restart systemd-test

I had great fun to discover all the possibilities systemd now offers me to add better control to my own scripts.

Please chime in if you have improvements or suggestions!

Enjoy!

No comments:

Post a Comment