Guide

This tutorial will guide you through all important steps of writing a check with the mplugin class library. Read this to get started.

Key concepts

mplugin has a fine-grained class model with clear separation of concerns. This allows plugin writers to focus on one particular tasks at a time while writing plugins. Monitoring plugins need to perform three steps: data acquisition, evaluation, and presentation. Each step has an associated class (Resource, Context, Summary) and information between tasks is passed with structured value objects (Metric, Result).

Classes overview

Here is a diagram with the most important classes and their relationships:

             +----------+                \
             | Resource |                 |
             +----------+                 |
       _____/      |     \_____           | Acquisition
      v            v           v          |
+---------+   +---------+   +---------+   |
| Metric  |...| Metric  |...| Metric  |  <
+---------+   +---------+   +---------+   |
     |             |             |        |
     v             v             v        |
+---------+   +---------+   +---------+   |
| Context |...| Context |...| Context |   | Evaluation
+---------+   +---------+   +---------+   |
     |             |             |        |
     v             v             v        |
+---------+   +---------+   +---------+   |
| Result  |...| Result  |...| Result  |  <
+---------+   +---------+   +---------+   |
           \___    |    ___/              |
               v   v   v                  | Presentation
              +---------+                 |
              | Summary |                 |
              +---------+                /
mplugin.Resource

A model of the thing being monitored. It should usually have the same name as the whole plugin. Generates one or more metrics.

Example: system load

mplugin.Metric

A single measured data point. A metric consists of a name, a value, a unit, and optional minimum and maximum bounds. Most metrics are scalar (the value can be represented as single number).

Example: load1=0.75

mplugin.Context

Additional information to evaluate a metric. A context has usually a warning and critical range which allows to determine if a given metric is OK or not. Contexts also include information on how to present a metric in a human-readable way.

Example: warning=0.5, critical=1.0

mplugin.Result

Product of a metric and a context. A result consists of a state (“ok”, “warning”, “critical”, “unknown”), some explanatory text, and references to the objects that it was generated from.

Example: WARNING - load1 is 0.75

mplugin.Summary

Condenses all results in a single status line. The status line is the plugin’s most important output: it appears in mails, text messages, pager alerts etc.

Example: LOAD WARNING - load1 is 0.75 (greater than 0.5)

The following tutorials which will guide you through the most important features of mplugin.

Hint

Study the source code in the mplugin/examples directory for complete examples.

Minimal setup

Tutorial #1: ‘Hello world’ check

In the first tutorial, we will develop check_world. This check will determine if the world exists. The algorithm is simple: if the world would not exist, the check would not execute.

This minimalistic check consists of a Resource World which models the part of the world that is interesting for the purposes of our check. Resource classes must define a Resource.probe() method which returns a list of metrics. We just return a single Metric object that states that the world exists.

#! /usr/bin/env python

"""Hello world monitoring check."""

import mplugin


class World(mplugin.Resource):
    def probe(self):
        return [mplugin.Metric("world", True, context="null")]


def main():
    check = mplugin.Check(World())
    check.main()


if __name__ == "__main__":
    main()

We don’t have a context to evaluate the returned metric yet, so we resort to the built-in “null” context. The “null” context does nothing with its associated metrics.

We now create a Check object that is fed only with the resource object. We could put context and summary objects into the Check() constructor as well. This will be demonstrated in the next tutorial. There is also no command line processing nor timeout handling nor output control. We call the Check.main() method to evaluate resources, construct text output and exit with the appropriate status code.

Running the plugin creates very simple output:

1$ check_world.py
2WORLD OK

The plugin’s exit status is 0, signalling success to the calling process.

Tutorial #2: check_load

In this tutorial, we will discuss important basic features that are present in nearly every check. These include command line processing, metric evaluation with scalar contexts, status line formatting and logging.

The check_load plugin resembles the one found in the standard monitoring plugins collection. It allows to check the system load average against thresholds.

Data acquisition

First, we will subclass Resource to generate metrics for the 1, 5, and 15 minute load averages.



class Load(mplugin.Resource):
    """Domain model: system load.

    Determines the system load parameters and (optionally) cpu count.
    The `probe` method returns the three standard load average numbers.
    If `percpu` is true, the load average will be normalized.

    This check requires Linux-style /proc files to be present.
    """

    percpu: bool

    def __init__(self, percpu: bool = False) -> None:
        self.percpu = percpu

    @staticmethod
    def cpus() -> int:
        _log.info('counting cpus with "nproc"')
        cpus = int(subprocess.check_output(["nproc"]))
        _log.debug("found %i cpus in total", cpus)
        return cpus

    def probe(self):
        _log.info("reading load from /proc/loadavg")
        with open("/proc/loadavg") as loadavg:
            load = loadavg.readline().split()[0:3]
        _log.debug("raw load is %s", load)
        cpus = self.cpus() if self.percpu else 1
        load = [float(load_value) / cpus for load_value in load]
        for i, period in enumerate([1, 5, 15]):
            yield mplugin.Metric("load%d" % period, load[i], min=0, context="load")


check_load has two modes of operation: the load averages may either be takes as read from the kernel or normalized by cpu. Accordingly, the Load() constructor has a parameter two switch normalization on.

In the Load.probe() method the check reads the load averages from the /proc filesystem and extracts the interesting values. For each value, a Metric object is returned. Each metric has a generated name (“load1”, “load5”, “load15”) and a value. We don’t declare a unit of measure since load averages come without unit. All metrics will share the same context “load” which means that the thresholds for all three values will be the same.

Note

Deriving the number of CPUs from /proc is a little bit messy and deserves an extra method. Resource classes may encapsulate arbitrary complex measurement logic as long they define a Resource.probe() method that returns a list of metrics. In the code example shown above, we sprinkle some logging statements which show effects when the check is called with an increased logging level (discussed below).

Evaluation

The check_load plugin should accept warning and critical ranges and determine if any load value is outside these ranges. Since this kind of logic is pretty standard for most of all Nagios/Icinga plugins, mplugin provides a generalized context class for it. It is the ScalarContext class which accepts a warning and a critical range as well as a template to present metric values in a human-readable way.

When ScalarContext is sufficient, it may be configured during instantiation right in the main function. A first version of the main function looks like this:

def main():
    argp = argparse.ArgumentParser(description=__doc__)
    argp.add_argument('-w', '--warning', metavar='RANGE', default='',
                      help='return warning if load is outside RANGE')
    argp.add_argument('-c', '--critical', metavar='RANGE', default='',
                      help='return critical if load is outside RANGE')
    argp.add_argument('-r', '--percpu', action='store_true', default=False)
    args = argp.parse_args()
    check = mplugin.Check(
        Load(args.percpu),
        mplugin.ScalarContext('load', args.warning, args.critical))
    check.main()

Note that the context name “load” is referenced by all three metrics returned by the Load.probe method.

This version of check_load is already functional:

 1$ ./check_load.py
 2LOAD OK - load1 is 0.11
 3| load15=0.21;;;0 load1=0.11;;;0 load5=0.18;;;0
 4
 5$ ./check_load.py -c 0.1:0.2
 6LOAD CRITICAL - load15 is 0.22 (outside 0.1:0.2)
 7| load15=0.22;;0.1:0.2;0 load1=0.11;;0.1:0.2;0 load5=0.2;;0.1:0.2;0
 8# exit status 2
 9
10$ ./check_load.py -c 0.1:0.2 -r
11LOAD OK - load1 is 0.105
12| load15=0.11;;0.1:0.2;0 load1=0.105;;0.1:0.2;0 load5=0.1;;0.1:0.2;0

In the first invocation (lines 1–3), check_load reports only the first load value which looks bit arbitrary. In the second invocation (lines 5–8), we set a critical threshold. The range specification is parsed automatically according to the Monitoring plugin API and the first metric that lies outside is reported. In the third invocation (lines 10–12), we request normalization and all values fit in the range this time.

Result presentation

Although we now have a running check, the output is not as informative as it could be. The first line of output (status line) is very important since the information presented therein should give the admin a clue what is going on. We want the first line to display:

  • a load overview when there is nothing wrong

  • which load value violates a threshold, if applicable

  • which threshold is being violated, if applicable.

The last two points are already covered by the Result default implementation, but we need to tweak the summary to display a load overview as stated in the first point:



class LoadSummary(mplugin.Summary):
    """Status line conveying load information.

    We specialize the `ok` method to present all three figures in one
    handy tagline. In case of problems, the single-load texts from the
    contexts work well.
    """

    def __init__(self, percpu: bool) -> None:
        self.percpu = percpu

    def ok(self, results: mplugin.Results) -> str:
        qualifier = "per cpu " if self.percpu else ""
        return "loadavg %sis %s" % (
            qualifier,
            ", ".join(str(results[r].metric) for r in ["load1", "load5", "load15"]),
        )


The Summary class has three methods which can be specialized: ok() to return a status line when there are no problems, problem() to return a status line when the overall check status indicates problems, and verbose() to generate additional output. All three methods get a set of Result objects passed in. In our code, the ok method queries uses the original metrics referenced by the result objects to build an overview like “loadavg is 0.19, 0.16, 0.14”.

Check setup

The last step in this tutorial is to put the pieces together:



@mplugin.guarded
def main():
    argp = argparse.ArgumentParser(description=__doc__)
    argp.add_argument(
        "-w",
        "--warning",
        metavar="RANGE",
        default="",
        help="return warning if load is outside RANGE",
    )
    argp.add_argument(
        "-c",
        "--critical",
        metavar="RANGE",
        default="",
        help="return critical if load is outside RANGE",
    )
    argp.add_argument("-r", "--percpu", action="store_true", default=False)
    argp.add_argument(
        "-v",
        "--verbose",
        action="count",
        default=0,
        help="increase output verbosity (use up to 3 times)",
    )
    args = argp.parse_args()
    check = mplugin.Check(
        Load(args.percpu),
        mplugin.ScalarContext("load", args.warning, args.critical),
        LoadSummary(args.percpu),
    )
    check.main(verbose=args.verbose)


if __name__ == "__main__":
    main()

In the main() function we parse the command line parameters using the standard argparse.ArgumentParser class. Watch the Check object creation: its constructor can be fed with a variable number of Resource, Context, and Summary objects. In this tutorial, instances of our specialized Load and LoadSummary classes go in.

We did not specialize a Context class to evaluate the load metrics. Instead, we use the supplied ScalarContext which compares a scalar value against two ranges according to the range syntax defined by the monitoring plugin API. The default ScalarContext implementation covers the majority of evaluation needs. Checks using non-scalar metrics or requiring special logic should subclass Context to fit their needs.

The check’s main() method runs the check, prints the check’s output including summary, log messages and performance data to stdout and exits the plugin with the appropriate exit code.

Note the guarded() decorator in front of the main function. It helps the code part outside Check to behave: in case of uncaught exceptions, it ensures that the exit code is 3 (unknown) and that the exception string is properly formatted. Additionally, logging is set up at an early stage so that even messages logged from constructors are captured and printed at the right place in the output (between status line and performance data).

Tutorial #3: check_users

In the third tutorial, we will learn how to process multiple metrics. Additionally, we will see how to use logging and verbosity levels.

Multiple metrics

A plugin can perform several measurements at once. This is often necessary to perform more complex state evaluations or improve latency. Consider a check that determines both the number of total logged in users and the number of unique logged in users.

A Resource implementation could look like this:

class Users(mplugin.Resource):

    def __init__(self):
        self.users = []
        self.unique_users = set()

    def list_users(self):
        """Return logged in users as list of user names."""
        [...]
        return users

    def probe(self):
        """Return both total and unique user count."""
        self.users = self.list_users()
        self.unique_users = set(self.users)
        return [mplugin.Metric('total', len(self.users), min=0,
                                    context='users'),
                mplugin.Metric('unique', len(self.unique_users), min=0,
                                    context='users')]

The probe() method returns a list containing two metric objects. Alternatively, the probe() method can act as generator and yield metrics:

def probe(self):
    """Return both total and unique user count."""
    self.users = self.list_users()
    self.unique_users = set(self.users)
    yield mplugin.Metric('total', len(self.users), min=0,
                              context='users')
    yield mplugin.Metric('unique', len(self.unique_users), min=0,
                              context='users')]

This may be more comfortable than constructing a list of metrics first and returning them all at once.

To assign a Context to a Metric, pass the context’s name in the metric’s context parameter. Both metrics use the same context “users”. This way, the main function must define only one context that applies the same thresholds to both metrics:

@mplugin.guarded
def main():
    argp = argparse.ArgumentParser()
    [...]
    args = argp.parse_args()
    check = mplugin.Check(
        Users(),
        mplugin.ScalarContext('users', args.warning, args.critical,
                                   fmt_metric='{value} users logged in'))
    check.main()

Multiple contexts

The above example defines only one context for all metrics. This may not be practical. Each metric should get its own context now. By default, a metric is matched by a context of the same name. So we just leave out the context parameters:

def probe(self):
    [...]
    return [mplugin.Metric('total', len(self.users), min=0),
            mplugin.Metric('unique', len(self.unique_users), min=0)]

We then define two contexts (one for each metric) in the main() function:

@mplugin.guarded
def main():
    [...]
    args = argp.parse_args()
    check = mplugin.Check(
        Users(),
        mplugin.ScalarContext('total', args.warning, args.critical,
                                   fmt_metric='{value} users logged in'),
        mplugin.ScalarContext(
            'unique', args.warning_unique, args.critical_unique,
            fmt_metric='{value} unique users logged in'))
    check.main(args.verbose, args.timeout)

Alternatively, we can require every context that fits in metric definitions.

Logging and verbosity levels

mplugin integrates with the logging module from Python’s standard library. If the main function is decorated with guarded() (which is heavily recommended), the logging module gets automatically configured before the execution of the main() function starts. Messages logged to the mplugin logger (or any sublogger) are processed with mplugin’s integrated logging.

Consider the following example check:

import argparse
import mplugin
import logging

_log = logging.getLogger('mplugin')


class Logging(mplugin.Resource):

    def probe(self):
        _log.warning('warning message')
        _log.info('info message')
        _log.debug('debug message')
        return [mplugin.Metric('zero', 0, context='default')]


@mplugin.guarded
def main():
    argp = argparse.ArgumentParser()
    argp.add_argument('-v', '--verbose', action='count', default=0)
    args = argp.parse_args()
    check = mplugin.Check(Logging())
    check.main(args.verbose)

if __name__ == '__main__':
    main()

The verbosity level is set in the check.main() invocation depending on the number of -v flags.

$ check_verbose.py
LOGGING OK - zero is 0 | zero=0
warning message (check_verbose.py:11)
$ check_verbose.py -v
LOGGING OK - zero is 0
warning message (check_verbose.py:11)
| zero=0
$ check_verbose.py -vv
LOGGING OK - zero is 0
warning message (check_verbose.py:11)
info message (check_verbose.py:12)
| zero=0
$ check_verbose.py -vvv
LOGGING OK - zero is 0
warning message (check_verbose.py:11)
info message (check_verbose.py:12)
debug message (check_verbose.py:13)
| zero=0

When called with verbose=0, both the summary and the performance data are printed on one line and the warning message is displayed. Messages logged with warning or error level are always printed. Setting verbose to 1 does not change the logging level but enable multi-line output. Additionally, full tracebacks would be printed in the case of an uncaught exception. Verbosity levels of 2 and 3 enable logging with info or debug levels.

This behaviour conforms to the “Verbose output” suggestions found in the Monitoring plug-in development guidelines.

The initial verbosity level is 1 (multi-line output). This means that tracebacks are printed for uncaught exceptions raised in the initialization phase (before Check.main() is called). This is generally a good thing. To suppress tracebacks during initialization, call guarded() with an optional verbose parameter. Example:

verbosity level

args

output

0

one line, error, warning

1

-v

multi-line, error, warning

2

-vv

info

3

-vvv

debug

@mplugin.guarded(verbose=0)
def main():
   [...]

Note

The initial verbosity level takes effect only until Check.main() is called with a different verbosity level.

It is advisable to sprinkle logging statements in the plugin code, especially into the resource model classes. A logging example for a users check could look like this:

class Users(mplugin.Resource):

    [...]

    def list_users(self):
        """Return list of logged in users."""
        _log.info('querying users with "%s" command', self.who_cmd)
        users = []
        try:
            for line in subprocess.check_output([self.who_cmd]).splitlines():
                _log.debug('who output: %s', line.strip())
                users.append(line.split()[0].decode())
        except OSError:
            raise mplugin.CheckError(
                'cannot determine number of users ({} failed)'.format(
                    self.who_cmd))
        _log.debug('found users: %r', users)
        return users

Interesting items to log are: the command which is invoked to query the information from the system, or the raw result to verify that parsing works correctly.

Plugin Debugging

Debugging plugins can sometimes be complicated since there are so many classes, which are tied together in an implicit way. I have collected some frequent questions about debugging.

An uncaught exception makes the plugin return UNKNOWN. Where is the cause?

When your plugin raises an exception, you may get very little output. Example:

$ check_users.py
USERS UNKNOWN: RuntimeError: error

Set the verbose parameter of main() to some value greater than zero and you will get the full traceback:

$ check_users.py -v
USERS UNKNOWN: RuntimeError: error
Traceback (most recent call last):
  File "mplugin/runtime.py", line 38, in wrapper
    return func(*args, **kwds)
  File "mplugin/examples/check_users.py", line 104, in main
    check.main(args.verbose, args.timeout)
  File "mplugin/check.py", line 110, in main
    runtime.execute(self, verbose, timeout)
  File "mplugin/runtime.py", line 118, in execute
    with_timeout(self.timeout, self.run, check)
  File "mplugin/platform/posix.py", line 19, in with_timeout
    func(*args, **kwargs)
  File "mplugin/runtime.py", line 107, in run
    check()
  File "mplugin/check.py", line 95, in __call__
    self._evaluate_resource(resource)
  File "mplugin/check.py", line 73, in _evaluate_resource
    metrics = resource.probe()
  File "mplugin/examples/check_users.py", line 57, in probe
    self.users = self.list_users()
  File "mplugin/examples/check_users.py", line 34, in list_users
    raise RuntimeError('error')
RuntimeError: error

A Check constructor dies with “cannot add type <…>”

When you see the following exception raised from Check() (or Check.add()):

UNKNOWN: TypeError: ("cannot add type <class '__main__.Users'> to check", <__main__.Users object at 0x7f0c64f73f90>)

chances are high that you are trying to add an object that is not an instance from Resource, Context, Summary, or Results or its subclasses. A common error is to base a resource class on object instead of Resource.

I’m trying to use pdb but I get a timeout after 10s

When using an interactive debugger like pdb on plugins, you may experience that your debugging session is aborted with a timeout after 10 seconds. Just set the timeout parameter in main() to 0 to avoid this.