3.5.5 Monitoring running services

Table of Contents

Understanding Service Monitoring

Monitoring running services means regularly checking that the important software components on your system are active, healthy, and behaving as expected. At the intermediate level you no longer only ask whether a process exists. You also want to know if it started correctly, keeps running, responds to requests, and logs meaningful information when something goes wrong.

Services on modern Linux systems are usually managed by systemd. Monitoring them uses systemd tools together with process and log inspection commands you have already seen in other chapters. In this chapter the focus stays on what is specific to keeping services under observation in day to day administration.

Effective service monitoring always answers three questions:

Is the service running?
Is the service healthy and doing useful work?
How quickly will I notice when the answers change?

Checking if a service is running

The first step is to verify if a service is active from systemd's perspective. You do this with systemctl. This goes beyond just checking for a process id. systemd tracks the state of the whole unit, its exit codes, restart attempts, and more.

You can query the state of a single service with:

systemctl status sshd.service

The output shows the unit name, whether it is loaded and active, the last start time, and the main process id. It also includes recent log lines from the service. For many services the unit name ends with .service and is often the same as the command, so sshd, nginx, postgresql and similar.

For a quick summary of many services you can use:

systemctl --type=service --state=running

This prints all units of type service that are currently in running state. You can adjust the state filter to include failed or exited as needed.

Remember that systemd considers some short lived services healthy even when they are not running all the time. For example a one shot job that applies configuration and exits may show as active (exited) and that is normal. What you want to look for in continuous services is active (running) and to avoid states like failed.

Detecting failed or crashed services

Monitoring also means spotting services that have stopped unexpectedly. systemd keeps track of failed units and can show them:

systemctl --failed

This lists any service or other unit that ended in a failed state. It is common to run this manually when you suspect problems, but you can also script it or integrate it into regular checks.

For a specific service that you know is in trouble, use:

systemctl status nginx.service

This provides more details about why it failed. Look for lines that mention a Result= or show an exit code. You will frequently see hints like code=exited, status=1/FAILURE that indicate a startup error.

When services crash repeatedly, systemd can place them into a start-limit-hit condition. In this state systemd refuses to start the service again for a period of time. You will see this in the status output, and it is a clear sign that further investigation into logs and configuration is needed.

If a service appears in systemctl --failed do not just restart it repeatedly. Always examine its status and logs to understand the cause of failure before relying on it again.

Watching services over time

One time checks are useful, but monitoring becomes much stronger when you can watch services over time. Even without a full monitoring system you can keep an eye on services with periodic checks and log streaming.

A simple way to keep an eye on state changes is to use:

systemctl list-units --type=service

You can run this regularly, for example after configuration changes or package upgrades. The output will show services that are no longer active or that have transitioned to failed.

For continuous observations of logs from a given service, journalctl is very helpful:

journalctl -u sshd.service -f

The -f option follows new log entries as they appear, like tail -f for traditional log files. When you restart a service or when it fails, you will see the messages in real time. This is very useful when you are trying to detect unstable behavior that only appears under certain conditions.

You can also look at the history to see when a service was last restarted:

journalctl -u sshd.service --since "1 day ago"

Recurrent restarts over a day or week can reveal hidden instability even if the service happens to be running at the moment you check.

Monitoring resource use per service

Checking that a service runs is not enough if it consumes too much CPU, memory, or other resources. systemd tracks resource usage per unit, which lets you correlate behavior with specific services.

For a quick overview of services and their resource use, use:

systemd-cgtop

This shows control groups created by systemd and lists CPU and memory usage per group. Each long running service appears as its own entry. You can observe which service spikes during heavy load.

You can also inspect a single service with:

systemctl status apache2.service

Near the bottom of the output you may see lines mentioning CPU time and tasks. While this is not a complete performance monitoring tool, it gives you a first impression of how active a service has been.

For more detailed per process monitoring you can combine information from ps, top, or htop with the process id that systemctl status shows as Main PID. This allows you to follow the specific processes that belong to a service during troubleshooting.

If a service consistently uses unexpectedly high CPU or memory, consider it a monitoring alert even if the service is still running. Excessive resource use is an early sign of misconfiguration, bugs, or external problems like denial of service attacks.

Using restart policies as part of monitoring

systemd can automatically try to restart services when they exit. While this is configured in the unit file and belongs to service management, it has a direct impact on monitoring.

If a service has a restart policy such as Restart=on-failure, a crash will not always be visible as a long outage. Instead, you might see frequent restarts. From a monitoring perspective, frequent restarts are just as important as a complete stop.

You can see the restart counter and recent failures with:

systemctl status yourservice.service

Look for the number of times the main process has been started and for messages about restart attempts. If a service is flapping, which means it starts and stops repeatedly, systemd records this pattern and may eventually refuse further restarts.

Monitoring scripts or external tools should check the status output for these patterns instead of only checking active (running).

Building simple service checks

Even without a full monitoring suite you can set up simple automatic checks that verify that critical services are running. For example a basic script could test the status and send a notification or create a log entry when something is wrong.

An example concept for checking a single service is:

#!/bin/sh
SERVICE="sshd.service"
if ! systemctl is-active --quiet "$SERVICE"; then
    echo "$(date): $SERVICE is not active" >> /var/log/service-watch.log
fi

The command systemctl is-active returns success for active services and failure for inactive or failed states. You can schedule such checks using timed mechanisms covered elsewhere, and you can extend them to test actual functionality, such as contacting a web server on its port instead of just checking that its unit is active.

Service monitoring that only checks for running processes is incomplete. Good checks also verify that the service responds correctly, for example by making a simple HTTP request to a web server or a connection test to a database.

Integrating logs into service monitoring

Logs are a central part of service monitoring. Whenever a service changes state, starts, stops, or encounters errors, it usually writes log messages. systemd collects these in the journal, which makes it easy to correlate events with service status.

To find all messages about a specific service, use:

journalctl -u nginx.service

You can narrow the time window to focus on recent issues:

journalctl -u nginx.service --since "10 minutes ago"

This is especially useful when you receive a report that a service misbehaved at a certain time. By matching log entries to reported symptoms and to unit state transitions, you build a complete picture of what occurred.

Over time, you will recognize recurring messages that precede failures, such as warnings about configuration, warnings about resource exhaustion, or connection errors. These patterns can later be turned into alert conditions in dedicated monitoring tools.

Service monitoring in the bigger picture

Monitoring running services on a single machine is the foundation of larger monitoring systems. On one host you learn how to ask if a given service is active, if it logs errors, and how it uses resources. At scale tools like Prometheus, Nagios, or other monitoring platforms automate those same questions across many systems.

On your local machine, you work directly with systemctl, journalctl, and process inspection tools. In a larger environment those tools usually run in the background, and their results are collected, visualized, and used to trigger alerts. Understanding manual service monitoring at the systemd level prepares you to configure and interpret those higher level tools correctly.

In everyday administration you will often move between these levels. You might receive a central alert that a service is unhealthy, then log in to the specific machine and use the commands from this chapter to diagnose the situation.

Comments

Please login to add a comment.

Don't have an account? Register now!