7.3.4 Profiling tools

Table of Contents

Understanding Profiling in Linux Performance Tuning

Profiling tools help you see what your system and applications are actually doing, rather than guessing. In performance tuning, they answer questions such as which function uses most CPU time, which process creates I/O bottlenecks, or why an application feels slow despite low average load.

Profiling is different from simple monitoring. Monitoring shows you high level metrics such as CPU usage or disk throughput. Profiling dives into detail inside a process or subsystem so that you can locate the exact hot spots that deserve optimization.

Profiling must always be guided by a clear question: “What exactly am I trying to understand or improve?” Running a profiler without a goal usually produces a lot of data and very little insight.

In this chapter the focus stays on practical use of common Linux profiling tools, not on general performance theory or on low level internals that are discussed elsewhere.

System Wide Profiling with `perf`

perf is part of the Linux kernel tools and provides detailed low overhead profiling of CPU, hardware events, and kernel activity. It can measure cache misses, branch mispredictions, context switches, and function level CPU usage, both in user space and in the kernel.

Before using perf, you usually install it through your distribution’s package manager. The exact package name can differ, but it commonly includes the word perf or linux-tools.

To record CPU profiling for a specific program, you can run:

perf record -g ./your_program

This starts the program under the profiler and collects samples. The -g option tells perf to record call graphs, so you can see who called what.

After the run finishes, perf creates a file named perf.data in the current directory. You analyze it with:

perf report

perf report opens a text interface. You see a list of functions with percentages that show how much sampled CPU time each symbol consumed. If you enabled call graphs, you can view the call stacks for each hot function. This helps you decide whether to optimize a specific function, change an algorithm, or move expensive work off a hot path.

You can also profile a process that is already running. For example, if a server with PID 1234 is consuming a lot of CPU, you can use:

perf record -g -p 1234 -- sleep 10

This attaches to the process for 10 seconds and records a sample of its activity.

perf is also useful for kernel related performance issues because it can show time spent inside system calls and kernel functions. When you see that a user space function looks slow, but the samples mostly land in kernel symbols, you know the bottleneck might be I/O, locking, or some kernel feature rather than application logic.

perf often requires elevated privileges or the adjustment of kernel settings such as kernel.perf_event_paranoid. If you cannot see user space or kernel symbols, check your system security settings, install debug symbol packages for your binaries, and verify that symbol tables are not stripped.

Sampling vs Tracing and the Role of Profilers

Many profiling tools rely on sampling. They periodically interrupt execution and record where the CPU is and which stack is active. Over time this approximates where the program spends most of its time. Sampling has low overhead, so it is suitable for production systems.

Tracing tools collect every event of a specific type, such as every file open, every context switch, or every specific function entry and exit. Tracing provides precise data, but it can create high overhead or very large traces.

Profiling tools sometimes combine these approaches. They might use sampling for CPU usage and targeted tracing for particular functions. When you choose between tools, you usually trade completeness against overhead and simplicity. For most tuning tasks you start with lightweight sampling and move to tracing only if needed.

Using `perf` for Hardware and Kernel Events

Beyond basic CPU profiling, perf can record hardware events that indicate specific bottlenecks. For example, to estimate cache efficiency for a process you can run:

perf stat -e cache-misses,cache-references -p 1234 -- sleep 5

perf stat summarizes these events over time rather than generating detailed call graphs. This is useful when you only need a few key metrics such as instructions per cycle or branch mispredictions.

You can inspect system calls of a process with:

perf trace -p 1234

This produces a stream of system call information that looks similar to classic tracing tools but uses the perf infrastructure. It is handy when you want to see whether an application is making too many small I/O requests or is blocked in certain calls.

`strace` and `ltrace` for System Calls and Libraries

strace and ltrace help you see how a program interacts with the operating system and with shared libraries.

strace traces system calls. It is very practical when a program seems slow because it waits on I/O, sleep calls, or blocking operations. To run a program under strace you can use:

strace -tt -T ./your_program

The -tt option prints timestamps with microsecond precision and -T adds the time spent in each system call. This shows which calls block for a long time, for example a read from a slow disk or a connect to a remote host with high latency.

You can also attach to a running process:

strace -p 1234 -tt -T

This is useful when a long running service appears hung or slow, because you see which system call it is currently executing.

ltrace focuses on calls to shared library functions instead of system calls. It shows which functions a process calls from dynamic libraries such as libc or other shared objects. In performance work you may use it to see frequent calls to unexpected library functions, such as repeated string formatting or unnecessary conversions.

Tracing every system call or library function with strace or ltrace has significant overhead. Use these tools for short focused investigations and avoid leaving them attached to busy production services for long durations.

Call Graph Profiling with `gprof`

gprof is a classic profiler that uses instrumentation inserted by the compiler. The program must be compiled with profiling support, usually by adding -pg to your compile and link commands, for example:

gcc -pg -O2 main.c -o myprog

After you run the instrumented program, it writes a gmon.out file. You then generate a report with:

gprof ./myprog gmon.out > profile.txt

The report contains a flat profile that lists functions and their total self time, plus a call graph that shows how much time flows through each function and its callers.

Because gprof relies on instrumentation, it introduces some overhead and may alter timing slightly. It is most useful in controlled test runs where you want detailed per function statistics instead of sampling approximations. It does not require kernel features, so it is often easier to set up on systems where perf is restricted.

Heap and Memory Profiling with Valgrind and `massif`

Valgrind is a framework that runs programs in a virtual CPU and can perform many analyses, including memory profiling. It is slower than native execution, but gives very precise information.

The massif tool inside Valgrind profiles heap usage over time. It shows how much dynamic memory a program uses and which allocations contribute most. To run a program under massif, use:

valgrind --tool=massif ./your_program

This creates an output file such as massif.out.12345. You view it with:

ms_print massif.out.12345

The report includes a graph of heap size over the lifetime of the program and lists the call stacks that allocate the most memory. This is very useful when performance issues come from excessive memory allocations, poor caching strategies, or memory growth that leads to swapping.

Valgrind can also detect memory errors, but that role belongs to debugging rather than pure performance tuning. For tuning, you mostly care about how memory allocation patterns affect speed and resource usage.

Valgrind slows down programs by a large factor, sometimes by 10 times or more. Use it on test workloads, not on production services, and keep runs focused on the code paths you truly need to analyze.

Heap Profiling with `heaptrack`

heaptrack is a dedicated heap profiler that records every allocation and deallocation together with stack traces, and then aggregates this data. It typically has less overhead than Valgrind and can be more suitable for heavier workloads.

To profile a program, you can run:

heaptrack ./your_program

When it finishes, heaptrack writes a .gz data file. You analyze it with the graphical heaptrack_gui or through the command line summary tool. The results show which code paths allocate the most memory, where memory stays allocated for a long time, and where you might introduce pooling or reuse to reduce overhead.

Unlike simple memory snapshots, heap profiling reveals allocation dynamics. This often exposes patterns such as repeated small allocations in tight loops or growth that depends on input size in a non obvious way.

Low Overhead Sampling with `perf` and Flame Graphs

Many modern performance investigations use flame graphs to visualize hot call stacks. Flame graphs are not a tool by themselves but a way to display profiling data collected from tools such as perf.

In a flame graph, each box represents a function, and the horizontal width of the box represents the amount of time attributed to that function. The vertical axis shows the call stack. The widest parts reveal where most time is spent.

To prepare data for a flame graph you usually record stack samples in a plain text format, for example with perf script. Tooling outside the scope of this chapter then converts that data into an interactive HTML flame graph.

The important point for this course is that many Linux profiling tools, particularly perf, can produce the raw call stack samples needed for such visualizations. Once you have a good capture, a flame graph often makes it much easier to explain performance problems to others because the hot spots are visually obvious.

Sampling Profilers for Applications: `gperftools` and Others

Some applications integrate sampling profilers directly into their code through libraries such as gperftools (Google Performance Tools). These profilers perform periodic sampling of the instruction pointer and build call graphs that you can analyze later.

To use gperftools, you compile and link your program with the profiler library and enable it at run time, often through an environment variable. The program produces a profile file that you then inspect with provided visualization tools.

This approach is suitable for complex applications that need repeatable and controlled profiling runs, for example servers that must be profiled in a staging environment with realistic traffic. Because the profiler is in process, you can sometimes select which threads or subsystems to profile and which to ignore.

The specifics of building with these libraries depend on the language and build system you use, so they are beyond the focus of this introductory course. What matters here is that Linux offers both external profilers and internal library based profilers, and for serious tuning work you might need both.

I/O and Block Level Profiling with `iostat`, `blktrace`, and Friends

Profiling is not limited to CPU and memory. When performance problems relate to the storage subsystem, you use tools that focus on I/O behavior.

For quick analysis, iostat shows per device usage, request sizes, and wait times. It can tell you whether disks are saturated or idle. This counts as monitoring, but it is often the first gateway into deeper profiling.

For more detailed disk level profiling, blktrace can record every I/O event at the block layer. It writes traces that you then analyze with tools such as blkparse or higher level visualizers. You can see request queues, merges, and the exact pattern of reads and writes.

At the file system level, tools that trace system calls, such as perf trace or strace, show which processes generate the I/O. Combined with block level traces, this helps connect high level operations to low level consequences.

Block level tracing can create a lot of data very quickly. Restrict the traced devices and time period as much as possible, and never leave blktrace running on a busy production server without a clear plan for handling the data.

Application Specific Profilers

Many runtimes and ecosystems on Linux provide their own profilers that integrate with debuggers, IDEs, or performance dashboards. For example, Java has profilers that understand the JVM and garbage collector behavior, such as Java Flight Recorder. Modern languages like Go, Rust, and Python also have their own tooling to profile CPU usage, memory allocation, and goroutines or threads.

In the context of Linux performance tuning, these application specific profilers complement system tools. System wide tools reveal which process misbehaves and whether the bottleneck is CPU, memory, I/O, or networking. Language specific profilers then help you dig into the internals of that process with full awareness of the runtime.

When you choose tools, consider both layers. For instance, if perf shows that a single Java process uses most CPU time, you might switch to a Java profiler to see which methods and which allocation patterns are responsible.

Choosing the Right Profiling Tool

Choosing a profiler is mostly about the question you need to answer and the cost you can afford in terms of overhead and complexity.

If you only know that the machine is slow, you usually first look at general monitoring, then use perf stat or perf top for a quick sense of CPU activity. Once you know which process or workload is responsible, you move to more detailed profiling.

If CPU usage is high and you need function level detail, perf record with call graphs is a strong default choice. If the problem seems to involve blocking or waits, strace or perf trace can reveal where the program spends time in system calls.

When memory growth or high allocation rates are suspected, use Valgrind massif or heaptrack in a controlled environment to examine heap usage patterns.

At the I/O and storage layers, start with aggregate tools such as iostat, then if needed use blktrace for precise request tracing.

In application environments with rich runtime support, you may rely on language specific profilers once system tools have identified the process and the broad nature of the bottleneck.

Always profile with a workload that is representative of real use, and avoid drawing conclusions from tiny artificial tests. A profiler can only reveal behavior that actually occurs during the profiling run.

Integrating Profiling into Performance Tuning

Profiling is not a one time activity. In serious tuning work, you alternate between measurement, hypothesis, change, and verification. Profiling tools provide the measurement and verification steps. You use them to locate hot spots, implement targeted changes, and then run the profiler again to confirm that you reduced the right kind of work.

By learning how to apply perf, tracing tools, and memory and I/O profilers in a focused and methodical way, you gain the practical skill needed to turn performance problems from vague complaints into specific, solvable issues.

Comments

Please login to add a comment.

Don't have an account? Register now!