7.3.4 Profiling tools

Table of Contents

Overview

Profiling tools help you understand where time and resources are actually spent in your system and applications. In contrast to the broader monitoring focus of the parent chapter, here the goal is fine‑grained measurement for diagnosis and optimization.

This chapter focuses on:

Types of profiling (CPU, memory, I/O, latency, perf events)
Key Linux profiling tools and how to use them
Typical workflows: from “system is slow” to “this function / syscall / query is the culprit”
Practical tips to avoid misleading results

You won’t see full performance theory here, just the tools and how to apply them.

Types of Profiling

Before picking a tool, clarify what you’re profiling:

CPU profiling

Where CPU time is spent (functions, code paths, syscalls)
Sampled vs instrumentation (sampling is less intrusive, instrumentation more precise)

Memory profiling

Leaks, excessive allocation, fragmentation
Per‑allocation backtraces, heap snapshots

I/O profiling

Disk latency, throughput, queue depth
Which processes and files are responsible

Lock / contention profiling

Mutexes, futexes, kernel locks
Where threads wait

Latency / tracing

Time between events (e.g., function entry/exit, syscalls, network recv/send)
Helps with “slow sometimes” problems

Many tools can do multiple of these with different commands or options.

`perf`: General‑Purpose Kernel and CPU Profiler

perf is a standard profiling and tracing tool integrated with the Linux kernel. It works with hardware performance counters (cycles, cache misses, branches, etc.) and kernel tracepoints.

Installation and Setup

On common distros:

Debian/Ubuntu: sudo apt install linux-perf
Fedora: sudo dnf install perf
Arch: sudo pacman -S perf

You often need debug symbols for meaningful function names:

Debian/Ubuntu: install libc6-dbg, linux-image-$(uname -r)-dbgsym (or relevant -dbg/-dbgsym and -dbgsym packages)
Fedora/RHEL: debuginfo-install kernel glibc (or use dnf debuginfo-install)
Arch: enable debug packages in /etc/pacman.conf and install glibc-debug / kernel debug if needed

CPU Sampling (`perf record` / `perf report`)

Profile a command:

   sudo perf record -g -- ./your_program --arg1 --arg2

-g: capture call graphs (backtraces)
By default, samples cycles at a regular interval

Profile an already running process:

   sudo perf record -g -p <PID> -- sleep 30

This samples the target PID for 30 seconds.

View the report:

   perf report

Shows functions sorted by percentage of samples (i.e., where CPU time is spent)
Navigate with arrows, expand call stacks, filter with /

Key options:

-F freq – sampling frequency (e.g. -F 99)
-e event – hardware/software events, e.g. -e cycles, -e instructions, -e cache-misses
--call-graph dwarf – more accurate call graphs for some binaries (needs frame info)

System‑Wide Sampling

To capture everything on the system:

sudo perf record -g -a -- sleep 10
sudo perf report

-a – all CPUs
Combine with filtering afterward (by PID, comm, DSO) inside perf report

Flame Graphs from `perf`

perf integrates well with Flame Graphs, which make “hot paths” visually obvious.

Generate folded stacks:

   sudo perf record -F 99 -a -g -- sleep 30
   sudo perf script > out.perf

Use Brendan Gregg’s FlameGraph scripts (clone the repo):

   ./stackcollapse-perf.pl out.perf > out.folded
   ./flamegraph.pl out.folded > perf.svg

Open perf.svg in a browser: tall stacks = high CPU cost; wide = broad code paths.

perf Top‑Like View (`perf top`)

For live CPU usage by symbol:

sudo perf top

Similar to top but at the function level
Use p to filter by PID, d for details

Other `perf` Subcommands

perf stat – summary statistics for a command:

  perf stat -e cycles,instructions,cache-misses ./your_program

perf sched – scheduler analysis (task switching, wakeup latencies)
perf record -e sched:sched_switch – capture context switches

`ftrace` and `trace-cmd`: Kernel Function Tracing

ftrace is a low‑level kernel tracing framework; trace-cmd is a wrapper tool that makes it easier to use.

Basic `trace-cmd` Usage

Install:

Debian/Ubuntu: sudo apt install trace-cmd
Fedora: sudo dnf install trace-cmd
Arch: sudo pacman -S trace-cmd

Trace sched and irq events for 10 seconds:

sudo trace-cmd record -e sched -e irq -- sleep 10
sudo trace-cmd report

Useful for:

Scheduler issues (high wakeup latency)
Interrupt storms
Determining what the kernel is doing during “mysterious” stalls

Function Tracing with `ftrace` (sysfs Interface)

ftrace lives under /sys/kernel/debug/tracing (ensure debugfs mounted):

sudo mount -t debugfs none /sys/kernel/debug
cd /sys/kernel/debug/tracing

Example: trace vfs_read only:

echo function | sudo tee current_tracer
echo vfs_read | sudo tee set_ftrace_filter
echo 1 | sudo tee tracing_on
sleep 5
echo 0 | sudo tee tracing_on
sudo cat trace

Use this carefully; it can generate large trace logs on busy systems.

eBPF and BCC / bpftrace

eBPF (extended BPF) allows powerful, low‑overhead dynamic tracing and profiling inside the kernel. You can attach probes to:

Kernel functions
User‑space functions (uprobes)
Tracepoints
Syscall entry/exit
Network events

BCC Tools: Ready‑Made Profilers

Install BCC (name varies by distro):

Debian/Ubuntu: sudo apt install bpfcc-tools
Fedora: sudo dnf install bcc-tools
Arch: sudo pacman -S bcc

Useful BCC scripts include:

profile – CPU stack sampling (like perf but via eBPF)
offcputime – where threads are blocked (off‑CPU)
runqlat – run‑queue latency
biolatency – block I/O latency
tcpretrans – TCP retransmits
execsnoop, opensnoop, filetop – high‑level activity tracing

Example: CPU profile (system‑wide, 49 Hz, 10 seconds):

sudo profile -F 49 -d 10

Example: off‑CPU time (which stacks account for blocked time):

sudo offcputime -d 10

`bpftrace`: One‑Liners for Tracing

bpftrace offers an awk‑like scripting language:

Install:

Debian/Ubuntu: sudo apt install bpftrace
Fedora: sudo dnf install bpftrace
Arch: sudo pacman -S bpftrace

Example: measure time spent in a user‑space function (uprobes):

sudo bpftrace -e '
uprobe:/usr/bin/myapp:myfunc {
  @start[tid] = nsecs;
}
uretprobe:/usr/bin/myapp:myfunc /@start[tid]/ {
  @time = hist((nsecs - @start[tid]) / 1000000);
  delete(@start[tid]);
}'

This builds a latency histogram (in ms) of myfunc.

eBPF tools are excellent for “black box” investigation when you can’t modify or rebuild code.

Memory Profiling Tools

`valgrind` / `massif` / `callgrind`

valgrind instruments programs for detailed memory behavior at the cost of heavy slowdown.

Install:

Debian/Ubuntu: sudo apt install valgrind
Fedora: sudo dnf install valgrind
Arch: sudo pacman -S valgrind

Leak Checking (`memcheck`)

valgrind --leak-check=full --show-leak-kinds=all ./your_program

Provides detailed leak backtraces; useful in development, less so in production.

Heap Profiling (`massif`)

valgrind --tool=massif ./your_program
ms_print massif.out.<pid> | less

Shows heap usage over time and where peak usage occurs.

CPU Simulation (`callgrind`)

For function‑level CPU cost (without relying on hardware counters):

valgrind --tool=callgrind ./your_program
callgrind_annotate callgrind.out.<pid> | less

This is slower than perf but sometimes easier to interpret in development environments.

`heaptrack`

heaptrack records all allocations and provides GUI and CLI analysis.

Install (names vary):

Debian/Ubuntu: sudo apt install heaptrack
Fedora: sudo dnf install heaptrack
Arch: sudo pacman -S heaptrack

Usage:

heaptrack ./your_program
heaptrack_print heaptrack.<timestamp>.zst | less

Or open the output file with heaptrack_gui for interactive analysis.

`jemalloc` / `tcmalloc` Profiling

Some alternative allocators (jemalloc, tcmalloc) expose built‑in profiling:

Enable profiling via environment variables (e.g. MALLOC_CONF=prof:true for jemalloc)
Use their profiling tools (jeprof, pprof) to visualize allocation hot‑spots

These are advanced but extremely powerful in long‑running servers.

I/O and Block‑Level Profiling

`iostat`, `pidstat`, `iotop`

While not “profilers” in the strict sense, these tools are essential for correlating I/O with processes.

iostat -x 1 – extended disk stats (utilization, await, svctm)
pidstat -d 1 – per‑process I/O stats
iotop – top‑like display for I/O

Install examples:

Debian/Ubuntu: sudo apt install sysstat iotop
Fedora: sudo dnf install sysstat iotop

Use these to validate: “Is the disk actually saturated?” before going deeper.

Block I/O Latency via BCC/eBPF

biolatency (BCC):

sudo biolatency 1

Prints histograms of block I/O latencies per device.

biosnoop shows individual requests with process names and latency.

These help differentiate:

Storage hardware limits
Filesystem behavior
Application patterns (e.g., tiny random writes)

Application‑Level Profilers

Different language ecosystems have their own profilers. Here’s just how they fit into a Linux tuning workflow.

C/C++ with `gprof` and Perf‑Aware Compilers

`gprof` (Legacy but sometimes useful)

Compile with -pg:

   gcc -pg -O2 -o myprog myprog.c

Run:

   ./myprog

Analyze:

   gprof ./myprog gmon.out | less

gprof gives call graph and per‑function statistics but is less accurate than sampling tools in optimized builds.

Compiler Support for Perf

Modern compilers emit DWARF and frame information; for better perf results:

Compile with -fno-omit-frame-pointer for easier stack unwinding
Keep symbols: remove -s, consider -g for debug builds

Python Profiling

Built‑ins:

cProfile – CPU profiling at Python‑level functions
profile – similar, pure‑Python

Run via command:

python3 -m cProfile -o stats.out your_script.py

View stats:

python3 -m pstats stats.out

For more advanced use (sampling, remote): look at tools like py-spy, scalene, yappi.

Java, JVM Languages

Use JVM tools:

jcmd, jstack, jmap, jstat
async-profiler – low‑overhead, perf‑based, supports CPU, allocation, wall‑clock, and lock profiling

Example (async-profiler):

./profiler.sh -d 30 -e cpu -f profile.svg <PID>

Open profile.svg for a flame graph.

Network and Latency Profiling

`perf` + Network Tracepoints

You can attach perf to networking tracepoints; e.g.:

sudo perf record -e net:net_dev_xmit -a -- sleep 10
sudo perf script

For more detailed packet‑level analysis use dedicated tracing (via eBPF tools, or tools in the Network Services / DevOps parts of the course).

eBPF Network Tools

BCC provides:

tcpretrans – retransmission analysis
tcplife – TCP connection lifetimes
tcpaccept – accept latency
tcpconnect – connect attempts

Example:

sudo tcpretrans

Helps correlate packet loss, RTT, or congestion with observed application slowness.

GUI and Integrated Profiling Tools

While much of Linux performance work is CLI‑driven, certain use cases benefit from GUI tooling.

`sysprof` (GNOME)

Useful for GNOME apps and general system tracing:

Can attach to processes or entire system
Visual timelines for CPU, I/O, syscalls, and GNOME internals

Launch via:

sysprof

KDE / Qt: `kcachegrind`, `hotspot`

kcachegrind visualizes callgrind output
hotspot visualizes perf data interactively

Install:

Debian/Ubuntu: sudo apt install kcachegrind hotspot
Fedora: sudo dnf install kcachegrind hotspot

Use them to explore data from perf record or valgrind --tool=callgrind.

Typical Profiling Workflows

This section ties multiple tools together into practical sequences.

Workflow 1: High CPU Usage

Confirm with top/htop.
System‑wide CPU profile:

   sudo perf record -F 99 -a -g -- sleep 30
   sudo perf report

If results are unclear, generate a flame graph as described earlier.
If the workload is in Python/Java/etc., switch to a language‑specific profiler to zoom in.

Workflow 2: System “Feels Slow” but CPU Not Maxed

Check I/O:

iostat -x 1 for disk
pidstat -d 1 / iotop for per‑process I/O

If disk latency seems high:

sudo biolatency 1
sudo biosnoop

If CPU is mostly idle and I/O is fine, check for scheduling / lock problems:

sudo offcputime -d 10
sudo runqlat 1

If still unclear, capture a broader eBPF or ftrace trace around the problematic period.

Workflow 3: Memory Growth / Out‑of‑Memory

Track memory per process (smem, ps, top/htop).
For reproducible dev workloads:

Use valgrind --leak-check=full or heaptrack.

For production:

Use allocator‑specific profiles (jemalloc / tcmalloc) if available.
Use language‑specific tools (e.g. Java heap dumps, Python tracemalloc).

Practical Tips and Pitfalls

Profiling changes behavior: sampling is safer in production; instrumentation (valgrind, -pg) can alter timings and expose or hide races.
Warm‑up effects: include JIT warm‑up, cache warming, and steady‑state behavior in your measurement window.
Reproducibility: profile under representative load; use same inputs and environment.
Frequency vs overhead: higher sampling frequency gives more detail but more overhead and noise.
Symbols matter: without debug symbols and frame pointers, profiles can be hard to interpret.
One tool rarely answers everything: combine coarse (top/iostat) with fine‑grained (perf/eBPF/valgrind) tools.

Profiling is iterative: measure, hypothesize, change, re‑measure. The tools covered here are your primary instruments for that cycle on Linux.

Comments

Please login to add a comment.

Don't have an account? Register now!

7.3.4 Profiling tools

Overview

Types of Profiling

`perf`: General‑Purpose Kernel and CPU Profiler

Installation and Setup

CPU Sampling (`perf record` / `perf report`)

System‑Wide Sampling

Flame Graphs from `perf`

perf Top‑Like View (`perf top`)

Other `perf` Subcommands

`ftrace` and `trace-cmd`: Kernel Function Tracing

Basic `trace-cmd` Usage

Function Tracing with `ftrace` (sysfs Interface)

eBPF and BCC / bpftrace

BCC Tools: Ready‑Made Profilers

`bpftrace`: One‑Liners for Tracing

Memory Profiling Tools

`valgrind` / `massif` / `callgrind`

Leak Checking (`memcheck`)

Heap Profiling (`massif`)

CPU Simulation (`callgrind`)

`heaptrack`

`jemalloc` / `tcmalloc` Profiling

I/O and Block‑Level Profiling

`iostat`, `pidstat`, `iotop`

Block I/O Latency via BCC/eBPF

Application‑Level Profilers

C/C++ with `gprof` and Perf‑Aware Compilers

`gprof` (Legacy but sometimes useful)

Compiler Support for Perf

Python Profiling

Java, JVM Languages

Network and Latency Profiling

`perf` + Network Tracepoints

eBPF Network Tools

GUI and Integrated Profiling Tools

`sysprof` (GNOME)

KDE / Qt: `kcachegrind`, `hotspot`

Typical Profiling Workflows

Workflow 1: High CPU Usage

Workflow 2: System “Feels Slow” but CPU Not Maxed

Workflow 3: Memory Growth / Out‑of‑Memory

Practical Tips and Pitfalls

Comments

Where to Move