Table of Contents
Overview
Profiling tools help you understand where time and resources are actually spent in your system and applications. In contrast to the broader monitoring focus of the parent chapter, here the goal is fine‑grained measurement for diagnosis and optimization.
This chapter focuses on:
- Types of profiling (CPU, memory, I/O, latency, perf events)
- Key Linux profiling tools and how to use them
- Typical workflows: from “system is slow” to “this function / syscall / query is the culprit”
- Practical tips to avoid misleading results
You won’t see full performance theory here, just the tools and how to apply them.
Types of Profiling
Before picking a tool, clarify what you’re profiling:
- CPU profiling
- Where CPU time is spent (functions, code paths, syscalls)
- Sampled vs instrumentation (sampling is less intrusive, instrumentation more precise)
- Memory profiling
- Leaks, excessive allocation, fragmentation
- Per‑allocation backtraces, heap snapshots
- I/O profiling
- Disk latency, throughput, queue depth
- Which processes and files are responsible
- Lock / contention profiling
- Mutexes, futexes, kernel locks
- Where threads wait
- Latency / tracing
- Time between events (e.g., function entry/exit, syscalls, network recv/send)
- Helps with “slow sometimes” problems
Many tools can do multiple of these with different commands or options.
`perf`: General‑Purpose Kernel and CPU Profiler
perf is a standard profiling and tracing tool integrated with the Linux kernel. It works with hardware performance counters (cycles, cache misses, branches, etc.) and kernel tracepoints.
Installation and Setup
On common distros:
- Debian/Ubuntu:
sudo apt install linux-perf - Fedora:
sudo dnf install perf - Arch:
sudo pacman -S perf
You often need debug symbols for meaningful function names:
- Debian/Ubuntu: install
libc6-dbg,linux-image-$(uname -r)-dbgsym(or relevant-dbg/-dbgsymand-dbgsympackages) - Fedora/RHEL:
debuginfo-install kernel glibc(or usednf debuginfo-install) - Arch: enable debug packages in
/etc/pacman.confand installglibc-debug/ kernel debug if needed
CPU Sampling (`perf record` / `perf report`)
- Profile a command:
sudo perf record -g -- ./your_program --arg1 --arg2-g: capture call graphs (backtraces)- By default, samples
cyclesat a regular interval
- Profile an already running process:
sudo perf record -g -p <PID> -- sleep 30This samples the target PID for 30 seconds.
- View the report:
perf report- Shows functions sorted by percentage of samples (i.e., where CPU time is spent)
- Navigate with arrows, expand call stacks, filter with
/
Key options:
-F freq– sampling frequency (e.g.-F 99)-e event– hardware/software events, e.g.-e cycles,-e instructions,-e cache-misses--call-graph dwarf– more accurate call graphs for some binaries (needs frame info)
System‑Wide Sampling
To capture everything on the system:
sudo perf record -g -a -- sleep 10
sudo perf report-a– all CPUs- Combine with filtering afterward (by PID, comm, DSO) inside
perf report
Flame Graphs from `perf`
perf integrates well with Flame Graphs, which make “hot paths” visually obvious.
- Generate folded stacks:
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script > out.perf- Use Brendan Gregg’s FlameGraph scripts (clone the repo):
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > perf.svg
Open perf.svg in a browser: tall stacks = high CPU cost; wide = broad code paths.
perf Top‑Like View (`perf top`)
For live CPU usage by symbol:
sudo perf top- Similar to
topbut at the function level - Use
pto filter by PID,dfor details
Other `perf` Subcommands
perf stat– summary statistics for a command:
perf stat -e cycles,instructions,cache-misses ./your_programperf sched– scheduler analysis (task switching, wakeup latencies)perf record -e sched:sched_switch– capture context switches
`ftrace` and `trace-cmd`: Kernel Function Tracing
ftrace is a low‑level kernel tracing framework; trace-cmd is a wrapper tool that makes it easier to use.
Basic `trace-cmd` Usage
Install:
- Debian/Ubuntu:
sudo apt install trace-cmd - Fedora:
sudo dnf install trace-cmd - Arch:
sudo pacman -S trace-cmd
Trace sched and irq events for 10 seconds:
sudo trace-cmd record -e sched -e irq -- sleep 10
sudo trace-cmd reportUseful for:
- Scheduler issues (high wakeup latency)
- Interrupt storms
- Determining what the kernel is doing during “mysterious” stalls
Function Tracing with `ftrace` (sysfs Interface)
ftrace lives under /sys/kernel/debug/tracing (ensure debugfs mounted):
sudo mount -t debugfs none /sys/kernel/debug
cd /sys/kernel/debug/tracing
Example: trace vfs_read only:
echo function | sudo tee current_tracer
echo vfs_read | sudo tee set_ftrace_filter
echo 1 | sudo tee tracing_on
sleep 5
echo 0 | sudo tee tracing_on
sudo cat traceUse this carefully; it can generate large trace logs on busy systems.
eBPF and BCC / bpftrace
eBPF (extended BPF) allows powerful, low‑overhead dynamic tracing and profiling inside the kernel. You can attach probes to:
- Kernel functions
- User‑space functions (uprobes)
- Tracepoints
- Syscall entry/exit
- Network events
BCC Tools: Ready‑Made Profilers
Install BCC (name varies by distro):
- Debian/Ubuntu:
sudo apt install bpfcc-tools - Fedora:
sudo dnf install bcc-tools - Arch:
sudo pacman -S bcc
Useful BCC scripts include:
profile– CPU stack sampling (likeperfbut via eBPF)offcputime– where threads are blocked (off‑CPU)runqlat– run‑queue latencybiolatency– block I/O latencytcpretrans– TCP retransmitsexecsnoop,opensnoop,filetop– high‑level activity tracing
Example: CPU profile (system‑wide, 49 Hz, 10 seconds):
sudo profile -F 49 -d 10Example: off‑CPU time (which stacks account for blocked time):
sudo offcputime -d 10`bpftrace`: One‑Liners for Tracing
bpftrace offers an awk‑like scripting language:
Install:
- Debian/Ubuntu:
sudo apt install bpftrace - Fedora:
sudo dnf install bpftrace - Arch:
sudo pacman -S bpftrace
Example: measure time spent in a user‑space function (uprobes):
sudo bpftrace -e '
uprobe:/usr/bin/myapp:myfunc {
@start[tid] = nsecs;
}
uretprobe:/usr/bin/myapp:myfunc /@start[tid]/ {
@time = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}'
This builds a latency histogram (in ms) of myfunc.
eBPF tools are excellent for “black box” investigation when you can’t modify or rebuild code.
Memory Profiling Tools
`valgrind` / `massif` / `callgrind`
valgrind instruments programs for detailed memory behavior at the cost of heavy slowdown.
Install:
- Debian/Ubuntu:
sudo apt install valgrind - Fedora:
sudo dnf install valgrind - Arch:
sudo pacman -S valgrind
Leak Checking (`memcheck`)
valgrind --leak-check=full --show-leak-kinds=all ./your_programProvides detailed leak backtraces; useful in development, less so in production.
Heap Profiling (`massif`)
valgrind --tool=massif ./your_program
ms_print massif.out.<pid> | lessShows heap usage over time and where peak usage occurs.
CPU Simulation (`callgrind`)
For function‑level CPU cost (without relying on hardware counters):
valgrind --tool=callgrind ./your_program
callgrind_annotate callgrind.out.<pid> | less
This is slower than perf but sometimes easier to interpret in development environments.
`heaptrack`
heaptrack records all allocations and provides GUI and CLI analysis.
Install (names vary):
- Debian/Ubuntu:
sudo apt install heaptrack - Fedora:
sudo dnf install heaptrack - Arch:
sudo pacman -S heaptrack
Usage:
heaptrack ./your_program
heaptrack_print heaptrack.<timestamp>.zst | less
Or open the output file with heaptrack_gui for interactive analysis.
`jemalloc` / `tcmalloc` Profiling
Some alternative allocators (jemalloc, tcmalloc) expose built‑in profiling:
- Enable profiling via environment variables (e.g.
MALLOC_CONF=prof:truefor jemalloc) - Use their profiling tools (
jeprof,pprof) to visualize allocation hot‑spots
These are advanced but extremely powerful in long‑running servers.
I/O and Block‑Level Profiling
`iostat`, `pidstat`, `iotop`
While not “profilers” in the strict sense, these tools are essential for correlating I/O with processes.
iostat -x 1– extended disk stats (utilization, await, svctm)pidstat -d 1– per‑process I/O statsiotop– top‑like display for I/O
Install examples:
- Debian/Ubuntu:
sudo apt install sysstat iotop - Fedora:
sudo dnf install sysstat iotop
Use these to validate: “Is the disk actually saturated?” before going deeper.
Block I/O Latency via BCC/eBPF
biolatency (BCC):
sudo biolatency 1Prints histograms of block I/O latencies per device.
biosnoop shows individual requests with process names and latency.
These help differentiate:
- Storage hardware limits
- Filesystem behavior
- Application patterns (e.g., tiny random writes)
Application‑Level Profilers
Different language ecosystems have their own profilers. Here’s just how they fit into a Linux tuning workflow.
C/C++ with `gprof` and Perf‑Aware Compilers
`gprof` (Legacy but sometimes useful)
- Compile with
-pg:
gcc -pg -O2 -o myprog myprog.c- Run:
./myprog- Analyze:
gprof ./myprog gmon.out | less
gprof gives call graph and per‑function statistics but is less accurate than sampling tools in optimized builds.
Compiler Support for Perf
Modern compilers emit DWARF and frame information; for better perf results:
- Compile with
-fno-omit-frame-pointerfor easier stack unwinding - Keep symbols: remove
-s, consider-gfor debug builds
Python Profiling
Built‑ins:
cProfile– CPU profiling at Python‑level functionsprofile– similar, pure‑Python
Run via command:
python3 -m cProfile -o stats.out your_script.pyView stats:
python3 -m pstats stats.out
For more advanced use (sampling, remote): look at tools like py-spy, scalene, yappi.
Java, JVM Languages
Use JVM tools:
jcmd,jstack,jmap,jstatasync-profiler– low‑overhead, perf‑based, supports CPU, allocation, wall‑clock, and lock profiling
Example (async-profiler):
./profiler.sh -d 30 -e cpu -f profile.svg <PID>
Open profile.svg for a flame graph.
Network and Latency Profiling
`perf` + Network Tracepoints
You can attach perf to networking tracepoints; e.g.:
sudo perf record -e net:net_dev_xmit -a -- sleep 10
sudo perf scriptFor more detailed packet‑level analysis use dedicated tracing (via eBPF tools, or tools in the Network Services / DevOps parts of the course).
eBPF Network Tools
BCC provides:
tcpretrans– retransmission analysistcplife– TCP connection lifetimestcpaccept– accept latencytcpconnect– connect attempts
Example:
sudo tcpretransHelps correlate packet loss, RTT, or congestion with observed application slowness.
GUI and Integrated Profiling Tools
While much of Linux performance work is CLI‑driven, certain use cases benefit from GUI tooling.
`sysprof` (GNOME)
Useful for GNOME apps and general system tracing:
- Can attach to processes or entire system
- Visual timelines for CPU, I/O, syscalls, and GNOME internals
Launch via:
sysprofKDE / Qt: `kcachegrind`, `hotspot`
kcachegrindvisualizescallgrindoutputhotspotvisualizesperfdata interactively
Install:
- Debian/Ubuntu:
sudo apt install kcachegrind hotspot - Fedora:
sudo dnf install kcachegrind hotspot
Use them to explore data from perf record or valgrind --tool=callgrind.
Typical Profiling Workflows
This section ties multiple tools together into practical sequences.
Workflow 1: High CPU Usage
- Confirm with
top/htop. - System‑wide CPU profile:
sudo perf record -F 99 -a -g -- sleep 30
sudo perf report- If results are unclear, generate a flame graph as described earlier.
- If the workload is in Python/Java/etc., switch to a language‑specific profiler to zoom in.
Workflow 2: System “Feels Slow” but CPU Not Maxed
- Check I/O:
iostat -x 1for diskpidstat -d 1/iotopfor per‑process I/O- If disk latency seems high:
sudo biolatency 1sudo biosnoop- If CPU is mostly idle and I/O is fine, check for scheduling / lock problems:
sudo offcputime -d 10sudo runqlat 1- If still unclear, capture a broader eBPF or ftrace trace around the problematic period.
Workflow 3: Memory Growth / Out‑of‑Memory
- Track memory per process (
smem,ps,top/htop). - For reproducible dev workloads:
- Use
valgrind --leak-check=fullorheaptrack. - For production:
- Use allocator‑specific profiles (jemalloc / tcmalloc) if available.
- Use language‑specific tools (e.g. Java heap dumps, Python tracemalloc).
Practical Tips and Pitfalls
- Profiling changes behavior: sampling is safer in production; instrumentation (
valgrind,-pg) can alter timings and expose or hide races. - Warm‑up effects: include JIT warm‑up, cache warming, and steady‑state behavior in your measurement window.
- Reproducibility: profile under representative load; use same inputs and environment.
- Frequency vs overhead: higher sampling frequency gives more detail but more overhead and noise.
- Symbols matter: without debug symbols and frame pointers, profiles can be hard to interpret.
- One tool rarely answers everything: combine coarse (top/iostat) with fine‑grained (perf/eBPF/valgrind) tools.
Profiling is iterative: measure, hypothesize, change, re‑measure. The tools covered here are your primary instruments for that cycle on Linux.