Process management is one of the most frequently used Linux admin skills. Every command you run, every service that starts, every script that executes — all of these create processes. Understanding how the Linux kernel manages processes, how to monitor them, and how to control them is fundamental to RHCA-level administration.
What Is a Process?
A process is an instance of a program in execution. When you run ls, the kernel creates a process, allocates memory, loads the binary, and runs it. When it finishes, the kernel destroys the process and frees resources.
Every Linux process has these attributes:
- PID (Process ID): Unique numeric identifier. PID 1 is always
initorsystemd. - PPID (Parent PID): The PID of the process that created this one. Every process has a parent, forming a tree rooted at PID 1.
- UID/GID: The user and group the process runs as (determines what files it can access).
- Nice value: Scheduling priority hint (-20 to +19).
- State: Current execution state (running, sleeping, stopped, zombie).
- Memory: Virtual address space, physical RAM pages, swap usage.
- Open file descriptors: Files, network sockets, pipes the process has open.
- Cgroup: Control group for resource limiting (systemd assigns each service its own cgroup).
Process States — Deep Dive
| State | Symbol | What It Means | Common Causes |
|---|---|---|---|
| Running | R | Actively executing on a CPU core, or in the run queue waiting for a CPU | Normal operation |
| Sleeping (interruptible) | S | Waiting for an event — network I/O, disk I/O, user input, timer. Can be woken by signals. | Normal for most daemons between requests |
| Sleeping (uninterruptible) | D | Waiting for I/O that cannot be interrupted (direct disk access, kernel calls). Cannot be killed. | Disk I/O, NFS hangs, kernel bugs |
| Stopped | T | Process execution suspended — stopped by signal (SIGSTOP) or debugger | Ctrl+Z, gdb debugging, SIGTSTP |
| Zombie | Z | Process finished executing but exit status not yet read by parent. Entry remains in process table. | Programming bugs, crashed parent processes |
| Traced | t | Stopped under debugger tracing | strace, gdb |
Processes in D State (Uninterruptible Sleep)
A large number of processes in D state is a serious problem indicator. These processes cannot be killed and cannot be signalled. Common causes:
- NFS mount hanging (server unreachable, network partition)
- Failing disk with pending I/O that never completes
- Kernel bug or driver issue
- Overloaded storage system
# Find processes in D state:
# ps aux | awk '{ if ($8 == "D") print }'
# ps -eo pid,state,cmd | grep "^[0-9]* D"
# Check what device/file they are blocked on:
# cat /proc/PID/wchan # what kernel function they wait in
# strace -p PID # trace system calls (if attachable)
ps — Process Status Command
# BSD-style syntax (no dash):
# ps aux
# a = show processes of all users
# u = user-oriented format (USERNAME, %CPU, %MEM, RSS)
# x = include processes without controlling terminal
# UNIX-style syntax (with dash):
# ps -ef
# -e = every process
# -f = full format (UID, PID, PPID, C, STIME, TTY, TIME, CMD)
# Output columns explained (ps aux):
# USER = owner of the process
# PID = process ID
# %CPU = CPU usage averaged over recent seconds
# %MEM = percentage of physical RAM used
# VSZ = virtual memory size (KB) - all memory the process CAN access
# RSS = resident set size (KB) - actual physical RAM currently used
# TTY = controlling terminal (? = none, tty1 = console, pts/0 = SSH)
# STAT = state (S, D, R, Z, T + modifiers: s=session leader, l=multi-threaded)
# START = when the process started
# TIME = cumulative CPU time consumed
# COMMAND = command line
# Useful ps filters:
# ps -ef | grep httpd specific process
# ps -ef --forest tree view (parent-child relationships)
# ps -eo pid,ppid,user,cmd --forest custom columns with tree
# ps aux --sort=-%cpu | head -10 top 10 CPU consumers
# ps aux --sort=-%mem | head -10 top 10 memory consumers
# ps -u apache all processes by user apache
top — Interactive Real-Time Monitor
# Launch:
# top
# top -b -n 1 > /tmp/snapshot.txt # batch mode, save to file
# top -b -n 3 -d 1 > /tmp/3sec.txt # 3 iterations, 1 second apart
# KEY HEADER METRICS:
# top - 14:32:01 up 5 days, 3:21, 2 users, load average: 0.52, 0.38, 0.31
#
# Load average: 3 numbers = last 1, 5, 15 minutes
# Rule: if load > nCPUs, system is overloaded
# Check nCPUs: nproc OR grep processor /proc/cpuinfo | wc -l
# Tasks line:
# Tasks: 185 total, 1 running, 184 sleeping, 0 stopped, 0 zombie
# Zombie count should always be 0
# CPU line (%Cpu(s)):
# us = user space (your applications)
# sy = kernel/system space (system calls, interrupts)
# ni = nice processes (low-priority user tasks)
# id = idle
# wa = I/O wait — HIGH VALUE = disk or network bottleneck
# hi = hardware interrupts
# si = software interrupts
# Memory lines:
# Mem: 16G total, 4G free, 8G used, 4G buff/cache
# Swap: 4G total, 0 used, 4G free
# NOTE: free memory that appears "used" by buff/cache is actually available
# Linux uses free RAM as page cache to speed up file access
# The OS reclaims cache instantly when an application needs memory
# INTERACTIVE COMMANDS inside top:
# [Space] = refresh immediately
# k = kill process (prompts for PID and signal)
# r = renice (change priority of running process)
# u = filter by username
# 1 = toggle per-CPU display
# M = sort by memory (%MEM)
# P = sort by CPU (%CPU, default)
# T = sort by time (cumulative)
# N = sort by PID
# i = toggle idle processes
# c = toggle command/full path
# f/F = add/remove display fields
# q = quit
# h = help
Signals — Communicating with Processes
Signals are software interrupts sent to processes. The kernel delivers a signal, the process either handles it (if it has registered a handler) or the default action is taken (typically terminate).
# List all signals:
# kill -l
# man 7 signal
# IMPORTANT SIGNALS:
# SIGHUP (1) = Hangup. Reload configuration without restart. Many daemons
# (nginx, apache, syslogd) re-read config files on SIGHUP.
# SIGINT (2) = Interrupt. Same as Ctrl+C. Requests orderly shutdown.
# SIGQUIT (3) = Quit. Like SIGINT but generates a core dump.
# SIGKILL (9) = Kill. CANNOT be caught, blocked, or ignored. Immediate
# termination. No cleanup. Use as last resort.
# SIGTERM (15) = Terminate. Default kill signal. CAN be caught.
# Allows graceful shutdown (save state, close connections).
# SIGSTOP (19) = Stop. CANNOT be caught. Suspends process.
# SIGCONT (18) = Continue. Resumes a stopped process.
# SIGTSTP (20) = Terminal stop. Same as Ctrl+Z. CAN be caught.
# Send signals:
# kill PID # sends SIGTERM (15) by default
# kill -15 PID # same
# kill -9 PID # SIGKILL (force)
# kill -1 PID # SIGHUP (reload config)
# kill -TERM PID # named signal
# Kill by name (sends to all matching processes):
# killall httpd # kill all httpd processes
# pkill httpd # same as killall
# pkill -9 -u raju # kill all processes owned by raju
# Find PID by name:
# pidof httpd # returns PID(s)
# pgrep httpd # returns PID(s)
# pgrep -a httpd # with command line
Process Priority — nice and renice
The Linux CFS (Completely Fair Scheduler) allocates CPU time based on a process's nice value and its scheduling policy. Nice values are a hint to the scheduler — they do not provide real-time guarantees.
# Nice value range: -20 (highest priority) to +19 (lowest priority)
# Default nice value: 0
# Only root can set negative nice values
# Normal users can only increase nice value (reduce priority)
# Start a process with adjusted priority:
# nice -n 10 /scripts/backup.sh # lower priority (background work)
# nice -n -20 /scripts/critical.sh # highest priority (root only)
# nice command # shows default nice (0)
# Change priority of running process:
# renice -n 5 -p 1234 # change PID 1234 to nice=5
# renice +5 1234 # same (increase by 5 from current)
# renice -5 1234 # decrease by 5 (more priority, root only)
# renice +15 -u raju # change ALL of raju's processes
# View nice values in top: NI column
# View in ps: ps -eo pid,ni,cmd
Background Jobs and Job Control
# Run a command in the background:
# /scripts/long_backup.sh &
# [1] 12345 # job number and PID
# Suspend foreground process:
# [Ctrl+Z] # sends SIGTSTP to current process
# [1]+ Stopped /scripts/backup.sh
# List jobs in current shell:
# jobs # brief list
# jobs -l # with PIDs
# Resume job in background:
# bg %1 # send job 1 to background
# bg # resume most recent
# Resume job in foreground:
# fg %1 # bring job 1 to foreground
# fg # bring most recent
# Kill a job by number:
# kill %1
# Disown a job (remove from shell job table — survives shell exit):
# disown %1
# disown -a # disown all
# Run process immune to terminal hangup:
# nohup /scripts/backup.sh & # continues after logout
# nohup /scripts/backup.sh > /var/log/backup.log 2>&1 &
# Modern alternative (systemd-run):
# systemd-run --unit=backup /scripts/backup.sh
lsof — List Open Files
lsof shows every file, socket, device, and pipe that is currently open by any process. Extremely useful for:
- Finding what process is using a port
- Finding what process is preventing a filesystem unmount
- Debugging "disk full" issues (deleted files held open)
# All open files:
# lsof | head -50
# Files opened by specific process:
# lsof -p 1234
# Files opened by specific user:
# lsof -u raju
# What is using port 80:
# lsof -i :80
# lsof -i tcp:80
# What is using a specific file:
# lsof /var/log/messages
# What is using a mount point (before unmounting):
# lsof /mnt/data
# Find processes with deleted files still held open:
# lsof | grep "(deleted)"
# This shows files that have been deleted but are still open
# They consume disk space until the process closes them or exits
Zombie Processes — Causes and Resolution
A zombie is a process that has finished execution but whose entry remains in the process table because the parent has not called wait() to collect its exit status. Zombies:
- Consume a process table slot (limited resource)
- Cannot be killed with
kill -9(already dead) - Are automatically cleaned up when the parent dies (reparented to init/systemd which collects them)
# Find zombies:
# ps aux | awk '$8 == "Z" {print}'
# ps -ef | grep defunct
# Count zombies:
# ps aux | awk '$8 == "Z"' | wc -l
# Resolution:
# 1. Kill the parent process — all zombies are reparented to init, which reaps them
# kill -9 PPID
# 2. If parent is critical (can't be killed):
# Send SIGCHLD to parent — tells it to call wait():
# kill -17 PPID
# 3. Last resort — reboot (all zombies are cleared)
System Resource Monitoring
# Load average interpretation:
# Load average is the average number of runnable/sleeping-uninterruptibly processes
# over 1, 5, and 15 minute windows
# Rule of thumb:
# Load = nCPUs → 100% utilization, no waiting
# Load < nCPUs → fine
# Load > nCPUs → processes are waiting for CPU (overloaded)
# Check CPU count:
# nproc # logical CPUs
# lscpu # detailed CPU info
# grep "^processor" /proc/cpuinfo | wc -l
# Detailed CPU statistics:
# mpstat 1 5 # per-CPU stats (requires sysstat)
# sar -u 1 5 # CPU utilization over time
# Check if system is CPU or I/O bound:
# top: if %wa (I/O wait) > 20% → I/O bound
# top: if %us + %sy > 80% → CPU bound
# Memory details:
# /proc/meminfo # raw kernel memory info
# free -h # clean summary
# Process memory breakdown:
# cat /proc/PID/status | grep -E "VmSize|VmRSS|VmSwap"
# VmSize = virtual memory (may be huge for JVM, databases — normal)
# VmRSS = physical RAM actually in use
# VmSwap = how much is swapped out
Managing Services as Processes
# Each systemd service runs as a cgroup:
# systemctl status httpd # shows PID(s) and resource usage
# systemd-cgls # full cgroup tree
# Resource limits per service:
# vim /etc/systemd/system/httpd.service.d/limits.conf
[Service]
LimitNOFILE=65535 # max open files
LimitNPROC=512 # max processes
MemoryLimit=2G # max memory (systemd v208+)
CPUQuota=50% # max CPU percentage
# Reload service configuration:
# systemctl daemon-reload
# systemctl restart httpd