Linux Process Management: ps, kill, top, nice

Process management is one of the most frequently used Linux admin skills. Every command you run, every service that starts, every script that executes — all of these create processes. Understanding how the Linux kernel manages processes, how to monitor them, and how to control them is fundamental to RHCA-level administration.

What Is a Process?

A process is an instance of a program in execution. When you run ls, the kernel creates a process, allocates memory, loads the binary, and runs it. When it finishes, the kernel destroys the process and frees resources.

Every Linux process has these attributes:

PID (Process ID): Unique numeric identifier. PID 1 is always init or systemd.
PPID (Parent PID): The PID of the process that created this one. Every process has a parent, forming a tree rooted at PID 1.
UID/GID: The user and group the process runs as (determines what files it can access).
Nice value: Scheduling priority hint (-20 to +19).
State: Current execution state (running, sleeping, stopped, zombie).
Memory: Virtual address space, physical RAM pages, swap usage.
Open file descriptors: Files, network sockets, pipes the process has open.
Cgroup: Control group for resource limiting (systemd assigns each service its own cgroup).

Process States — Deep Dive

State	Symbol	What It Means	Common Causes
Running	R	Actively executing on a CPU core, or in the run queue waiting for a CPU	Normal operation
Sleeping (interruptible)	S	Waiting for an event — network I/O, disk I/O, user input, timer. Can be woken by signals.	Normal for most daemons between requests
Sleeping (uninterruptible)	D	Waiting for I/O that cannot be interrupted (direct disk access, kernel calls). Cannot be killed.	Disk I/O, NFS hangs, kernel bugs
Stopped	T	Process execution suspended — stopped by signal (SIGSTOP) or debugger	Ctrl+Z, gdb debugging, SIGTSTP
Zombie	Z	Process finished executing but exit status not yet read by parent. Entry remains in process table.	Programming bugs, crashed parent processes
Traced	t	Stopped under debugger tracing	strace, gdb

Processes in D State (Uninterruptible Sleep)

A large number of processes in D state is a serious problem indicator. These processes cannot be killed and cannot be signalled. Common causes:

NFS mount hanging (server unreachable, network partition)
Failing disk with pending I/O that never completes
Kernel bug or driver issue
Overloaded storage system

# Find processes in D state:
# ps aux | awk '{ if ($8 == "D") print }'
# ps -eo pid,state,cmd | grep "^[0-9]* D"

# Check what device/file they are blocked on:
# cat /proc/PID/wchan                # what kernel function they wait in
# strace -p PID                      # trace system calls (if attachable)

ps — Process Status Command

# BSD-style syntax (no dash):
# ps aux
# a = show processes of all users
# u = user-oriented format (USERNAME, %CPU, %MEM, RSS)
# x = include processes without controlling terminal

# UNIX-style syntax (with dash):
# ps -ef
# -e = every process
# -f = full format (UID, PID, PPID, C, STIME, TTY, TIME, CMD)

# Output columns explained (ps aux):
# USER   = owner of the process
# PID    = process ID
# %CPU   = CPU usage averaged over recent seconds
# %MEM   = percentage of physical RAM used
# VSZ    = virtual memory size (KB) - all memory the process CAN access
# RSS    = resident set size (KB) - actual physical RAM currently used
# TTY    = controlling terminal (? = none, tty1 = console, pts/0 = SSH)
# STAT   = state (S, D, R, Z, T + modifiers: s=session leader, l=multi-threaded)
# START  = when the process started
# TIME   = cumulative CPU time consumed
# COMMAND = command line

# Useful ps filters:
# ps -ef | grep httpd                specific process
# ps -ef --forest                    tree view (parent-child relationships)
# ps -eo pid,ppid,user,cmd --forest  custom columns with tree
# ps aux --sort=-%cpu | head -10     top 10 CPU consumers
# ps aux --sort=-%mem | head -10     top 10 memory consumers
# ps -u apache                       all processes by user apache

top — Interactive Real-Time Monitor

# Launch:
# top
# top -b -n 1 > /tmp/snapshot.txt   # batch mode, save to file
# top -b -n 3 -d 1 > /tmp/3sec.txt  # 3 iterations, 1 second apart

# KEY HEADER METRICS:
# top - 14:32:01 up 5 days, 3:21, 2 users, load average: 0.52, 0.38, 0.31
#
# Load average: 3 numbers = last 1, 5, 15 minutes
# Rule: if load > nCPUs, system is overloaded
# Check nCPUs: nproc OR grep processor /proc/cpuinfo | wc -l

# Tasks line:
# Tasks: 185 total, 1 running, 184 sleeping, 0 stopped, 0 zombie
# Zombie count should always be 0

# CPU line (%Cpu(s)):
# us = user space (your applications)
# sy = kernel/system space (system calls, interrupts)
# ni = nice processes (low-priority user tasks)
# id = idle
# wa = I/O wait — HIGH VALUE = disk or network bottleneck
# hi = hardware interrupts
# si = software interrupts

# Memory lines:
# Mem:  16G total, 4G free, 8G used, 4G buff/cache
# Swap: 4G total, 0 used, 4G free

# NOTE: free memory that appears "used" by buff/cache is actually available
# Linux uses free RAM as page cache to speed up file access
# The OS reclaims cache instantly when an application needs memory

# INTERACTIVE COMMANDS inside top:
# [Space]  = refresh immediately
# k        = kill process (prompts for PID and signal)
# r        = renice (change priority of running process)
# u        = filter by username
# 1        = toggle per-CPU display
# M        = sort by memory (%MEM)
# P        = sort by CPU (%CPU, default)
# T        = sort by time (cumulative)
# N        = sort by PID
# i        = toggle idle processes
# c        = toggle command/full path
# f/F      = add/remove display fields
# q        = quit
# h        = help

Signals — Communicating with Processes

Signals are software interrupts sent to processes. The kernel delivers a signal, the process either handles it (if it has registered a handler) or the default action is taken (typically terminate).

# List all signals:
# kill -l
# man 7 signal

# IMPORTANT SIGNALS:
# SIGHUP  (1)  = Hangup. Reload configuration without restart. Many daemons
#                (nginx, apache, syslogd) re-read config files on SIGHUP.
# SIGINT  (2)  = Interrupt. Same as Ctrl+C. Requests orderly shutdown.
# SIGQUIT (3)  = Quit. Like SIGINT but generates a core dump.
# SIGKILL (9)  = Kill. CANNOT be caught, blocked, or ignored. Immediate
#                termination. No cleanup. Use as last resort.
# SIGTERM (15) = Terminate. Default kill signal. CAN be caught.
#                Allows graceful shutdown (save state, close connections).
# SIGSTOP (19) = Stop. CANNOT be caught. Suspends process.
# SIGCONT (18) = Continue. Resumes a stopped process.
# SIGTSTP (20) = Terminal stop. Same as Ctrl+Z. CAN be caught.

# Send signals:
# kill PID                           # sends SIGTERM (15) by default
# kill -15 PID                       # same
# kill -9 PID                        # SIGKILL (force)
# kill -1 PID                        # SIGHUP (reload config)
# kill -TERM PID                     # named signal

# Kill by name (sends to all matching processes):
# killall httpd                      # kill all httpd processes
# pkill httpd                        # same as killall
# pkill -9 -u raju                   # kill all processes owned by raju

# Find PID by name:
# pidof httpd                        # returns PID(s)
# pgrep httpd                        # returns PID(s)
# pgrep -a httpd                     # with command line

Process Priority — nice and renice

The Linux CFS (Completely Fair Scheduler) allocates CPU time based on a process's nice value and its scheduling policy. Nice values are a hint to the scheduler — they do not provide real-time guarantees.

# Nice value range: -20 (highest priority) to +19 (lowest priority)
# Default nice value: 0
# Only root can set negative nice values
# Normal users can only increase nice value (reduce priority)

# Start a process with adjusted priority:
# nice -n 10 /scripts/backup.sh     # lower priority (background work)
# nice -n -20 /scripts/critical.sh  # highest priority (root only)
# nice command                      # shows default nice (0)

# Change priority of running process:
# renice -n 5 -p 1234               # change PID 1234 to nice=5
# renice +5 1234                    # same (increase by 5 from current)
# renice -5 1234                    # decrease by 5 (more priority, root only)
# renice +15 -u raju                # change ALL of raju's processes

# View nice values in top: NI column
# View in ps: ps -eo pid,ni,cmd

Background Jobs and Job Control

# Run a command in the background:
# /scripts/long_backup.sh &
# [1] 12345                         # job number and PID

# Suspend foreground process:
# [Ctrl+Z]                          # sends SIGTSTP to current process
# [1]+ Stopped  /scripts/backup.sh

# List jobs in current shell:
# jobs                              # brief list
# jobs -l                           # with PIDs

# Resume job in background:
# bg %1                             # send job 1 to background
# bg                                # resume most recent

# Resume job in foreground:
# fg %1                             # bring job 1 to foreground
# fg                                # bring most recent

# Kill a job by number:
# kill %1

# Disown a job (remove from shell job table — survives shell exit):
# disown %1
# disown -a                         # disown all

# Run process immune to terminal hangup:
# nohup /scripts/backup.sh &        # continues after logout
# nohup /scripts/backup.sh > /var/log/backup.log 2>&1 &

# Modern alternative (systemd-run):
# systemd-run --unit=backup /scripts/backup.sh

lsof — List Open Files

lsof shows every file, socket, device, and pipe that is currently open by any process. Extremely useful for:

Finding what process is using a port
Finding what process is preventing a filesystem unmount
Debugging "disk full" issues (deleted files held open)

# All open files:
# lsof | head -50

# Files opened by specific process:
# lsof -p 1234

# Files opened by specific user:
# lsof -u raju

# What is using port 80:
# lsof -i :80
# lsof -i tcp:80

# What is using a specific file:
# lsof /var/log/messages

# What is using a mount point (before unmounting):
# lsof /mnt/data

# Find processes with deleted files still held open:
# lsof | grep "(deleted)"
# This shows files that have been deleted but are still open
# They consume disk space until the process closes them or exits

Zombie Processes — Causes and Resolution

A zombie is a process that has finished execution but whose entry remains in the process table because the parent has not called wait() to collect its exit status. Zombies:

Consume a process table slot (limited resource)
Cannot be killed with kill -9 (already dead)
Are automatically cleaned up when the parent dies (reparented to init/systemd which collects them)

# Find zombies:
# ps aux | awk '$8 == "Z" {print}'
# ps -ef | grep defunct

# Count zombies:
# ps aux | awk '$8 == "Z"' | wc -l

# Resolution:
# 1. Kill the parent process — all zombies are reparented to init, which reaps them
# kill -9 PPID

# 2. If parent is critical (can't be killed):
# Send SIGCHLD to parent — tells it to call wait():
# kill -17 PPID

# 3. Last resort — reboot (all zombies are cleared)

System Resource Monitoring

# Load average interpretation:
# Load average is the average number of runnable/sleeping-uninterruptibly processes
# over 1, 5, and 15 minute windows

# Rule of thumb:
# Load = nCPUs → 100% utilization, no waiting
# Load < nCPUs → fine
# Load > nCPUs → processes are waiting for CPU (overloaded)

# Check CPU count:
# nproc                              # logical CPUs
# lscpu                              # detailed CPU info
# grep "^processor" /proc/cpuinfo | wc -l

# Detailed CPU statistics:
# mpstat 1 5                         # per-CPU stats (requires sysstat)
# sar -u 1 5                         # CPU utilization over time

# Check if system is CPU or I/O bound:
# top: if %wa (I/O wait) > 20% → I/O bound
# top: if %us + %sy > 80% → CPU bound

# Memory details:
# /proc/meminfo                      # raw kernel memory info
# free -h                            # clean summary

# Process memory breakdown:
# cat /proc/PID/status | grep -E "VmSize|VmRSS|VmSwap"
# VmSize = virtual memory (may be huge for JVM, databases — normal)
# VmRSS  = physical RAM actually in use
# VmSwap = how much is swapped out

Managing Services as Processes

# Each systemd service runs as a cgroup:
# systemctl status httpd             # shows PID(s) and resource usage
# systemd-cgls                       # full cgroup tree

# Resource limits per service:
# vim /etc/systemd/system/httpd.service.d/limits.conf
[Service]
LimitNOFILE=65535          # max open files
LimitNPROC=512             # max processes
MemoryLimit=2G             # max memory (systemd v208+)
CPUQuota=50%               # max CPU percentage

# Reload service configuration:
# systemctl daemon-reload
# systemctl restart httpd