Lower number = higher clocks. For compute, expect P2 or P0. P8 during heavy work means idle, bottleneck, or throttling.
nvidia-smi -l 1 — live dashboardWhy run this?
To watch utilization, memory, temperature, and P‑state change live. Confirms if the GPU is actually busy and whether it’s throttling.
Flags
-l 1: loop/refresh every 1 second (use-lmsfor ms granularity).
CommandHealthy
nvidia-smi -l 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00 Driver 555.32.00 CUDA 12.4 |
|-------------------------------+----------------------+----------------------|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA H100 On | 00000000:65:00.0 Off | Off |
| 60% 75C P2 300W / 500W | 20,480MiB / 80,000MiB | 98% Default |
+-----------------------------------------------------------------------------+
| 5467 C+G python train_llm.py 20,432MiB |
+-----------------------------------------------------------------------------+
High util, P2, temps below 80°C → normal training.
CommandUnhealthy
nvidia-smi -l 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00 Driver 555.32.00 CUDA 12.4 |
|-------------------------------+----------------------+----------------------|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA H100 On | 00000000:65:00.0 Off | Off |
| 90% 95C P8 150W / 500W | 78,800MiB / 80,000MiB | 12% Throttled |
+-----------------------------------------------------------------------------+
Low util + hot + P8 → thermal/power throttling or data starvation.
How to fix
- Lower temps: clean filters, increase airflow/fan curve, reduce ambient temp.
- Avoid VRAM spill: enable AMP, reduce batch size, gradient checkpointing.
- Speed up input pipeline: increase
num_workers,pin_memory, pre-tokenize.
Sources / Test cases
- NVIDIA SMI manual: docs.nvidia.com/deploy/nvidia-smi
- NVTOP (real-time monitor): github.com/Syllo/nvtop
nvidia-smi -q -d TEMPERATURE,PERFORMANCE — throttle reasonsWhy run this?
To see if the GPU is slowing itself due to heat or power and which limiter is active.
Flags
-q: query detailed fields;-d: restrict to specific domains (TEMPERATURE,PERFORMANCE).
CommandHealthy
nvidia-smi -q -d TEMPERATURE,PERFORMANCE
GPU 00000000:65:00.0
Temperature
GPU Current Temp : 73 C
GPU Slowdown Temp : 90 C
Performance State : P2
Clocks Throttle Reasons
Power Limit : Not Active
Thermal Slowdown : Not Active
HW Slowdown : Not Active
CommandUnhealthy
nvidia-smi -q -d TEMPERATURE,PERFORMANCE
GPU 00000000:65:00.0
Temperature
GPU Current Temp : 91 C
GPU Slowdown Temp : 90 C
Performance State : P8
Clocks Throttle Reasons
Power Limit : Active
Thermal Slowdown : Active
HW Slowdown : Active
How to fix
- Reduce thermal load (airflow, fans, datacenter inlet temp); check paste/contacts.
- Verify provider power policies; request higher cap if allowed.
Sources / Test cases
- NVIDIA SMI manual fields & throttle reasons: docs.nvidia.com/deploy/nvidia-smi
- NVIDIA forum notes on thermal limits (T.Limit): forums.developer.nvidia.com (T.Limit)
nvidia-smi -q -d POWER — power caps & drawWhy run this?
To confirm if a power limiter is holding clocks down (common in dense racks or provider-imposed caps).
Flags
-q: detailed query;-d POWER: power-only fields (draw, limit, throttle reason).
CommandHealthy
nvidia-smi -q -d POWER
Power Readings
Power Draw : 312.43 W
Power Limit : 500.00 W
Performance State : P2
Clocks Throttle Reasons
Power Limit : Not Active
CommandUnhealthy
nvidia-smi -q -d POWER
Power Readings
Power Draw : 148.12 W
Power Limit : 500.00 W
Performance State : P8
Clocks Throttle Reasons
Power Limit : Active
How to fix
- Request/raise power limit within spec (
sudo nvidia-smi -pl <watts>) if permitted. - Reduce spikes with AMP and stable batch times; avoid CPU stalls that drop clocks.
Sources / Test cases
- nvidia-smi power fields &
-pl: docs.nvidia.com/deploy/nvidia-smi - DCGM active health/power monitoring for servers: docs.nvidia.com/datacenter/dcgm
nvidia-smi topo -m — PCIe/NVLink topologyWhy run this?
To detect slow host↔GPU links (e.g., Gen3x4) that starve the GPU when transferring batches or model shards.
Flags
topo -m: prints matrix/topology with link types (PCIe gen/width, NVLink status).
CommandHealthy
nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-63
NUMA Affinity 0
Link 0: PCIe Gen4x16
NVLink: 0 links active (single-GPU)
CommandUnhealthy
nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-63
NUMA Affinity 0
Link 0: PCIe Gen3x4
NVLink: 0 links active
How to fix
- Select instances with Gen4x16 (or NVLink for multi-GPU). On-prem: ensure GPU sits in a x16 Gen4/5 slot; update BIOS.
- Co-locate CPU affinity/NUMA with the GPU for better host↔device bandwidth.
Sources / Test cases
- nvidia-smi topology/NVLink refs: docs.nvidia.com/deploy/nvidia-smi (nvlink), explainer: Exxact NVLink guide
- NVIDIA Fabric Manager topology (H100/H200): Fabric Manager User Guide (PDF)
nvtop — live process/util monitorWhy run this?
Instant feedback on GPU util vs CPU util to spot data-loader starvation (GPU idle while CPU 90–100%).
Flags
- No flags needed; install via package manager and run.
CommandHealthy
sudo apt-get update && sudo apt-get install -y nvtop
nvtop
GPU0 [█████████████████████████████████████████████] 97%
Mem : 20.0/80.0 GiB Temp: 75°C Power: 310W
CPU: 18% MEM: 22.3 GiB / 128.0 GiB
CommandUnhealthy
nvtop
GPU0 [███ ] 12%
Mem : 78.8/80.0 GiB Temp: 90°C Power: 150W (throttled)
CPU: 98% MEM: 118.7 GiB / 128.0 GiB
How to fix
- Increase
num_workers, enablepin_memory, and prefetch to keep GPU fed. - Pre-tokenize/serialize datasets to avoid CPU tokenization in the training loop.
Sources / Test cases
- NVTOP repo: github.com/Syllo/nvtop
- Alternative monitor: github.com/XuehaiPan/nvitop
A) DataLoader throughput
Why
GPU underutilized & CPU pegged → data loader is the bottleneck. Multiple workers and pinned memory reduce H2D copy latency.
Healthy High-throughput
loader = DataLoader(dataset,
batch_size=64, shuffle=True,
num_workers=8, # parallel CPU workers
pin_memory=True, # page-locked host RAM for fast copies
persistent_workers=True, # keep workers alive
prefetch_factor=4)
for xb, yb in loader:
xb = xb.to(device, non_blocking=True)
yb = yb.to(device, non_blocking=True)
Unhealthy Starved GPU
loader = DataLoader(dataset,
batch_size=64, shuffle=True,
num_workers=0, # main thread only
pin_memory=False) # pageable → slower copies
for xb, yb in loader:
xb = xb.to(device) # blocking copies
yb = yb.to(device)
Sources / Test cases
- PyTorch DataLoader docs: docs.pytorch.org/data
- Community guidance on
pin_memory/num_workers: discuss.pytorch.org/813
B) Mixed Precision (AMP)
Why
Cuts memory footprint (~½ for activations) and speeds up kernels on Tensor Cores; reduces VRAM pressure that can cause spills and throttling.
Healthy AMP on
scaler = torch.cuda.amp.GradScaler()
for xb, yb in loader:
xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast():
loss = model(xb, yb)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Unhealthy FP32-only
for xb, yb in loader:
xb = xb.to(device); yb = yb.to(device)
optimizer.zero_grad()
# No autocast/GradScaler → higher VRAM & slower
loss = model(xb, yb)
loss.backward()
optimizer.step()
Sources / Test cases
- PyTorch AMP guide: docs.pytorch.org/amp
- AMP recipe with measured speedups: pytorch.org/amp_recipe
C) Gradient Checkpointing
Why
Trades extra compute for lower memory; prevents VRAM spill that kills utilization.
Healthy Fit bigger models
from torch.utils.checkpoint import checkpoint
def fwd(x): return block(x)
out = checkpoint(fwd, x)
Unhealthy OOM & spill
out = block(x) # CUDA OOM → host spill → 1–10% GPU util
D) Stable tensor shapes
Why
Stable shapes improve kernel fusion & caching; erratic shapes introduce overhead and allocator churn.
Healthy Pad to fixed L
batch = pad_to_length(batch, L=4096)
mask = build_mask(L=4096)
out = model(batch, mask)
Unhealthy Varying shapes
out = model(batch, mask) # different lengths each step → slower
E) Tokenizer throughput
Why
Tokenization is CPU-bound; doing it in the training loop starves the GPU.
Healthy Fast tokenizer + pretokenize
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(model, use_fast=True)
# Pre-tokenize to files; training reads token IDs directly.
Unhealthy Slow regex tokenizer inline
# Python regex tokenizer inside the training loop
# GPU waits while CPU chews text.
Profiling / Proof tools
- NVIDIA Nsight Systems (
nsys profile python train.py): Nsight Systems User Guide & Get Started - PyTorch Profiler (
torch.profiler) + TensorBoard traces. - Hugging Face fast tokenizers: github.com/huggingface/tokenizers