GPU Troubleshooting (Lambda-style H100/B200) — Detailed Commands, Flags, Why, Proven Fixes

P‑states (performance states) — quick sanity

Lower number = higher clocks. For compute, expect P2 or P0. P8 during heavy work means idle, bottleneck, or throttling.

ELI5: Gears on a car. P2 is fast driving; P8 is idling. If you’re racing but stuck idling, fix heat, power, or the pit crew (data pipeline).

Reference: NVIDIA P‑state docs (NVAPI) — docs.nvidia.com (P‑State)

1) nvidia-smi -l 1 — live dashboard

Why run this?

To watch utilization, memory, temperature, and P‑state change live. Confirms if the GPU is actually busy and whether it’s throttling.

Flags

-l 1: loop/refresh every 1 second (use -lms for ms granularity).

CommandHealthy

nvidia-smi -l 1

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00    Driver 555.32.00    CUDA 12.4                        |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M.        |
|===============================+======================+======================|
|  0  NVIDIA H100         On   | 00000000:65:00.0 Off |                  Off |
| 60%   75C   P2  300W / 500W | 20,480MiB / 80,000MiB |  98%   Default |
+-----------------------------------------------------------------------------+
| 5467  C+G   python train_llm.py                          20,432MiB          |
+-----------------------------------------------------------------------------+

High util, P2, temps below 80°C → normal training.

CommandUnhealthy

nvidia-smi -l 1

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 555.32.00    Driver 555.32.00    CUDA 12.4                        |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M.        |
|===============================+======================+======================|
|  0  NVIDIA H100         On   | 00000000:65:00.0 Off |                  Off |
| 90%   95C   P8  150W / 500W | 78,800MiB / 80,000MiB |  12% Throttled |
+-----------------------------------------------------------------------------+

Low util + hot + P8 → thermal/power throttling or data starvation.

How to fix

Lower temps: clean filters, increase airflow/fan curve, reduce ambient temp.
Avoid VRAM spill: enable AMP, reduce batch size, gradient checkpointing.
Speed up input pipeline: increase num_workers, pin_memory, pre-tokenize.

Sources / Test cases

NVIDIA SMI manual: docs.nvidia.com/deploy/nvidia-smi
NVTOP (real-time monitor): github.com/Syllo/nvtop

2) nvidia-smi -q -d TEMPERATURE,PERFORMANCE — throttle reasons

Why run this?

To see if the GPU is slowing itself due to heat or power and which limiter is active.

Flags

-q: query detailed fields; -d: restrict to specific domains (TEMPERATURE, PERFORMANCE).

CommandHealthy

nvidia-smi -q -d TEMPERATURE,PERFORMANCE

GPU 00000000:65:00.0
    Temperature
        GPU Current Temp            : 73 C
        GPU Slowdown Temp           : 90 C
    Performance State               : P2
    Clocks Throttle Reasons
        Power Limit                 : Not Active
        Thermal Slowdown            : Not Active
        HW Slowdown                 : Not Active

CommandUnhealthy

nvidia-smi -q -d TEMPERATURE,PERFORMANCE

GPU 00000000:65:00.0
    Temperature
        GPU Current Temp            : 91 C
        GPU Slowdown Temp           : 90 C
    Performance State               : P8
    Clocks Throttle Reasons
        Power Limit                 : Active
        Thermal Slowdown            : Active
        HW Slowdown                 : Active

How to fix

Reduce thermal load (airflow, fans, datacenter inlet temp); check paste/contacts.
Verify provider power policies; request higher cap if allowed.

Sources / Test cases

NVIDIA SMI manual fields & throttle reasons: docs.nvidia.com/deploy/nvidia-smi
NVIDIA forum notes on thermal limits (T.Limit): forums.developer.nvidia.com (T.Limit)

3) nvidia-smi -q -d POWER — power caps & draw

Why run this?

To confirm if a power limiter is holding clocks down (common in dense racks or provider-imposed caps).

Flags

-q: detailed query; -d POWER: power-only fields (draw, limit, throttle reason).

CommandHealthy

nvidia-smi -q -d POWER

Power Readings
    Power Draw                  : 312.43 W
    Power Limit                 : 500.00 W
    Performance State           : P2
Clocks Throttle Reasons
    Power Limit                 : Not Active

CommandUnhealthy

nvidia-smi -q -d POWER

Power Readings
    Power Draw                  : 148.12 W
    Power Limit                 : 500.00 W
    Performance State           : P8
Clocks Throttle Reasons
    Power Limit                 : Active

How to fix

Request/raise power limit within spec (sudo nvidia-smi -pl <watts>) if permitted.
Reduce spikes with AMP and stable batch times; avoid CPU stalls that drop clocks.

Sources / Test cases

nvidia-smi power fields & -pl: docs.nvidia.com/deploy/nvidia-smi
DCGM active health/power monitoring for servers: docs.nvidia.com/datacenter/dcgm

4) nvidia-smi topo -m — PCIe/NVLink topology

Why run this?

To detect slow host↔GPU links (e.g., Gen3x4) that starve the GPU when transferring batches or model shards.

Flags

topo -m: prints matrix/topology with link types (PCIe gen/width, NVLink status).

CommandHealthy

nvidia-smi topo -m

        GPU0    CPU Affinity
GPU0     X      0-63
        NUMA Affinity 0
        Link 0: PCIe Gen4x16
        NVLink: 0 links active (single-GPU)

CommandUnhealthy

nvidia-smi topo -m

        GPU0    CPU Affinity
GPU0     X      0-63
        NUMA Affinity 0
        Link 0: PCIe Gen3x4
        NVLink: 0 links active

How to fix

Select instances with Gen4x16 (or NVLink for multi-GPU). On-prem: ensure GPU sits in a x16 Gen4/5 slot; update BIOS.
Co-locate CPU affinity/NUMA with the GPU for better host↔device bandwidth.

Sources / Test cases

nvidia-smi topology/NVLink refs: docs.nvidia.com/deploy/nvidia-smi (nvlink), explainer: Exxact NVLink guide
NVIDIA Fabric Manager topology (H100/H200): Fabric Manager User Guide (PDF)

5) nvtop — live process/util monitor

Why run this?

Instant feedback on GPU util vs CPU util to spot data-loader starvation (GPU idle while CPU 90–100%).

Flags

No flags needed; install via package manager and run.

CommandHealthy

sudo apt-get update && sudo apt-get install -y nvtop
nvtop

GPU0 [█████████████████████████████████████████████] 97%
Mem : 20.0/80.0 GiB  Temp: 75°C  Power: 310W
CPU: 18%   MEM: 22.3 GiB / 128.0 GiB

CommandUnhealthy

nvtop

GPU0 [███                                     ] 12%
Mem : 78.8/80.0 GiB  Temp: 90°C  Power: 150W (throttled)
CPU: 98%   MEM: 118.7 GiB / 128.0 GiB

How to fix

Increase num_workers, enable pin_memory, and prefetch to keep GPU fed.
Pre-tokenize/serialize datasets to avoid CPU tokenization in the training loop.

Sources / Test cases

NVTOP repo: github.com/Syllo/nvtop
Alternative monitor: github.com/XuehaiPan/nvitop

PyTorch — proven fixes with code & why

A) DataLoader throughput

Why

GPU underutilized & CPU pegged → data loader is the bottleneck. Multiple workers and pinned memory reduce H2D copy latency.

Healthy High-throughput

loader = DataLoader(dataset,
    batch_size=64, shuffle=True,
    num_workers=8,              # parallel CPU workers
    pin_memory=True,            # page-locked host RAM for fast copies
    persistent_workers=True,    # keep workers alive
    prefetch_factor=4)
for xb, yb in loader:
    xb = xb.to(device, non_blocking=True)
    yb = yb.to(device, non_blocking=True)

Unhealthy Starved GPU

loader = DataLoader(dataset,
    batch_size=64, shuffle=True,
    num_workers=0,         # main thread only
    pin_memory=False)      # pageable → slower copies
for xb, yb in loader:
    xb = xb.to(device)  # blocking copies
    yb = yb.to(device)

Sources / Test cases

PyTorch DataLoader docs: docs.pytorch.org/data
Community guidance on pin_memory/num_workers: discuss.pytorch.org/813

B) Mixed Precision (AMP)

Why

Cuts memory footprint (~½ for activations) and speeds up kernels on Tensor Cores; reduces VRAM pressure that can cause spills and throttling.

Healthy AMP on

scaler = torch.cuda.amp.GradScaler()
for xb, yb in loader:
    xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        loss = model(xb, yb)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Unhealthy FP32-only

for xb, yb in loader:
    xb = xb.to(device); yb = yb.to(device)
    optimizer.zero_grad()
    # No autocast/GradScaler → higher VRAM & slower
    loss = model(xb, yb)
    loss.backward()
    optimizer.step()

Sources / Test cases

PyTorch AMP guide: docs.pytorch.org/amp
AMP recipe with measured speedups: pytorch.org/amp_recipe

C) Gradient Checkpointing

Why

Trades extra compute for lower memory; prevents VRAM spill that kills utilization.

Healthy Fit bigger models

from torch.utils.checkpoint import checkpoint
def fwd(x): return block(x)
out = checkpoint(fwd, x)

Unhealthy OOM & spill

out = block(x)  # CUDA OOM → host spill → 1–10% GPU util

D) Stable tensor shapes

Why

Stable shapes improve kernel fusion & caching; erratic shapes introduce overhead and allocator churn.

Healthy Pad to fixed L

batch = pad_to_length(batch, L=4096)
mask  = build_mask(L=4096)
out   = model(batch, mask)

Unhealthy Varying shapes

out = model(batch, mask)  # different lengths each step → slower

E) Tokenizer throughput

Why

Tokenization is CPU-bound; doing it in the training loop starves the GPU.

Healthy Fast tokenizer + pretokenize

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(model, use_fast=True)
# Pre-tokenize to files; training reads token IDs directly.

Unhealthy Slow regex tokenizer inline

# Python regex tokenizer inside the training loop
# GPU waits while CPU chews text.

Profiling / Proof tools

NVIDIA Nsight Systems (nsys profile python train.py): Nsight Systems User Guide & Get Started
PyTorch Profiler (torch.profiler) + TensorBoard traces.
Hugging Face fast tokenizers: github.com/huggingface/tokenizers