GPU Management ============== TomoGUI uses one or more GPUs for TomoCuPy reconstruction and for AI Reco inference. Single-file mode ---------------- On the Main tab, the **GPU** dropdown picks one CUDA device index. TomoCuPy and AI Reco are invoked with ``CUDA_VISIBLE_DEVICES=``. Batch mode (unified queue) -------------------------- The *Number of GPUs* field on the Advanced Config tab controls parallelism for every batch operation — *Batch Try*, *Batch Full*, and each phase of *Batch AI Reco*. All three share the same dispatcher (``_run_batch_with_queue``): - *N* GPU slots are available. - When a slot is free and the queue is non-empty, a subprocess is spawned for one file, with ``CUDA_VISIBLE_DEVICES`` pinned to that slot's GPU. - When the subprocess exits, the slot is returned to the pool and the next file dispatches immediately. This means a single stuck file only blocks one GPU slot, not a whole chunk of files; the other GPUs keep draining the queue. .. figure:: /_static/screenshots/advanced_config_tab.png :alt: Advanced Config tab :align: center Per-phase subprocesses ~~~~~~~~~~~~~~~~~~~~~~ - **Try / Full** — ``tomocupy --file-name --rotation-axis …`` - **Inference** — ``python -m tomogui._infer_worker `` Both are launched as ``QProcess`` objects, with stdout streamed to the log and parsed (for inference, ``[infer-worker] OK => `` updates the row's COR cell live). Monitoring ---------- During a run you will see: - the progress bar advancing as files complete - per-row status updates (``Queued`` → ``Running on GPU N`` → ``Inferred`` / ``Done`` / ``Failed`` / ``Uploaded``) - streaming log lines such as ``[infer-worker] OK /data/.../sample_0042.h5 => 1024.3`` Use ``nvidia-smi`` on the reconstruction host to confirm all requested GPUs are busy. If only one GPU shows activity, check: - Advanced Config *Number of GPUs* is > 1 - ``nvidia-smi`` lists every GPU (driver issue otherwise) - no ``CUDA_VISIBLE_DEVICES`` is set in the parent shell before launching TomoGUI — it caps what child processes can see. Remote GPUs ----------- When *Remote host* is set in Advanced Config, the batch queue SSHes to the remote host and runs subprocesses there. The GPU count refers to GPUs on the **remote** host. See :doc:`ssh_setup`.