GPU Support
GaussMLE.jl includes GPU acceleration through CUDA.jl and KernelAbstractions.jl. A contention-aware scheduling system enables reliable GPU usage in multi-process environments such as shared GPU servers running parallel analysis scripts.
Quick Start
using GaussMLE
# Auto-detect backend (default: uses GPU if available, falls back to CPU)
fitter = GaussMLEConfig()
# Force CPU
fitter = GaussMLEConfig(backend = :cpu)
# Force GPU (errors if unavailable after timeout)
fitter = GaussMLEConfig(backend = :gpu)
# Auto with custom timeout
fitter = GaussMLEConfig(backend = :auto, auto_timeout = 300.0)Backend Selection
The backend parameter controls compute device selection:
| Backend | Behavior | On timeout |
|---|---|---|
:auto (default) | Try GPU, fall back to CPU | Warning + CPU fallback |
:cpu | Always CPU, no GPU interaction | N/A |
:gpu | Require GPU | Error |
Timeout Configuration
| Field | Default | Description |
|---|---|---|
auto_timeout | 300.0 | Seconds to wait for GPU in :auto mode |
gpu_timeout | Inf | Seconds to wait in :gpu mode |
on_wait | nothing | Progress callback (elapsed, available, required) -> nothing |
# Auto mode: wait up to 5 min (default), then fall back to CPU
fitter = GaussMLEConfig(backend = :auto, auto_timeout = 300.0)
# Explicit GPU: wait up to 60s, error if still unavailable
fitter = GaussMLEConfig(backend = :gpu, gpu_timeout = 60.0)
# Progress callback for monitoring waits
fitter = GaussMLEConfig(
backend = :auto,
auto_timeout = 60.0,
on_wait = (elapsed, available, required) ->
@info "Waiting for GPU" elapsed=round(elapsed, digits=1) available=Base.format_bytes(available)
)GPU Scheduling System
The Problem
When multiple Julia processes share a GPU server, naive GPU device selection causes failures:
- Context creation OOM:
CUDA.device!(i)creates a CUDA context (~300 MB - 6 GB), which can OOM if another process is using the GPU - Memory pool retention: CUDA.jl's memory pool holds freed arrays, making NVML report memory as "used" even when no GPU arrays exist
- TOCTOU races: Free memory looks sufficient during the check, but another process grabs it before context creation completes
Solution: Unified Retry Loop
GaussMLE uses a single retry loop in _run_mle_kernel! that handles all GPU failure modes uniformly:
while !gpu_succeeded && before deadline:
1. Poll NVML for available GPU (no CUDA context needed)
2. Check contention: skip if other_procs > 0 AND (low memory OR compute > 90%)
3. Try:
CUDA.device!(dev) # Create context on selected GPU
<process all batches> # Run fitting kernel
CUDA.reclaim() # Return pooled memory to OS
gpu_succeeded = true
4. Catch:
release CUDA context # Free reserved memory
loop back to step 1 # Re-poll NVML for next available GPUAfter the loop: :auto falls back to CPU, :gpu errors.
Three Failure Modes
The loop handles three distinct GPU failure scenarios:
No free memory - NVML poll waits with jittered backoff (0.5s intervals). All GPUs are scanned each tick; whichever frees up first is used.
TOCTOU context creation race - NVML shows sufficient free memory, but
CUDA.device!()fails because another process grabbed the GPU between the check and context creation. The context is released and NVML is re-polled.Runtime OOM - GPU allocation or kernel execution fails mid-batch. The context is released and all batches are restarted on a re-acquired GPU (partial results from a failed CUDA context are unreliable).
NVML Contention Detection
Pre-flight GPU checks use NVIDIA Management Library (NVML) queries instead of CUDA API calls. NVML does not create CUDA contexts, so it is safe to call from multiple processes simultaneously.
A GPU is considered contended when:
- Other processes are present (
compute_processes > 0excluding own PID), AND - Resources are threatened: free memory < 1.5x required OR compute utilization > 90%
If other processes are present but memory is ample and utilization is low, the GPU is still selected (the other processes are likely idle).
Context Lifecycle
On failure: _release_gpu_context() calls cuDevicePrimaryCtxRelease to destroy the CUDA context and free reserved memory. CUDA.reclaim() is NOT called after context release because it requires an active context and would either hang or silently re-create the released context.
On success: CUDA.reclaim() is called after all GPU batches complete to return pooled memory to the OS. Without this, NVML reports the memory as used even though no GPU arrays exist, blocking other processes waiting for GPU availability.
Multi-GPU Support
On systems with multiple GPUs, NVML scans all devices each poll iteration. The first GPU with sufficient free memory and low contention is selected. This naturally load-balances across GPUs.
Batch Processing
For datasets larger than GPU memory, GaussMLE automatically batches the data:
# Configure batch size (default: 10,000 ROIs per batch)
fitter = GaussMLEConfig(
backend = :gpu,
batch_size = 5000 # Process 5000 ROIs at a time
)
# Large dataset - automatically batched
smld, info = fit(large_data, fitter)
println("Processed in $(info.n_batches) batches of $(info.batch_size)")Tune batch size based on:
- Available GPU memory (larger batches = more memory)
- ROI size (larger ROIs need smaller batches)
- Number of parameters (more params = more memory per ROI)
Performance
Benchmarking Your Hardware
Run Pkg.test("GaussMLE") locally to benchmark your hardware. The test suite includes a comprehensive performance benchmark that tests all model/camera/device combinations and reports fits/second along with MLE optimality (std/CRLB ratio).
GPU throughput scales with batch size and varies significantly by GPU model. Typical speedups range from 100-200x over single-threaded CPU for large batches.
When to Use GPU
GPU is beneficial for:
- Large datasets (>10,000 fits)
- Batch processing of multiple files
- Real-time analysis requirements
CPU may be better for:
- Small datasets (<1,000 fits)
- Systems without capable GPUs
- Debugging and development
Performance Tips
- Use Float32 - Native GPU type, sufficient precision for localization
- Fixed sigma -
GaussianXYNBis ~20% faster than variable-sigma models - Warm-up - First GPU call includes kernel compilation overhead
- Appropriate batch size - Too small has transfer overhead, too large may OOM
Result Consistency
GPU and CPU results are numerically consistent (differences at machine precision ~1e-6 microns). The same unified kernel runs on both devices via KernelAbstractions.jl.
Troubleshooting
CUDA Not Available
using CUDA
CUDA.functional() # Should return trueIf false: check NVIDIA driver installation, GPU hardware, CUDA.jl configuration.
Out of Memory
# Reduce batch size
fitter = GaussMLEConfig(backend = :gpu, batch_size = 2000)
# Or use auto mode for automatic CPU fallback
fitter = GaussMLEConfig(backend = :auto, auto_timeout = 5.0)Multi-Process Contention
If parallel scripts deadlock waiting for GPU:
- Ensure all scripts use
backend = :auto(not:gpu) for graceful fallback - Set reasonable
auto_timeoutvalues (default 5 min is usually sufficient) - Check
nvidia-smifor zombie processes holding GPU memory
Architecture
Unified Kernel Design
GaussMLE uses KernelAbstractions.jl for portable GPU/CPU code:
@kernel function unified_gaussian_mle_kernel!(...)
# Same code runs on CPU and GPU
endData Flow
- Input: ROI data on CPU (Float32)
- Transfer: Copy batch to GPU memory
- Compute: Run fitting kernel
- Transfer: Copy results back to CPU
- Reclaim: Return GPU memory pool to OS
- Repeat: Next batch (steps 2-5)
- Output: BasicSMLD with fitted parameters