Ultralytics YOLO
Description
๐ Summary
๐ Ultralytics v8.4.87 delivers a cleaner, safer GPU device-selection system, plus stability and performance fixes for training, inference, tracking, exports, and dataset checks.
๐ Key Changes
Clean-sheet CUDA device selection ๐งญ
- Added
parse_device()to normalize device inputs such ascuda:0,0,1, lists/tuples,torch.device, and-1idle-GPU auto-selection. select_device()no longer mutatesCUDA_VISIBLE_DEVICES, making device selection predictable across repeated calls and long-running Python processes.- Explicit single-GPU requests now use
torch.cuda.set_device()instead of environment-variable remapping. - Trainer, DDP setup, validation, autobatch, and distributed barriers now consistently use resolved CUDA device indices.
- Added documentation for
ultralytics.utils.torch_utils.parse_device.
- Added
Stronger GPU training tests ๐งช
- Added a cold-process nonzero-GPU training test to better match real CLI and Ultralytics Platform training behavior.
- Verifies that training on GPUs like
device=1or higher works correctly from a fresh process without relying on previous CUDA initialization.
Fixed DataLoader worker cleanup at training shutdown ๐งน
- Added a
close()method toInfiniteDataLoader. - Training now explicitly shuts down persistent train and validation workers before Python exits.
- Helps prevent end-of-run
DataLoader worker ... killed by signal: Terminatederrors after results are already saved.
- Added a
Improved inference warmup for standard NMS โก
AutoBackend.warmup()now preloadstorchvisionfor non-end-to-end models.- This helps later non-max suppression calls use faster
torchvisionNMS when appropriate, reducing first-inference latency after warmup.
Corrected dataset file-speed reporting ๐พ
- Fixed an inverted condition in
check_file_speeds(). - Slow storage, such as network-mounted datasets, should now trigger the intended warning instead of being incorrectly reported as "Fast image access โ ".
- Fixed an inverted condition in
Tracking ReID device alignment ๐ฏ
- Trackers now pass the predictor device into ReID encoders.
- ReID models are initialized and run on the same device as prediction where applicable, improving consistency for tracking workflows.
Export reliability improvements ๐ฆ
- TensorFlow SavedModel export now distinguishes CUDA vs non-CUDA export paths more carefully.
- CPU exports hide TensorFlow GPUs where possible to avoid unnecessary GPU memory use.
- ONNX Runtime and Paddle dependency checks now better handle interchangeable CPU/GPU package variants to avoid unnecessary or conflicting installs.
- Paddle export now uses the actual export device to decide whether GPU Paddle is needed.
๐ฏ Purpose & Impact
More reliable GPU behavior ๐
- Users should see fewer surprises when training, validating, predicting, or exporting repeatedly in the same Python session.
- This is especially important for notebooks, services, CI, distributed training, and production systems where changing
CUDA_VISIBLE_DEVICESmid-process can cause hard-to-debug issues.
Better support for nonzero GPU training ๐ฅ๏ธ
- Training on GPUs beyond
CUDA:0is now more robust, including cold-start CLI usage common in production and Ultralytics Platform environments.
- Training on GPUs beyond
Cleaner shutdowns after training โ
- Persistent DataLoader workers are now cleaned up explicitly, reducing noisy shutdown crashes and improving confidence that completed runs exit cleanly.
Lower latency after warmup โก
- Standard detection workflows can benefit from smoother post-warmup inference performance by ensuring faster NMS paths are ready when needed.
More accurate dataset diagnostics ๐
- Users with slow disks or network storage will receive correct warnings, helping them identify dataset I/O bottlenecks that can slow training.
More consistent tracking and export workflows ๐
- ReID tracking components now better follow the selected prediction device.
- Export paths are less likely to allocate unwanted GPU memory or install conflicting runtime packages.
What's Changed
- Add cold-process nonzero-device GPU train test by @glenn-jocher in https://github.com/ultralytics/ultralytics/pull/25019
- Fix inverted read-speed condition in dataset file speed check by @ahmet-f-gumustas in https://github.com/ultralytics/ultralytics/pull/25025
- Fix leaked dataloader workers at end of training (atexit
killed by signal: Terminatedcrash) by @Bovey0809 in https://github.com/ultralytics/ultralytics/pull/25024 - Preload torchvision during warmup for non-end2end NMS path by @Y-T-G in https://github.com/ultralytics/ultralytics/pull/25023
- Clean-sheet device selection: stop mutating
CUDA_VISIBLE_DEVICESby @glenn-jocher in https://github.com/ultralytics/ultralytics/pull/25021
Full Changelog: https://github.com/ultralytics/ultralytics/compare/v8.4.86...v8.4.87
Notes
Files
ultralytics/ultralytics-v8.4.87.zip
Files
(3.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:0b335361b376c53e3b7235cb7af03f7e
|
3.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/ultralytics/ultralytics/tree/v8.4.87 (URL)
Software
- Repository URL
- https://github.com/ultralytics/ultralytics