Ultralytics YOLO

Jocher, Glenn; Qiu, Jing; Chaurasia, Ayush

doi:10.5281/zenodo.21171689

Published July 3, 2026 | Version v8.4.87

Software Open

Ultralytics YOLO

1. Ultralytics

🌟 Summary

🚀 Ultralytics v8.4.87 delivers a cleaner, safer GPU device-selection system, plus stability and performance fixes for training, inference, tracking, exports, and dataset checks.

📊 Key Changes

Clean-sheet CUDA device selection 🧭
- Added parse_device() to normalize device inputs such as cuda:0, 0,1, lists/tuples, torch.device, and -1 idle-GPU auto-selection.
- select_device() no longer mutates CUDA_VISIBLE_DEVICES, making device selection predictable across repeated calls and long-running Python processes.
- Explicit single-GPU requests now use torch.cuda.set_device() instead of environment-variable remapping.
- Trainer, DDP setup, validation, autobatch, and distributed barriers now consistently use resolved CUDA device indices.
- Added documentation for ultralytics.utils.torch_utils.parse_device.
Stronger GPU training tests 🧪
- Added a cold-process nonzero-GPU training test to better match real CLI and Ultralytics Platform training behavior.
- Verifies that training on GPUs like device=1 or higher works correctly from a fresh process without relying on previous CUDA initialization.
Fixed DataLoader worker cleanup at training shutdown 🧹
- Added a close() method to InfiniteDataLoader.
- Training now explicitly shuts down persistent train and validation workers before Python exits.
- Helps prevent end-of-run DataLoader worker ... killed by signal: Terminated errors after results are already saved.
Improved inference warmup for standard NMS ⚡
- AutoBackend.warmup() now preloads torchvision for non-end-to-end models.
- This helps later non-max suppression calls use faster torchvision NMS when appropriate, reducing first-inference latency after warmup.
Corrected dataset file-speed reporting 💾
- Fixed an inverted condition in check_file_speeds().
- Slow storage, such as network-mounted datasets, should now trigger the intended warning instead of being incorrectly reported as "Fast image access ✅".
Tracking ReID device alignment 🎯
- Trackers now pass the predictor device into ReID encoders.
- ReID models are initialized and run on the same device as prediction where applicable, improving consistency for tracking workflows.
Export reliability improvements 📦
- TensorFlow SavedModel export now distinguishes CUDA vs non-CUDA export paths more carefully.
- CPU exports hide TensorFlow GPUs where possible to avoid unnecessary GPU memory use.
- ONNX Runtime and Paddle dependency checks now better handle interchangeable CPU/GPU package variants to avoid unnecessary or conflicting installs.
- Paddle export now uses the actual export device to decide whether GPU Paddle is needed.

🎯 Purpose & Impact

More reliable GPU behavior 🚀
- Users should see fewer surprises when training, validating, predicting, or exporting repeatedly in the same Python session.
- This is especially important for notebooks, services, CI, distributed training, and production systems where changing CUDA_VISIBLE_DEVICES mid-process can cause hard-to-debug issues.
Better support for nonzero GPU training 🖥️
- Training on GPUs beyond CUDA:0 is now more robust, including cold-start CLI usage common in production and Ultralytics Platform environments.
Cleaner shutdowns after training ✅
- Persistent DataLoader workers are now cleaned up explicitly, reducing noisy shutdown crashes and improving confidence that completed runs exit cleanly.
Lower latency after warmup ⚡
- Standard detection workflows can benefit from smoother post-warmup inference performance by ensuring faster NMS paths are ready when needed.
More accurate dataset diagnostics 📊
- Users with slow disks or network storage will receive correct warnings, helping them identify dataset I/O bottlenecks that can slow training.
More consistent tracking and export workflows 🔄
- ReID tracking components now better follow the selected prediction device.
- Export paths are less likely to allocate unwanted GPU memory or install conflicting runtime packages.

What's Changed

Add cold-process nonzero-device GPU train test by @glenn-jocher in https://github.com/ultralytics/ultralytics/pull/25019
Fix inverted read-speed condition in dataset file speed check by @ahmet-f-gumustas in https://github.com/ultralytics/ultralytics/pull/25025
Fix leaked dataloader workers at end of training (atexit killed by signal: Terminated crash) by @Bovey0809 in https://github.com/ultralytics/ultralytics/pull/25024
Preload torchvision during warmup for non-end2end NMS path by @Y-T-G in https://github.com/ultralytics/ultralytics/pull/25023
Clean-sheet device selection: stop mutating CUDA_VISIBLE_DEVICES by @glenn-jocher in https://github.com/ultralytics/ultralytics/pull/25021

Full Changelog: https://github.com/ultralytics/ultralytics/compare/v8.4.86...v8.4.87

Notes

If you use this software, please cite it using the metadata from this file.

Files

ultralytics/ultralytics-v8.4.87.zip

Files (3.2 MB)

Name	Size	Download all
ultralytics/ultralytics-v8.4.87.zip md5:0b335361b376c53e3b7235cb7af03f7e	3.2 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/ultralytics/ultralytics/tree/v8.4.87 (URL)

Repository URL: https://github.com/ultralytics/ultralytics

	All versions	This version
Views	25,928	3
Downloads	3,848	0
Data volume	8.8 GB	0 Bytes

Ultralytics YOLO

Authors/Creators

Description

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

What's Changed

Notes

Files

ultralytics/ultralytics-v8.4.87.zip

Files (3.2 MB)

Additional details

Related works

Software