Enhancing GPU-HBM Data Transfer Efficiency Using Markov Chains and Neural Network-Driven Predictive Caching with Quantization and Pruning
Authors/Creators
- 1. Research Assistant, Department of Computer Science, Georgia Institute of Technology, Atlanta (Georgia), United States of America (USA).
Description
Abstract: Background High-bandwidth memory (HBM) systems face persistent data transfer bottlenecks, particularly when CPUs are unable to supply data to GPUs at a sufficient rate. This limitation reduces overall computational efficiency and highlights the need for improved cache management strategies. Methods: Markov Chains represented transitions between frequently accessed memory blocks, enabling predictive sequencing of data needs. A neural network was then applied to model and optimise these Markov transitions, improving cache prefetching accuracy and further optimising data movement techniques. Results & Conclusions: The combined use of Markov-based memory modelling, NN optimisation, and supplementary data transfer techniques demonstrates strong potential to mitigate CPU–GPU bandwidth limitations. Together, these methods offer more efficient cache utilization and reduced bottlenecks in high-demand computational environments.
Files
F370015060126.pdf
Files
(902.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:074f2cd18d7ed160ff176ded7ced0f2b
|
902.7 kB | Preview Download |
Additional details
Identifiers
- DOI
- 10.35940/ijsce.F3700.15060126
- EISSN
- 2231-2307
Dates
- Accepted
-
2026-01-15Manuscript received on 19 November 2025 | First Revised Manuscript received on 29 November 2025 | Second Revised Manuscript received on 08 December 2025 | Manuscript Accepted on 15 January 2026 | Manuscript published on 30 January 2026
References
- Joseph, D., & Grunwald, D. (2002, August 06). Prefetching using Markov predictors. IEEE Journals & Magazine. DOI: https://doi.org/10.1109/12.75265, works remain significant, see the declaration
- Jog, A., Kayiran, O., Mishra, A. K., Kandemir, M. T., Mutlu, O., Iyer, R., & Das, C. R. (2013, June 23). Orchestrated scheduling and prefetching for GPGPUs. Association for Computing Memory. DOI: https://doi.org/10.1145/2485922.2485951, works remain significant, see the declaration
- Bauer, M., Cook, H., & Khailany, B. (2011, November 12). CUDADMA: Optimizing GPU memory bandwidth via warp specialization. Association for Computing Machinery. DOI: https://doi.org/10.1145/2063384.2063400, works remain significant, see the declaration
- Liang, T., Glossner, J., Wang, L., Shi, S., & Zhang, X. (2021, January 24). Pruning and quantization for deep neural network acceleration: a survey. arXiv.org. DOI: https://doi.org/10.48550/arXiv.2101.09671
- Shi, Z., Huang, X., Jain, A., & Lin, C. (2019, October 12). Applying deep learning to the cache replacement problem. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. DOI: https://doi.org/10.1145/3352460.3358319
- Chopra, B. (2024, May 7). Enhancing machine learning Performance: the role of GPU-based AI computer architectures. Journal of Knowledge Learning and Science Technology ISSN 2959-6386 (Online), 3(3), 20 32. DOI: https://doi.org/10.60087/jklst.vol3.n3.p20-32
- Hou, J., Tao, T., Lu, H., & Nayak, A. (2023, June 22). Intelligent caching with graph neural network-based deep reinforcement learning on SDN-based ICN. Future Internet, 15(8), 251. DOI: https://doi.org/10.3390/fi15080251
- Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., & Aamodt, T. M. (2009, April 1). Analyzing CUDA workloads using a detailed GPU simulator. IEEE Conference Publication. DOI: https://doi.org/10.1109/ISPASS.2009.4919648
- Liu, A., & Tucker, A. (1988). Applied Combinatorics. DOI: https://doi.org/10.1137/1030075, works remain significant, see the declaration
- Mittal, S. (2015, January 16). A survey of techniques for managing and leveraging caches in GPUs. Journal of Circuits, Systems and Computers, 23(08), 1430002. DOI: https://doi.org/10.1142/s0218126614300025, works remain significant, see the declaration