Building blocks for network-accelerated distributed file systems

Di Girolamo, Salvatore; De Sensi, Daniele; Taranov, Konstantin; Malesevic, Milos; Besta, Maciej; Schneider, Timo; Kistler, Severin; Hoefler, Torsten

doi:10.5555/3571885.3571898

Published November 18, 2022 | Version v1

Conference paper Open

Building blocks for network-accelerated distributed file systems

1. ETH Zürich

High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives.

Files

3571885.3571898_Building-blocks.pdf

Files (1.4 MB)

Name	Size	Download all
3571885.3571898_Building-blocks.pdf md5:3df4e5199e40c43a231e92fbe459340e	1.4 MB	Preview Download

Additional details

RED-SEA – Network Solution for Exascale Architectures 955776: European Commission
DEEP-SEA – DEEP – SOFTWARE FOR EXASCALE ARCHITECTURES 955606: European Commission

1. A. Sainio, "NVDIMM: changes are here so what's next," Memory Computing Summit, 2016.
2. J. Min, M. Liu, T. Chugh, C. Zhao, A. Wei, I. H. Doh, and A. Krishnamurthy, "Gimbal: enabling multi-tenant storage disaggregation on SmartNIC JBOFs," in Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp. 106--122.
3. I. T. Association, "InfiniBand Architecture Specification, Volume 1, Release 1.2," 2004.
4. P. Braam, "The Lustre storage architecture," arXiv preprint arXiv:1903.01955, 2019.
5. F. B. Schmuck and R. L. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters." in FAST, vol. 2, no. 19, 2002.
6. D. Borthakur et al., "HDFS architecture guide," Hadoop Apache Project, vol. 53, no. 1--13, p. 2, 2008.
7. L. Wang, Y. Ma, A. Y. Zomaya, R. Ranjan, and D. Chen, "A parallel file system with application-aware data layout policies for massive remote sensing image processing in digital earth," IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp. 1497--1508, 2014.
8. R. B. Ross, R. Thakur et al., "PVFS: A parallel file system for Linux clusters," in Proceedings of the 4th annual Linux showcase and conference, 2000, pp. 391--430.
9. J. Yang, J. Izraelevitz, and S. Swanson, "Orion: A distributed file system for non-volatile main memory and RDMA-capable networks," in 17th USENIX Conference on File and Storage Technologies (FAST 19), 2019, pp. 221--234.
10. Y. Lu, J. Shu, Y. Chen, and T. Li, "Octopus: an RDMA-enabled distributed persistent memory file system," in 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017, pp. 773--785.
11. J. Yang, J. Izraelevitz, and S. Swanson, "FileMR: Rethinking RDMA Networking for Scalable Persistent Memory," in 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), 2020, pp. 111--125.
12. H. Shi and X. Lu, "TriEC: tripartite graph based erasure coding NIC offload," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--34.
13. P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese et al., "P4: Programming protocol-independent packet processors," ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87--95, 2014.
14. J. Kicinski and N. Viljoen, "eBPF hardware offload to SmartNICs: clsbpf and XDP."
15. S. Miano, M. Bertrone, F. Risso, M. Tumolo, and M. V. Bernal, "Creating complex network services with ebpf: Experience and lessons learned," in 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 2018, pp. 1--8
16. J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo, "NetFPGA-an open platform for gigabit-rate network switching and routing," in 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07). IEEE, 2007, pp. 160--161.
17. A. Forencich, A. C. Snoeren, G. Porter, and G. Papen, "Corundum: An open-source 100-Gbps Nic," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 38--46.
18. M. Hennecke, "Daos: A scale-out high performance storage stack for storage class memory," Supercomputing frontiers, p. 40, 2020.
19. R. Rajesh, K. B. Ramia, and M. Kulkarni, "Integration of LwIP stack over Intel DPDK for high throughput packet delivery to applications," in 2014 Fifth International Symposium on Electronic System Design. IEEE, 2014, pp. 130--134.
20. J. Liu, C. Maltzahn, C. Ulmer, and M. L. Curry, "Performance Characteristics of the BlueField-2 SmartNIC," arXiv preprint arXiv:2105.06619, 2021.
21. T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and R. Brightwell, "sPIN: High-performance streaming Processing in the Network," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1--16.
22. B. Barrett, R. Brightwell, R. Grant, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, R. Riesen, T. Hoefler, A. Maccabe, and T. Hudson, "The Portals 4.2 Network Programming Interface," 11 2018.
23. S. Di Girolamo, A. Kurth, A. Calotoiu, T. Benz, T. Schneider, J. Beranek, L. Benini, and T. Hoefler, "A RISC-V in-network accelerator for flexible high-performance low-power packet processing," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021.
24. D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, "PULP: A parallel ultra low power platform for next generation IoT applications," in 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 2015, pp. 1--39.
25. A. Kalia, D. Andersen, and M. Kaminsky, "Challenges and solutions for fast remote persistent memory access," in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 105--119.
26. T. Talpey, "RDMA extensions for remote persistent memory access," in 12th Annual Open Fabrics Alliance Workshop, 2016.
27. A. Kalia, M. Kaminsky, and D. G. Andersen, "FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs," in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 185--201.
28. K. Taranov, S. Di Girolamo, and T. Hoefler, "CoRM: Compactable Remote Memory over RDMA," in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1811--1824.
29. A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis et al., "The structural simulation toolkit," ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 37--42, 2011.
30. X. Wang, G. Chen, X. Yin, H. Dai, B. Li, B. Fu, and K. Tan, "StaR: Breaking the Scalability Limit for RDMA," in 2021 IEEE 29th International Conference on Network Protocols (ICNP). IEEE, 2021, pp. 1--11.
31. K. Taranov, B. Rothenberger, A. Perrig, and T. Hoefler, "sRDMA - Efficient NIC-based Authentication and Encryption for Remote Direct Memory Access," in Proceedings of the 2020 USENIX Annual Technical Conference. USENIX, Jul. 2020.
32. H. Gobioff, G. Gibson, and D. Tygar, "Security for network attached storage devices," CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, Tech. Rep., 1997.
33. R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser, "Optimal broadcast and summation in the LogP model," in Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures, 1993, pp. 142--153.
34. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, "LogGP: Incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation," in Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, 1995, pp. 95--105.
35. D. Kim, A. Memaripour, A. Badam, Y. Zhu, H. H. Liu, J. Padhye, S. Raindel, S. Swanson, V. Sekar, and S. Seshan, "Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems," in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 2018, pp. 297--312.
36. I. S. Reed and G. Solomon, "Polynomial codes over certain finite fields," Journal of the society for industrial and applied mathematics, vol. 8, no. 2, pp. 300--304, 1960.
37. H. Shi and X. Lu, "INEC: fast and coherent in-network erasure coding," in 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2020, pp. 924--940.
38. S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn, "Ceph: A scalable, high-performance distributed file system," in Proceedings of the 7th symposium on Operating systems design and implementation, 2006, pp. 307--320.
39. S. Radhakrishnan, Y. Geng, V. Jeyakumar, A. Kabbani, G. Porter, and A. Vahdat, "SENIC: Scalable NIC for end-host rate limiting," in 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 2014, pp. 475--488.
40. A. Gulati, D. K. Panda, P. Sadayappan, and P. Wyckoff, "NIC-based rate control for proportional bandwidth allocation in Myrinet clusters," in International Conference on Parallel Processing, 2001. IEEE, 2001, pp. 305--312.
41. I. Pratt and K. Fraser, "Arsenic: A user-accessible gigabit ethernet interface," in Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No. 01CH37213), vol. 1. IEEE, 2001, pp. 67--76.
42. IEEE, "RFC 802.1Qbb - Priority-based flow control." [Online]. Available: https://1.ieee802.org/dcb/802-1qbb/
43. S.-A. Reinemo, T. Skeie, T. Sodring, O. Lysne, and O. Trudbakken, "An overview of QoS capabilities in InfiniBand, advanced switching interconnect, and ethernet," IEEE Communications Magazine, vol. 44, no. 7, pp. 32--38, 2006.
44. D. D. Sensi, S. D. Girolamo, K. H. McMahon, D. Roweth, and T. Hoefler, "An in-depth analysis of the Slingshot interconnect," 2020.
45. Mellanox, "Introducing 200G HDR InfiniBand Solutions," accessed: 2021-04-07. [Online]. Available: https://www.mellanox.com/related-docs/whitepapers/WP_Introducing_200G_HDR_InfiniBand_Solutions.pdf
46. L. Lamport, "The part-time parliament," in Concurrency: the Works of Leslie Lamport, 2019, pp. 277--317.
47. D. Ongaro and J. Ousterhout, "The raft consensus algorithm," 2015.
48. M. Poke and T. Hoefler, "DARE: High-Performance State Machine Replication on RDMA Networks," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'15). ACM, 06 2015, pp. 107--118.
49. J. Heichler, "An introduction to BeeGFS," 2014.
50. N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda, "High performance RDMA-based design of HDFS over InfiniBand," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012, pp. 1--12.
51. Z. Liang, J. Lombardi, M. Chaarawi, and M. Hennecke, "DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory," in Asian Conference on Supercomputing Frontiers. Springer, 2020, pp. 40--54.
52. J. Lu, B. Du, Y. Zhu, and D. Li, "MADFS: the mobile agent-based distributed network file system," in 2009 WRI Global Congress on Intelligent Systems, vol. 1. IEEE, 2009, pp. 68--74.
53. weka.io, "WekaIO Matrix Architecture," 2019, Technical white paper.
54. J. Wu, P. Wyckoff, and D. Panda, "PVFS over InfiniBand: Design and performance evaluation," in 2003 International Conference on Parallel Processing, 2003. Proceedings. IEEE, 2003, pp. 125--132.
55. A. Davies and A. Orsaria, "Scale out with GlusterFS," Linux Journal, vol. 2013, no. 235, p. 1, 2013.
56. J. Kim, I. Jang, W. Reda, J. Im, M. Canini, D. Kostić, Y. Kwon, S. Peter, and E. Witchel, "LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism," in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 2021, pp. 756--771.
57. NVIDIA, "NVIDIA BlueField," accessed: 2021-05-20. [Online]. Available: https://www.nvidia.com/en-us/networking/products/data-processing-unit/
58. A. Singhvi, A. Akella, D. Gibson, T. F. Wenisch, M. Wong-Chan, S. Clark, M. M. Martin, M. McLaren, P. Chandra, R. Cauble et al., "1RMA: Re-envisioning remote memory access for multi-tenant data-centers," in Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, 2020, pp. 708--721.
59. M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta, "iPipe: A Framework for Building Distributed Applications on Multicore SoC SmartNICs," in Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), 2019.
60. Y. Taleb, R. Stutsman, G. Antoniu, and T. Cortes, "Tailwind: fast and atomic RDMA-based replication," in 2018 USENIX Annual Technical Conference (USENIX ATC 18), 2018, pp. 851--863.

	All versions	This version
Views	26	26
Downloads	101	101
Data volume	140.5 MB	140.5 MB

Building blocks for network-accelerated distributed file systems

Files

3571885.3571898_Building-blocks.pdf

Files (1.4 MB)

Additional details

Funding

References

Building blocks for network-accelerated distributed file systems

Creators

Description

Files

3571885.3571898_Building-blocks.pdf

Files (1.4 MB)

Additional details

Funding

References