Introduction
Current high-performance computing (HPC) systems are increasingly exploiting heterogeneous computing nodes to improve performance in terms of latency and energy utilization for completing specific computation tasks [1, 2]. While the communication patterns driven by modern workloads exhibit temporal bursts and spatial non-uniformity [3, 4], today’s interconnection networks based on electronic switches and optical fibers are inherently rigid, incapable of changing the network topology or link bandwidth to adequately cope with the significant variations of traffic patterns. It would then be desirable to design a bandwidth-reconfigurable interconnection network that can adapt its connectivity to the various traffic demands [5-7].
There have been recent advances in silicon photonic (SiPh) integrated reconfigurable wavelength routing and space switching that allows to redefine the connectivity in both spectral and spatial domains on demand. Indeed, wavelength-and-space selective switching fabrics that can reconfigure the bandwidth between selected pair of input and output ports have been demonstrated [8, 9]. Recently, we proposed and demonstrated a SiPh bandwidth-reconfigurable all-to-all interconnection switch, ‘Flexible Low-Latency Interconnect Optical Network Switch (Flex-LIONS),’ enabled by combination of all-to-all interconnection using an arrayed waveguide grating router (AWGR) and multi-wavelength selective switches [10]. While Flex-LIONS has superior performance in terms of scalability and energy consumption when compared with other proposed architectures (see [10] for more details), specific reconfiguration policies and algorithms at the network and application layers to take advantage of such physical-layer reconfiguration capability are still needed. In particular, we are interested in exploiting emerging AI techniques to address the challenges related to reconfiguration policies.
Reconfigurable Architecture with Machine-learning-based Cognitive Control Plane
Figure 1 shows the architecture that we are currently investigating. We called this architecture Hyper-Flex-LIONS: it leverages Flex-LIONS to enhance a Dragon-Fly like topology with unique optical reconfiguration capabilities within a group and between groups based on an observe-analyze-act cycle exploiting deep learning techniques. Optical reconfiguration is achieved using the SiPh Flex-LIONS technology discussed below.
Figure 2 illustrates the working principle of Flex-LIONS. The SiPh Flex-LIONS has an N-port AWGR and b microring resonator (MRR) add-drop filters at each AWGR input/output port. For uniform traffic, all MRR add-drop filters can be set off-resonance so that each input port provides N wavelength division multiplexing (WDM) signals to interconnect with all the N output ports according to the all-to-all wavelength routing property of the AWGR [12]. For different traffic patterns, the MRR filters can be tuned in resonance to select specific wavelengths channels to be switched by the multi-wavelength switch (for the SiPh chip shown in Figure 2 the multi-wavelength switch is implemented as an MRR crossbar [10]), practically creating a different topology as well as increasing by a factor of b the bandwidth between the port pairs connected through the multi-wavelength switch.
Figure 3 depicts the NC&M framework of Hyper-Flex-LIONS. Each group is equipped with a group manager for managing the data plane operations within the group, using a software-defined networking (SDN) paradigm. Meanwhile, Hyper-Flex-LIONS employs an inter-group manager at a higher hierarchy being responsible for managing inter-group reconfiguration. The group and inter-group managers apply advanced machine learning (ML) technologies (i.e. deep reinforcement learning) at different time-scales to achieve knowledge-based cognitive networking, forming a hierarchical observe-analyze-act paradigm similarly to interworking of brain and reflex.
Preliminary Results
We used OMNeT++ simulator and TensorFlow to simulate the DRL-based reconfigurable Flex-LIONS architecture. We assumed 16 Top-of-Rack (ToR) switches interconnected with one 16-port Flex-LIONS. We considered four possible topologies the DRL algorithm can choose from. We utilized a time-varying traffic consisting of four traffic patterns: adversarial, neighbor exchange, and all-to-all for inter and intra-groups (a group is composed of four racks). The four patterns appear periodically. For training process, the four changing traffic patterns and the four network topologies are all set as part of the DNNs’ input features. The DNN models consisted of two convolutional layers and five fully connected layers, and each layer contains 128 neurons.
Figure 4 (Left) shows how the reward value converges via training, which means the DRL agent works efficiently to maintain the lowest network end-to-end delay. In addition, convergences act differently according to different learning rate. We compared our DRL-based reconfigurable architecture to different fixed networks in terms of average end-to-end delay [see Figure 4 (Right)]. The proposed DRL-based reconfigurable architecture always achieves the lowest average network latency among all packet injection rates.
Ongoing testbed work
Figure 5 shows our in-progress testbed efforts to evaluate the proposed architecture and reconfiguration algorithm solutions on a real testbed exploiting research grade photonic interconnect prototypes as well commercial top-of-rack switches, servers and open-source software solutions for network control and management plane and applications management.
Current Research Opportunities
We are currently seeking Master students, PhD students and Postdoctoral researchers with a variety of skills that are interested in working on architectures, algorithms and hands-on testbed work for demonstrating innovative ideas and solutions in the context of the above research topic. Interested candidates should send their resumes to sbyoo@ucdavis.edu or rproietti@ucdavis.edu.
REFERENCES
[1] Mittal, S., Jeffrey, S. V.: ‘A survey of CPU-GPU heterogeneous computing techniques’, ACM Computing Surveys (CSUR) 47.4 (2015): 69.
[2] Schulte, M. J., Ignatowski, M., Gabriel, H. L., et al.: ‘Achieving exascale capabilities through heterogeneous computing’, IEEE Micro 35.4 (2015): 26-36
[3] Roy, A., Zeng, H., Bagga, J., et al.: ‘Inside the social network’s (datacenter) network’, ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015.
[4] Zhang, Q., Liu, V., Zeng, H., et al.: ‘High-resolution measurement of data center microbursts’. Proceedings of the 2017 Internet Measurement Conference. ACM, 2017
[5] Cao, Z., Proietti, R., Clements, M., et al.: ‘Experimental demonstration of flexible bandwidth optical data center core network with all-to-all interconnectivity’, Journal of Lightwave Technology 33.8 (2015): 1578-1585
[6] Proietti, R., Liu, G., Xiao, X., et al.: ‘FlexLION: A Reconfigurable All-to-All Optical Interconnect Fabric with Bandwidth Steering’. 2019 Conference on Lasers and Electro-Optics (CLEO). IEEE, 2019
[7] S. Salman, C. Streiffer, H. Chen, T. Benson, and A. Kadav, “DeepConf: Automating data center network topologies management with machine learning,†in Proc. of NetAI, (2018), pp. 8–14.
[8] Seok, T. J., Luo, J., Huang, Z., et al.: ‘MEMS-Actuated 8× 8 Silicon Photonic Wavelength-Selective Switches with 8 Wavelength Channels’. 2018 Conference on Lasers and Electro-Optics (CLEO). IEEE, 2018.
[9] Khope, A. S. P., Saeidi, M., Yu, R., et al.: ‘Multi-wavelength selective crossbar switch’, Optics Express 27.4 (2019): 5203-5216
[10] Xiao, X., Proietti, R., Liu, G., Lu, H., Zhang, Y., Yoo, S.J.B., “Experimental Demonstration of SiPh Flex-LIONS for Bandwidth-Reconfigurable Optical Interconnects”, ECOC, 2019
[11] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch, “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, †arXiv.org > cs > arXiv:1706.02275
[12] Proietti, R., Cao, Z., Nitta, C. J., et al.: ‘A scalable, low-latency, high-throughput, optical interconnect architecture based on arrayed waveguide grating routers’, Journal of Lightwave Technology 33.4 (2015): 911-920
[13] Guojun Yuan, Roberto Proietti, Xiaoli Liu, Alberto Castro, Dawei Zang, Ninghui Sun, CheYu Liu, Zheng Cao, and S. J. Ben Yoo, “ARON: Application-Driven Reconfigurable Optical Networking for HPC Data Centers“, European Conference on Optical Communications (ECOC), 2016