Reliable and Distributed Network Monitoring via In-band Network Telemetry
Traditional network monitoring solutions usually lack of scalability due to their centralized nature collecting heartbeats from all network components via a single controller. As a solution, In-Band Network Telemetry (INT) framework has been recently proposed to collect network telemetry information more autonomously and distributedly by employing programmable switches. However, it imposes further challenges to (i) find suitable INT paths to optimize the control overhead and information freshness and (ii) ensure reliable delivery of control information over multi-hop INT paths. In this work, we propose a monitoring scheme, reliable Graph Partitioned INT (GPINT), by extending our previous work and integrating shared queue ring (SQR) as a reliability feature against potential failures in network telemetry collection due to network congestion and link degradation that may cause loss of the visibility of the network. We implement our proposal in a recent data plane programming language P4, and compare it with traditional Simple Network Management Protocol (SNMP) and also another state-of-the-art study employing Euler's method for INT path generation. Our analysis first shows the importance of having a data recovery mechanism against packet losses under different network conditions. Then, our emulation results indicate that GPINT with reliability extension performs much better than its opponent in terms of telemetry collection latency and overhead monitoring scheme even under a high amount of packet losses.
READ FULL TEXT