How to Design for Software Robustness and Resiliency in High-Performance Routing Environments Featured

How to Design for Software Robustness and Resiliency in High-Performance Routing Environments Image Credit: raspirator/

Digital communications networks are of growing importance for society at large, including a wide variety of key industries, as essential, mission-critical infrastructure. IP routers supporting an ever-growing number of IP-based applications are at the heart of today’s advanced communications networks. But these high-performance networks are under constant risk of attack. As we evolve to 5G, society is growing more and more dependent on this routing infrastructure, and the impact of failures on business and end users will be increasingly grave. Therefore, it is no surprise they are under scrutiny from the highest levels of government – their resilience and robustness are now becoming issues of national security.

These qualities are founded in the design approach, design architecture and testing methodologies of a router’s operating system. These fundamental principles cannot be an afterthought and must be built from the start, as they determine how communications systems measure up on reliability, high availability and security.

In developing router operating systems, these principles have been fundamental. For instance, one of the first principles of developing routing infrastructure is to maintain a single development stream, avoiding custom streams or forks. This strengthens the ability to test the system for software quality. It’s important to make heavy investments in test automation with a 1:1 ratio of software developers to test engineers. They work side by side to write automated test code for every new line of code in the operating system, which is then continuously regression tested using tens of thousands of servers around the clock. With a single OS stream, all the test cycles focus on the same software image. This strategy produces code that is so robust, it virtually eliminates the occurrence of major bugs in the field.

Part of what makes it possible to have a single OS stream across all platforms is the modular design with separation of control and forwarding planes. Now, companies use a hardware abstraction layer that allows components of the OS, such as the control protocols, to be developed independently of the hardware. The result is that a single router OS image will run on any hardware or chipset across the entire routing portfolio.

In addition to being hardware independent, router operating systems also use distributed processes for added reliability, scalability and performance. Code is designed to be symmetric multi-processing (SMP) safe so that software can scale out and run multiple threads in parallel, taking advantage of modern multi-core and multi-threaded processors.

Real-time scheduling of processes also helps efficiently manage the allocation of system resources based on priority and process state. This ensures that time-critical processes always have the CPU cycles needed, which prevents network meltdowns from misconfigurations, buggy code in other routers or malicious attacks.

Redundancy is essential, as the need for the constant availability of mission-critical IP applications grows. With advanced routing OS, all state is replicated across redundant control processors. This is an inherent architectural feature that can’t be added later. It enables companies to do innovative things, such as in-service software upgrades (ISSU) and non-stop routing and services, which ensures minimal to no impact to network and service availability.

The need for high availability also depends on security. First, routing OS should have a robust set of features to protect and secure the router from attacks. This set of features include securing access to the router, out-of-band management to prevent administrative access, and using hardware QoS to prioritize traffic to the control processors, which prevents DoS attacks aimed at the control plane.

The OS can also makes the router part of the network defense by streaming telemetry data for comprehensive stats and counters, flow analysis, traffic mirroring and traffic filtering. When properly architected, routers should have the ability to filter out 90 percent of the nuisance traffic associated with today’s volumetric DDoS attacks, reducing the reliance on much more expensive mitigation hardware.

To do this filtering, OS should be designed so that access control lists can be applied at the flow level on a per-service or per-interface basis with zero performance penalties. Operating systems can now leverage capabilities of network processors with terabit-level forwarding capacity and enhanced packet inspection to the packet payload level. This enables the OS to isolate and discard malicious traffic flows tied to volumetric DDoS attacks at the first stage, with the router line card at the network edge before it does damage to the network or targeted applications and services.

Requirements for high-performance routing infrastructure, such as resilience, robustness and security, must be built into routing OS at an architectural level. As we evolve to 5G and routed networks take on national-level importance as essential infrastructure for our society, OS design architecture, design approach and testing methodologies will come to be seen as a benchmark.

As CTO for Nokia’s IP and Optical Networks Business, Steve is responsible for determining which new projects and emerging markets we should invest in while helping to develop road maps so Nokia can embrace the opportunities that are a best fit. A self-proclaimed cynic when it comes to technology trends, Steve likes to cut through industry hype and get down to what’s best for business. 


New Media Convergence: How Operators Can Maximise Their Investment in Video


Is Digital and 5G Forcing the OSS/BSS Vendors to Grow up and Reach Their Potential?

Latest Videos