Scaling the OpenDaylight Open Source SDN Controller

In this blog, I’ll address some questions frequently asked by our customers and other OpenDaylight users:

  • Is the Lumina ODL distribution suitable for large networks?
  • Can your Lumina SDN controller (open source SDN controller) support thousands of devices?
  • What is the scale model in OpenDaylight?

Before we can address the answer to these questions, we have to first review the context and options to fulfill the need.

 

Use Case Consideration

Whenever I get any questions regarding scalability, I reply with another question: What is the use case? Because for example, it is not the same to scale NETCONF devices where yang models live in memory and configuration transactions does not happen that often, than OPENFLOW devices where CPU is stressed by frequent FLOW_MOD transactions and periodic statistics collection. The scale problem is highly use case dependent. Below, I’ll discuss two common uses cases: An SDN application to configure devices via NETCONF, and an SDN application to deploy L2 multicast services on an OPENFLOW network. While these are common and relevant uses cases, they’re obviously not the only ones. If you have a question about a use case not highlighted below, please use our “Contact us” and we’ll address.

 

Vertical Scale

Increasing the system resources (e.g. CPU, Java Heap, storage), tuning the JVM (e.g. Java Garbage Collector) and optimizing the OpenDaylight plugins and applications involved in the use case are the usual steps to achieve scale. Vertical scale can be the solution for some use cases and some scale requirements - I recommend you spend time on this before trying anything else. Keep in mind, there are high scale scenarios where vertical scale is not sufficient, the software and the system supporting the software has its own limits, or there is undesired behavior when these limits are close to being crossed: long GC pauses, disk I/O saturation, etc.

 

OpenDaylight Cluster

The OpenDaylight AKKA Cluster allows for three or more instances of OpenDaylight to work in “tandem” and actively handle device connections. However, the overhead associated with the data and state replication makes this solution more optimal for High Availability than for scale. In this context, it is worth mentioning the original AKKA design allows for data partition and granular “sharding” as a way to scale the amount of data to be replicated, unfortunately the OpenDaylight shard implementation requires a yang root-level node which puts some restrictions on the shard definition. Still, if your application or plugin yang models allow for granular sharding, this approach may be an option.

 

Horizontal Scale

Finally we get to a solution that can potentially scale with very large number of devices. The solution relies on two ideas: 1) break the monolithic in OpenDaylight by separating the application business logic from the device plugins so that they can run in separate instances (microservices); 2) horizontal scale (replicate) the device plugin to meet the scale requirements. In the following sections we will see some examples of how to achieve this in different use case scenarios.

 

Netconf Application Case Study

The first use case study is a NETCONF application currently deployed as single controller instance in one of our customers. The application is used to configure NETCONF devices not supporting standard yang models. Here is the architecture diagram:

Netconf Application Case Study

The NETCONF application has the following components:

  • A Network Creation (NC) application exposing a standard device yang model (e.g. openconfig) via RESTCONF.
  • A Translator service to perform model schema translation: standard <-> vendor specific model.
  • A Device Credential service to store device connection information.
  • A REST library to send RESTCONF request to device.
  • The OpenDaylight NETCONF plugin.

The same application after applying the horizontal scale concepts would look like this:

Netconf Application Case Study 2

The steps to scale the NETCONF application to support large number of devices are:

  • Split the controller into the following microservices:
    • Network Creator: includes the NC-APP as well as the translator service.
    • Device Connection: includes the Device Credential service and the new Device Triager service (setup and maintain the NETCONF connections).
    • Agent Workload Manager: This is a new service to distribute device connections among NETCONF agents. Note the device vendor and model can be used for distributing connections and therefore to save memory in the agents.
    • Netconf Agent: This is the ODL NETCONF plugin.
  1. Use REST for sync and Message Bus for async inter-module communication.
  2. Horizontally scale the NETCONF agents. Note current OpenDaylight NETCONF plugin does not keep any device configuration state apart from the local connection data.
  3. Horizontally scale the NC service. Note this service is driven by RESTCONF RPCs and does not currently maintain any network state, so it is easy to replicate.

Both the Device Connection service and the Agent Workload Manager maintain global device connection and distribution state so for these services we recommend to implement a basic HA mechanism vs horizontal scale (service replication).



OPENFLOW Application Case Study

The second use case study is an OPENFLOW application currently deployed in a 3-node controller cluster in one of our customers. The application is used to deploy L2 multicast services on an OPENFLOW network. Here is the architecture diagram:

OpenFlow Application Case Study 1

The OPENFLOW application has the following components:

  • Northbound Check Service to validate REST requests before they hit the ODL datastore.
  • Point to Multipoint transport service to configure the root and leafs ports.
  • Point to Multipoint tree path service to program the distribution tree (root -> leafs).
  • Point to Point transport service to configure the edge nodes ports.
  • Point to Point path service to program the path between 2 nodes.
  • MPLS-SR service to provide MPLS Segment Routing OPENFLOW programming.
  • Topology Manager to maintain the desired and actual topology.
  • OPENFLOW Plugin Facade as proxy between applications and the OPENFLOW plugin. This component is also responsible to reconcile controller programming state with device programming state.
  • The ODL OPENFLOW plugin.

 

The same application after applying the horizontal scale concepts would look like this:

OpenFlow Application Case Study 2

The steps to scale the OpenFlow application to support large number of devices are:

  1. Split the controller into the following microservices:
    • Transport Service: includes the NB check service along with all the transport applications and the OPENFLOW plugin facade.
    • Segment Routing: includes the MPLS-SR service and the Topology Manager.
    • OpenFlow Agent: This is the ODL OpenFlow plugin.
  2. Use JSON-RPC for sync and Message Bus for async inter-module communication.
  3. Horizontally scale the OPENFLOW Agents. Note current OpenDaylight OPENFLOW plugin can program devices using RPC requests vs datastore requests, this way it does not need to maintain any device configuration state.
  4. Use multiple instances of the Transport Service and Segment Routing. This is useful for network federation or network slicing (e.g. a set of transport services, managed by the same organization, can use a subset of the physical topology).

Both the Transport Service and the Segment Routing maintain configuration and network state (L2 service, path, SR topology), so for these services we recommend to implement basic HA mechanism vs horizontal scale (service replication).

Also note that in this solution, the devices connect to the controller (not the other way around), so an external Load Balancer is required to provide the device distribution and failover mechanism for the OPENFLOW connections.

 

Design Considerations

Here are some considerations when deploying applications in a microservice fashion:

  • Define your microservices wisely so the components requiring horizontal scale do not handle network state, and those handling network state are very well scoped and offloaded of unnecessary tasks.
  • In some cases you will have to develop new applications to deal with the horizontal scale, for example the Agent Workload Manager described earlier.
  • Select and implement a mechanism for microservices communication. In OpenDaylight, REST and JSON-RPC are available out-of-the-box, but some use cases may require more performance and async communication (e.g. kafka or rabbit-mq bus) not available in OpenDaylight.
  • Select and implement a shared datastore. This could be the OpenDaylight datastore but nothing prevents to use an off-the-shelf datastore.
  • Finally it is advisable to use a framework to deploy and maintain the microservice deployment (e.g. Kubernetes). The framework should provide centralized logging and troubleshooting features.

 

Conclusion

There are multiple approaches to the network scale problem, I would recommend to do the following: if the existing OpenDaylight controller (SDN controller) does not meet your scale requirements, try the “vertical scale” first, this is simpler and more efficient solution than the “horizontal scale”. In those cases where the network size cannot be handled by just “vertical scale”, use the “horizontal” scale strategy, but be aware this option requires some amount of development and integration effort that cannot be 100% reused across use cases. This would ultimately explain why the “horizontal scale” solution cannot be upstreamed to OpenDaylight, instead downstream and partners companies like Lumina Networks have to work the right solution and microservice deployment to meet the final user requirements.

Learn more about how Lumina plans to scale OpenDaylight from our latest presentation at the Linux Networking Foundation Technical Event - DDF. 

To better understand how we harden open source solutions for transforming networks, please visit our website or contact us to discuss your project.

Additional Resources: