OpenDaylight Nuts & Bolts: Common Mistakes Around ODL/LSC Systems

With our hands deep into OpenDaylight, we support the community in many ways - project leadership, upstreaming code, bug fixes, and general Q&A. It’s this last piece we’ve decided to make publicly available on our blog. We regularly field questions from the community asking for our expert advice, input, or guidance on specific topics. In support of the community innovation process, we want to make sure everyone has access to this information and we open real dialog on these topics.

This week, we’ve collected a list of common errors we’ve seen made around OpenDaylight and our Lumina SDN Controller. It’s a checklist of sorts that will prove handy if you’re running an ODL at scale in a production environment. Although these are the most common systems administration issues we’ve seen, you may have some to add so please review and comment, or contact us to discuss further!

The first three on our list are all covered here and may be handled by the underlying platform team already, but should be verified prior to an install:

1. VM/container sizing (i.e., allocate enough resources for both the workload AND G1GC)
2. Avoid oversubscription on compute, RAM, networking, and storage I/O
3. Java sizing

4. Switch and tune the default Java Garbage Collection mechanism:
The default garbage collector in Java 8 should not be used for anything larger than 6GB, and regularly causes failures above 8GB. See the links above for configuration options for G1GC.

5. Too much Heap, not enough cores/threads for GC:
If there are not enough cores, then Java Garbage Collection will create longer pauses than apps can handle. You can always add MORE threads to this, but see #1/2/3 and don't add more RAM unless there are enough cores/threads.
→ Note: the default heap size (JAVA_MAX_MEM=2GB) should never be used. If it's set to 2GB (or an equivalent MB/KB number), increase it.

6. Incorrect user/permissions:
All files in the ODL/LSC hierarchy must be owned/readable/etc by the ODL/LSC-controlling user (not root or end-user UIDs): You can use chown -R to recursively change all of these if needed.
→ Be careful with scripts, installers, and copying of files as other users or automation tools. The "find" tool can be useful here.

7. Name resolution must all work consistently:
This includes /etc/hosts, /etc/resolv.conf, DNS lookups, and any systems calling-into/called-from the controller (such as when Java does a reverse-lookup of a client connecting via HTTPS -- it must be DNS-resolvable in reasonable time).

8. Attempting to use dual-stack features when dual-stack isn't working correctly on the host itself:
If the host does not have IPv6 features enabled, then in many configurations VMs and their services cannot either.
→ Also apply DNS checks here.

9. TLS Java options, certificates, and strict checking:
Karaf has options to allow TLS v1, v1.1, and v1.2 explicitly at run-time. These can be selectively disabled, as some are considered deprecated. Assume that strict certificate checking is enabled by default unless you disable it (i.e., no self-signed).
→ DNS must work in both directions.

10. Listening ports, netstat, & tcpdump:
Be sure that all intended listening ports are indeed listening (ex: 8443, 8181, 22, and any tied to specific apps/features). Be sure that you can see packets coming into these ports, and being replied to, via tcpdump. Look for ICMP unreachable or ICMP prohibited messages via tcpdump.

11. Problems with iptables or static routes:
Are there any forwarding rules?
Are there any NAT rules?
Are rules applied to the wrong interface or loopback?
→ If more than one routable interface is in use, are routes to specific destinations correct and passing traffic?

12. Insufficient file handles/descriptors/other limits for the ODL/LSC user in Linux:
The ulimit tool can show what a particular user's limits are; the defaults for a non-root user are far too low.
→ Be careful with syntax in /etc/security/limits.conf (or equivalent), as users can be immediately locked out.

13. Disk space: Always verify that monitoring is enabled for all storage volumes.
This is highly subjective based on how much data and logging your use-case consumes, but the rule of thumb is to alarm when below 20%. Note that it's always advisable to have your logs write to a separate partition than applications or swap. Set karaf.log rotation to more than the default (10x10MB files) but something you can reasonably store & share w/support.

14. Leaving DEBUG or TRACE logging enabled:
View the log4j config to ensure nothing is left in TRACE, and only approved items are left in DEBUG. ONAP CCSDK is an example that is commonly misconfigured as TRACE when configs are committed to repo.

15. Insufficient storage I/O:
This is also highly subjective based on use case, but 250MB/s sustained is generally acceptable for a low-volume use-case, and 500MB/s+ for high-volume. While NOT in production, you can use the "dd" command to write randomized multi-GB files to determine throughput.

16. Test SSH and netconf settings from the host/VM level prior to troubleshooting within Karaf:
Use the following command to ensure that the network element is configured correctly and ssh is working, as well as user auth: ssh -vvv -p -s @ netconf

17. Lack of adequate Java profiling tools (commercial or freeware):
Be sure that you are running your preferred Java profiling agent within Karaf; these are usually loaded via "EXTRA_JAVA_OPTS" in controller/etc/setenv or inc. This should be the LAST thing you add and test, since some of these tools may try to install dependencies that conflict with an otherwise-working ODL/LSC installation.

18. "Cattle not pets" metaphor, even at the installation level:
While clustered VM & container instances of ODL/LSC are still very much "pets," each individual installation should be able to be deleted and replaced without much manual intervention. i.e., ensure that your installation scripts and customization tools can be executed if there is any doubt that the integrity of an install has been compromised by manual hands or misconfigured tooling.

19. Understanding partial cleans vs full cleans:
[NOTE: All the paths here are RELATIVE to the main "controller" directory]
[NOTE2: Only do these while the controller is stopped.]
→ When installing, replacing, or upgrading feature bundles, note that there is a cache of files that must be deleted prior to restarting.

A *partial* cleanup only removes this cache.
rm -rf .../controller/data/[gctkp]*
rm -rf .../controller/cache/*

A *netconf schema* cleanup only removes all the cached yang models. Use this when vendors claim to have fixed one of their yang models, but remounting it yields no results (or odd results). Chances are they edited the file but failed to edit the required date-version string & namespace.
rm -rf .../controller/cache/*

A *complete* cleanup essentially removes all forms of cache and local data, leaving you with a "clean yet still configured/tuned" controller. This is typically most useful in clusters, where the single cluster member can be repaired and re-sync its data among its peers.
rm -rf .../controller/data/*
rm -rf .../controller/cache/*
rm -rf .../controller/journal/*
rm -rf .../controller/snapshots/*

20. You can manually validate using verbose curl to detect problems with restconf, tls, etc.:
Just see if restconf is up (http, not https) -- handy for load balancer health checking:
curl --user admin:admin -H "Content-Type: application/json" -H "Accept: application/json" -X GET "http://localhost:8181/restconf/modules"

Manual check of cluster member status:
curl -s --user admin:admin -H "Content-Type: application/json" -H "Accept: application/json" "http://localhost:8181/jolokia/read/akka:type=Cluster/MemberStatus" | python -mjson.tool

Look at the current list of netconf mountpoints (warning: this can become large):
curl -s --user admin:admin -H "Content-Type: application/json" -H "Accept: application/json" -X GET "http://localhost:8181/restconf/config/network-topology:network-topology/topology/topology-netconf" | python -mjson.tool

21. clustered-netconf vs netconf-topo bundles:
As documented in the upstream ODL documentation, the odl-netconf-topology and odl-clustered-netconf-topology are mutually exclusive. If you detect both, remove both and only install what you need.

22. karaf client is a localhost ssh session for checking lots of things: The karaf console script at ...controller/bin/client creates a localhost session to karaf, where you can use "?" and tab-completion to view and run a wide assortment of troubleshooting commands.
→ This is the fastest way to decrease or increase log levels as needed, and immediately be able to see results.

23. Looking for bundles that are not "Active":
In the karaf client, use "feature:list" and "bundle:list" to look for anything that is not "Active", and look for modules/dependencies that may be missing. Each use-case deployment will have a specific total number of these once everything is installed and working correctly -- use this number as your standard for measuring feature/bundle overall status. (Ex: Normally expect 323 active; use wc -l to count.)

24. Ignore INFO messages that are not WARN or ERROR:
During controller startup in particular, there are many messages that may seem like errors, but are not (especially after a cache cleanup). If it says INFO, it does not need to be treated as an ERROR or WARN.
→ Of particular note: Ignore anything with "returning the original object instead."

25. Cluster-specific tuning: See examples ONAP and OpenDaylight Nuts & Bolts: Tuning 3node Clusters.