NRE Labs v1.3.0 - Kata Containers, Cilium, cRPD!

Just in time for the holidays - I’m pleased to announce the next release of….well….pretty much the whole NRE Labs stack. This is the culmination of the work I’ve been doing for what feels like all year to enable some new capabilities, so this release is a bit infrastructure-heavy, but what it enables in terms of possible future content is exciting.

New NRE Labs Curriculum

The most important thing to announce is a new version of the NRE Labs curriculum, which includes new content, new endpoint images, and some long-awaited fixes to lessons that didn’t always work quite right. The full changelog, as always, is available here, but here are some highlights.

Ansible Lesson Fixed!

If you’ve had a chance to check out the Ansible lesson, you may have run into a known issue where Ansible wasn’t able to connect to the vQFX device. If this happened to you, this was undoubtedly quite frustrating, since a large portion of the lesson includes steps that require the vQFX, and were thus inaccessible to you.

It turns out that this was a bug triggered by using a mixture of the new ansible_user-style configuration options with the legacy ansible_ssh_pass option. Changing the latter to ansible_password fixed this.

This one took a while to fix, not because it was a particularly difficult issue to fix, but because I was waiting for a pull request in upstream Ansible to be merged, which I believed would solve the underlying problem. Since as of today this PR was not merged, I started looking for workarounds, and stumbled across the solution above, and it works great! Apologies to those that had a less-than-ideal experience with this thus far, but it should work a lot better now.

Re-Vamped JSNAPy Lesson

The lesson on using JSNAPy for network unit testing is among the oldest lessons in the curriculum. In fact, it was the very first - built primarily as a proof-of-concept for the NRE Labs platform back in 2018 before we even officially launched. It was built with two parts: one where the tests failed, and one where the tests passed. While somewhat useful for at least getting some experience with JSNAPy, its original purpose was to illustrate the underlying mechanisms that make NRE Labs work, and as a result didn’t do a really great job at actually teaching the underlying concepts of JSNAPy itself. Since that time, it hasn’t really seen any major updates to improve on this, until now.

The new JSNAPy lesson has been totally re-vamped to not only focus on the fundamentals of how JSNAPy works, but also with space for future releases to add chapters to this lesson seamlessly. So, this version includes the first of these chapters, with more planned in subsequent releases very soon.

This lesson also features a new image endpoint, which is Juniper’s containerized routing protocol daemon, or “cRPD”. This isn’t a full-blown network operating system (like most virtual Junos images) but rather a totally disaggregated routing and management stack, packaged inside a container. In this release, we’ve coupled this image with some of the underlying infrastructure improvements, such as using Kata containers to seamlessly run these containers inside a very lightweight Linux VM.

We'll cover the details of how this works in the following
section covering infrastructure improvements.

The result of this is that we get an extremely fast, lightweight network element that acts exactly like a traditional Junos device (you can run Junos commands on it, send API requests to it, it can route traffic, connect in a topology, etc), but has all of the benefits of running as a simple container. What’s also exciting about this is that unlike the other Junos endpoints we’ve used thus far, cRPD enjoys regular feature updates and includes the latest automation features like gNMI support.

I’m excited about the simplicity and speed this brings to the JSNAPy lesson, but there is still a lot more potential in not only the cRPD image specifically, but in the new infrastructure that makes it possible. Being able to run simple containers within the safety and security of a virtual machine makes a lot of concepts that were previously impossible (doing anything with the kernel) now feasible. I’m excited for future content to take advantage of this feature. See the section on Antidote improvements for more details on how this works.

Infrastructure Upgrades

This release also brings a few “behind the scenes” improvements into production. Normally I don’t talk too much about this stuff, but given the impact on the curriculum, and the substantial nature of these improvements, I figure it’s worth at least a short summary.

Kata Containers

As mentioned in the previous section, we’re now able to run containers in lightweight virtual machines using Kata Containers. This is an alternative container runtime that executes containers inside a lightweight virtual machine, rather than natively on the host operating system. However, the operational model for these containers is no different - they still look and feel exactly like containers. So, we get the developer experience of containers with the safety and security of a virtual machine (gets away from the “shared kernel” paradigm of traditional containers). Again, this is cool for NRE Labs because it provides a much simpler platform for content that requires access to a dedicated kernel.

To be able to run a Kubernetes pod using an alternative runtime like Kata, the Kubernetes cluster must be deployed using an implementation of the Container Runtime Interface (CRI). This is because historically, Kubernetes defaulted to using Docker, which did not allow us to pick and choose which runtime we wanted to use - you always used runc. Unfortunately this isn’t a simple configuration change - the cluster needs to be built to support this from the ground up, so starting earlier in the year, I got to work on a new, separate cluster that supported CRI.

It turns out that this change was timely, as this month, Kubernetes announced it would be dropping support for native Docker as a runtime, and would instead only be supporting those that support CRI. So, to continue to use modern versions of kubernetes, this change is actually non-optional.

So, to support the ability to run lesson endpoints in lightweight Kata virtual machines, the new cluster that powers NRE Labs has been rebuilt from the ground up to run containerd with the CRI plugin. By default, this will still run containers using runc, but unlike the legacy installation, we can specify in a created pod that we wish to use the Kata runtime. As I’ll cover in the next section, this allows us to create “flavors” of endpoint images in Antidote that execute in different runtimes based on requirements.

The takeaway from this is that we can now optionally execute lesson endpoints within a secure virtual machine, and in fact this is the default in the new version of Antidote. Not only is this more secure, but it also gives each lesson endpoint its own kernel, which makes certain lesson concepts easier to explore, and it doesn’t sacrifice a lot of performance to get these benefits - everything starts extremely fast.

Cilium

The previous CNI that was being used within the Kubernetes cluster was showing its age. There were a number of issues that were causing some network connectivity problems during periods of heavy usage, as well as during maintenance events, and it didn’t seem like these problems were being addressed. It had been some time since I last evaluated options for a network provider within Kubernetes, so I spent some time looking at the current state of the industry here.

I had been tracking the Cilium project for a few months prior, and have been pretty impressed with not only their work, but also with the entire eBPF community, and the focus on simpler, safer, and more performant architectures that are made possible with BPF. I decided to give Cilium a go, and haven’t looked back. It works great. Note that we’re still using Multus to add additional network interfaces to lesson endpoints that need them, and these continue to use the linux bridge for this simple connectivity - but Cilium is now in charge of eth0 for each lesson endpoint provisioned behind the scenes in NRE Labs.

Kubernetes 1.18

The version of Kubernetes in use on the cluster also warranted an upgrade. Previously we were using Kubernetes 1.14, and a lot of the API resources we were still relying on weren’t yet stable. Kubernetes 1.18 proved to be a good candidate to take advantage of the security and maturity enhancements that have been taking place since 1.14. It also was necessary to support the version of containerd that met our requirements for running Kata Containers, as described above.

New Antidote Platform

Finally, we have a new version of the underlying “Antidote” platform which powers NRE Labs. There were a number of platform improvements that were necessary in order to facilitate the availability of alternative runtimes like Kata Containers, and there were a number of user-experience issues that needed to be fixed.

As always, the full changelog is here, but let’s go over some highlights:

When a user navigates to the final stage of a lesson, the “next” button was greyed out, but was also colored blue, as opposed to a white background. This led to some confusion, especially on lessons that only have a single stage, as it seemed like there was more content to see, but you just couldn’t get to it. This version of NRE Labs fixes this problem, making the button truly greyed out so it’s more obvious when you’ve reached the end of the lesson as it currently stands.
To help give learners a better “next step” when things go wrong on NRE Labs (lesson fails to start, etc), most error messages on the site will include a button that takes the learner to the NRE Labs Forums, and pre-fills a new post with all of the detail necessary for NRE Labs admins to troubleshoot the problem and get to a resolution. As more content comes to the site, this should make it easier for the project to address issues more quickly, and keep content quality on the site high.
Antidote now supports the concept of “Image Flavors”, which controls the permissions that a given endpoint image is granted, as well as the container runtime (i.e. Kata Containers) used to execute it. Each image in the NRE Labs curriculum has been updated to specify this field (either “legacy”, “untrusted”, and “trusted”). This is the key feature needed to enable NRE Labs lesson endpoints to take advantage of the infrastructure improvements mentioned earlier such as Kata Containers.
Lesson metadata now has a new field for specifying the subnet on a given Connection definition. While this isn’t strictly enforced, (endpoints that have the ability to change their IP addresses are still able to do this), it’s nice to be able to specify the subnet you want to automatically be provisioned, because it’s one less thing you have to configure yourself using other methods. Currently, the subnet is as granular as you can get - you cannot currently specify which endpoint gets which IP address statically - but this is a step in that direction.