Highlights of NRE Labs v0.2.0
This is a big release with a lot of changes, and many of the changes across the various repositories are tightly coordinated. We’ll provide the full Changelog at the end of this post, but before that, we want to go over some of the highlights for this release, especially those that required changes across multiple repositories:
New Lesson - Introduction to StackStorm!
We’re happy to introduce a great new lesson. This one explores a great open source project for doing event-driven automation, called StackStorm
You can check out the new lesson right now, right here! This is an exciting, new lesson with five brand new labs ready to go, covering each of the aspects of event-driven automation with StackStorm:
- Introduction
- Actions
- Workflows
- Sensors and Triggers
- Rules
We were originally going to add even more labs, but we have some additional changes planned for the overall structure of NRE Labs, and thought it might be best to save those for an upcoming release. In the meantime, pull the “trigger” on this new lesson! :)
Streamlined Authentication
Previously, we used a variety of different credentials for lesson endpoints like the vQFX and utility image. A few of the images had been standardized, especially those that were created explicitly for this project, but a few had different credentials. As a result, we had to provide a credential set in the lesson definition for each endpoint, so that syringe
and antidote-web
knew how to log in to the device.
However, this added quite a bit of additional complexity, as these credentials needed to be embedded in the lesson definition, as well as passed into the metadata for the lesson endpoints in Kubernetes. It was a bit of a mess.
We threw all that mess out, and decided to enforce a single credential set across the board. Any lesson endpoint that uses SSH and is represented as a terminal in the browser, must support these credentials.
This simplifies the lesson definitions significantly, and also reduces the logic on the back-end for deciding which credentials to use based on the lesson endpoint.
The Progress Bar Actually Does Something Now!
We’re really excited about this one, as we think it addresses arguably one of the most painful parts of NRE Labs to date. Previously, the “progress bar” brought up when loading a new lesson - or moving between labs within a lesson - didn’t do much of anything except sit there and churn. It didn’t matter what was happening on the back-end, you had to stare at that thing at 100% the entire time. Kind of insulting - we know. We are sorry.
No more. Now, progress for the lesson’s setup and configuration is being provided via the Syringe API. The API now indicates - from the moment you initially request a new lesson - how many endpoints have passed a basic reachability test (meaning they’ve initially started), and how many haven’t. Once those are all online, it will also indicate whether or not anything needs to be configured. Of course, it will also indicate when everything is ready to go.
This means we can also improve the front-end to consume all this information and present a much, much, MUCH more useful progress indicator:
We really hope this makes things easier to wait for, especially on the initial lesson startup.
Syringe Stability Improvements
While at first glance, this will appear to be a laundry list of minor improvements, these add a tremendous amount of stability to the back-end component for NRE Labs: “Syringe”. These will have a big positive impact on the end user experience in NRE Labs.
- Simplify and improve safety of in-memory state model - Syringe doesn’t (currently) have a database. Everything Syringe does is tracked using in-memory structs and maps. Unfortunately we failed to clean up a few concurrency no-nos that we committed early on in Syringe’s development while we were still deciding how to handle state. This PR fixes that, and prevents related crashes.
- Introduce garbage collection whitelist functionality - this is a cool new feature that might not be useful to everyone, but could be useful for those doing demos with NRE Labs, or developing lesson content. This is a handy API and CLI feature for adding a session ID to a “whitelist”, which prevents any lesson with that session ID from being cleaned up due to garbage-collection. In effect, these lessons would last until Syringe was restarted.
- Fixed bug with bridge naming and reachability timeout - NRE Labs currently uses linux bridges to tie endpoints together in “virtual” networks. Each bridge needs a unique name, which is formed using a combination of lesson ID, session ID, and the two endpoint names participating in the network. Unfortunately, this proved way too long for the 16 character maximum for bridge names, and the only uniqueness contained in the first 16 characters was down to the lesson/session combination. This meant that all network segments in a lesson ended up being on the same bridge anyways. While we didn’t notice this so far because all lessons use unicast traffic (we’re still working on getting protocols like OSPF and LLDP working), this is not ideal. So, we reduced the length of characters needed for uniqueness, and put a segment ID and lesson ID right at the beginning of the bridge name. The remaining characters are populated with the session ID.
- Added timeout logic to reachability test - Syringe uses connectivity checks to verify whether or not an endpoint is online. When the endpoint is a network device, or utility (like a linux container), Syringe actually tries to authenticate over SSH, as a way of really making sure it’s online. Sometimes, this test would seem to hang, ignoring the timeout provided to the connection client. This actually held up the provisioning of the entire lesson, since Syringe will wait until all goroutines executing connection tests return before moving forward. This PR moves the timeout logic outside the SSH client entirely, and simply kills all goroutines using a
select
statement. This ends up being much more reliable. - Fix fundamentally broken networkpolicy - The previous networkpolicy being applied was too restrictive. It was applied after initial lesson provisioning, which worked, but would block internet access for any pods performing reconfiguration between labs in a lesson. This meant you could load a lesson for the first time, but couldn’t go to any other labs (configuration would eventually fail). This PR allows internet access for configuration pods.
Serve Lab Guide Directly from API
Each lab guide is written in markdown, and stored within the lesson directory. In past versions, Syringe would require that a lesson’s syringe.yaml
file specified a public URL in each stage’s declaration, and would simply pass this to antidote-web
. The javascript there would download the contents at that URL and run it through our markdown-to-html converter in the browser.
This presented problems when moving between versions, or when working with markdown files stored on someone’s branch (since Github embeds the branch name in the URL for files). So, as of v0.2.0, Syringe now parses the Markdown file itself, and provides the contents directly via the lesson definition API. Antidote-web no longer needs to download the contents from an external source, and now simply sends the contents straight into the converter.
This keeps things much simpler operationally, and prevents the web front-end from making an extra, unnecessary request.
Working towards LLDP capabilities
We’ve been working towards getting LLDP working in NRE Labs for a while. There are a few lessons we want to write that really require LLDP in order to work, so this is something we’ve wanted to tackle since the beginning.
This has ended up not being so straightforward. There are a few things we knew we had to do, like setting the right group_fwd_mask
on all linux bridges. However, it ended up being a bit more complicated than that. Rather than spoil a great future blog post, suffice it to say we still don’t support this. However, we made some good progress towards this in v0.2.0:
- Cleaned up and enabled LLDP within the
vqfx
image #147 - Add LLDP to all lesson configurations and install new
antibridge
plugin via bootstrap playbooks #150
Full Changelog
Those are the highlights - stay tuned for the full change log. As always, we hope you enjoy this version of NRE Labs, and encourage you to submit an issue to the Antidote
project for any requests or bug reports.
Antidote
Curriculum
- Adding Lesson-15 Event-Driven Network Automation with StackStorm #126
- Clarify internet access restrictions in REST API lesson and docs #148
- Updated YAML lesson to be less of a Juniper commercial #161
- Document howto check spin up progress and find the UI URL #166
Other
- Simplified authentication by using consistent credentials, statically #143
- Remove lab guide from lesson definitions #146
- Cleaned up and enabled LLDP within the
vqfx
image #147 - Enable LLDP end-to-end #150
- Added better docs for Syringe environment variables #160
- Moved to a custom githelper image for making lesson repo available in pods with correct permissions #162
Syringe
- Simplified authentication by using consistent credentials, statically #40
- Serve lab guide directly from lesson definition API #41
- Simplify and improve safety of in-memory state model #42
- Introduce garbage collection whitelist functionality #45
- Fixed bug with bridge naming and reachability timeout #51
- Add more detail around the status of a livelesson’s startup progress #52
- Add check to lesson import to ensure lesson IDs are unique #53
- Use new githelper image instead of configmap script #55
- Fix fundamentally broken networkpolicy #58
- Added timeout logic to reachability test #59
Antidote-Web
- Simplified authentication by using consistent credentials, statically #23
- Render lab guide directly from Syringe API #24
- Add notice that mobile isn’t yet supported #25
- Add more detail to the progress modal based on Syringe API changes #27
- Fixed a few navigational bugs and broken link to Github issues on the error modal #28