By Benoît Canet and Don Marti
A new type of OSv workload
The MIKELANGELO project aims to bring High Performance Computing (HPC) to the cloud. HPC traditionally involves bleeding edge technologies, including lots of CPU cores, Infiniband interconnects between nodes, MPI libraries for message passing, and, surprise—NFS, a very old timer of the UNIX universe.
In an HPC context this networked filesystem is used to get the data inside the compute node before doing the raw computation, and then to extract the data from the compute node.
Some OSv NFS requirements
For HPC NFS is a must, so we worked to make it happen. We had some key requirements:
- The NFS driver must go reasonably fast
- The implementation of the NFS driver must be done very quickly to meet the schedule of the rest of the MIKELANGELO project
- There is no FUSE (Filesystem in User Space) implementation in OSv
- OSv is a C++ unikernel, so the implementation must make full usage of its power
- The implementation must use the OSv VFS (Virtual File System) layer, and so be transparent for the application
The first possibility that we can exclude right away is doing an NFS implementation from scratch. This subproject is simply too short on time.
The second possibility is to leverage an implementation from an existing mainstream kernel and simply port it to OSv. The pro would be code reuse, but this comes with a lot of cons.
- Some implementation licenses do not match well with the unikernel concept where everything can be considered a derived work of the core kernel
- Every operating system has its own flavor of VFS. The NFS subproject would be at risk of writing wrappers around another operating system’s VFS idiosyncrasies
- Most mainstream kernel memory allocators are very specific and complex, which would leads to more insane wrappers.
The third possibility would be to use some userspace NFS implementation, as their code is usually straightforward POSIX and they provide a nice API designed to be embedded easily in another application. But wait! Didn’t we just say the implementation must be in the VFS, right in the middle of the OSv kernel? There is no FUSE on OSv.
Enter the Unikernel
Traditional UNIX-like operating system implementations are split in two:
- Kernel space: a kernel doing the low level plumbing everyone else will use
- User space: a bunch of applications using the facilities provided by the kernel in order to accomplish some tasks for the user
One twist of this split is that kernel space and user space memory addresses are totally separated by using the MMU (Memory Management Unit) hardware of the processor. It also usually implies two totally different sets of programing APIs, one for kernel space and one for user space, and needless to say a lot of memory copies each time some data must cross the frontier from kernel space to userspace.
A unikernel such as OSv is different. There is only one big address space and only one set of programing APIs. Therefore you can use POSIX and Linux userspace APIs right in an OSv driver. So no API wrappers to write and no memory copies.
Another straightforward consequence of this is that standard memory management functions including malloc(), posix_memalign(), free() and friends, will just work inside an OSv driver. There are no separate kernel-level functions for managing memory, so no memory allocator wrappers needed.
libnfs, by Ronnie Sahlberg, is a user space NFS implementation for Linux, designed to be embedded easilly in an application.
It’s already used in successful programs like Fabrice Bellard’s QEMU, and the author is an established open source developer who will not disappear in a snap.
Last by not last, the libnfs license is LGPL. So far so good.
The implementation phase
The implementation phase went fast for a networked filesystem. Reviewing went smoothly thanks to Nadav Har’El’s help, and the final post on the osv-devel mailing list was the following:
Some extra days were spent to fix the occasional bugs and polish the result and now the MIKELANGELO HPC developers have a working NFS client.
Some code highlights
Almost 1:1 mapping
Given the unikernel nature of OSv, an useful system call like truncate(), used to adjust the size of a file, boils down to
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
OSv allowed us to implement this syscall with a very thin shim without involving any additional memory allocation wrapper.
C++ empowers you to do powerful things in kernel code
One of the known limitation of libnfs is that it’s not thread-safe. See this mailing list posting on multithreading and preformance. OSv is threaded—so heavily threaded that there is no concept of a process in OSv, just threads. Clearly this is a problem, but OSv is written in modern C++, which provides us with modern tools.
This single line allows us to work around the libnfs single threaded limitation.
Here the code makes an associative map between the mount point (the place in the filesystem hierarchy where the remote filesystem appears) and the libnfs
The one twist to notice here is
thread_local: this single C++ keyword automatically makes a separate instance of this map per thread. The consequence is that every thread/mount point pair can have its own separate
mount_context. Although an individual
mount_context is not thread-safe, that is no longer an issue.
As we have seen here, the OSv unikernel is different in a lot of good ways, and allows you to write kernel code fast.
- Standard POSIX functions just work in the kernel.
- C++, which is not used in other kernels, comes with blessings.
Scylla will keep improving OSv with the various MIKELANGELO partners, and we should see exciting new hot technologies like vRDMA on OSv in the not so distant future.
The MIKELANGELO research project is a three-year research project sponsored by the European Commission’s Horizon 2020 program. The goal of MIKELANGELO is to make the cloud more useful for a wider range of applications, and in particular make it easier and faster to run high-performance computing (HPC) and I/O-intensive applications in the cloud. For project updates, visit the MIKELANGELO site, or subscribe to this blog’s RSS feed.