What Is the Most Popular OSv Virtual Appliance?

By Tzach Livyatan

(Spoiler: It’s Apache Tomcat.)

Capstan is a tool for rapidly building and running applications on OSv. As with Docker, Capstan users can download and run images from a public repository. We chose to implement our public Capstan repository using Amazon S3.

Amazon S3 gives us the flexibility and security we need, but by default it’s missing a critical feature: download statistics. This statistics are very interesting to us, to evaluate which of the Capstan virtual appliances are more popular. Fortunately, there is an easy way to gather the stats we need.

After a short tools survey, we choose s3stat.

s3stat is a cloud-based service which can follow an S3 bucket, and visualize download statistics, by file, country, browser day, or otherwise. The price makes sense, and it is super easy to enable.

s3stat chart

So what are the results? (drum roll….) s3stat files

Omitting Capstan download of capstan index.yaml files, which Capstan does for every repository search, the most popular images are the base images for OSv and OSv + Java. That make sense because these two images will be used by anyone who wants to build a local OSv application, running a native or Java application.

Virtual appliances comes right after, with (drums again….) Tomcat, Cassandra, Memcached on the podium (Tomcat wins the Gold). These are all very early results, but we will keep using s3stat to follow Capstan image downloads.

s3stat map

Hypervisors Are Dead, Long Live the Hypervisor (Part 3)

By Dor Laor and Avi Kivity

The new school: functionality, isolation and simplicity

(This is part 3 of a 3-part series. Part 1, Part 2)

Containers make administration simple, and VMs give you portability, isolation, and administration advantages. The concept of putting containers inside VMs gives you the isolation you need, but there are now two layers of configuration and overhead instead of one.

What if there were one technology that could give us the simplicity and reduced overhead of containers and the security, tools, and hardware support of hypervisors? That’s where OSv comes in. OSv, the “Operating System Designed for the Cloud.” takes an approach different from either containerization or virtualizing an existing bare-metal OS. OSv is a single address space OS, designed to run as a guest only, with one application per VM.

Best of both worlds?

Glauber Costa, in a speech at Linuxcon called “The failure of Operating Systems, and how we can fix it”, pointed out that the existence of hypervisors is evidence that Operating Systems alone cannot meet some of the demands of real workloads. Through OSv, we have the opportunity to work together with the hypervisor to create a superior solution to what can be done with the OS alone: combining the resource efficiency of containers with the processor-aided advantages of hardware virtualization.

In the eight years since the release of Intel VMX, the silicon has kept getting better and better at moving the costs of virtualization into hardware. Enterprise customers have been demanding lower virtualization overhead for as long as hypervisors have been a thing, and the best minds of the CPU industry are working on it. With nested page tables and other features coming “for free” on the processor, virtualization overhead is being squeezed closer and closer to parity with bare metal.

typical cloud stack with duplication

Typical cloud stacks have duplicate functionality at the hypervisor, guest OS, and application levels.

While many players are trying to carve out a simple OS containerization system at the guest OS level, they are ignoring the stable, simple, secure, hardware-supported interface we already have: the hypervisor-guest interface. There’s nothing that says we have to use this well-tested, industry-standard interface just to run a large, complete OS designed for bare metal. (In fact, research projects such as “Erlang on Xen” and MirageOS have explored using the hypervisor to run something less than a full OS for quite a while.)

OSv is designed to perform

OSv transparently loads an application into its kernel space. There is no userspace whatsoever. It removes the need for user to kernel context switches. In addition, the kernel trusts the application, since it relies on the underlying hypervisor for isolation from other applications in other VMs. Thus it opens up a way for the application to use any kernel API – from taking scheduling decisions to zero copy operations on data, and even unlock the brute force of the hardware page tables for the benefit of the application or its framework.

To date (June 2014), OSv provides 4x better performance for Memcache, a 40% gain with Apache Tomcat, and a 20% gain with Cassandra and SPECjbb. These results are based on our alpha versions, and are likely to improve as we complete the optimizations remaining on our roadmap.

OSv example workloads

OSv runs many key cloud workloads with low overhead and high performance.

OSv’s image is your app and our kernel. Sometimes it means an image size of 10MB! That’s a 100-400x better than the traditional OS and resembles a container’s footprint. The OSv boot time is under a second, which is also closer to container startup time.

##OSv management: some questions for devops

How many configuration files does your OS have? OSv has zero.

How many times have you had to perform string manipulation on UNIX-like config files? OSv is built for automation and uses a RESTful API instead.

How hard is it to upgrade your OS, and how can you revert it? OS is stateless.

With an hypervisor below, you get the features such as live migration, perfect SLA, superior security for free while you get to enjoy from OSv’s added value.

Capstan – or what we have learned from Docker

We do love Docker with regard to development. The neat public image repository and the dead-simple-single execution won our hearts. We wanted to have the same for VMs, so we created the Capstan project. Capstan has a public image repository, and by executing ‘capstan run cloudius/osv-cassandra’ a virtual machine image will be either downloaded to your laptop (Mac OS X, Microsoft Windows, or Linux) or be executed on your cloud of choice. Capstan also allows you to build images including an app and a base OSv image. It takes about three seconds. On Capstan’s roadmap, we plan to support the Docker file format, run Java apps directly without a config file, and form a simple PaaS for developers to load their favorite app directly to a running VM.

Pick a cloud, any cloud

The business case for cloud computing has never been better for the customer. While Amazon continues to upgrade the available instances and offer faster VMs at lower prices, Google is coming on strong as well. Microsoft, HP, IBM, and others are all competing for cloud business. The cloud VM is the new generic PC. Because we can create standard VMs that will run on anyone’s cloud, or on a private or hybrid cloud, we can develop with the confidence that we’ll be able to deploy to whatever infrastructure makes business sense–or move, or split deployment.

Lastly, we like to point out we are not against containers. Container technology is awesome when used for the right scenario. As there are cases for public transportation versus private cars, the same applies to devops. Both containers and OSv excel, in different domains. Here is a simple flow chart that can guide you with your choices:

Guest OS selection flowchart

Using OSv on ubiquitous, secure, full-featured hypervisors is the way to keep performance up, costs down, and options open. We had to completely reinvent the guest OS to do it—but now that we have it, OSv is available to build on. Please join the osv-dev mailing list for technical info, or follow @CloudiusSystems on Twitter for the latest news.

( Part 1, Part 2 )

Hypervisors Are Dead, Long Live the Hypervisor (Part 2)

By Dor Laor and Avi Kivity

Linux containers

(This is part 2 of a 3-part series. Part 1 was published yesterday.)

Containers, which create isolated compartments at the operating system level instead of adding a separate hypervisor level, trace their history not to mainframe days, but to Unix systems.

FreeBSD introduced “jails” in 2000. There’s a good description of them in Jails: Confining the omnipotent root by Poul-Henning Kamp and Robert N. M. Watson. Solaris got its Zones in 2005. Both systems allowed for an isolated “root” user and root filesystem.

The containers we know today, Linux Containers, or LXC, are not a single monolithic system, but more of a concept, based on a combination of several different isolation mechanisms built into Linux at the kernel level. Linux Containers 1.0 was released earlier this year., but many of the underlying systems have been under development in Linux independently. Containers are not an all-or-nothing design decision, and it’s possible for different systems to work with them in different ways. LXC can use all of the following Linux features:

  • Kernel namespaces (ipc, uts, mount, pid, network and user)

  • AppArmor and SELinux profiles

  • Seccomp policies

  • Chroots (using pivot_root)

  • Kernel capabilities

  • Control groups (cgroups)

Although the combination can be complex, there are tools that make containers simple to use. For several years userspace tools such as LXC, libvirt allowed users to manage containers. However, containers didn’t really get picked up by the masses until the creation of Docker. Docker and systemd-nspawn can start containers with minimal configuration, or from the command line. The Docker developers deserve much credit for adding two powerful concepts above the underlying container complexity: a. Public image repository - immediate search and download of containers pre-loaded with dependencies, and b. Dead-simple execution - a one-liner command for running a container.

Docker diagram

Docker gives container users a simple build process and a public repository system.

Container advantages

When deployed on a physical machine, containers can eliminate the need of running two operating systems, one on top of the other (as in traditional virtualization). It makes IO system calls almost native and the footprint is minimal. However, this comes with a cost as we will detail below. The rule of the thumb is that if you do not need multi-tenancy and you’re willing to do without a bunch of software defined features, containers on bare metal are perfect for you!

In production, Google uses containers extensively, starting more than two billion per week. Each container includes an application, built together with its dependencies, and containerization helps the company manage diverse applications across many servers.

Containers are an excellent case for development and test. It becomes possible to test some fairly complex setups, such as a version control system with hooks, or an SMTP server with spam filters, by running services in a container. Because a container can use namespaces to get a full set of port numbers, it’s easy to run multiple complex tests at a time. The systemd project even uses containers for testing their software, which manages an entire Linux system. Containers are highly useful for testing because of their fast startup time–you’re just invoking an isolated set of processes on an existing kernel, not booting an entire guest OS.

If you run multiple applications that depend on different versions of a dependency, then deploying each application within its own container can allow you to avoid dependency conflict problems. Containers in theory decouple the application from the operating system. We use the term ‘in theory’ because lots of care and thought should be given to maintaining your stack. For example, will your container combo be supported by the host OS vendor? Is your container up-to-date and does it include fixes for bugs such as ‘heartbleed’? Is your host fully updated, and does its kernel API provide the capabilities your application requires?

We highly recommend the use of containers whenever your environment is homogeneous:

  • No multitenancy

  • Application is always written with clustering in mind

  • Load balancing is achieved by killing running apps and re-spinning them elsewhere (as opposed to live migration)

  • No need to run different kernel versions

  • No underlying hypervisor (otherwise, you’re just adding a layer)

When the above apply, you will enjoy near bare-metal performance, a small footprint and fast boot time.

Container disadvantages: Security

It’s clear that a public cloud needs strong isolation separating tenant systems. All that an attacker needs is an email address and a credit card number to set up a hostile VM on the same hardware as yours. But strong isolation is also needed in private clouds behind the corporate firewall. Corporate IT won’t be keen to run sandbox42.interns.engr.example.com and payroll.example.com within the same security domain.

Hypervisors have a relatively simple security model. The interface between the guest and the hypervisor is well defined, based on real or virtual hardware specifications. Five decades of hypervisor development have helped to form a stable and mature interface. Large portions of the interface’s security are enforced by the physical hardware.

Containers, on the other hand, are implemented purely in software. All containers and their host share the same kernel. Nearly the entire Linux kernel had to undergo changes in order to implement isolation for resources such as memory management, network stack, I/O, the scheduler, and user namespaces. The Linux community is investing a lot of effort to improve and expand container support. However, rapid development makes it harder to stabilize and harden the container interfaces.

Container disadvantages: Software-defined data center

Hypervisors are the basis for the new virtualized data center. They allow us to perfectly abstract the hardware and play nicely with networking and storage. Today there isn’t a switch or a storage system without VM integration or VM-specific features.

Can a virtualized data center be based on containers in place of hypervisors? At almost all companies, no.. There will always be security issues with mounting SAN devices and filesystems from containers in different security domains. Yes, containers are a good fit for plenty of tasks but are restricted when it comes to sensitive areas such as you data center building blocks such as the storage and the network.

No one operating system, even Linux, will run 100% of the applications in the data center. There will always be diversity at the data center, and the existence of different operating systems will force the enterprise to keep the abstraction at the VM level.

Container disadvantages: Management

The long history of hypervisors means that the industry has developed a huge collection of tools for real-world administration needs.

CPM monitoring in VMware

Blogger Robert Moran shows a screenshot of CPU monitoring in VMware’s vSphere.

The underlying functionality for hypervisor management is also richer. All of the common hypervisors support “live migration” of guests from one host to another.

Hypervisors have become an essential tool in the community of practice around server administration. Corporate IT is in the process of virtualizing its diverse collection of servers, running modern and vintage Linux distributions, plus legacy operating systems, and hypervisor vendors including VMWare and Microsoft are enabling it.

Container disadvantages: Complexity

While containers take advantage of the power built into Linux, they share Linux’s complexity and diversity. For example, each Linux distribution standardizes on a different kernel version, and some use AppArmor while others use SELinux. Because containers are implemented using multiple isolation features at the OS level, the “containerization” features can vary by kernel version and platform.

The anatomy of a multi-tenant exploit

Let’s assume a cloud vendor, whether SaaS, IaaS, or PaaS, implements a service within a container. How would an attacker exploit it? The first stage would be to gain control of the application within the container. Many applications have flaws and the attacker would need to exploit an existing unpatched CVE in order to gain access to the container. IaaS even makes it simpler as the attacker already has a “root” shell inside a neighboring container.

The next stage would be to penetrate the kernel. Unfortunately, the kernel’s attack surface contains hundreds of system calls, and other vulnerabilities exist in the form of packets and file metadata that can jeopardize the kernel. Many attackers have access to zero-day exploits, unpublished local kernel vulnerabilities. (A typical “workflow” is to watch upstream kernel development for security-sensitive fixes, and figure out how to exploit them on the older kernels in production use.)

Once the hacker gains control of the kernel, it’s game over. All the other tenants’ data is exposed.

The list of exploitable bugs is always changing, and there will probably be more available by the time you read this. A few recent examples:

  • “An information leak was discovered in the Linux kernel’s SIOCWANDEV ioctl call. A local user with the CAP_NET_ADMIN capability could exploit this flaw to obtain potentially sensitive information from kernel memory.“ (CVE-2014-1444) Some container configurations have CAP_NET_ADMIN, while others don’t. Because it’s possible to set up containers in more or less restricted ways, individual sites need to check if they’re vulnerable. (Many LInux capabilities are equivalent to root because they can be used to obtain root access.)

  • “An information leak was discovered in the wanxl ioctl function in Linux. A local user could exploit this flaw to obtain potentially sensitive information from kernel memory.” (CVE-2014-1445)”

  • “An unprivileged local user with access to a CIFS share could use this flaw to crash the system or leak kernel memory. Privilege escalation cannot be ruled out (since memory corruption is involved), but is unlikely.“ (CVE-2014-0069)

Each individual vulnerability is usually fixed quickly, but there’s a constant flow of new ones for attackers to use. Linux filesystem developer Ted Ts’o wrote,

Something which is baked in my world view of containers (which I suspect is not shared by other people who are interested in using containers) is that given that the kernel is shared, trying to use containers to provide better security isolation between mutually suspicious users is hopeless. That is, it’s pretty much impossible to prevent a user from finding one or more zero day local privilege escalation bugs that will allow a user to break root. And at that point, they will be able to penetrate the kernel, and from there, break security of other processes.

So if you want that kind of security isolation, you shouldn’t be using containers in the first place. You should be using KVM or Xen, and then only after spending a huge amount of effort fuzz testing the KVM/Xen paravirtualization interfaces.

Kernel developer Greg Kroah-Hartman wrote,

Containers are not necessarily a “security” boundary, there are many “leaks” across it, and you should use it only as a way to logically partition off users/processes in a way that makes it easier to manage and maintain complex systems. The container model is quite powerful and tools like docker and systemd-nspawn provide a way to run multiple “images” at once in a very nice way.

Containers are powerful tools for Linux administrators, but for true multi-tenant cloud installations, we need stricter isolation between tenants.

Containerization is not “free”. For instance, the Linux Memory Controller can slow down the kernel by as much as 15%, just by being enabled, with no users. The Memory Controller itself is complicated, but cgroups controllers, on which it depends, are also complex. The surface of change is just way too big, and the resulting implementation necessarily too complex. George Dunlap said it best,

With containers you’re starting with everything open and then going around trying to close all the holes; if you miss even a single one, bam, you lose. With VMs, you start with almost everything closed, and selectively open things up; that makes a big difference.

This is part 2 of a 3-part series. Please subscribe to our feed or follow @CloudiusSystems to get a notification when part 3 is available.

Hypervisors Are Dead, Long Live the Hypervisor (Part 1)

By Dor Laor and Avi Kivity

The hypervisor is the basic building block of cloud computing; hypervisors drive the software-defined data center revolution, and two-thirds of all new servers are virtualized today. Hypervisors for commodity hardware have been the key enabler for the software revolution we have been experiencing.

However, for the past 8 years a parallel technology has been growing, namely, containers. Recently containers have been getting a fairly large amount of traction with the development of the Docker project. When run on bare metal, containers perform better than hypervisors and have a lower footprint.

There is a lot in common with the goals of these technologies. These three blog entries will try to provide an answer to the question

Will containers kill the hypervisor?

The series will provide in-depth explanations about the underlying technology and the pros and cons of each solution.

Intro: ancient hypervisor lore

What is virtualization anyway?

Virtualization diagram

A hypervisor is a software component (potentially assisted by hardware) that allows us to run multiple operating systems on the same physical machine. The overlay OS is called the guest OS or simply, a Virtual Machine (VM). The guest OS may not even be aware it is running on virtual hardware.

The interface between the guest and the host is the hardware specification. It covers the CPU itself and any other hardware devices, from BIOS to NICs, SCSI adapters, GPUs and memory.

IBM System/360

IBM, together with MIT and the University of Michigan, pioneered hypervisor technology on System/360 and System/370 mainframes, beginning in the 1960s.

IBM was the first company to produce hypervisors. The IBM System/360, model 1967, was the first to ship with virtual memory hardware supporting virtualization. The next system in the series, System/370, was the “private cloud” of its day. Administrators could set up virtual machines for running different OS versions, and even public-cloud-like “time sharing” by multiple customers.

Virtualization for x86

Virtualization didn’t make it to commodity systems until the release of VMware Workstation in 1999. In the early 2000s, hypervisors were based on pure software and were mostly useful for development and testing. VMware initially used a technique called dynamic translation to intercept privileged operations by the guest operating system. When the guest accessed “hardware”, VMWare rewrote the instructions on the fly, to protect itself from the guest and isolate guests from each other.

VMware logo

Later on, the open source Xen hypervisor project coined the term paravirtualization (PV). PV guests, which have to be specially modified to run on a PV host, do not execute privileged instructions themselves but ask the hypervisor to do it on their behalf.

Xen logo

Eventually, Intel, AMD and ARM implemented support for virtualization extensions. A special host mode allows running guest code on the bare CPU, getting near 100% of bare metal throughput for CPU-intensive workloads. In parallel, the memory management and the IO path received attention as well with technologies such as nested paging (virtual memory), virtual interrupt controllers, single-root I/O virtualization (SRIOV) and other optimizations.

Hardware support for hypervisors

Hypervisor enablement continues to be a priority for hardware manufacturers. Glauber Costa wrote, “the silicon keeps getting better and better at taking complexity away from software and hiding somewhere else.”

According to a paper from Red Hat Software,

Both Intel and AMD continue to add new features to hardware to improve performance for virtualization. In so doing, they offload more features from the hypervisor into “the silicon” to provide improved performance and a more robust platform….These features allow virtual machines to achieve the same I/O performance as bare metal systems.

Old-school hypervisors

Hypervisors are one of the main pillars of the IT market (try making your way through downtown San Francisco during VMworld) and solve an important piece of the problem. Today the hypervisor layer is commoditized, users can choose any hypervisor they wish when they deploy Open Stack or similar solutions.

Hypervisors are a mature technology with a rich set of tools and features ranging from live migration, cpu hotplug, software defined networking and other new coined terms that describe the virtualization of the data center.

However, in order to virtualize your workload, one must deploy a full fledged guest operating system onto every VM instance. This new layer is a burden in terms of management and in terms of performance overhead. We’ll look at one of the other approaches to compartmentalization next time: containers.

This is part 1 of a 3-part series. Part 2 is now available. Please subscribe to our feed or follow @CloudiusSystems to get a notification of future posts.

Photo credit, IBM 360: Dave Mills for Wikimedia Commons

Nadav Har’El Presenting OSv at USENIX June 19th

By Don Marti

If you’re attending the USENIX Annual Technical Conference, be sure not to miss “OSv—Optimizing the Operating System for Virtual Machines” . Nadav Har’El will be presenting tomorrow at 10:40 AM.

Here’s a quick preview of some of the performance results that Nadav will show:

table from upcoming paper

It’s also your opportunity to ask Nadav some hard questions.

If you’re not at USENIX this year, you can still get a copy of the paper.

Thanks to the USENIX open access policy, the paper is scheduled to go Open Access on the day of the event. To get an alert from us when it’s up, please follow @CloudiusSystems on Twitter.

Using Capstan With Local OSv Images

By Don Marti

If you’re building OSv from source, you can use the capstan push command to temporarily use a local build in place of a base image from a network repository. This is handy when you’re trying your application with a patched version of OSv. Just run capstan push after the OSv build to push your newly built image into your local Capstan repository.

For example, if your Capstanfile uses the cloudius/osv-base base image:

make 
capstan push cloudius/osv-base  build/release/usr.img

When you’re ready to go back to using the image from the network, you can run

capstan pull cloudius/osv-base

to replace the image in your local repository with the image from the network repository.

For more tips and updates, please follow @CloudiusSystems on Twitter.

OSv Paper Coming to USENIX in June

By Don Marti

We’re going to the USENIX Annual Technical Conference in Philadelphia!

2014 Federated Conferences Week

Our paper, “OSv—Optimizing the Operating System for Virtual Machines” has been accepted by one of our favorite IT events. We appreciate all the excellent comments and questions from our peer reviewers.

This year, ATC will be part of a Federated Conferences Week that includes HotCloud, HotStorage, two days of sysadmin training, and more, so there should be something for everyone.

The paper will be available under Open Access terms starting on the date of the event, but we all hope you can come see us live and in person.

Here’s the abstract:

Virtual machines in the cloud typically run existing general-purpose operating systems such as Linux. We notice that the cloud’s hypervisor already provides some features, such as isolation and hardware abstraction, which are duplicated by traditional operating systems, and that this duplication comes at a cost.

We present the design and implementation of OSv, a new guest operating system designed specifically for running a single application on a virtual machine in the cloud. It addresses the duplication issues by using a low-overhead library-OS-like design. It runs existing applications written for Linux, as well as new applications written for OSv . We demonstrate that OSv is able to efficiently run a variety of existing applications. We demonstrate its sub-second boot time, small OS image and how it makes more memory available to the application. For unmodified network-intensive applications, we demonstrate up to 25% increase in throughput and 47% decrease in latency. By using non-POSIX network APIs, we can further improve performance and demonstrate a 290% increase in Memcached throughput.

For more event updates, please follow @CloudiusSystems on Twitter.

Interview: OSv on 64-bit ARM Systems

Q&A with Paul Mundt, Jani Kokkonen, and Claudio Fontana

Paul Mundt is CTO of OS & Virtualization at Huawei, while Jani and Claudio are both Virtualization Architects on Huawei’s virtualization team. All are based in Munich, which is the headquarters for Huawei’s European Research Center. The company also has a team of OSv developers in Hangzhou, China, who are focused on adaptation of OSv to Huawei’s x86-based enterprise servers.

Q: ARM processors are everywhere. What are the important differences between the Aarch64 hardware that you’re targeting with the OSv port and the garden-variety ARM processors that we have in our phones, toasters, and Raspberry Pis?

Other than the relatively obvious architectural differences in going from a 32-bit to a 64-bit architecture (more general purpose registers, address space, etc), there are quite a number of fundamental changes in v8 that considerably clean up the architecture in contrast to earlier versions.

One of the more substantial changes is the new exception and privilege model, with 4 exception levels now taking the place of v7’s assortment of processor modes. The new privilege levels are much more in line with conventional CPU privilege rings (eg, x86), even though for whatever reason the numbering has been inverted – now with EL3 being the most privileged, and EL0 being the least.

Of specific relevance to the OSv port, through its heavy use of C++11/C1x atomic operations and memory model, are the improvements to the CPU’s own memory and concurrency model. In contrast to x86, v7 and earlier adopt a weak memory model for better energy efficiency, but have always been terrible at sequentially consistent (SC) atomics as a result. In v8, the weak memory model has been retained, but special attention has also been paid to improving the deficiencies in SC atomics, resulting in the addition of load-acquire/store-release instruction pairs that work across the entire spectrum of general purpose and exclusive loads/stores. This places the architecture in direct alignment with the emerging standardization occurring in C++11/C1x, and has simplified much of the porting work in this area.

Beyond this (while not strictly v8-specific) there are also a new range of virtualization extensions to the interrupt controller that we can take advantage of, but unfortunately this part of the IP is not yet finalized and remains under NDA.

As our semiconductor company (HiSilicon) produces its own Aarch64 CPUs, we have also made a number of our own changes to the microarchitecture to better fit our workloads, especially in the areas of the cache and virtual memory system architecture, virtualization extensions, interconnects, and so on.

Q: What class of applications is your team planning to run on OSv?

We see many different potential applications for OSv within Huawei. While OSv is primarily touted as a lightweight cloud OS, the area that is more interesting for my team is its potential as a lightweight kernel for running individual applications directly on the hypervisor, as well as its ability to be used as an I/O or compute node kernel in the dataplane through virtio.

Tight coupling of the JVM to the hypervisor is also an area that we are interested in, particularly as we look to new directions in heterogeneous computing emerging through OpenJDK Sumatra, Aparapi, and the on-going work by the HSA Foundation in which we are also engaged.

Over the next year or so we also expect to see the JVM support maturing, to the point where it should also become possible to run some of the heavier weight big data stacks, but there is a long way to go first.

Q: When you’re considering using OSv as a lightweight kernel for running applications directly on the hypervisor, are you considering using it without a local filesystem? (I understand OSv can boot in about 1/10th the time without ZFS.)

ZFS is indeed quite heavyweight for our needs, and indeed, up until this stage in the porting effort we have largely been able to avoid it, but this will obviously not be the long-term case as we look to a solution we can bring to our customers.

In addition to the boot time issues you have mentioned, the ZFS adaptive replacement cache (ARC) and page cache interactivity problems with large mmap()’s is an area of concern for some of our potential users, so this is something that we are also closely monitoring, particularly as we think about other ways we might utilize OSv for other applications in the future.

That being said, at the moment we basically see a few different directions to go on the file system side (and by extension, the VFS layer) for our more immediate applications:

1) Simple in-memory file systems with substantially reduced functionality that we can use for scenarios like dataplane applications or I/O nodes where we need no persistent storage. In these cases as we don’t even need to support file I/O, we will likely be carrying out some customization and optimizations in this area. This is obviously in contrast to the compute node and control plane side, which we primarily intend to run under Linux in parallel for now.

2) Adaptation for phase change and other non-volatile memories. OSv has a much lighter weight stack with no real legacy at the moment, so fits the role of testbed quite well in terms of experimenting with the best way to tie these technologies in for different deployment scenarios, particularly under a layer of virtualization. In the long run we would naturally expect the results of this work to transfer to the Linux kernel, too.

3) Global and distributed filesystems – initially across OSv instances, and then across to Linux. This also has implications for the underlying transport mechanisms, particularly as we look to things like lightweight paravirtualized RDMA and inter-VM communication.

Q: Which hypervisor or hypervisors are you using?

While Huawei is actively engaged across many different hypervisors, as my department (in which most of us have a Linux kernel development background) is quite focused on working close to the metal and on performance related issues, KVM is our primary focus.

We have previously done a fair bit of work with guest OS real-time, inter-VM communications, and I/O virtualization enhancements on ARM, so continuing with KVM also makes the most sense for us and our customers.

As one of the main focuses for my OS team is in heterogeneous computing, we also aim to leverage and contribute to much of the work surrounding accelerator, domain processor, and heterogeneous system architecture virtualization under KVM, although much of this is tied up in various European Union framework programmes (eg, FP7-ICT, H2020) at the moment. OSv will also continue to play an important role in these areas as we move forward.

Q: Anything else that you would like to add?

Only that now is an exciting time to be in OSv development. The OS has a lot of potential and is still very much in its infancy, which also makes it an excellent target for trying out new technical directions. I would also encourage people who are not necessarily cloud-focused to look at the broader potential for the system, as there’s certainly a lot of interesting development to get involved in.

About

Paul Mundt

Paul Mundt is the CTO of OS & Virtualization at Huawei’s European Research Center in Munich, Germany, where he manages the Euler department (including OS & Virtualization R&D, as well as new CPU development and support). Paul first joined Huawei at the beginning of 2013 as the Chief Architect of Huawei’s Server OS division, responsible for overall architecture and strategy. Prior to that, as the Linux Kernel Architect at Renesas in Tokyo, Paul was responsible for leading the Linux group within Renesas for seven years, establishing both the initial strategy and vision while taking the group from zero in-house support or upstream engagement to supporting hundreds of different CPUs across the entire MCU/MPU spectrum and becoming a consistent top-10 contributor to the Linux kernel, which carries on to this day. He has more than 15 years of Linux kernel development experience, across a diverse range of domains (HPC, embedded, enterprise, carrier grade), and has spent most of that time as the maintainer of various subsystems (primarily in the areas of core kernel infrastructure, CPU architectures, memory management, and file systems). He previously organized and chaired the Memory Management Working Group within the CE Linux Forum, where he advocated the convergence of Enterprise and Embedded technologies, resulting in the creation of Asymmetric NUMA, as well as early transparent superpage/large TLB adoption. He is a voting member of the OASIS Virtual I/O Device (VIRTIO) Technical Committee and the HSA Foundation.

Jani Kokkonen

Jani Kokkonen received his master’s degree in 2000 from the Technical University of Helsinki, Finland. He went to pursue research and development job in Nokia Networks. The work concentrated in research of different transport technologies on various radio access networks. This was followed by research and development activities on virtualization technologies on 3GPP radio and core network elements. Work consisted also evaluation of hardware extensions for virtualization support on various embedded multicore chips. He has been as virtualization architect in Huawei ERC Euler Department since September 2011. The work in Huawei has concentrated on research and development on QEMU/KVM on ARM and Intel platforms varying from CPU to network technologies, with his most recent effort focusing on ARM64 memory management where he is responsible for the MMU backend in the Aarch64 OSv port, as well as leading the OSv team. He is a member of the OASIS Virtual I/O Device (VIRTIO) Technical Committee and the Multicore Association.

Claudio Fontana

Claudio Fontana received his Laurea in Computer Science in 2005 at the University of Trento, Italy, discussing a thesis on frameworks for the evaluation of taxonomy matching algorithms. He went on to pursue a software engineering opportunity in Zurich, where he worked on medium-scale (100s hosts) distributed systems. This was followed by a software engineering position in Amsterdam, working on messaging, routing, firewalls and billing systems. He has been with Huawei since December 2011. He is currently working in the virtualization area (Linux/KVM/QEMU). He is part of the early enablement effort for the ARM 64bit architecture (ARMv8 AArch64), and has been a maintainer and contributor of Free and Open Source projects, lately involving mostly QEMU binary translation (as QEMU Aarch64 TCG maintainer) and the OSv Operating System (as Aarch64 maintainer). He also spent some time as a member of Linaro’s Virtualization team, where he focused on early Aarch64 enablement.

For more information

To keep up with the progress of OSv on ARM (and x86_64 too), join the osv-dev mailing list or follow @CloudiusSystems on Twitter.

Bridged Networking With Capstan

By Don Marti

New versions of Capstan are making it simpler to run OSv virtual machines in a production configuration, by adding more control of network options. A useful new feature, which helps deal with the details of bringing up networking, is the -n option.

By default, Capstan starts up KVM/QEMU with user networking:

 -netdev user,id=un0,net=192.168.122.0/24,host=192.168.122.1

(That’s from ps ax | grep qemu, which you can run to see the qemu-system-x86_64 command that Capstan is executing for you.)

But there are many more networking options for QEMU/KVM. The basic user networking, which does not require root access to start up, is good for development and simple tasks. But for production use, where you need to get your VM on a network where it’s available from other VMs or from the outside, you’ll need bridged networking. (See your Linux distribution or hypervisor documentation for the details of creating a virtual or public bridge device.)

If you invoke capstan run with the -n bridge option, you’ll get QEMU running with:

-netdev bridge,id=hn0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper

If you have a specific bridge device to connect to, you can use the -b option with the name of the bridge device. The default is virbr0, but you can also set up a public bridge, usually br0, that’s bridged to a physical network interface on the host.

Other hypervisors

Don’t feel left out if you have a different hypervisor. Capstan also handles bridged networking on VirtualBox, with the rest of the supported hypervisors coming soon. The fact that the syntax is the same is going to be a big time-saver for those of us who have to do testing and demos on multiple systems—no more dealing with arcane commands that are different from system to system.

For more on Capstan and networking, please join the osv-dev mailing list on Google Groups. You can get updates by subscribing to this blog’s feed, or folllowing @CloudiusSystems on Twitter.

New OSv Meetup Group

By Don Marti

We held the first meeting of the OSv Meetup group in San Francisco this week, and got 14 participants from the Apache, Big Data, and OSv communities, as well as a few meetup.com users interested in cloud computing who just came along serendipitously.

attendees

Thanks to our hosts at OhmData who made their groovy South of Market office space available, and thanks to our attendees for coming in to try out OSv. Looking forward to seeing the results of your initial experiments.

(For the users of VirtualBox on Mac OS who ran into the “assertion failed” problem, we’re discussing that on the osv-dev mailing list now, so watch the list for an update.)

To get advance notice of future events—both the free-form hands-on sessions like this one and an upcoming tech talk series—please join the Meetup group or follow @CloudiusSystems on Twitter.