Index ¦ Archives ¦ Atom

Creating a Driver for Self-Ballooning a Virtio Balloon in Linux (Part 1: Ballooning as a Concept)

Introduction

This is a series of articles that focuses on constructing a Linux driver for supporting self-ballooning virtio balloons. Such a driver exists in Xen, and was considered for virtio, but seems to have been abandoned: The mailing list does not show any activity related to the balloon driver in the last years, and the last relevant mails do not seem to reach a definitive decision on the matter. Ballooning is, in general, kind of a messy subject: It is evident from the linked discussion that finding an optimal design for such a device is a difficult and error-prone process. Nevertheless, the idea is interesting, and implementing this extra functionality for virtio is a nice way of showing virtualization drivers - and Linux device management in general - in action.

Paravirtualized Drivers

A very important issue with virtual machines is the fact that resource management becomes more complex, due to the multiple levels of management mechanisms which are unable to synchronize with each other. The root of this problem is the so-called semantic gap: A virtual machine is just another process to the host, with its inner workings completely opaque. At the same time, the guest's memory manager (provided it runs a conventional OS) is designed for running on bare metal; as a result, there are baked-in policies in the system that make total sense in the original context within which they have been devised, but which prove problematic in a virtualized environment.

A prime example of the effects of this gap is the handling of unused guest memory by both the guest and the host: The former thinks it is running natively, and therefore every single page of physical memory that goes unused is a waste. The solution? Pool it into the swap cache and use it to avoid disk I/O. While that logic holds for OSes running on bare metal, it falls flat on its face in our case, because the host cannot discern between pages holding irreplacable data and pages that could be very well be freed at any moment, should memory demand in another part of the guest spike. As a result, when the host is strapped for memory it starts swapping out guest pages indiscriminately, including the pages that comprise the cache. Going back once again to the guest, according to its page table the cached pages are still in memory; therefore, it still uses them for caching, even though an access equals a host-side page fault and a consequent swapin. In short, the semantic gap results in the working set of the guest expanding to encompass all the memory it was given at startup, even if that means tanking the host's and its own performance in the process.

The kind of behavior that would be desirable here involves the guest actively coordinating with the host, offering or requesting resources according both entities' needs. That is possible through a family of mechanisms that have been developed exactly to at least partially bridge the semantic gap: paravirtualized devices. These devices' goals is not to faithfully copy the behavior of existing hardware, but comprise entities that are emulated by the host, and which take advantage of the fact that they are not running natively to make both host-side emulation and guest-side drivers easier. Such devices could, for example, alleviate the need for pretending that the guest has a real disk - without paravirtualized devices, we would need to trap and emulate the guest's disk accesses, since the driver used would be designed for actual devices; such a process includes both having to configure hardware on the side of the guests, and having to emulate the hardware's expected response on the side of the host, just to stick to the emulated model's specification. With these special devices, we now have an abstract interface through which we can communicate requests to the host, without pretending we are running on bare metal. Similarly, the host has no obligation to implement extraneous hardware features, like in the case of legacy drivers.

In our case, what we need is a device that provides a way to signal from the guest to the host that it can take chunks of memory not in use to cover its own needs, or alternatively to request from the host spare memory from each of the guests when needed. This functionality is covered by so-called balloon devices, which "inflate" into the guest, hollowing out its physical memory map to give back to the host, and which force the host to once again back up with physical memory these voids in the physical address space when needed. The big picture here is that by using a balloon, the guest owns exactly enough memory to satisfy its needs according to its current working set. When memory pressure changes, the balloon is expected to change size, coordinating with the host to retrieve enough resources to avoid the performance penalties related to swapping. The end result is that it is easier to overcommit a server's memory to have theoretical maximum usage that exceeds its available memory, since the guest now does expand its memory to encompass the whole physical address space that it has been given unless it is absolutely necessary.

A Demonstration of Ballooning

To demonstrate the behavior of the balloon we use libvirt, a library used to manage virtual machines that supports various backends, including Xen and KVM. More specifically, we use its shell utility called virsh, from which we have spawned our VM. We could also use virt-manager, a GUI which has the exact same capabilities as virsh. We will inflate the balloon, and see what the end result is in the guest.

Here's the output of top(1) from inside a guest with 1 GB of guest physical memory, before requesting the balloon to inflate from the host:

Image

And here's the output after the request (done with virsh setmem <hostname> 512M):

Image

It is interesting that, due to the aforementioned semantic gap, the host's view of the whole sequence of events is much blurrier; the inflation and deflation of the balloon is not visible, since all this is done in the guest's memory map. The memory mapped to the process, (the VIRT (virtual size) entry) does not change value; it is always the theoretical maximum guest physical memory. That is expected; the metric that changes is the maximum value of the RES (resident size) entry, which denotes how much of the address space of a process is actually backed by physical memory at the moment.

Turning our attention back to the guest, notice how the memory consumed by the balloon stops being accounted for at all - it is not in use; it is nonexistent. This is because the balloon goes beyond using memory, instead reaching into the memory management subsystem and removing the pages it is using from it completely. We can confirm that by tracing the origin of the data we see from userspace.

The memory info by top(1) (and also free(1)) is given by reading the file /proc/meminfo, which is a pseudofilesystem whose files are actually channels through which the kernel communicates information to userspace. The kernel documentation gives an excellent rundown of the various uses of this FS - and there's a lot of them. The specific entry in the directory is created by the following code in the kernel, where we can see the variables that correspond to the values our tools read. In this case, the total memory reported by top, which is presented as MemTotal in /proc/meminfo, is taken from a field of a struct populated by a call to si_meminfo(). This is defined here, where we see that our value is taken from the variable totalram_pages, a kernel-wide defined symbol. The way to properly modify it seems to be through a call to adjust_managed_page_count(), since modifying it directly would mean ignoring the spinlock that protects it. Indeed, searching for this function in the kernel code confirms that it is utilized by the virtio balloon that we are using.

In contrast, memory that is in the system, even if in use by the kernel, is fully accounted for. Let's test that by using a simple kernel module, called kmem, that allocates 1GB of memory in-kernel when inserted, and frees it when removed. The code is given below:

#include <linux/module.h>
#include <linux/swap.h>

static void *addresses[1024];

static int __init init_func(void)
{
    int i, j;

    for (i = 0; i < 1024; i++) {
        addresses[i] = kmalloc(1024 * 1024, GFP_KERNEL);
        /* If error, cleanup */
        if (!addresses[i]) {
            pr_err("malloc failed\n");
            for (j = 0; j < i; j++)
                kfree(addresses[j]);
            return -ENOMEM;
        }
    }

    return 0;
}

static void __exit exit_func(void)
{
    int i;

    for (i = 0; i < 1024; i++) 
        kfree(addresses[i]);

}

module_init(init_func);
module_exit(exit_func);

The results are the following:

Image

It is evident from the above that reserving memory and giving it to the balloon is not the same, with the latter including a very different codepath from the one normally followed when allocating. What ballooning does isn't just holding pages, but completely wiping off parts of the guest's physical address space.

Next Time

Up to now, we have discussed what paravirtualization is, and shown a userspace view of ballooning. In the next post, we will dive into the Xen balloon driver, along with its implemented self-ballooning functionality.

© Emil T.. Built using Pelican. Theme by Giulio Fidente on github.