Surrounded By Bugs

Forgetting about the problem of memory

There's a pattern that emerged in software some time ago, that bothers me: in a nutshell, it is that it's become acceptable to assume that memory is unlimited. More precisely, it is the notion that it is acceptable for a program to crash if memory is exhausted.

It's easy to guess some reasons why this the case: for one thing, memory is much more prevalent than it used to be. The first desktop computer I ever owned had 32kb of memory (I later owned a graphing calculator with the same processor and same amount of memory as that computer). My current desktop PC, on the other hand, literally has more than a million times that amount of memory.

Given such huge amounts of memory to play with, it's no wonder that programs are written with the assumption that they will always have memory available. After all, you couldn't possibly ever chew up a whole 32 gigabytes of RAM with just the handful of programs that you'd typically run on a desktop, could you? Surely it's enough that we can forgot about the problem of ever running out of memory. (Many of us have found out the unfortunate truth, the hard way; in this age where simple GUI apps are sometimes bundled together with a whole web browser - and in which web browsers will happily let a web page eat up more and more memory - it certainly is possible to hit that limit).

But, for some time, various languages with garbage-collecting runtimes haven't even exposed memory allocation failure to the underlying application (this is not universally the case, but it's common enough). This means that a program written in such a language that can't allocate memory at some point will generally crash - hopefully with a suitable error message, but by no means with any sort of certainty of a clean shutdown.

This principle has been extended to various libraries even for languages like C, where checking for allocation failure is (at least in principle) straight-forward. Glib (one of the foundation libraries underlying the GTK GUI toolkit) is one example: the g_malloc function that it provides will terminate the calling process if requested memory can't be allocated (a g_try_malloc function also exists, but it's clear that the g_malloc approach to "handling" failure is considered acceptable, and any library or program built on Glib should typically be considered prone to unscheduled termination in the face of an allocation failure).

Apart from the increased availability of memory, I assume that the other reason for ignoring the possibility of allocation failure is just because it is easier. Proper error handling has traditionally been tedious, and memory allocation operations tend to be prolific; handling allocation failure can mean having to incorporate error paths, and propagate errors, through parts of a program that could otherwise be much simpler. As software gets larger, and more complex, being able to ignore this particular type of failure becomes more attractive.

The various "replacement for C" languages that have been springing up often have "easier error handling" as a feature - although they don't always extend this to allocation failure; the Rust standard library, for example, generally takes the "panic on allocation failure" approach (I believe there has been work to offer failure-returning functions as an alternative, but even with Rust's error-handling paradigms it is no doubt going to introduce some complexity into an application to make use of these; also, it might not be clear if Rust libraries will handle allocation failures without a panic, meaning that a developer needs to be very careful if they really want to create an application which can gracefully handle such failure).

Even beyond handling allocation failure in applications, the operating system might not expect (or even allow) applications to handle memory allocation failure. Linux, as it's typically configured, has overcommit enabled, meaning that it will allow memory allocations to "succeed" when only address space has actually been allocated in the application; the real memory allocation occurs when the application then uses this address space by storing data into it. Since at that point there is no real way for the application to handle allocation failure, applications will be killed off by the kernel when such failure occurs (via the "OOM killer"). Overcommit can be disabled, theoretically, but to my dismay I have discovered recently that this doesn't play well with cgroups (Linux's resource control feature for process groups): an application in a cgroup that attempts to allocate more than the hard limit for the cgroup will generally be terminated, rather than have the allocation fail, regardless of the overcommit setting.

If the kernel doesn't properly honour allocation requests, and will kill applications without warning when memory becomes exhausted, there's certainly an argument to be made that there's not much point for an application to try to be resilient to allocation failure.

But is this really how it should be?

I'm concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want - but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated? (Linux's OOM uses heuristics to try and terminate the "right" process, but of course that doesn't necessarily correspond to what the user or system administrator would want).

I've discussed desktops but the problem is still a problem on servers, perhaps more so; wouldn't it be better if critical processes are able to detect and respond to memory scarcity rather than be killed off arbitrarily? Isn't scaling back, at the application level, better than total failure, at least in some cases?

Linux could be fixed so that OOM was not needed on properly configured systems, even with cgroups; anyway there are other operating systems that, reportedly, have better behaviour. That would still leave the applications which don't handle allocation failure, of course; fixing that would take (as well as a lot of work) a change in developer mindset. The thing is, while the odd application crash due to memory exhaustion probably doesn't bother some, it certainly bothers me. Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure? Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn't it be better if this didn't happen?

I'd like to say no, but the current consensus would seem to be against me.


Addendum:

I tried really hard in the above to be clear how minimal a claim I was making, but there are comments that I've seen and discussions I've been embroiled in which make it clear this was not understood by at least by some readers. To sum up in what is hopefully an unambiguous fashion:

The one paragraph in particular that I think could possibly have caused confusion is this one:

I'm concerned, personally, about this notion that processes can just be killed off by the system. It rings false. We have these amazing machines at our disposal, with fantastic ability to precisely process data in whatever way and for whatever purpose we want - but, prone to sudden failures that cannot really be predicted or fully controlled, and which mean the system at a whole is fundamentally less reliable. Is it really ok that any process on the system might just be terminated?

Probably, there should have been emphasis on the "any" (in "any process on the system") to make it clear what I was really saying here, and perhaps the "system at a whole is fundamentally less reliable" is unnecessary fluff.

There's also a question in the concluding paragraph:

Do we really trust that applications will reliably save necessary state at all times prior to crashing due to a malloc failure?

This was a misstep and very much not the question I wanted to ask; I can see how it's misleading. The right question was the one that follows it:

Are we really ok with important system processes occasionally dying, with system functionality accordingly affected? Wouldn't it be better if this didn't happen?

Despite those slips, I think if you read the whole article carefully the key thrust should be apparent.

For anyone wanting for a case where an application really does need to be able to handle allocation failures, I recently stumbled across one really good example:

To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.

The other example is service managers, of which I am the primary author of one (Dinit), which is largely what got me thinking about this issue in the first place. A service manager has a system-level role and if one dies unexpectedly it potentially leaves the whole system in an awkward state (and it's not in general possible to recover just be restarting the service manager). In the worst case, a program running as PID 1 on Linux which terminates will cause the kernel to panic. (The OOM killer will not target PID 1, but it still should be able to handle regular allocation failure gracefully). However, I'm aware of some service manager projects written using languages that will not allow handling allocation failure, and it concerns me.