It’s a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. The growth of big data has made time saving techniques such as performing operations in-memory more attractive than ever for programmers. New research, though, challenges the notion that in-memory operations are always faster than disk-access approaches and reinforces the need for developers to better understand system-level software.
These findings were recently presented by researchers from the University of Calgary and the University of British Columbia in a paper titled When In-Memory Computing is Slower than Heavy Disk Usage. They tested this assumption that working in-memory is necessarily faster than doing lots of disk writes using a simple example. Specifically, they compared the efficiency of alternative ways to create a 1MB string and write it to disk. An in-memory version concatenated strings of fixed sizes (first 1 byte then 10 then 1,000 then 1,000,000 bytes) in-memory, then wrote the result to disk (a single write). The disk-only approach wrote the strings directly to disk (e.g., 1,000,000 writes of 1 bytes strings, 100,000 writes of 10 byte strings, etc.).
Java and Python versions of the code were written and then run on Windows and Linux systems for comparison. The total time of all writes for disk-only version was compared to total time of in-memory operations plus the disk write of the in-memory approach were then compared.
The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly. For example, using Java (on both Windows and Linux) to concatenate 1,000,000 1-byte strings in-memory and doing a single write to disk was 9,000 times slower than simply doing 1,000,000 disk writes. This applied to both Windows and Linux systems.
The in-memory approach was faster when the code was written in Python instead of Java, but was still hundreds of times slower than the write-to-disk-only approach when doing many concatenations. As expected, as the number of string concatenations decreased, the in-memory approach got closer and closer to the time required by the disk-only approach.
The explanation offered by the authors is that these higher level languages are doing a lot of work behind the scenes to handle the concatenation, such as creating new objects and copying the strings in order to accommodate the extra bytes of data. “The above explanation applies to any data structure that has to be stored contiguously and increases in size, or is immutable,” they wrote. Conversely, the disk-access approach was faster because the operating systems handled the writes efficiently via buffering and only actually wrote to disk when necessary.
In order to improve the speed of working in-memory, one needs to know about how the language and operating system handle the operations. For example, the authors were able to make the in-memory Python version of the code much faster on Linux by changing the concatenation order, specifically by adding the new string to the end (rather than the beginning) of the target string (a similar change made no difference at all for the Java code, though using a mutable data type did improve in-memory performance dramatically). This change, however, didn’t help the performance of the in-memory approach on Windows. Also, this improvement was undone if the concatenation string was declared as a global variable.
While this is a very specific example, the implications of this study are that developers shouldn’t automatically assume that working in-memory is always faster than disk access. To really know whether an operation would be faster if done in-memory, developers need to understand well how the underlying language and operating system handle things. As the authors wrote, “This justifies our emphasis on 1) re-examining system-level algorithms with in-memory operations in mind, and 2) better training to make developers familiar with system-level software intricacies. Doing so would help in-memory computing better deliver on its potentials and promises.“