15

Using below program I try to test how fast I can write to disk using std::ofstream.

I achieve around 300 MiB/s when writing a 1 GiB file.

However, a simple file copy using the cp command is at least twice as fast.

Is my program hitting the hardware limit or can it be made faster?

#include <chrono>
#include <iostream>
#include <fstream>

char payload[1000 * 1000]; // 1 MB

void test(int MB)
{
    // Configure buffer
    char buffer[32 * 1000];
    std::ofstream of("test.file");
    of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

    auto start_time = std::chrono::steady_clock::now();

    // Write a total of 1 GB
    for (auto i = 0; i != MB; ++i)
    {
        of.write(payload, sizeof(payload));
    }

    double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
    double megabytes_per_ns = 1e3 / elapsed_ns;
    double megabytes_per_s = 1e9 * megabytes_per_ns;
    std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}

int main()
{
    for (auto i = 1; i <= 10; ++i)
    {
        test(i * 100);
    }
}

Output:

Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s

Update

I changed from std::ofstream to fwrite:

#include <chrono>
#include <cstdio>
#include <iostream>

char payload[1024 * 1024]; // 1 MiB

void test(int number_of_megabytes)
{
    FILE* file = fopen("test.file", "w");

    auto start_time = std::chrono::steady_clock::now();

    // Write a total of 1 GB
    for (auto i = 0; i != number_of_megabytes; ++i)
    {
       fwrite(payload, 1, sizeof(payload), file );
    }
    fclose(file); // TODO: RAII

    double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
    double megabytes_per_ns = 1e3 / elapsed_ns;
    double megabytes_per_s = 1e9 * megabytes_per_ns;
    std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}

int main()
{
    test(256);
    test(512);
    test(1024);
    test(1024);
}

Which improves the speed to 668MiB/s for a 1 GiB file:

Size=256MiB   Duration=0.4s   Speed=2524.66MiB/s
Size=512MiB   Duration=0.79s  Speed=1262.41MiB/s
Size=1024MiB  Duration=1.5s   Speed=664.521MiB/s
Size=1024MiB  Duration=1.5s   Speed=668.85MiB/s

Which is just as fast as dd:

time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576

real    0m1.539s
user    0m0.001s
sys 0m0.344s
4
  • Are you testing a release build of your program with optimizations? Have you tried increasing the buffer size? Commented Mar 4, 2017 at 8:35
  • Shouldn't it be double megabytes_per_ns = MB / elapsed_ns; ?
    – zett42
    Commented Mar 12, 2017 at 11:36
  • Also you should open the stream in binary mode to fairly compare it to other methods of writing. Use std::ofstream of("test.file", std::ios::binary). I get very close performance between ofstream and fwrite then (differences are in the range of measurement errors). Compiler VC++2017.
    – zett42
    Commented Mar 12, 2017 at 12:27
  • Test results for payload=1024 MiB on my machine, averaged over 50 runs. The ofstream opened in binary mode. fwrite() 704.159 MiB/s, ofstream::write() 646.046 MiB/s. Compiler VC++2017.
    – zett42
    Commented Mar 12, 2017 at 13:36

5 Answers 5

15
+50

First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.

There seems to be something wrong in the calculations too. You're not using the value of MB.

Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).

Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.


To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():

while fread() from source to buffer
  fwrite() buffer to destination file

This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).

This might be slightly faster using memory mapping:

open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest

For large files smaller mapped views should be used.

7
  1. Streams are slow
  2. cp uses syscalls directly read(2) or mmap(2).
1
  • The linked question is 5 years old, implementations may have improved in the meantime. The code sample links are dead.
    – zett42
    Commented Mar 12, 2017 at 12:37
4

I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.

If it's within your filesystem then it might be Deduplication.

As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).

4

You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.

You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).

The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.

You're not using any of the operating system's hints that you're doing sequential reading.

To fix this:

  • Change your code to have more than one outstanding read request at all times.
  • Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
  • Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
  • Consider using mmap to cut down on overhead.
  • Use a larger buffer size per read request to cut down on overhead.

This answer has more information: https://stackoverflow.com/a/3756466/344638

0

The problem is that you specify too small buffer for your fstream

char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.

This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.

In your second app you use

FILE* file = fopen("test.file", "w");

FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.

Please note in your case, the bottle neck is not HDD performance. It is context switching

image

Not the answer you're looking for? Browse other questions tagged or ask your own question.