5

I want to Get-Content of a large (1GB - 10GB) .txt file (which has only 1 line!) and split it into multiple files with multiple lines, but whenever I try to I end up with a System.OutOfMemoryException.

Of course I did search for a solution, but all the solutions I found were reading a file line by line which is kinda hard to do when the file only has 1 line.

Although PowerShell takes up to 4 GB of RAM when loading a 1 GB file, the issue is not connected to my RAM as I have a total of 16 GB and even with a game running in the background the peak usage is at around 60%.

I'm using Windows 10 with PowerShell 5.1 (64 bit) and my MaxMemoryPerShellMB is set to the default value of 2147483647.


This is the script I wrote and am using, it works fine with a filesize of e.g. 100MB:

$source = "C:\Users\Env:USERNAME\Desktop\Test\"
$input = "test_1GB.txt"
$temp_dir = "_temp"

# 104'857'600 bytes (or characters) are exactly 100 MB, so a 1 GB file has exactly
# 10 temporary files, which have all the same size, and amount of lines and line lenghts.

$out_size = 104857600

# A line length of somewhere around 18'000 characters seems to be the sweet spot, however
# the line length needs to be dividable by 4 and at best fit exactly n times into the
# temporary file, so I use 16'384 bytes (or characters) which is exactly 16 KB.

$line_length = 16384



$file = (gc $input)
$in_size = (gc $input | measure -character | select -expand characters)
if (!(test-path $source$temp_dir)) {ni -type directory -path "$source$temp_dir" >$null 2>&1}

$n = 1
$i = 0

if ($out_size -eq $in_size) {
    $file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\_temp_0001.txt" -encoding ascii
} else {
    while ($i -le ($in_size - $out_size)) {
        $new_file = $file.substring($i,$out_size)
        if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
        $temp_name = "_temp_$count.txt"
        $i += $out_size
        $n += 1
        $new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
    }
    if ($i -ne $in_size) {
        $new_file = $file.substring($i,($in_size-$i))
        if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
        $temp_name = "_temp_$count.txt"
        $new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
    }
}

If there is an easier solution which doesn't use Get-Content I gladly take it, too. It really doesn't matter too much how I achieve the result as long as it is possible to do so with every up to date windows machine and no extra software. If this, however, should not be possible, I would also consider other solutions.

10
  • Could you clarify your split criteria a bit more? Equal size? Every X bytes? On specific characters? etc..
    – Bob
    Commented Feb 14, 2018 at 23:42
  • @Bob The file contains binary strings à 4 digits. So it should be every n bytes where n can be divivded by 4. Commented Feb 14, 2018 at 23:54
  • Are you running this on a 32bit machine by any chance? Commented Feb 14, 2018 at 23:56
  • 1
    @JamieHanrahan If we're purely talking about the OOM error, one problem is PowerShell strings take up more memory than one might expect. At a minimum, you're looking at 2 bytes (UTF-16) per byte (ASCII/ANSI-variant) of the file, because strings are internally all UTF-16. I'll expand my answer a bit.
    – Bob
    Commented Feb 15, 2018 at 3:18
  • 1
    Also, since Get-Content will attempt to read each line into its own string, a single line must not exceed the max length of a string. It happens that the max length of a single .NET string is (2^31 - 1) chars, leading to ~4 GB memory used (2x because UTF-16). But it turns out there's a 2 GB single object limit, so it's also possible that there's simply two copies of this string going around.
    – Bob
    Commented Feb 15, 2018 at 3:21

1 Answer 1

5

Reading large files into memory simply to split them, while easy, will never be the most efficient method, and you will run into memory limits somewhere.

This is even more apparent here because Get-Content works on strings — and, as you mention in the comments, you are dealing with binary files.

.NET (and, therefore, PowerShell) stores all strings in memory as UTF-16 code units. This means each code unit takes up 2 bytes in memory.

It happens that a single .NET string can only store (2^31 - 1) code units, since the length of a string is tracked by an Int32 (even on 64-bit versions). Multiply that by 2, and a single .NET string can (theoretically) use about 4 GB.

Get-Content will store every line in its own string. If you have a single line with > 2 billion characters... that's likely why you're getting that error despite having "enough" memory.

Alternatively, it could be because there's a limit of 2 GB for any given object unless larger sizes are explicitly enabled (are they for PowerShell?). Your 4 GB OOM could also be because there's two copies/buffers kept around as Get-Content tries to find a line break to split on.

The solution, of course, is to work with bytes and not characters (strings).


If you want to avoid third-party programs, the best way to do this is to drop in to the .NET methods. This is easiest done with a full language like C# (which can be embedded into PowerShell), but it is possible to do purely with PS.

The idea is you want to work with byte arrays, not text streams. There are two ways to do this:

  • Use [System.IO.File]::ReadAllBytes and [System.IO.File]::WriteAllBytes. This is pretty easy, and better than strings (no conversion, no 2x memory usage), but will still run into issues with very large files - say you wanted to process 100 GB files?

  • Use file streams and read/write in smaller chunks. This requires a fair bit more maths since you need to keep track of your position, but you avoid reading the entire file into memory in one go. This will likely be the fastest approach: allocating very large objects will probably outweigh the overhead of multiple reads.

So you read chunks of a reasonable size (these days, the minimum is 4kB at a time) and copy them to the output file one chunk at a time, rather than reading the entire file into memory and splitting it. You may wish to tune the size upwards, e.g. 8kB, 16kB, 32kB, etc., if you need to squeeze every last drop of performance out - but you'd need to benchmark to find the optimum size, as some larger sizes are slower.

An example script follows. For reusability it should be converted into a cmdlet or at least a PS function, but this is enough to serve as a working example.

$fileName = "foo"
$splitSize = 100MB

# need to sync .NET CurrentDirectory with PowerShell CurrentDirectory
# https://stackoverflow.com/questions/18862716/current-directory-from-a-dll-invoked-from-powershell-wrong
[Environment]::CurrentDirectory = Get-Location
# 4k is a fairly typical and 'safe' chunk size
# partial chunks are handled below
$bytes = New-Object byte[] 4096

$inFile = [System.IO.File]::OpenRead($fileName)

# track which output file we're up to
$fileCount = 0

# better to use functions but a flag is easier in a simple script
$finished = $false

while (!$finished) {
    $fileCount++
    $bytesToRead = $splitSize

    # Just like File::OpenWrite except CreateNew instead to prevent overwriting existing files
    $outFile = New-Object System.IO.FileStream "${fileName}_$fileCount",CreateNew,Write,None

    while ($bytesToRead) {
        # read up to 4k at a time, but no more than the remaining bytes in this split
        $bytesRead = $inFile.Read($bytes, 0, [Math]::Min($bytes.Length, $bytesToRead))

        # 0 bytes read means we've reached the end of the input file
        if (!$bytesRead) {
            $finished = $true
            break
        }

        $bytesToRead -= $bytesRead

        $outFile.Write($bytes, 0, $bytesRead)
    }

    # dispose closes the stream and releases locks
    $outFile.Dispose()
}

$inFile.Dispose()
2
  • Thank you! I had to adjust the output a bit because the $fileCount was added behind the extension, I ended up with _temp$fileCount.txt. Also, thanks for all the comments, but I don't even understand half of it... will definitley have to look into that, started PowerShell / C# not even a week ago... ._. Commented Feb 15, 2018 at 2:04
  • 1
    @FatalBulletHit If you're interested, here's a quick'n'dirty (untested, vague) adaptation to C# so you can see what the difference is: gist.github.com/BobVul/72c3c3947bcb4982931ff5ff394474c4. Basically, while PowerShell is definitely powerful and very extensible (by giving you access to the entirety of .NET), some more-complex things are just a little bit cleaner in C#.
    – Bob
    Commented Feb 15, 2018 at 2:48

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .