I want to Get-Content
of a large (1GB - 10GB) .txt
file (which has only 1 line!) and split it into multiple files with multiple lines, but whenever I try to I end up with a System.OutOfMemoryException
.
Of course I did search for a solution, but all the solutions I found were reading a file line by line which is kinda hard to do when the file only has 1 line.
Although PowerShell takes up to 4 GB of RAM when loading a 1 GB file, the issue is not connected to my RAM as I have a total of 16 GB and even with a game running in the background the peak usage is at around 60%.
I'm using Windows 10 with PowerShell 5.1 (64 bit) and my MaxMemoryPerShellMB
is set to the default value of 2147483647
.
This is the script I wrote and am using, it works fine with a filesize of e.g. 100MB:
$source = "C:\Users\Env:USERNAME\Desktop\Test\"
$input = "test_1GB.txt"
$temp_dir = "_temp"
# 104'857'600 bytes (or characters) are exactly 100 MB, so a 1 GB file has exactly
# 10 temporary files, which have all the same size, and amount of lines and line lenghts.
$out_size = 104857600
# A line length of somewhere around 18'000 characters seems to be the sweet spot, however
# the line length needs to be dividable by 4 and at best fit exactly n times into the
# temporary file, so I use 16'384 bytes (or characters) which is exactly 16 KB.
$line_length = 16384
$file = (gc $input)
$in_size = (gc $input | measure -character | select -expand characters)
if (!(test-path $source$temp_dir)) {ni -type directory -path "$source$temp_dir" >$null 2>&1}
$n = 1
$i = 0
if ($out_size -eq $in_size) {
$file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\_temp_0001.txt" -encoding ascii
} else {
while ($i -le ($in_size - $out_size)) {
$new_file = $file.substring($i,$out_size)
if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
$temp_name = "_temp_$count.txt"
$i += $out_size
$n += 1
$new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
}
if ($i -ne $in_size) {
$new_file = $file.substring($i,($in_size-$i))
if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
$temp_name = "_temp_$count.txt"
$new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
}
}
If there is an easier solution which doesn't use Get-Content
I gladly take it, too. It really doesn't matter too much how I achieve the result as long as it is possible to do so with every up to date windows machine and no extra software. If this, however, should not be possible, I would also consider other solutions.
n
bytes wheren
can be divivded by 4.Get-Content
will attempt to read each line into its own string, a single line must not exceed the max length of a string. It happens that the max length of a single .NET string is (2^31 - 1) chars, leading to ~4 GB memory used (2x because UTF-16). But it turns out there's a 2 GB single object limit, so it's also possible that there's simply two copies of this string going around.