3

I am writing a function in C# to split a large delimited file into smaller delimited files. I am writing this function because a 2.7 GB file was taking hours to ETL and was causing a bottleneck to the ETL batch. I assign a max number of lines per files and start a new file once that limit is reached. If the file has a header, then I write the header to each file.

Initially, I tried using StreamReader and StreamWriter as recommended in many C# posts, as they are supposed to handle large files by reading one line at a time and not storing everything in memory. Additionally, I explored using File.ReadLines and File.WriteAllLines, as they are also recommended as memory-efficient options.

Both of these approaches work on smaller files; however, when processing the 2.7 GB with 1238951 rows, both methods failed with out-of-memory exceptions. Surprisingly, they managed to process 1238950 rows before throwing the exception, leaving me just one line short of completion. My PC has 32GB of RAM is is nowhere near pegged out on memory. What's going on here? Is there something I am missing?

internal void SplitFileWithStream(string localFilePath, bool hasHeader)
{
    try
    {
        var targetDirectory = Path.GetDirectoryName(localFilePath);
        var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
        var extension = Path.GetExtension(localFilePath);
        var fileSuffix = 0;
        var maxLinesPerFile = 100000;
        string? header = null;
        using (var sr = new StreamReader(localFilePath))
        {
            while (!sr.EndOfStream)
            {
                var lineNumber = 0;
                var newFileName = $"{fileNameWithoutExtension}__split_{(++fileSuffix)}{extension}";
                var newFilePath = Path.Combine(targetDirectory, newFileName);
                if (File.Exists(newFilePath))
                    File.Delete(newFilePath);
                using (var sw = new StreamWriter(newFilePath))
                {
                    if (!sr.EndOfStream && hasHeader)
                    {
                        if (header == null)
                        {
                            header = sr.ReadLine();
                            maxLinesPerFile++; //add one to max for header
                        }
                        sw.WriteLine(header);
                        lineNumber++;
                    }
                    while (!sr.EndOfStream && lineNumber < maxLinesPerFile)
                    {
                        sw.WriteLine(sr.ReadLine());
                        lineNumber++;
                    }
                    sw.Close();
                }
            }
            sr.Close();
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.ToString());
    }
}
System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
MyFileUtilities.FileSplitter.SplitFileWithStream(String localFilePath, Boolean hasHeader)
in C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs:line 341
internal void SplitFileWithReadLines(string localFilePath, bool hasHeader)
{
    try
    {
        var targetDirectory = Path.GetDirectoryName(localFilePath);
        var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
        var fileName = Path.GetFileName(localFilePath);
        var extension = Path.GetExtension(localFilePath);
        var fileSuffix = 0;
        var maxLinesPerFile = 100000;
        long position = 0;
        string? header = null;
        int i = 0;
        int skip = 0;
        string workingFilePath = localFilePath;
        while (true)
        {
            int take = maxLinesPerFile;
            if (hasHeader && header == null)
            {
                header = File.ReadLines(localFilePath).Take(1).First();
                skip = 1;
            }
            var linesToSplice = File.ReadLines(workingFilePath).Skip(skip).Take(take);
            if (!linesToSplice.Any())
                return;
            linesToSplice.Append(Environment.NewLine); //added just to see if it would try to write the actual last line, but did not work.
            var newFileName = $"{fileNameWithoutExtension}__split_{(++i)}{extension}";
            var newFilePath = Path.Combine(targetDirectory, newFileName);
            File.WriteAllLines(newFilePath, linesToSplice);
            skip += take;
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.ToString());
    }
}
System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
System.IO.ReadLinesIterator.MoveNext()    at
System.Linq.Enumerable.EnumerablePartition`1.MoveNext()    at
System.IO.File.InternalWriteAllLines(TextWriter writer, IEnumerable`1
contents)    at System.IO.File.WriteAllLines(String path,
IEnumerable`1 contents)    at
MyFileUtilities.FileSplitter.SplitFileWithReadLines(String
localFilePath, Boolean hasHeader) in
C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs\FileSplitter.cs:line
402

I realize there's linux commands, and other tools to do this, but if it's possible, I'd like a C# solution to add this to a C# utility library I have.

Edit: I tried File.ReadLines(workingFilePath).Skip(skip).Take(take).ToList(); and got another OutOfMemoryException.

System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
System.IO.ReadLinesIterator.MoveNext()    at
System.Linq.Enumerable.EnumerablePartition`1.ToList()    at
System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)    at
MyFileUtilities.FileSplitter.SplitFileWithReadLines(String
localFilePath, Boolean hasHeader) in ...
7
  • 1
    How big is the last line? Commented Jul 31, 2023 at 16:31
  • The longest line in the file is 965 characters and the average is about 918 characters.
    – zBomb
    Commented Jul 31, 2023 at 16:37
  • 1
    File.ReadLines(workingFilePath).Skip(skip).Take(take) - in general try avoiding doing that, it is effectively O(n^2) to read the file.
    – Guru Stron
    Commented Jul 31, 2023 at 16:47
  • 2
    Can you upload the file somewhere? Or it is private data? In general I don't see anything wrong with the solution (especially first one). Maybe the last line in file is corrupted, or we are missing something outside the provided code. Have you checked the memory consumption, is there anything suspicious? Have you tried running on other machine? 2.7 GBs is not that much of data, it should fit even into memory for 32 bit process. In theory OOM does not mean that there is not enough memory but it is fragmented (LOH can be a factor here too).
    – Guru Stron
    Commented Jul 31, 2023 at 16:59
  • Could indeed be a fragmentation problem, since the average line size is 918 characters and the default buffersize is 1024, that would mean it often creates a StringBuilder to concatenate halves. Maybe see if it helps if you increase buffersize on the StreamReader, eg new StreamReader(filepath, Encoding.UTF8, true, 128 * 1024) Commented Jul 31, 2023 at 19:48

2 Answers 2

0

File.ReadLines returns an IEnumerable.

Try

File.ReadLines(workingFilePath).Skip(skip).Take(take).ToList();

to get the memory pressure down. From the call stack that you posted it seems evident that the WriteAllLines call is the first statement that starts enumerating the IEnumerable, ultimately looping through everything, which is exactly what you wanted to avoid in first place.

2
  • 4
    This Skip().Take() / skip += take pattern is an example of Shlemiel The Painter, horrible for large inputs.
    – Ben Voigt
    Commented Jul 31, 2023 at 16:41
  • I tried that and got another OutOfMemoryException. I edited the original post with the details.
    – zBomb
    Commented Jul 31, 2023 at 16:48
0

I have concluded that the last line of the file was corrupt. I was unable to open it in most text editors, and in the few text editors I was able to open the file with, it was erratic or unresponsive. However, I downloaded a trial version of Ultra Edit, which was very fast and smooth. Ultra Edit opened the file easily and revealed what I believe are the hidden/corrupt characters at the end of the last line. They are represented by little squares that would not display in other text editors. 1,779,974,601 little squares to be exact! I recommend using Ultra Edit to search for corrupt characters in your file effectively.

As seen in Ultra Edit

Not the answer you're looking for? Browse other questions tagged or ask your own question.