Extract Excel data using Interop.Excel from C#

Question

Code Objective
I need to create a System.Data.DataSet object from an Excel workbook. Each DataTable within the DataSet must correspond to a nonempty worksheet in the workbook. The top row of each sheet will be used as column names for each DataTable and the columns will be populated as the string representation of the worksheet data. The string part here is very important--everything in the final DataSet must be visually identical to what is in the Excel file with no assumptions about data types.

The Issue
My code is horrendously slow. The workbook I'm using to test my code has one worksheet which uses the Excel columns A through CO and uses rows 1 through 11361, so I don't expect it to be too blazing fast, but I finally stopped it after 20 minutes of letting it run. That's really slow, even for a big workbook.

My Goal Here
I would love your help in determining the cause of the slowness, or perhaps if there's an altogether more efficient way of going about this. My instinct was to post this to Stack Overflow, but I figured this would be a better place because the code actually works and does what I want it to do, it just does it very slowly.

The Code
I used Dylan Morley's code here for creating an Excel "wrapper" which aids with orphaned Excel processes and the release of COM objects.

namespace Excel.Helpers
{
    internal class ExcelWrapper : IDisposable
    {
        private class Window
        {
            [DllImport("user32.dll", SetLastError = true)]
            static extern IntPtr FindWindow(string lpClassName, string lpWindowName);

            [DllImport("user32.dll")]
            private static extern IntPtr GetWindowThreadProcessId(IntPtr hWnd,
                out IntPtr ProcessID);

            public static IntPtr GetWindowThreadProcessId(IntPtr hWnd)
            {
                IntPtr processID;
                IntPtr returnResult = GetWindowThreadProcessId(hWnd, out processID);
                return processID;
            }

            public static IntPtr FindExcel(string caption)
            {
                IntPtr hWnd = FindWindow("XLMAIN", caption);
                return hWnd;
            }
        }

        private Application excel;
        private IntPtr windowHandle;
        private IntPtr processID;
        private const string ExcelWindowCaption = "MyUniqueExcelCaption";

        public ExcelWrapper()
        {
            excel = CreateExcelApplication();
            windowHandle = Window.FindExcel(ExcelWindowCaption);
            processID = Window.GetWindowThreadProcessId(windowHandle);
        }

        private Application CreateExcelApplication()
        {
            Application excel = new Application();
            excel.Caption = ExcelWindowCaption;
            excel.Visible = false;
            excel.DisplayAlerts = false;
            excel.AlertBeforeOverwriting = false;
            excel.AskToUpdateLinks = false;
            return excel;
        }

        public Application Excel
        {
            get { return this.excel; }
        }

        public int ProcessID
        {
            get { return this.processID.ToInt32(); }
        }

        public int WindowHandle
        {
            get { return this.windowHandle.ToInt32(); }
        }

        public void Dispose()
        {
            if (excel != null)
            {
                excel.Workbooks.Close();
                excel.Quit();
                Marshal.ReleaseComObject(excel);
                excel = null;
                GC.Collect();
                GC.WaitForPendingFinalizers();
                GC.Collect();
                GC.WaitForPendingFinalizers();
                try
                {
                    Process process = Process.GetProcessById(this.ProcessID);

                    if (process != null)
                    {
                        process.Kill();
                    }
                }
                catch
                {
                    throw;
                }
            }
        }
    }
}

Now I make a class that processes the Excel files using Microsoft.Office.Interop.Excel operations.

namespace FileProcessing
{
    class FileToProcess
    {
        public string File { get; set; }

        public DataSet GetExcelData()
        {
            using (var wrapper = new Excel.Helpers.ExcelWrapper())
            {
                Application oXL = wrapper.Excel;

                Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);

                try
                {
                    DataSet excelData = new DataSet();

                    foreach (Worksheet oSheet in oWB.Worksheets)
                    {
                        int nColumns = oSheet.UsedRange.Columns.Count;
                        int nRows = oSheet.UsedRange.Rows.Count;

                        if (nColumns > 0)
                        {
                            System.Data.DataTable sheetData = excelData.Tables.Add(oSheet.Name);

                            Range rHeaders = (Range)oSheet.UsedRange.Rows[1];

                            for (int j = 1; j <= nColumns; j++)
                            {
                                string columnName = rHeaders.Columns[j].Text.ToString();

                                if (string.IsNullOrEmpty(columnName))
                                    continue;
                                else if (sheetData.Columns.Contains(columnName))
                                {
                                    int i = 1;
                                    string c;

                                    do
                                    {
                                        c = columnName + i.ToString();
                                        i += 1;
                                    }
                                    while (sheetData.Columns.Contains(c));

                                    sheetData.Columns.Add(c, typeof(string));
                                }
                                else
                                    sheetData.Columns.Add(columnName, typeof(string));
                            }

                            for (int i = 2; i <= nRows; i++)
                            {
                                DataRow sheetRow = sheetData.NewRow();

                                for (int j = 1; j <= nColumns; j++)
                                {
                                    string columnName = rHeaders.Columns[j].Text.ToString();

                                    Range oRange = (Range)oSheet.Cells[i, j];

                                    if (!string.IsNullOrEmpty(columnName))
                                        sheetRow[columnName] = oRange.Text.ToString();
                                }

                                sheetData.Rows.Add(sheetRow);
                            }
                        }
                    }

                    return excelData;
                }
                catch (Exception)
                {
                    throw;
                }
                finally
                {
                    oWB.Close();
                    oWB = null;
                    oXL.Quit();
                    oXL = null;
                    wrapper.Dispose();
                }
            }
        }
    }

Finally I try to use the class. This is just a testing example where I record the amount of time the code takes to read the file and construct the dataset.

    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = Stopwatch.StartNew();

            FileToProcess testXLFile = new FileToProcess();
            testXLFile.File = @"C:\BigWorkbook.xls";

            DataSet ds = testXLFile.GetExcelData();

            sw.Stop();

            TimeSpan ts = sw.Elapsed;

            Console.WriteLine("Elapsed Time: {0}ms", ts.Milliseconds);

            Console.ReadKey();
        }
    }
}

Closing Remarks
My thought here is that reading each cell one at a time is taking forever on huge worksheets. I don't know how else to ensure that each value is properly captured as a string though. I don't want to ever assume that a particular column or cell has a particular data type to avoid conversion issues and preserve data integrity between the Excel file and the DataSet. I saw user3488442's question here but I'm not convinced that I could modify it to suit my needs without running into conversion issues. Conversion between dates (which import as DateTime objects in C#) and strings doesn't always preserve the display format in the worksheet. I also saw this question but it uses a VBA macro (and I don't speak German). Any input would be much appreciated.

As a short info, I am the OP of the second question you mention. FYI what I use is an Excel Macro (VBA). That would require you to recode your whole program, as vba != vb (!) != c#.... The Macro is executed inside the Excel application itself, and thus has some overhead less, that comes from using COM / interop. You might consider switching to a macro after refactoring your current approach — Vogel612, Commented May 7, 2014 at 17:05
@Vogel612: Thanks for your input. Unfortunately I can't switch to a macro because I will need to apply this approach to a bunch of files. My actual program (which was edited here to include only Excel) includes handling for files other than Excel workbooks as well. — Alex A., Commented May 7, 2014 at 17:10
What happens if you change your Range oRange = (Range)oSheet.Cells[i, j]; call to the outside loop, selecting multiple rows and all columns and loop through that? (A guess that if that is slow, it may have a fixed overhead) — major-mann, Commented May 7, 2014 at 17:46

Mathieu Guindon · Accepted Answer · 2014-05-07 17:46:32Z

The workbook I'm using to test my code has one worksheet which uses the Excel columns A through CO and uses rows 1 through 11361, so I don't expect it to be too blazing fast [...]

Cell CO11361, in R1C1 notation, is R11361C93. You're reading the value of 1,056,573 cells, through COM interop... just to return a DataSet.

Excel Interop wasn't designed for this.

Since you're returning a DataSet, it looks like a safe assumption to make, that the workbook data has the shape of a data table.

I have this code that I've been using on some Excel 97-2003 files, I'm sure it can be tweaked to work with newer versions - basically you write an implementation for this interface:

/// <summary>
/// An interface exposing functionality required to convert a Microsoft Excel workbook into a <see cref="DataTable"/> object using an <see cref="OleDbConnection"/>.
/// </summary>
public interface IExcelOleDbReader
{
    /// <summary>
    /// Converts specified Microsoft Excel workbook into a <see cref="DataTable"/> object.
    /// </summary>
    /// <param name="fileName">The name of the file to import.</param>
    /// <returns></returns>
    DataTable ImportExcelFile(string fileName);
}

Here's the Excel 8.0 implementation I have:

public class Excel8OleDbReader : IExcelOleDbReader
{
    /// <summary>
    /// Converts specified Microsoft Excel 97-2003 (8.0) workbook into a <see cref="DataTable"/> object.
    /// </summary>
    /// <param name="fileName">Full path and file name of Microsoft Excel 97-2003 (8.0) workbook to connect to.</param>
    /// <returns>Returns a <see cref="DataTable"/> with the contents of the active worksheet.</returns>
    public DataTable ImportExcelFile(string fileName)
    {
        if (string.IsNullOrEmpty(fileName)) return null;
        if (!File.Exists(fileName)) return null;

        var result = new DataTable();
        using (var connection = GetWorkbookConnection(fileName))
        {
            connection.Open();
            var adapter = new OleDbDataAdapter("select * from A1:BA60000", connection);
            adapter.Fill(result);
            connection.Close();
        }

        return result;
    }

    private OleDbConnection GetWorkbookConnection(string fileName)
    {
        var connectionString = @"Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" + fileName + @";Extended Properties=""Excel 8.0;HDR=YES;""";
        return (fileName != string.Empty) 
                ? new OleDbConnection(connectionString)
                : null;
    }
}

OLEDB should be much faster than COM interop, but it comes with its own problems - if you have two data types in a single column, you might need to do some gymnastics with your worksheet data to get it to work.

You might want to search/ask Stack Overflow about how this might work when you have more than just the active worksheet to read from.

Thank you very much for your answer. My original thought had been to go with OLEDB and use IMEX=1 in my connection string to account for mixed data types, but I had an issue with dates importing as DateTime objects and I wasn't able to capture the exact display format used in the worksheet. It's critical that I a.) don't edit the original workbooks in any way, and b.) capture the exact display from the workbook as a string. — Alex A., Commented May 7, 2014 at 17:53
Handling multiple worksheets in a workbook is no problem. You loop through the entries in the OLEDB schema and process each worksheet individually by name. — Alex A., Commented May 7, 2014 at 17:55
VBA is "Excel-native", C# is managed code "talking" to COM, "talking" to Excel, there's LOTS of overhead with Excel interop, so yes, looping through 1M cells in VBA is much faster than doing the exact same thing in C# ;) — Mathieu Guindon, Commented May 7, 2014 at 17:58
I figured out how to do it successfully in OLEDB, hence the acceptance of the answer that recommended OLEDB. I really appreciate all of your fantastic input, Mat (and Mug). — Alex A., Commented May 8, 2014 at 18:48
I basically adapted my approach from one I use for Access files, so as a disclaimer my OLEDB method is pretty dissimilar from yours. Mine constructs the DataSet from the 1,056,573 cell workbook in 432ms. I'd say that's an improvement on 20 minutes. :) — Alex A., Commented May 8, 2014 at 18:52

Community · Accepted Answer · 2020-06-10 13:24:26Z

My other answer proposed a radically different approach. This is more of a code review.

Naming

The naming isn't following the typical C# naming conventions:

    Application oXL = wrapper.Excel;

    Workbook oWB = oXL.Workbooks.Open(this.File, 0, true);

Would be more readable like this:

     Application app = wrapper.Excel; 
     Workbook workbook = app.Workbooks.Open(this.File, 0, true);

Or, with type inference / implicit typing:

     var app = wrapper.Excel; 
     var workbook = app.Workbooks.Open(this.File, 0, true);

Avoid arbitrary prefixes:

    int nColumns = oSheet.UsedRange.Columns.Count;
    int nRows = oSheet.UsedRange.Rows.Count;

n means what? number? it's redundant. Just call a cat, a cat:

    var columns = worksheet.UsedRange.Columns.Count;
    var rows = worksheet.UsedRange.Rows.Count;

Same here, the o prefix isn't buying you anything, and only makes the pronounciation of that variable's name sound funny:

    int oRange = (Range)oSheet.Cells[i, j];

Use meaningful names:

    public string File { get; set; }

File, in .net, usually refers to the System.IO.File class. Given that it's a string, I'd suggest the name FileName for that property instead.

Readability

This:

i += 1;

Would more explicitly read as an increment if written like this:

i++;

Your main loop (inside the try block) is doing 3 things:

Create a new DataTable for each worksheet.
Create a new DataColumn for each column on each worksheet.
Create a new DataRow for each row on each worksheet, and populate each column.

I'd extract 3 methods out of the main loop, to make it easier to parse:

var result = new DataSet();
try
{
    foreach (var worksheet in workbook.Worksheets)
    {
        DataTable worksheetData;
        If (ReadWorksheetData(worksheet, out worksheetData))
        {
            result.Tables.Add(worksheetData);
        }
    }
}
finally
{
    workbook.Close();
    app.Quit();
}

I dropped the redundant catch clause, and removed the call to wrapper.Dispose(), since the wrapper instance is created in a using scope, which already ensures its proper disposal - you're basically calling Dispose twice on it. Also nulling references that you're not going to reuse, doesn't buy you anything really, so I removed them too.

The ReadWorksheetData method would look something like this:

private bool ReadWorksheetData(Worksheet worksheet, out DataTable worksheetData)
{
    var result = false;

    var columns = worksheet.UsedRange.Columns.Count;        
    if (columns > 0)
    {
        var data = new DataTable(worksheet.Name);
        worksheetData = data;
        AddWorksheetColumns(worksheet, worksheetData, columns);
        AddWorksheetRows(worksheet, worksheetData); // temporal coupling here...

        result = true;
    }
    else
    {
        worksheetData = null;
    }

    return result;
}

This isn't perfect, AddWorksheetColumns must run before AddWorksheetRows and that creates temporal coupling between the two methods, but you get the idea - you break the main loop into multiple, smaller methods that do as little as possible.

Would parallelization help?

As a possible performance helper, the main loop could then be parallelized:

    Parallel.ForEach(workbook.Worksheets, worksheet =>
        {
            DataTable worksheetData;
            If (ReadWorksheetData(worksheet, out worksheetData))
            {
                result.Tables.Add(worksheetData);
            }
        });

So you'd possibly have multiple worksheets being processed at the same time, depending on number of available CPU cores and some other factors. The overall workload remains the same though.

Also it's useful to note that until I added the wrapper.Dispose() I was getting orphaned Excel processes on exceptions. I don't when I include it though. — Alex A., Commented May 7, 2014 at 19:48
That's an interesting observation... and mug says thanks! :) — Mathieu Guindon, Commented May 7, 2014 at 19:50

Stack Exchange Network

Extract Excel data using Interop.Excel from C#

2 Answers 2

Naming

Readability

Would parallelization help?

Not the answer you're looking for? Browse other questions tagged
c#
performance
excel
or ask your own question.

Linked

Hot Network Questions

Extract Excel data using Interop.Excel from C#

2 Answers 2

Naming

Readability

Would parallelization help?

Not the answer you're looking for? Browse other questions tagged c#performanceexcel or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c#
performance
excel
or ask your own question.