9
\$\begingroup\$

Given a folder typically containing ~250,000 small (20-100kb) files in HTML format (they open as single-sheet workbooks in Excel) and a list of ~ 1 million filenames, I need to open all the files that match the list and Do Stuff.

Overview of Main code Loop: "Test 1 Start" to "Test 1 End"
Once we get to the main file loop, I have a 1-D array arrCompanyNumbers with approx. 1 Million 8-digit company numbers (extracted from a 2-d array arrFilteredAddresses where the company number is only one of about 12 columns).

There is a folder with a couple hundred thousand files in it named like this.

"Prod224_0005_00040751_20131231"
"Prod224_0005_00040789_20130930"
.......

That number in the middle is the company number. The number on the end is the file date.

The macro loops through the folder using strFilename = dir. For each file, it extracts the company number and tries to match it against the array. If it does, it calls Check_File(). If it doesn't it goes to the next file and repeats.

Overview of Check_File()

Each file opens as a one-sheet workbook with a corporate accounts filing in it. I want to find the Cash, Assets and Profits in the last year. I have 3 collections of phrases that correspond to those values e.g. "Cash at Bank:".

The macro searches the first 200 rows of the first 2 columns for those phrases. Then searches 10 cells across the row and returns the first number it finds.

Once it has Cash, Assets and Profits (or failed to find them), it filters them against set criteria. If they pass, it copies the results to a second worksheet in the main workbook (which is eventually just one long list of companies, file dates and cash/assets/profits) and closes the file.

Optimisation Parameters:

I've already optimised it as far as I think I can e.g. Running speed tests on using vlookup instead of iterative searching. Even after all that, it will typically run for 6-24 hours to filter an entire month of data, and comes dangerously close to running out of memory.

I would like it to run an order of magnitude faster and with a noticeable reduction in memory usage. Any suggestions on how to achieve that would be much appreciated.

Runtime Tests:

Code between "Test 1 Start" and "Test 1 End" consumes approximately 60% of the runtime.

The Check_File() sub is responsible for the other 40%

Main Sub:

Sub Check_Companies_House_Files()

Application.ScreenUpdating = False
Application.EnableEvents = False
Application.Calculation = xlCalculationManual

'/================================================================================================================
'/ Author: Zak Michael Armstrong
'/ Email:  -
'/ Date: 07 August 2015
'/
'/ Summary:
'/ Companies House release all the electronic corporate account filings they receive every month, which we download onto our server
'/ This spreadsheet takes a list of companies whose registered headquarters are in certain postcodes (The Filtered Addreses)
'/ And searches the accounts for these companies' filings (using their company number)
'/ It then opens these files (provided in XML or HTML format), and performs a manual search for Cash, Assets and Profits
'/ If the level of cash, assets or profits meet set filtering requirements, it copies the spreadsheet into the main book and/or
'/ adds the distilled information to a list of account data.
'/
'/ Speed test results: With the full filtered list (approx. 1M companies), macro can check 5 Companies House Files / second
'/=================================================================================================================

Dim wbAccountsData              As Workbook         '/ The main Workbook, containing a filtered list of companies to search for and the eventual output data

Dim wsFilteredAddresses         As Worksheet        '/ Contains the list of filtered companies with supplementary data
Dim wsAccountsData              As Worksheet        '/ Will contain a list of companies and the cash, assets, profits reported on their filing

Dim arrFilteredAddresses()      As Variant          '/ An array to hold all the data in the filtered addresses spreadsheet
Dim arrCompanyNumbers()         As Variant          '/ An array to hold all the company numbers (1 column of the filtered addresses data)

Dim strRightString              As String           '/ Used to get the company number out of the filenames for comparison with the array
Dim strLeftString               As String           '/

Dim strCompanyNumber            As String           '/ Unique identifying number with Companies House
Dim strCompanyName              As String           '/ Company's Registered Name
Dim strPostcode                 As String           '/ Postcode of their registered address

Dim strFileName                 As String           '/ Filename of the company accounts
Dim strFolderPath               As String           '/ folder the accounts are stored in
Dim strDay                      As String           '/ the day of the filedate
Dim strMonth                    As String           '/ the month of the filedate
Dim strYear                     As String           '/ the year of the filedate
Dim strFileDate                 As String           '/ the full filedate

Dim lngFinalRow                 As Long             '/ used for determining size of arrays
Dim lngFinalColumn              As Long             '/

Dim lngCounter                  As Long             '/ used for general counting
Dim lngCounter2                 As Long             '/

Dim lngYear                     As Long             '/ Designates the year to be scanning
Dim lngMonth                    As Long             '/ Designates the month to be scanning (each folder contains one month)

Dim varHolder1                  As Variant          '/ General variable holder

Dim I                           As Long             '/ General purpose numbers
Dim J                           As Long             '/
Dim K                           As Long             '/
Dim L                           As Long             '/
Dim M                           As Long             '/
Dim N                           As Long             '/

Dim lngFolderLength             As Long             '/ Counts the number of files in a folder to be scanned

Dim lngTriggerPoint             As Long             '/ Used to trigger debug.print operations at set progress intervals


'/=================================================================================================================
'/ Initial Setup
'/=================================================================================================================

 Debug.Print "Start: " & Now

'/ Remove any residual data
Application.DisplayAlerts = False
L = Worksheets.Count
    Do While L > 2
        Sheets(L).Delete
        L = L - 1
    Loop
Application.DisplayAlerts = True

Set wbAccountsData = ActiveWorkbook

Set wsFilteredAddresses = Sheets("Filtered Addresses")
Set wsAccountsData = Sheets("Accounts Data")

    wsFilteredAddresses.Activate

'/ Create arrays
lngFinalRow = Cells(1048576, 1).End(xlUp).Row
If lngFinalRow = 1 Then lngFinalRow = 1048576

lngFinalColumn = Cells(1, 10000).End(xlToLeft).Column

    Debug.Print "Start array prep: " & Now

    ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)

        '/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
        For L = 1 To lngFinalRow
            For M = 1 To lngFinalColumn
                arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
            Next M
        Next L

    ReDim arrCompanyNumbers(1 To lngFinalRow)

        For L = 1 To lngFinalRow
            arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
        Next L

'/ Currently have data from March 2014 to June 2015
'/ Currently starts at the most recent and counts backward
lngYear = 2015
lngMonth = 6

'/=================================================================================================================
'/ Begin Main loop
'/=================================================================================================================
Do While lngMonth >= 1 '/ approx. 1M files, should (hopefully) finish over a weekend

    lngTriggerPoint = 5000

    '/=============================================================
    '/ Begin Month Loop
    '/=============================================================

            Debug.Print lngYear & " - " & MonthName(lngMonth) & " - " & "Start file checks: " & Now

                strFolderPath = "S:\Investments\Data\Companies House\Monthly Companies House Downloads\Accounts_Monthly_Data-" & MonthName(lngMonth) & lngYear & "\"
                strFileName = Dir(strFolderPath)

            lngFolderLength = 0
                Do While strFileName <> ""
                    lngFolderLength = lngFolderLength + 1
                    strFileName = Dir
                Loop

                    '/===============================
                    '/ Test 1 start (not including call check_file)
                    '/===============================    
                    strFileName = Dir(strFolderPath)
                    lngCounter = 0


                    Do While strFileName <> ""
                        lngCounter = lngCounter + 1

                        strRightString = Right(strFileName, 22)
                        strLeftString = Left(strRightString, 8)
                        strCompanyNumber = strLeftString
                            K = 1

                            '/=============================================================
                            '/ Search arrCompanyNumbers for the current file's company
                            '/=============================================================
                                Do While K <= UBound(arrCompanyNumbers)
                                        If strCompanyNumber = arrCompanyNumbers(K) _
                                            Then
                                                If lngCounter > lngTriggerPoint _
                                                    Then
                                                        Debug.Print (lngCounter & " - " & lngFolderLength & " - " & Now & " - " & MonthName(lngMonth) & " - " & lngYear)
                                                        lngTriggerPoint = lngTriggerPoint + 5000
                                                End If

                                                strCompanyName = arrFilteredAddresses(K, 1)
                                                strPostcode = arrFilteredAddresses(K, 10)
                                                strDay = Left(Right(strFileName, 7), 2)
                                                strMonth = Left(Right(strFileName, 9), 2)
                                                strYear = Left(Right(strFileName, 13), 4)
                                                strFileDate = strDay & "." & strMonth & "." & strYear

                                               '/wsFilteredAddresses.Activate       '/ originally introduced to save time by deleting companies from the list as they were found
                                               '/wsFilteredAddresses.Rows(K).Delete '/ taken out as not huge time saving, and means only most recent filing is found

                                                '/ The subroutine opens the file in question and tries to filter the company against set financial values
                                                Call Check_file(strCompanyNumber, strCompanyName, strPostcode, strFileName, strFolderPath, strFileDate, _
                                                wbAccountsData, wsAccountsData)
                                                DoEvents

                                                K = UBound(arrCompanyNumbers) + 1
                                        End If
                                    K = K + 1
                                Loop

                        strFileName = Dir
                    Loop
                    '/===================================
                    '/Test 1 End
                    '/===================================

    '/=============================================================
    '/ End Month Loop
    '/=============================================================

    Debug.Print lngYear & " - " & MonthName(lngMonth) & " - " & "Finish: " & Now
    Debug.Print "Files: " & lngCounter
    Debug.Print ""

    lngMonth = lngMonth - 1

        If lngMonth <= 0 _
            Then
                lngMonth = lngMonth + 12
                lngYear = lngYear - 1
        End If

    If lngYear = 2014 And lngMonth = 3 Then lngYear = 2000
Loop

'/=================================================================================================================
'/ End Main loop
'/=================================================================================================================

Debug.Print "Macro Finish: " & Now

wsAccountsData.Activate

Application.Calculation = xlCalculationAutomatic
Application.EnableEvents = True
Application.ScreenUpdating = True

End Sub

Check_File()

Private Sub Check_file(ByVal strCompanyNumber As String, ByVal strCompanyName As String, ByVal strPostcode As String, ByVal strFileName As String, ByVal strFolderPath As String, ByVal strFileDate As String, _
                        ByRef wbAccountsData As Workbook, ByRef wsAccountsData As Worksheet)

'/================================================================================================================
'/ Author: Zak Michael Armstrong
'/ Email:  -
'/
'/ Summary:
'/ Opens the file, searches the first and second columns for phrases that correspond to cash, assets and profit
'/ Searches across the row from those terms until it finds a number
'/ It assumes that number is the value it's after
'/ If the data meet set filtering requirements, copies the data and/or the worksheet into the main AccountsData workbook
'/=================================================================================================================

Dim wbTempFile                      As Workbook         '/ This workbook, containing the accounts filing
Dim wsTempFile                      As Worksheet        '/ This worksheet (always just 1), containing the accounts filing

Dim arrFirstColumn()                As Variant          '/ The first column of data for text search
Dim arrSecondColumn()               As Variant          '/ The second column of data for text search

Dim varCash                         As Variant          '/ pre-formatted cash value
Dim curCash                         As Currency         '/ currency-formatted cash value

Dim varAssets                       As Variant          '/ pre-formatted assets value
Dim curAssets                       As Currency         '/ currency-formatted assets value

Dim varProfits                      As Variant          '/ pre-formatted profits value
Dim curProfits                      As Currency         '/ currency-formatted profits value

Dim colCashPhrases                  As Collection       '/ contains all the phrases I've found that correspond to comapnies' current cash
Dim colAssetPhrases                 As Collection       '/ contains all the phrases I've found that correspond to comapnies' current assets
Dim colProfitPhrases                As Collection       '/ contains all the phrases I've found that correspond to comapnies' current profits

Dim lngCurrentRow                   As Long             '/ General indicators
Dim lngCurrentColumn                As Long             '/
Dim lngFinalRow                     As Long             '/
Dim lngFinalColumn                  As Long             '/

Dim strPhraseHolder                 As String           '/ will hold a string from a collection for text matching

Dim varHolder1                      As Variant          '/ General variable holders
Dim varHolder2                      As Variant          '/
Dim varHolder3                      As Variant          '/

Dim bCashFound                      As Boolean          '/ Checks to see if the program found the values
Dim bAssetsFound                    As Boolean          '/
Dim bProfitsFound                   As Boolean          '/

Dim bCashFilter                     As Boolean          '/ Is the value going to be used for filtering
Dim bAssetsFilter                   As Boolean          '/
Dim bProfitsFilter                  As Boolean          '/

Dim curCashFilterValue              As Currency         '/ the values to set the filter at
Dim curAssetsFilterValue            As Currency         '/
Dim curProfitsFilterValue           As Currency         '/

Dim strCashFilterDirection          As String           '/ whether to filter >= or <=
Dim strAssetsFilterDirection        As String           '/
Dim strProfitsFilterDirection       As String           '/

Dim bPassedCashFilter               As Boolean          '/ Handling the (up to) 3 filters separately so these are to check that
Dim bPassedAssetsFilter             As Boolean          '/ each filter case has been handled correctly
Dim bPassedProfitsFilter            As Boolean          '/

Dim I                               As Long             '/ General counters
Dim J                               As Long             '/
Dim K                               As Long             '/
Dim L                               As Long             '/
Dim M                               As Long             '/
Dim N                               As Long             '/

'/=================================================================================================================
'/ Initialise variables, set filter parameters
'/=================================================================================================================

Workbooks.Open (strFolderPath & strFileName)
Set wbTempFile = Workbooks(strFileName)
Set wsTempFile = wbTempFile.Sheets(1)

bCashFound = False
bAssetsFound = False
bProfitsFound = False

'/ Column 1 data
lngFinalRow = Cells(1048576, 1).End(xlUp).Row

ReDim Preserve arrFirstColumn(1 To lngFinalRow)
For I = 1 To lngFinalRow
    arrFirstColumn(I) = UCase(Left(Cells(I, 1).Text, 40)) '/ Left(40) is in case of extremely long cell text
Next I

'/ Column 2 data
lngFinalRow = Cells(1048576, 2).End(xlUp).Row

ReDim Preserve arrSecondColumn(1 To lngFinalRow)
For I = 1 To lngFinalRow
    arrSecondColumn(I) = UCase(Left(Cells(I, 2).Text, 40)) '/ Left(40) is in case of extremely long cell text
Next I

' Fill Collections
Set colCashPhrases = New Collection
colCashPhrases.Add ("Cash at bank and in hand")
colCashPhrases.Add ("Cash at bank")
colCashPhrases.Add ("Cash in hand")
colCashPhrases.Add ("Cash at bank and in hand:")
colCashPhrases.Add ("Cash at bank:")
colCashPhrases.Add ("Cash in hand:")

Set colAssetPhrases = New Collection
colAssetPhrases.Add ("Net Current Assets")
colAssetPhrases.Add ("Total net assets (liabilities)")
colAssetPhrases.Add ("Net Current Assets (liabilities)")
colAssetPhrases.Add ("Total Assets Less current liabilities")
colAssetPhrases.Add ("Net Current assets/(liabilities)")
colAssetPhrases.Add ("Net Current Assets:")
colAssetPhrases.Add ("Total net assets (liabilities):")
colAssetPhrases.Add ("Net Current Assets (liabilities):")
colAssetPhrases.Add ("Total Assets Less current liabilities:")
colAssetPhrases.Add ("Net Current assets/(liabilities):")

Set colProfitPhrases = New Collection
colProfitPhrases.Add ("Profit and loss account")
colProfitPhrases.Add ("Profit and loss account:")

bCashFilter = False
bAssetsFilter = False
bProfitsFilter = True

curCashFilterValue = 0
curAssetsFilterValue = 0
curProfitsFilterValue = 250000

strCashFilterDirection = ">="
strAssetsFilterDirection = ">="
strProfitsFilterDirection = ">="

'/=================================================================================================================
'/ Search File for Cash, Assets and Profits
'/=================================================================================================================
On Error Resume Next

'/ Search for Cash Value
I = 1
    Do While I <= colCashPhrases.Count
        strPhraseHolder = UCase(colCashPhrases(I))
        varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
            If IsError(varHolder1) _
                Then
                varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
                varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
            End If

        '/ varholder1 holds the index, varholder2 holds the text value (if found)
        If CStr(varHolder2) = strPhraseHolder _
            Then
                lngCurrentRow = varHolder1
                lngCurrentColumn = 1
                lngFinalColumn = Cells(lngCurrentRow, 10000).End(xlToLeft).Column

                    Do While lngCurrentColumn <= lngFinalColumn
                        lngCurrentColumn = lngCurrentColumn + 1
                            varCash = Cells(lngCurrentRow, lngCurrentColumn).Value
                                If IsNumeric(varCash) And CLng(varCash) <> 0 _
                                    Then
                                        lngCurrentColumn = lngFinalColumn + 1
                                        curCash = CCur(varCash)
                                        bCashFound = True
                                End If
                    Loop
        End If

        If bCashFound = False Then I = I + 1
        If bCashFound = True Then I = colCashPhrases.Count + 1
    Loop

'/ Search for Assets value
I = 1
    Do While I <= colAssetPhrases.Count
        strPhraseHolder = UCase(colAssetPhrases(I))
        varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
            If IsError(varHolder1) _
                Then
                varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
                varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
            End If

        '/ varholder1 holds the index, varholder2 holds the text value (if found)
        If CStr(varHolder2) = strPhraseHolder _
            Then
                lngCurrentRow = varHolder1
                lngCurrentColumn = 1
                lngFinalColumn = Cells(lngCurrentRow, 10000).End(xlToLeft).Column

                    Do While lngCurrentColumn <= lngFinalColumn
                        lngCurrentColumn = lngCurrentColumn + 1
                            varAssets = Cells(lngCurrentRow, lngCurrentColumn).Value
                                If IsNumeric(varAssets) And CLng(varAssets) <> 0 _
                                    Then
                                        lngCurrentColumn = lngFinalColumn + 1
                                        curAssets = CCur(varAssets)
                                        bAssetsFound = True
                                End If
                    Loop
        End If

        If bAssetsFound = False Then I = I + 1
        If bAssetsFound = True Then I = colAssetPhrases.Count + 1
    Loop

'/ Search for profits value
I = 1
    Do While I <= colProfitPhrases.Count
        strPhraseHolder = UCase(colProfitPhrases(I))
        varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
            If IsError(varHolder1) _
                Then
                varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
                varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
            End If

        '/ varholder1 holds the index, varholder2 holds the text value (if found)
        If CStr(varHolder2) = strPhraseHolder _
            Then
                lngCurrentRow = varHolder1
                lngCurrentColumn = 1
                lngFinalColumn = Cells(lngCurrentRow, 10000).End(xlToLeft).Column

                    Do While lngCurrentColumn <= lngFinalColumn
                        lngCurrentColumn = lngCurrentColumn + 1
                            varProfits = Cells(lngCurrentRow, lngCurrentColumn).Value
                                If IsNumeric(varProfits) And CLng(varProfits) <> 0 _
                                    Then
                                        lngCurrentColumn = lngFinalColumn + 1
                                        curProfits = CCur(varProfits)
                                        bProfitsFound = True
                                End If
                    Loop
        End If

        If bProfitsFound = False Then I = I + 1
        If bProfitsFound = True Then I = colProfitPhrases.Count + 1
    Loop

    On Error GoTo 0

'/=================================================================================================================
'/ Determine filter outcome
'/=================================================================================================================

bPassedCashFilter = False
bPassedAssetsFilter = False
bPassedProfitsFilter = False

'/ Filter Cash

If bCashFilter = True _
    Then
        Select Case strCashFilterDirection

            Case Is = ">="
                If curCash >= curCashFilterValue Then bPassedCashFilter = True Else bPassedCashFilter = False

            Case Is = "<="
                If curCash <= curCashFilterValue Then bPassedCashFilter = True Else bPassedCashFilter = False

            Case Else
                MsgBox ("Macro encountered an unexpected error whilst filtering file financial data")
                Stop
        End Select
    Else
        bPassedCashFilter = True
End If

'/ Filter Assets

If bAssetsFilter = True _
    Then
        Select Case strAssetsFilterDirection

            Case Is = ">="
                If curAssets >= curAssetsFilterValue Then bPassedAssetsFilter = True Else bPassedAssetsFilter = False

            Case Is = "<="
                If curAssets <= curAssetsFilterValue Then bPassedAssetsFilter = True Else bPassedAssetsFilter = False

            Case Else
                MsgBox ("Macro encountered an unexpected error whilst filtering file financial data")
                Stop
        End Select
    Else
        bPassedAssetsFilter = True
End If

'/ Filter Profits

If bProfitsFilter = True _
    Then
        Select Case strProfitsFilterDirection

            Case Is = ">="
                If curProfits >= curProfitsFilterValue Then bPassedProfitsFilter = True Else bPassedProfitsFilter = False

            Case Is = "<="
                If curProfits <= curProfitsFilterValue Then bPassedProfitsFilter = True Else bPassedProfitsFilter = False

            Case Else
                MsgBox ("Macro encountered an unexpected error whilst filtering file financial data")
                Stop
        End Select
    Else
        bPassedProfitsFilter = True
End If

'/ The filter might return true against a default value of 0 if real number not found, so fail if real number not found
If bCashFound = False And bCashFilter = True Then bPassedCashFilter = False
If bAssetsFound = False And bAssetsFilter = True Then bPassedAssetsFilter = False
If bProfitsFound = False And bProfitsFilter = True Then bPassedProfitsFilter = False

'/ if passed all 3 conditions, then print and/or copy to main workbook
If bPassedCashFilter = True And bPassedAssetsFilter = True And bPassedProfitsFilter = True _
    Then
        wbAccountsData.Activate
        wsAccountsData.Activate
        lngFinalRow = Cells(1048576, 2).End(xlUp).Row
        lngCurrentRow = lngFinalRow + 1

        Cells(lngCurrentRow, 2) = strCompanyNumber
        Cells(lngCurrentRow, 3) = strCompanyName
        Cells(lngCurrentRow, 4) = strPostcode
        Cells(lngCurrentRow, 5) = curCash
        Cells(lngCurrentRow, 6) = curAssets
        Cells(lngCurrentRow, 7) = curProfits
        Cells(lngCurrentRow, 8) = strFileDate

''        '/ copies worksheet to main workbook
''
''        wbTempFile.Activate
''        wsTempFile.Copy After:=wbAccountsData.Worksheets(wbAccountsData.Worksheets.Count)
''        wbAccountsData.Activate
''        ActiveSheet.Name = strCompanyNumber & " - " & strFileDate

End If

wbAccountsData.Activate
wbTempFile.Close

End Sub
\$\endgroup\$
4
  • \$\begingroup\$ Can you some text to your post to explain which bit(s) of the code are taking the most time? Do you also only open those HTML files whose names exist in the ~1 million long list? \$\endgroup\$ Commented Aug 20, 2015 at 20:17
  • \$\begingroup\$ done and yes. Approximately 25% of the files should be in the list, on average. \$\endgroup\$
    – Kaz
    Commented Aug 21, 2015 at 8:46
  • \$\begingroup\$ You are taking all of a used range in a sheet (or is it a column?) and storing it in an array that's going to be a 2 dimensional array, right? Then you are entering a directory where there are a bunch of folders with documents in them? Or are all the documents in a single directory? Then you go through the array to find the files? Or do you go through the files and match to the array? \$\endgroup\$ Commented Aug 21, 2015 at 11:29
  • \$\begingroup\$ I hope the edits make it clearer. \$\endgroup\$
    – Kaz
    Commented Aug 21, 2015 at 12:06

2 Answers 2

8
\$\begingroup\$

Here are a couple of things to look into:

  1. In the Check_Companies_House_Files, near the start you have this block of code:

    ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
    
    '/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
    For L = 1 To lngFinalRow
        For M = 1 To lngFinalColumn
            arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
        Next M
    Next L
    
    ReDim arrCompanyNumbers(1 To lngFinalRow)
    
    For L = 1 To lngFinalRow
        arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
    Next L
    

    Why have two separate loops? You should be able to merge them.

    ReDim arrFilteredAddresses(1 To lngFinalRow, 1 To lngFinalColumn)
    ReDim arrCompanyNumbers(1 To lngFinalRow)
    
    '/ Done iteratively because excel throws an "Out of memory" if I try to pass the whole range to the array in one go. Approx. 2 minutes for 1Million length list
    For L = 1 To lngFinalRow
        For M = 1 To lngFinalColumn
            arrFilteredAddresses(L, M) = wsFilteredAddresses.Cells(L, M).Text
        Next M
        arrCompanyNumbers(L) = Right("00000000" & arrFilteredAddresses(L, 2), 8) '/ company numbers in the filenames are always 8 digits long, with 0's filling up any extra digits
    Next L
    

    When you build the arrCompanyNumbers array, you do that text formatting 1 million times. Why not add a column to your Excel sheet with the value already formatted using the TEXT function? You then just read that column into the array.

  2. Your code is looping through your array arrCompanyNumbers looking for a matching entry. You might to look into using a Scripting.Dictionary instead because the Exists method is possibly much quicker than a 1 million long array. Presumably your company number values are all unique?

  3. If your files are HTML, Excel probably spends a lot of time parsing the HTML code when opening the file. You could try using the Scripting.FileSystemObject and the Scripting.TextStream.ReadAll method. This will load the file into a string variable and you could then use the InStr function to search for your text entries. If the HTML is complicated/fussy it might be too tricky to find the value that goes with your heading.

  4. You do this kind of thing a couple of times:

        varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
        varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
        If IsError(varHolder1) _
            Then
            varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
            varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
        End If
    

    I've got two issues with this. First, why bother getting the value for varHolder2 before you've checked if varHolder1 is an error? You should check varHolder1 first and then decide what to do. (By the way, why do you put the Then onto a new line and indent it? I would leave it on the same line as the If unless the line is very long.)

        varHolder1 = Application.Match(strPhraseHolder, arrFirstColumn, 0)
        If Not IsError(varHolder1) Then
            varHolder2 = UCase(Application.Index(arrFirstColumn, varHolder1))
        Else
            varHolder1 = Application.Match(strPhraseHolder, arrSecondColumn, 0)
            varHolder2 = UCase(Application.Index(arrSecondColumn, varHolder1))
        End If
    

    The second thing, is why use the Index function at all? The Match function has told you the row number/index number of the element. Why not use:

    varHolder2 = UCase(arrSecondColumn(varHolder1))
    
  5. A minor niggle is you hardcoding the max number of rows in your code, e.g. lngFinalRow = Cells(1048576, 1).End(xlUp).Row. This is fragile and might break if you upgrade Excel. Better to use lngFinalRow = Cells(wsAccountsData.Rows.Count, 1).End(xlUp).Row

\$\endgroup\$
1
  • \$\begingroup\$ Thank you, they all sound like useful suggestions. I'll try them out and see what effect they have. the if _ then thing is just a personal habit. I like the visual structure. \$\endgroup\$
    – Kaz
    Commented Aug 22, 2015 at 0:09
3
\$\begingroup\$

So, 3 months later, I come back to this project. This time around, I implemented some serious Benchmarking and experimentation.

File Matching:

I tried a number of things. I got a good solution going by sorting the company numbers and filenames in ascending order so I would only have to iterate through each once (occasionally backtracking for overshoots).

This yielded: 70 minutes to match 80,000 filenames against 1,000,000 company numbers. This formed my basis for benchmarking.

Then, Jackpot: I put all my company numbers into a collection, where each company number was also its own key. I then used the following code to check if a file existed:

    var = 0
    On Error Resume Next
    var = colCompanyNumbers.Item(strCompanyNumber)
    On Error GoTo 0
    If var <> 0 Then bMatchFound = True

The new time: <1 second. I've never used them, but I imagine using Scripting.Dictionary and exists() (as suggested in ChipsLetten's answer) would yield results of a similar magnitude.

I now understand this is because of the nature of arrays (contiguous memory blocks) and hashed linked lists (collections, dictionaries etc.).

Whilst iterating through values is faster in an array, a hashed linked list means checking for the existence of an element takes a negligible amount of time, regardless of the size of the list, because you don't have to iterate through every value in order to do it.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.