4
\$\begingroup\$

For the first time I've worked in Excel VBA to find rows in my dataset that contain the same adress as another entry in a cluster. These entries have to be merged and the row then is deleted. I've come up with the following, which works (As far as I can tell from the testing I did on small samples of the set):

Sub Merge_Orders()

Application.ScreenUpdating = False

Application.DisplayStatusBar = False


Dim lastrow As Long
lastrow = Cells(Rows.Count, "A").End(xlUp).Row
Dim y As Long
Dim x As Long
Dim j As Long
Dim k As Long

For i = 2 To lastrow //for each row, starting below header row
j = 1

y = (Cells(i, 9)) //this is the clusternumber
Do While y = (Cells(i + j, 9)) //while the next row is in the same cluster
 x = (Cells(i, 12)) //this is the adresscode
 k = 1
    Do While x = (Cells(i + k, 12)) //while the current adresscode is the same as the next, iterating until another adresscode is hit
     Cells(i, 16) = Val(Cells(i, 16)) + Val(Cells(i + k, 16)) //update cell value
     Cells(i, 18) = Cells(i, 18) + Cells(i + k, 18)  //update cell value
     Cells(i, 19) = Cells(i, 19) + Cells(i + k, 19)  //update cell value
     If Cells(i, 20) > Cells(i + k, 20) Then
     Cells(i, 20) = Cells(i + k, 20)  //update cell value
    End If
    If Cells(i, 21) > Cells(i + k, 21) Then
     Cells(i, 21) = Cells(i + k, 21)  //update cell value
    End If
    Cells(i, 22) = Cells(i, 22) + Cells(i + k, 22)  //update cell value
    Cells(i, 23) = Cells(i, 23) + Cells(i + k, 23)  //update cell value

    Rows(i + 1).EntireRow.Delete //Delete the row from which data was pulled
    k = k + 1
    Loop
 j = j + 1
Loop

Next i

Application.ScreenUpdating = True
Application.DisplayStatusBar = True
End Sub

The problem I'm facing is time. Testing this on a small sample of ~50 rows took over 5 minutes. My entries total over 100K rows. I would expect it has to run for close to 10 days. It's been running for over a day with no end in sight.

Is there a way to optimize this so I don't have to wait until I'm grey?

Note: I've added //comments just for CR/SO, they are not in VBA.

\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Correctness

First of all, I have some concerns whether your code actually does what you want it to do. Let me explain why I think this.

  1. In your inner Do While Loop you delete the next row, but still increment k. This means that you are skipping more and more rows. I think your intention was to always look at the next row, i.e. at k = 1.
  2. In the inner loop you do not check whether you are still in the same cluster. So, you will end up aggregating over different clusters if the addresses match.
  3. The inner loop does not depend on the middle loop.
  4. Since you delete rows, your outer loop will run past the last remaining row.

Performance

Next, let me give some performance tips that do not change your code radically:

  1. Accessing a worksheet in Excel VBA is expensive. You should load as much as possible into the memory at once. To achieve this you can load an entire range into a two dimentional Variant array by assigning the Value property of the range to a variant variable, or better the Value2 property. (See this blog post for some reason to use Value2.) You can write the array back to a range by assigning to the Value property of the range. If you go down this road, you will have to employ some collection to save the rows to delete for later. (When you remove the rows, you should do it from the bottom to the top as this guarantees that the row numbers of the remaining rows are preserved.)

  2. Your middle loop is redundant. This is the case because the loop body does not depend on the index variable j. I think what you actually wanted to do is determine whether x = (Cells(i + 1, 12)) And y = (Cells(i + 1, 9)) and only do one loop. (I already set k and j to 1 in accordance with my first comment from above.)

  3. You can save some redundant itterations by decrementing lastrow by 1 whenever you delete a row.

Style

Now, let me come to some general remarks regarding the programming style.

  1. It is a very good practice to give variables meaningful names. This helps the next reader of your code a lot trying to understand what the code is doing. In your case, it would make most comments unnecessary. E.g. you could use currentCluster instead of y and currentAddress instead of x. Moreover, If you load the columns into variant arrays, you can call the array for row 9 clusterIds and that for column 12 adresses.
  2. You should avoid to use the active sheet implicitely. You are referring to it with your calls to Cells. This can lead to very hard to detect errors. Instead you could explicitly use it by writing ActiveSheet.Cells or enclosing everything in a With ActiveSheet block and then using .Cells. Personally, I think the first alternative is better since the period can be overlooked or forgotten easily.
  3. You should also avoid to use default properties, like the Value property of the range object returned by Cells. Using default properties does not only hurt the readability, it can also lead to obscure errors. Instead, you should use the Value property directly, or better Value2 when you are reading the data, especially when querying cells formatted as currency. These get crippled by Value. (Again, see this blog post.)
  4. Always declare your variables. Otherwise, they are implicitly Variant. In your code, y and i are not declared. To ensure that you always declare variables, you can add Option Explicit. Then you get a compiler error for undeclared variables. This has the added advantage that you cannot inadvertedly introduce a new variable by a typo. (Such bugs are very hard to find.)
  5. Get into the habit not to use underscores in method names unless there is a syntactic reason for it. This would be the case when writing an event handler or implementing an interface. The problem with underscores surfaces when using classes in VBA. You actually cannot implement a method of an interface if it contains an underscore. (This is a bug of the VBE.)

Alternative Approaches

Finally, I would like to present a completely different approach to your problem: to use the SQL support for Excel via ADODB. How to use it, can be found in this Microsoft article.

Using it the output you seem to want could be produced with a SQL statement similar to the following.

SELECT Cluster, Address, SUM(Value1), SUM(Value2), SUM(Value3)
FROM [ActiveSheetName$]
GROUP BY Cluster, Address

Here, ActiveSheetName should be the name of the active sheet, which you can get via ActiveSheet.Name and ValueX should be the names in the header row of the columns you want to sum for matching clusters and addresses.

An alterntive approach would be not to use Excel at all and instead use a real database, e.g. Access for starters. They are made to handle requests like yours efficiently and also provide ways to enforce data integrity.

\$\endgroup\$
3
  • \$\begingroup\$ +1 for handing off to ADODB, far quicker and less reinventing the wheel than the vba solution. \$\endgroup\$ Commented Jan 19, 2017 at 21:37
  • \$\begingroup\$ Hi, thanks for the detailed response. You are right about k skipping more and more rows. Unfortunately it can happen that even though the next row doesn't have a matching adress code, the one after that does. So checking only at value(i+1) doesn't hold. Suggested elsewhere was that I check (i+j), which I'm trying our right now. I'm using all your other comments and writing them into VBA(no underscores, value2, logical names). I have no experience with using ADODB and am on a bit of a clock, so I would like to try to avoid that. Thanks for the suggestion, will do it if this doesn't work. \$\endgroup\$
    – Rob
    Commented Jan 20, 2017 at 9:53
  • \$\begingroup\$ Yes, you can check both conditions for i+j in the inner loop. (You really have to check both in the inner loop to avoid leaving the cluster.) When you do this, you also have to delete row i+j instead of row i+1. As an alternative to the inner loop you can also use an if else block only checking the address. The if block would contain the loop body and the else block would increment j. \$\endgroup\$
    – M.Doerner
    Commented Jan 20, 2017 at 12:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.