1

I'm doing something wrong with summing dynamic arrays data across rows and columns in Excel365.

I have an initial input (not in a table):

    A      B       C       D       E       F       G       H       I
1   data  value1  value2  value3  value4  value5  value6  value7  value8
2   a         83      39      84      56      81      16      83      66
3   b         74      74      42       0      98      14      86      18
4   c         53      50      71      79      72       1      57      54
5   d          4      67       7      57      62       7       6      73
6   b         99      67      47      65      73      92      73      59
7   c         30       0      84      30      49      89      70      48
8   d         95      42      73       1      80       9      52      63
9   b         49      45      78      68      89      54      28       5
10  b         36      68      99      37      72      49      90      66
11  d         15      80      34      47      89      96      87      83
12  a         73       4      17      59      71      86       4      93
13  b         78      21      15      94      92      33      17      50
14  a          9      35      57      47      30      12      62      89
15  a          7      66      77      51      69      75      92      49
16  a         19      89      43       0      67      81      62      90
17  a        100      63      17      83      52      26      48       4

Theres's nothing special about the particular values. In my production spreadsheet, it's the number of MAC addresses on given interfaces, essentially arbitrary data that for this example I generated using the RANDARRAY function then copied->pasted as values.

I then convert this to a series of spilled arrays.

B20: "=B1:I1"
A21: "=A2:A17"
B21: "=B2:I17"

This results the following (I've used "ASCII art" to show the resulting... I think they're called "spilled dynamic arrays"?)

    A       B       C       D       E       F       G       H       I
.         --------------------------------------------------------------
20 data  |value1  value2  value3  value4  value5  value6  value7  value8|
.         --------------------------------------------------------------
.   -     --------------------------------------------------------------
21| a |  |    83      39      84      56      81      16      83      66|
22| b |  |    74      74      42       0      98      14      86      18|
23| c |  |    53      50      71      79      72       1      57      54|
24| d |  |     4      67       7      57      62       7       6      73|
25| b |  |    99      67      47      65      73      92      73      59|
26| c |  |    30       0      84      30      49      89      70      48|
27| d |  |    95      42      73       1      80       9      52      63|
28| b |  |    49      45      78      68      89      54      28       5|
29| b |  |    36      68      99      37      72      49      90      66|
30| d |  |    15      80      34      47      89      96      87      83|
31| a |  |    73       4      17      59      71      86       4      93|
32| b |  |    78      21      15      94      92      33      17      50|
33| a |  |     9      35      57      47      30      12      62      89|
34| a |  |     7      66      77      51      69      75      92      49|
35| a |  |    19      89      43       0      67      81      62      90|
36| a |  |   100      63      17      83      52      26      48       4|
.   -     --------------------------------------------------------------

In my production spreadsheet I did some data mangling across multiple sheets, but I recreated this sample and got the same issue even with these simple expressions

So my issue - I applied new functions:

B40: "=B20#"
A41: "=UNIQUE(A21#)"
B41: "=SUMIFS(B21:B36,A21#,A41#)"

Which works fine:

<code>
.   A       B       C       D       E       F       G       H       I
.         --------------------------------------------------------------
40 data  |value1  value2  value3  value4  value5  value6  value7  value8|
.         --------------------------------------------------------------
.   -     ------
41| a |  |   291|
42| b |  |   336|
43| c |  |    83|
44| d |  |   114|
.   -     ------

But if I change B41 to either "=SUMIFS(B21#,A21#,A41#)" or "=SUMIFS(B21#,A21#,A41#,B20#,A40#)" I get a result of #VALUE! on all spilled array cell tables (though, the spilled array sizes are as expected)

The result I'm ultimately looking for is the working 2-D sumifs results in a single formula at cell B41, with a spilled-array to fill all rows/collumns {a} - {d} and values 1-8:

    A       B       C       D       E       F       G       H       I
.         --------------------------------------------------------------
40 data  |value1  value2  value3  value4  value5  value6  value7  value8|
.         --------------------------------------------------------------
.   -     --------------------------------------------------------------
41| a |  |   291     296     295     296     370     296     351     391|
42| b |  |   336     275     281     264     424     242     294     198|
43| c |  |    83      50     155     109     121      90     127     102|
44| d |  |   114     189     114     105     231     112     145     219|
.   -     --------------------------------------------------------------

I realize I could simply do a small number of sumifs and make things easier on my sample data set, but my production data set has over 26million datapoints per sheet and multiple sheets, and it's taking hours to process. I'm finding dynamic spilled arrays to be far more time-efficient, even if for no other reason than less RAM aka less swap file usage and probably also better L2 cache performance. Would be nice to have a workbook that doesn't take 2hr to close.

5
  • Whats wrong with doing: "=SUMIFS(B$21:B$36,$A$21#,$A$41#)" and copying over? Commented Mar 28, 2022 at 16:50
  • I think I understand your problem. If you follow Scott's formula above, the same compare operation will be repeated for each column. Instead you want to compare once and return multiple sumifs at once in a 2D array. Unfortunately SUMIFS dont sum on 2D arrays... but maybe matrix multiplication can help... Commented Mar 28, 2022 at 19:47
  • I'll be honest, I had a need to optimize , my instinct said this is the right way to go rather than a long series of "=sumifs(..." then copy/paste/paste.... With 15,000 of columns and 30,000 rows created from (but never linked to) a CSV input set that started at 26milion data points, seems like I need to do everything I can think of to optimize the spreadsheet. See mobus's point 14 in the article he references below. Shoot - first thing I had to do was figure out how to convert 26million 1-D entries into a 2-D format so that I could open it in a spreadsheet at all. Commented Mar 29, 2022 at 2:02
  • Out of interest, how did you fold the 26mil 1D entries into a 2D table? Did you run some sort of script to replace 29 out of 30 line feeds with commas in the CSV? Did you consider using a data model in Excel Power Query / "Transform & Get"? See here chandoo.org/wp/more-than-million-rows-in-excel Commented Mar 30, 2022 at 6:40
  • python; it's how I got my data in the first place, so adding a 1D -> 2D script was easy. Short version: the original CSV had the row and column in each of the 26million lines. It was fairly straightforward to parse that into a dict of dicts (outer dict had each row label as a key so it was a lot of keys but thats ok), then flatten it into a list of dicts (by adding a "row":<row_label> to each inner dict) that I could then re-write to a CSV file. I then had sort the data both row- and column-wise in Excel once I imported it, but that was straightforward. Commented Mar 30, 2022 at 14:16

2 Answers 2

1

Solution: Multiple SUMIFs in a single array formula

This is the equivalent of a 2D SUMIF giving an array answer

=MMULT(--(TRANSPOSE(A21#)=A41#),B21#)

How it works: (this explanation ended up way longer than I hoped!)

MMULT does matrix multiplication and is one of very few ways in Excel you can change the length (and width) of an array in a single array formula. The fact that you wanted to reduce a large 2D table to a smaller 2D table was already the hint for me that MMULTI might be the way to go.

Understanding how exactly matrix multiplication works is beyond this post but you can read more here. It is important to understand how the dimensions change. If C = A.B, and matrix A is m x n (rows x columns), then B must have n rows to match A's n columns. If B is n x p, then C will have output dimensions of m x n . n x p = m x p.

If A is a horisontal/row array 1 x n and B is a vertical/column array n x 1, the answer of A.B will be 1 x 1 in dimension i.e. a single scalar value. This is essentially the same as the dot product or SUMPRODUCT in Excel (a1 * b1 + a2 * b2 + ... + an * bn), with the only difference that SUMPRODUCT expects the arrays to be either both horisontal or both vertical. If you now stack B in 2D, i.e. n x p, then MMULT(A, B) is effectively doing p SUMPRODUCTs one next to the other with output dimensions 1 x p.

From SUMIF to MMULT

Any SUMIF can be presented as a SUMPRODUCT. The formula =SUMIF(A1:A10,"yes",B1:B10) is equivalent to =SUMPRODUCT(--(A1:A10="yes"), B1:B10) where the -- simply serves to convert boolean TRUE/FALSE to numerical 1/0. --(A1:A10="yes") becomes a mask array of 0's and 1's, selecting which entry in B gets added together. Likewise, SUMIF/SUMPRODUCT can be done with MMULT just being mindful of the array orientation =MMULT(--(TRANSPOSE(A1:A10)="yes"), B1:B10).

Unfortunately neither SUMIF or SUMPRODUCT extend to 2D i.e. =SUMIF(A1:A10,"yes",B1:F10) does not work and can't provide a selective sum total array in say B11:F11. MMULT to the rescue:

=MMULT(--(TRANSPOSE(A1:A10)="yes"), B1:F10)

which is in terms of dimensions 1 x 10 . 10 x 5 = 1 x 5. Each column of B1:F10 is matrix multiplied/"sum product'ed" by the same mask array.

Finally, in the last step of mind bending maths, imagine that you want to summarise the above "yes" results and the "no" results below it. You could do, in two rows:

=MMULT(--(TRANSPOSE(A1:A10)="yes"), B1:F10)
=MMULT(--(TRANSPOSE(A1:A10)="no"), B1:F10)

but why not

=MMULT(--(TRANSPOSE(A1:A10)={"yes"; "no"}), B1:F10)

or likened to your specific situation

=MMULT(--(TRANSPOSE(A1:A10)=UNIQUE(A1:A10)), B1:F10)

which is in terms of dimensions 2 x 10 . 10 x 5 = 2 x 5.

To understand how Excel handles comparing matrixes of different sizes, it may help to test =TRANSPOSE(A1:A10)={"yes"; "no"} separately. Take the "; no" away. Test. Replace with ", no" and test. Note ";" is row separator and "," is column separator (at least for my localisation settings).

From SUMIFS to MMULT

To implement an array output SUMIFS with multiple criteria

=MMULT(TRANSPOSE(BITAND((A1:A10={"yes","no"}),(G1:G10="active"))),B1:F10)

Final note about efficiency SUMIF vs SUMPRODUCT/MMULT

As shown, these functions can give equivalent answers for the 1D case. It is important to note that SUMIF is probably more efficient (uses less memory and have no multiplication instructions). If the criteria is FALSE, SUMIF does nothing, whilst SUMPRODUCT still multiplies with zero and adds the result to the total. Now comparing multiple SUMIF's, each in its own cell, vs MMULT isn't so easy. While MMULTI would need m x n memory to store the mask and m x n x p additional multiplications, each SUMIF will repeat the criteria comparison with m x n x (p-1) more cell reads and (slow text?) comparisons. A speed comparison will be nice to have, but my money is on MMULT.

9
  • I'm not sure if this would necessarily be faster or less memory intensive than Scott's suggestion. Commented Mar 28, 2022 at 19:57
  • Ps. with such a massive workbook, you want to avoid using any volatile functions. Also try saving your file in binary format (.xlsb) which should definitely help with loading and saving times. Check here for my top tips on how to speed up your workbook: superuser.com/questions/1643437/… Commented Mar 28, 2022 at 20:25
  • Thanks. Fascinating read. Your point 14 is why I'm going down this path at all, I have the same gut instinct (see also python list comprehension vs manual for loops). I like the LET function, I wasn't aware of it but I knew that "one of" my issues was recalculating the same thing over and over on all 26million entries. And also I have conditional formatting, I could turn that off for now though I'll need it in the final. So ultimately binary format + conditional format + let + figuring out this array thing should hopefully get me a long way. Commented Mar 29, 2022 at 1:43
  • I just tried it... that's a pretty slick function. I've no idea how it works, there are follow-on actions I'll need to take after I get this licked to filter additional data so if I figure this out I'll mark as solved. I also found another post (mrexcel.com/board/threads/…) that showed use of a lambda function, but it uses static ranges rather than dynamic spilled arrays so I like this better. Commented Mar 29, 2022 at 1:54
  • 1
    I was able to do some very coarse profiing... ie no stopwatch but just observing Excel's performance. 1) MMULT did not complete on the full data set - ran out of resources. SUMIFS was mandatory, unfortunately. 2) However, I was able to test on a greatly reduced dataset (basically I just deleted everything after line 10000), MMULT worked there. It was much much faster, though took several GB more RAM. 3) MMULT is a lot easer to maintain, not only is it an easier equation but also once built it "just works" for any change in scale of the source dataset. Commented Mar 30, 2022 at 15:02
1

Abstract

There's no way I could improve @Mobus answer, I won't even try.

But, I thought this additional effort would be of interest to anyone else stumbling on this thread. Not sure if it's worth understanding my actual production data, but screw it page space is cheap. Scroll WAY down if you just want to see my final results, including screenshots.

Introduction

I'm trying to develop IPv6 first-hop security features for my organization using Catalyst 3850 and 9300 switches running IOS XE 17.6. IOS XE uses what's known as "Device Tracking," but the key thing is that the switch must have less than 1000 IPs tracked or the switch becomes de-stabilized (it's not documented anywhere except in Cisco TAC case 634280103). So, if I assume IPv6 link-scope (FE80::), IPv6 global-scope (2000::), and IPv4 addresses all consume a device tracking entry for the same MAC address, I therefor have to track no more then 333 MAC addresses with the Device Tracking feature.

In my organization, what does that limit me to?

Hypothesis: Either

  • "client-facing interfaces but no interfaces on the network module," or
  • "client VLANs but not management VLANs."

Method

Graph it. All interfaces and VLANs (all security features are at least interface-based, some have a VLAN-based alternative method) on all switches across a statistically significant period of time. That would mean I needed a 3D graph, device_interface x timestamp, with #MACs on the z-axis. I needed data though, and I needed Excel's graphing capabilities. (Yeah, I know Matlab would be better but I'm already teaching myself enough just to get this far). Since I'm looking to see where various options break, I need to see the high-end extremes, so I can also graph a max threshold plane and see data points exceeding my max threshold.

I was able write some Python scripts to integrate a few disparate management servers to aid this quest. That generated two CSV files: one that recorded number of MAC address per VLAN per switch, and another that recorded MAC addresses per interface per switch - both at 15min intervals, for a month. Any entries that were zero were dumped, but that still resulted in 26million lines in the interfaces CSV file (VLANs was smaller - only 8million).

Back to Python, and I converted to two new CSV files with header row (VLAN was effectively the same): device_interface,device,model,interface,time1,time2,time3,time4,time5.....

That made around 31600 Interface lines and around 13000 VLAN lines, both with around 2600 timestamps. Cool! I can now open these in Excel.

Two interesting nuggets: (1) I recorded the timestamp of the start time of data gathering an entire battery, and also the timestamp of data gathering an individual switch; (2) some data-gathering batteries were incomplete, though ultimately enough data was gathered to statistically make this second issue disappear ("replace with zero and the missed data is immaterial"). That made the switch_interfaceXtime(n) intersection look like: x&excel-formatted-timestamp&MACs (x = "C" for clean and "D" for dirty). Example: C&44612.59375&27

Ok, cool, including all the gaps in the data, that's something like 100million data points that I now have loaded spread across two worksheets. This is where I ran into problems, trying to grapple with that much data.

My original solution (apologies, I no longer have it) I think would have worked, but I stopped that calculation after it had consumed 60GB of virtual memory, and I calculated the time to completion and figured out to see that it would have taken 3 years. I was also getting frustrated that anything I did required a LONG time to apply (minutes, typically). Back to the drawing board, which yielded the question here that @Mobus answered fantastically. Turns out my original solution had lots of issues, see my comments under @Mobus answer but the big part I used the C/D to apply Conditional Formatting and I was using the OFFSET function, both of which are volatile, and I think that made my solution O(n^2) or 100million X 100million calculations, and probably far larger (O(n^3)? O(n^4)?).

Unfortunately, @Mobus's solution doesn't work at the scale I'm working on. It worked fine on the 13,000-line VLAN sheet but Excel didn't even bother to try the 31,000-line Interfaces worksheet (some dialog box about not enough resources to complete the operation).

I tried to re-do some mmult functions and sumproducts and so forth to see if I could get spilled results, and therefor stuff everything into simply the top-left cell of the output range, but I ran into a few issues:

  • my first attempt was essentially a re-hash of what you suggested (I didn't realize that until later), and it immediately refused to calculate due to resource exhaustion.
  • my second attempt was to see if I could do either with dot-products or cross-products. I wanted to see if I could succeed for a single cell and then scale to a spilled array. I think I may have accomplished, but attempting it consumed so much memory it froze the system. At one point I saw it had 60GB commit on my 16GB machine, so it got at least that high.

In researching the sumproduct formula, I learned excel's array multiplication mechanism (this is used in @Mobus mmult solution, but I didn't understand the general underlying principle until I did this work). That finally worked, though my solution only calculated a single cell, I needed to copy/paste to all many-million cells. The copy/paste was easy, at least I only need to do it once (right...???).

Discussion

It took either 5min or 20min to copy/paste each time to the whole worksheet, even when copying only a single row. I didn't try to figure out the "why" behind the difference, though I assume it had something to do with page-swapping.

Calculation (post-copy) took the better part of a half day, from various observations I estimate 9-10 hours per worksheet, with my 4x CPU cores 90% loaded on both hyperthreaded pipelines running at 90% max frequency, nearly all in what Linux would call "user space." Final product was 18.4GB when freshly opened.

I'm doing the graphing aspect as a follow-on activity; I don't think it's necessarily related to the question that spawned this thread.

File open with all 9x worksheets now takes 45min, save and close each now take about 10-15min; that's WAY down from the hours it was taking before. Using my first approach: minimum of 2hr to open, save, or close, with only 4x worksheet; I never even bothered trying to do all 9x worksheets using my first approach.

The formula could possibly be improved for readability, but I didn't want to spend hours and hours to verify. Maybe something I can improve on this with sequence() to create a grid of the correct size, and then... not sure after that.

Anyway, the nice thing about these formulas are that they're 100% non-volatile. I use unique, sort, and filter in a few places but those I think are also non-volatile, so further R&D doesn't suffer.

Results

Memory Usage

Above: Screenshot showing the memory usage of the file when freshly opened.

Interface and VLAN Raw Data

"...(Raw)" tabs: No formulas; this is strictly raw data.

Interfaces and VLANs with MACs Extracted

"...(Streamlined)" tabs: Formulas on the VLAN worksheet are "remarkably similar."

A1: ='Interface Data (Raw)'!A1
A2: =LET(x,'Interface Data (Raw)'!A:A,FILTER(x,(x>0)*(x<>A1)))
B1: ='Interface Data (Raw)'!B1
B2: =LET(x,'Interface Data (Raw)'!B:B,FILTER(x,(x>0)*(x<>B1)))
C1: ='Interface Data (Raw)'!D1
C2: =LET(x,'Interface Data (Raw)'!D:D,FILTER(x,(x>0)*(x<>C1)))
D1: =LET(x,'Interface Data (Raw)'!1:1,FILTER(x,LEFT(x,1)="4"))
D2: =LET(x,'Interface Data (Raw)'!E2:E31634,IF(x="",0,RIGHT(x,LEN(x)-FIND("&",x,5))))

D2 results in a spilled array, 1 columns with an equal number of rows as column A, B, and C. D2 was copy/pasted all the way though to cell CTO2

Interfaces and VLANs Summarized Per-Device

A1: same as above
A2: same as above
A2: =UNIQUE('Interface Data (Streamlined)'!B2#)
B2: =SUM((--('Interface Data (Streamlined)'!$B$2#=$A2))*'Interface Data (Streamlined)'!D$2#)

B2 resulted in a single cell, that had to be copy/pasted all the way to cell CTM1783

Interfaces Summarized with No Uplink, and No Module

You get the idea with the boring cells.

B2 (upper): =SUM((--('Interface Data (Streamlined)'!$B$2#=$A2))*'Interface Data (Streamlined)'!D$2#*(NOT(RIGHT('Interface Data (Streamlined)'!C$2#,4)="/1/1")))
B2 (lower): =SUM((--('Interface Data (Streamlined)'!$B$2#=$A2))*'Interface Data (Streamlined)'!D$2#*(NOT(LEFT(RIGHT('Interface Data (Streamlined)'!$C$2#,4),3)<>"/1/")))

B2 also resulted in a single cell, that had to be copy/pasted all the way to cell CTM1783

VLANs Summarized with No Management

B3, B4, and C3 are nothing interesting that wasn't shown above.

A4: =--ISERROR(MATCH('VLAN Data (Streamlined)'!C2#,E2#,))
G1: =SEQUENCE(,1000,2000)
E2: =E1:ALR1
C4: =SUM(--('VLAN Data (Streamlined)'!$B$2#=$B4)*$A$4#*'VLAN Data (Streamlined)'!D$2#)

E2 was necessary because I wasn't sure how else to get the array {1,8,2000,2001...2999}, and I forgot I could have simply started typing "2000" and "2001" then dragged the rest of the way. Oh well.

C4 also resulted in a single cell, that had to be copy/pasted all the way to cell CTN1785. Interestingly, pre-calculating into A4 didn't appreciably speed the overall calculation.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .