3
\$\begingroup\$

INTRO

I am using an Excel worksheet as my "database." No headers so [F1] is assigned to column one by default. I filled the entire column, all 1,048,576 cells with RANDBETWEEN(1,20). I then hard set these values by copy/pasting as value.

It is very slow. Much slower than using Excel functions would be to generate same data. I realize that I could use 1,048,576 as a constant denominator but I wanted to practice with SQL query language, and keep model more dynamic.

EDIT: it should not say % in results not many times. I am getting the % of times each var occurred in my data set. I am basically seeing what the distribution is of randbetween(1-20) outputs are over X calls. In this instance x is 1,048,576.

My SQL Query

"SELECT Round(SUM(IIF([F1]=" & searchKey & ",1,0))*100.0/SUM(IIF([F1]<> Null,1,0)),10) From [Sheet1$];"

Is returning the total number of records in column F1 that are equal to searchKey and dividing that number by the total number of records in column F1.

CODE

Option Explicit
Private Declare PtrSafe Function timeGetTime Lib "winmm.dll" () As Long

Sub SqlQueryOnWorkSheet()
Dim started                                     As Long
Dim cn                                          As ADODB.Connection
Dim filePath                                    As String
Dim counter                                     As Long
Dim outCome                                     As Double
Dim ended                                       As Long

    started = timeGetTime

    filePath = "Z:\Test\Test1.xlsx"
      
    Set cn = EstablishConnection(filePath)
    
    If cn.State <> 1 Then GoTo CleanFail:

    For counter = 1 To 20
        outCome = FindCount(cn, counter)
        PrintOutcome counter, outCome
    Next counter

    cn.Close
    Set cn = Nothing
    
    ended = timeGetTime
    Debug.Print "QUERIES RAN IN " & (ended - started) / 1000 & " SECONDS"
    Exit Sub
CleanFail:
    Debug.Print "CONNECTION COULD NOT BE MADE"
End Sub

Function EstablishConnection(ByVal filePath As String) As ADODB.Connection
    Set EstablishConnection = New ADODB.Connection
    EstablishConnection.Open _
        "Provider=Microsoft.ACE.OLEDB.12.0;" & _
        "Data Source='" & filePath & "';" & _
        "Extended Properties=""Excel 12.0 Macro;HDR=No;IMEX=1;"";"
End Function

Function FindCount(ByRef cn As ADODB.Connection, ByVal searchKey As Long) As Double
Dim strSql                                      As String
Dim rs                                          As ADODB.Recordset
On Error GoTo CleanFail:
    Set rs = New ADODB.Recordset
    strSql = "SELECT Round(SUM(IIF([F1]=" & searchKey & ",1,0))*100.0/SUM(IIF([F1]<> Null,1,0)),10) From [Sheet1$];"
    rs.Open strSql, cn
    FindCount = rs.GetString
    rs.Close
    Set rs = Nothing
    Exit Function
CleanFail:
    Debug.Print "QUERY FAILED"
End Function

Sub PrintOutcome(ByVal counter As Long, ByVal outCome As Double)
    Debug.Print "Variable " & counter & " Occured " & outCome & " Many Times"
End Sub

Results

EDIT: THIS SHOULD BE % NOT MANY -- FIXED

Variable 1 Occured 4.9837112427 % Of Time

Variable 2 Occured 5.0171852112 % Of Time

Variable 3 Occured 4.9752235413 % Of Time

Variable 4 Occured 4.9716949463 % Of Time

Variable 5 Occured 5.0051689148 % Of Time

Variable 6 Occured 4.9989700317 % Of Time

Variable 7 Occured 4.9901008606 % Of Time

Variable 8 Occured 5.0283432007 % Of Time

Variable 9 Occured 5.0018310547 % Of Time

Variable 10 Occured 5.0164222717 % Of Time

Variable 11 Occured 4.9933433533 % Of Time

Variable 12 Occured 5.0059318542 % Of Time

Variable 13 Occured 5.0333976746 % Of Time

Variable 14 Occured 4.9952507019 % Of Time

Variable 15 Occured 5.0163269043 % Of Time

Variable 16 Occured 4.9654006958 % Of Time

Variable 17 Occured 4.9822807312 % Of Time

Variable 18 Occured 5.0310134888 % Of Time

Variable 19 Occured 5.0113677979 % Of Time

Variable 20 Occured 4.9770355225 % Of Time

QUERIES RAN IN 38.754 SECONDS
\$\endgroup\$
7
  • 1
    \$\begingroup\$ Guessing that since RANDBETWEEN is volatile, it's updating every single one of the 1,048,576 results every time the cursor moves for a read... \$\endgroup\$
    – Comintern
    Commented Aug 23, 2018 at 22:13
  • \$\begingroup\$ I don't see RANDBETWEEN in the code anywhere, so are you really testing it or are you performing a similar query to compare results? Please adjust your title to reflect what your code does. \$\endgroup\$ Commented Aug 23, 2018 at 23:00
  • \$\begingroup\$ @Raystafarian OK tried to give it a better title. Also adjusted name of primary subroutine....bad habits die hard!!! :-p \$\endgroup\$ Commented Aug 24, 2018 at 1:22
  • \$\begingroup\$ @Comintern hopefully more informative title more accurately represents what I am doing \$\endgroup\$ Commented Aug 24, 2018 at 1:23
  • \$\begingroup\$ My point is that if you have any cells with RANDBETWEEN in them, you should expect the performance to be poor. RANDBETWEEN is volatile function. That means every time you calculate the value of a cell with RANDBETWEEN in it, all of them recalculate. That means as the ADO cursor moves from cell to cell, you end up recalculating 1,048,576 cells. If you aggregate them in the query, you get a trillion cell calculations. That's why it's slow. \$\endgroup\$
    – Comintern
    Commented Aug 24, 2018 at 1:44

1 Answer 1

3
\$\begingroup\$

The reason for the poor performance is that you are taking a non database approach to compiling the data. A query's speed is dependent on its ability to group, filter and index records.

SELECT 
Round(SUM(IIF([F1]=" & searchKey & ",1,0))
*100.0
/SUM(IIF([F1]<>Null,1,0)),10)
From [Sheet1$];

The IIF() function is much slower that using a WHERE clause to filter the data. Use WHERE ([F1] Is Not Null) AND ([F1]=" & searchKey & ") so you are only retrieving the relevant records.

With the records properly filtered, you can use Count(*) instead of Sum() and IIF([F1]<>Null,1,0).

It would be better to group the counts and returning all the records at once.

Although it doesn't matter for 20 records, writing to the Immediate Window line by line is slow.

Before writing the code below, I copied the data into an Access table and ran the Query Wizard - Count Duplicates.

PrintOccurences

Sub PrintOccurences(ByRef cn As ADODB.Connection)
        Const strSql As String = "SELECT First([Sheet1$].[F1]) AS [F1 Field], Round((Count([Sheet1$].[F1])/ (SELECT Count(*) FROM [Sheet1$] WHERE [Sheet1$].[F1])) * 100,10) AS NumberOfDups" & vbNewLine & _
      "FROM [Sheet1$]" & vbNewLine & _
      "GROUP BY [Sheet1$].[F1]" & vbNewLine & _
      "HAVING (((First([Sheet1$].[F1])) Is Not Null));"

    Dim rs As ADODB.Recordset
    Dim sb As Object
    On Error GoTo CleanFail:
    Set rs = New ADODB.Recordset
    Set sb = CreateObject("System.Text.StringBuilder")
    rs.Open strSql, cn

    Do While Not rs.EOF
        sb.AppendFormat_4 "Variable {0} Occured {1} Many Times" & vbNewLine, Array(rs("F1 Field").Value, rs("NumberOfDups").Value)
        rs.MoveNext
    Loop

    Debug.Print sb.ToString
    rs.Close
    Set rs = Nothing
    Exit Sub
CleanFail:
    Debug.Print "QUERY FAILED"
End Sub

Results

Variable 1 Occurred 5.2052052052 Many Times
Variable 2 Occurred 3.9039039039 Many Times
Variable 3 Occurred 4.8048048048 Many Times
Variable 4 Occurred 5.2052052052 Many Times
Variable 5 Occurred 4.4044044044 Many Times
Variable 6 Occurred 5.3053053053 Many Times
Variable 7 Occurred 5.005005005 Many Times
Variable 8 Occurred 5.2052052052 Many Times
Variable 9 Occurred 5.5055055055 Many Times
Variable 10 Occurred 4.8048048048 Many Times
Variable 11 Occurred 5.4054054054 Many Times
Variable 12 Occurred 4.9049049049 Many Times
Variable 13 Occurred 4.6046046046 Many Times
Variable 14 Occurred 4.1041041041 Many Times
Variable 15 Occurred 6.6066066066 Many Times
Variable 16 Occurred 5.005005005 Many Times
Variable 17 Occurred 5.5055055055 Many Times
Variable 18 Occurred 4.4044044044 Many Times
Variable 19 Occurred 5.005005005 Many Times
Variable 20 Occurred 5.1051051051 Many Times
\$\endgroup\$
9
  • \$\begingroup\$ I am finding the % not how many times lol. Idk why I left it like that, just tired. I am finding what % of time a each number showed up. I explain in paragraph preceding code but I guess the many times threw you off (and rightfully so) \$\endgroup\$ Commented Aug 24, 2018 at 13:48
  • \$\begingroup\$ Your initial comments are much appreciated. It seems that I am using this as an expensive array rather than an ADO. \$\endgroup\$ Commented Aug 24, 2018 at 13:50
  • \$\begingroup\$ My SQL Query is returning the total number of records in column F1 that are equal to searchKey and dividing that number by the total number of records in column F1. I tried to use a Where clause but had hard time with it \$\endgroup\$ Commented Aug 24, 2018 at 13:57
  • \$\begingroup\$ @learnAsWeGo I modified my code to return the percentage. I will clean up my post later. \$\endgroup\$
    – TinMan
    Commented Aug 24, 2018 at 14:50
  • \$\begingroup\$ Basically if i want this to be fast I am going to have to do the indexing / sorting / filtering myself then run query? \$\endgroup\$ Commented Sep 25, 2018 at 3:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.