6
\$\begingroup\$

Oftentimes I find myself wanting the total number of rows returned by a query even though I only may display 50 or so per page. Instead of doing this in multiple queries like so:

SELECT first_name, 
       last_name, 
       (SELECT count(1) FROM  sandbox.PEOPLE WHERE trunc(birthday) = trunc(sysdate) ) as totalRows
FROM  sandbox.PEOPLE 
WHERE trunc(birthday) = trunc(sysdate);

It has been recommended to me to do this:

SELECT first_name, 
            last_name, 
            count(*) over () totalRows
FROM  sandbox.PEOPLE
WHERE trunc(birthday) = trunc(sysdate);

I am just looking for what is better as far as performance and if performance is a wash. Does this really improve readability of SQL? It is certainly cleaner/easier to write.

\$\endgroup\$
7
  • 2
    \$\begingroup\$ Do you have a specific question? \$\endgroup\$
    – Jeremy
    Commented Jan 27, 2011 at 20:56
  • \$\begingroup\$ Really don't think this belongs here. Probably better on SO. \$\endgroup\$ Commented Jan 27, 2011 at 20:58
  • 1
    \$\begingroup\$ @Mark: Maybe Programmers.SE? There's nothing wrong with the code. I assume @Scott wants to know which is better? \$\endgroup\$
    – Jeremy
    Commented Jan 27, 2011 at 21:12
  • \$\begingroup\$ @Jeremy: programmers makes sense to me. \$\endgroup\$ Commented Jan 27, 2011 at 21:25
  • \$\begingroup\$ The second query looks better; performance-wise I expect it's a wash. How are you doing your pagination? \$\endgroup\$
    – Ron
    Commented Jan 28, 2011 at 23:09

1 Answer 1

8
\$\begingroup\$

The latter query will be much more efficient-- it only requires hitting the table once. You can do a quick test yourself to confirm this.

I'll create a simple two-column table with 1 million rows where the second column is one of 10 distinct values

SQL> create table t (
  2    col1 number,
  3    col2 number
  4  );

Table created.

SQL> insert into t
  2    select level, mod(level,10)
  3      from dual
  4   connect by level <= 1000000;

1000000 rows created.

Now, I'll run two different queries that retrieve 10% of the data. I've set SQL*Plus to not bother displaying the data but to display the query plan and the basic execution statistics. With the first query, note that the query plan shows that Oracle has to access the table twice and then do a sort and aggregate. The query does ~10,000 consistent gets which is a measure of the amount of logical I/O being done (note that this is independent of whether data is cached so it is a much more stable measure-- if you run the same query many times, the consistent gets figure will fluctuate very little)

SQL> set autotrace traceonly;
SQL> select col1
  2        ,(select count(*) from t where col2=3)
  3    from t
  4   where col2 = 3;

100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3335345748

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 85706 |  2176K|   525   (3)| 00:00:07 |
|   1 |  SORT AGGREGATE    |      |     1 |    13 |            |          |
|*  2 |   TABLE ACCESS FULL| T    | 85706 |  1088K|   525   (3)| 00:00:07 |
|*  3 |  TABLE ACCESS FULL | T    | 85706 |  2176K|   525   (3)| 00:00:07 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("COL2"=3)
   3 - filter("COL2"=3)

Note
-----
   - dynamic sampling used for this statement (level=2)


Statistics
----------------------------------------------------------
         32  recursive calls
          1  db block gets
      10465  consistent gets
          0  physical reads
        176  redo size
    2219528  bytes sent via SQL*Net to client
      73850  bytes received via SQL*Net from client
       6668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     100000  rows processed

On the other hand, with the analytic function approach, the query plan shows that we only have to hit the table once. And we only have to do ~1,900 consistent gets-- less than 20% of the logical I/O that the first query had to do.

SQL> select col1,
  2         count(*) over ()
  3    from t
  4   where col2 = 3;

100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2291049666

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 85706 |  2176K|   525   (3)| 00:00:07 |
|   1 |  WINDOW BUFFER     |      | 85706 |  2176K|   525   (3)| 00:00:07 |
|*  2 |   TABLE ACCESS FULL| T    | 85706 |  2176K|   525   (3)| 00:00:07 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("COL2"=3)

Note
-----
   - dynamic sampling used for this statement (level=2)


Statistics
----------------------------------------------------------
          4  recursive calls
          0  db block gets
       1892  consistent gets
          0  physical reads
          0  redo size
    2219510  bytes sent via SQL*Net to client
      73850  bytes received via SQL*Net from client
       6668  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
     100000  rows processed

Now, to be fair, you probably won't cut out 80% of your consistent gets moving to the analytic function approach using this particular query because it is likely that far fewer than 10% of the rows in your PEOPLE table have a birth date of today. The fewer rows you return, the less the performance difference will be.

Since this is Code Review, the analytic function approach is much easier to maintain over time because you don't violate the DRY principle and you have to remember to make all the same changes to your inline query that you make in the main query.

\$\endgroup\$
1
  • 1
    \$\begingroup\$ If the total data set is very large and an estimate of the total number of rows is acceptable then a subquery with a sample clause might be handy: "(select count(*) *100 from t sample(1) where col2=3)" \$\endgroup\$ Commented May 1, 2013 at 21:53

Not the answer you're looking for? Browse other questions tagged or ask your own question.