30

The ANSI SQL standard defines (chapter 6.5, set function specification) the following behaviour for aggregate functions on empty result sets:

COUNT(...) = 0
AVG(...) = NULL
MIN(...) = NULL
MAX(...) = NULL
SUM(...) = NULL

Returning NULL for AVG, MIN and MAX makes perfect sense, since the average, minimum and maximum of an empty set is undefined.

The last one, however, bothers me: Mathematically, the SUM of an empty set is well-defined: 0. Using 0, the neutral element of addition, as the base case makes everything consistent:

SUM({})        = 0    = 0
SUM({5})       = 5    = 0 + 5
SUM({5, 3})    = 8    = 0 + 5 + 3
SUM({5, NULL}) = NULL = 0 + 5 + NULL

Defining SUM({}) as null basically makes "no rows" a special case that does not fit in with the others:

SUM({})     = NULL  = NULL
SUM({5})    = 5    != NULL + 5 (= NULL)
SUM({5, 3}) = 8    != NULL + 5 + 3 (= NULL)

Is there some obvious advantage of the choice that was made (SUM being NULL) that I have missed?

2

3 Answers 3

20

I'm afraid that the reason is simply that the rules were set in an adhoc fashion (like quite many other "features" of the ISO SQL standard) at a time when SQL aggregations and their connection with mathematics were less understood than they are now (*).

It's just one of the extremely many inconsistencies in the SQL language. They make the language harder to teach, harder to learn, harder to understand, harder to use, harder to whatever you want, but that's just the way things are. The rules cannot be changed "cold" and "just like that", for obvious reasons of backward compatibility (If the ISO committee publishes a final version of the standard, and vendors then set out to implement that standard, then those vendors will not appreciate it very much if in a subsequent version, the rules are changed such that existing (compliant) implementations of the former version of the standard "automatically fail to comply" the new version ...)

(*) It is now better understood that aggregations over an empty set behave more consistently if they systematically return the identity value (= what you call the 'neutral element') of the underlying binary operator at hand. That underlying binary operator for COUNT and SUM is addition, and its identity value is zero. For MIN and MAX, that identity value is the highest and lowest value of the type at hand, respectively, if the concerned types are finite. Cases like averaging, harmonic means, medians, etc. are extremely intricate and exotic in this respect, though.

11
  • I think null makes sense over an empty set with min and max. You might say an identity value there really is unknown, but the sum of no values is 0 for the same reason that n * 0 is always 0. But min and max are different. I don't think the result is properly defined running across no records. Commented Oct 5, 2012 at 12:14
  • Also avg() over a null set makes sense as a null because 0/0 is not properly defined in this context. Commented Oct 5, 2012 at 12:16
  • 5
    MIN and MAX are not so different. Take an underlying binary operator LOWESTOF(x,y) and HIGHESTOF(x,y) respectively. These binary operators do have an identity value. Because in both cases (if the involved type is finite), there exists indeed some value z such that forall x : LOWESTOF(z,x)=x and forall y : HIGHESTOF (y,z)=y. (The identity value is not the same for both cases, but it does exist for both cases.) I agree that the results look extremely counterintuitive at first glance, but there is no denying the mathematical reality. Commented Oct 5, 2012 at 12:53
  • @Erwin: I agree on all your points, except that the identity of some operations, like HIGHEST() many not be an element of the datatype, like for Reals where the identity would be the -Infinity (and +Infinity for LOWEST()) Commented Oct 5, 2012 at 13:17
  • 1
    @SQL kiwi. Are you forgetting about static type checking ? If expressions like SUM() are handled by the static type checker as if they always return an integer, then obviously it should be impossible for the SUM() invocation to sometimes return something that is not an integer (e.g. an empty relation). Commented Oct 5, 2012 at 22:14
3

In a pragmatic sense the existing result of NULL is useful. Consider the following table and statements:

C1 C2
-- --
 1  3 
 2 -1 
 3 -2 

SELECT SUM(C2) FROM T1 WHERE C1 > 9;

SELECT SUM(C2) FROM T1 WHERE C1 < 9;

The first statement returns NULL and the second returns zero. If an empty set returned zero for SUM we would need another means to distinguish a true sum of zero from an empty set, perhaps using count. If we indeed want zero for the empty set then a simple COALESCE will furnish that requirement.

SELECT COALESCE(SUM(C2),0) FROM T1 WHERE C1 > 9;
6
  • 1
    as a result., SUM(union of set1 and set2) <> SUM(set1) + SUM(set2), because any number + NULL = NULL. Does it make sense to you?
    – A-K
    Commented Oct 5, 2012 at 14:15
  • 2
    @Leigh: Using COALESCE() like this will not distinguish the (0) sum of an empty set from the (NULL) sum (say the table had a (10, NULL) row. Commented Oct 5, 2012 at 14:26
  • Besides, we still cannot distinguish SUM(empty set) from SUM(set of one or more NULLs). Do we need to distinguish at all?
    – A-K
    Commented Oct 5, 2012 at 15:14
  • @AlexKuznetsov - We can distinguish a sum of an empty set from a sum of a set that contains one or more nulls as long as at least one row contains a value. You are correct that if the set contains only NULLs then we can't distinguish the NULL set from this set of all NULL values. My point wasn't that it is useful in every case, merely that it can be useful. If I SUM a column and get back zero I know without having to check that there is at least one not NULL row being used to show me the result. Commented Oct 5, 2012 at 15:30
  • @ypercude - You are absolutely correct. My point was that the current behavior of SUM does distinguish an empty set from a set that contains values (even if some are null). It is simpler to use COALESCE when the distinction is not required than to use something like DECODE(count(c2),0,NULL,sum(c2)) when it is. Commented Oct 5, 2012 at 17:07
-1

The main difference I can see is with regard to the datatype. COUNT has a well defined returntype: A whole number. All the others depend on the type of the column/expression they are looking at. Their return type must be compatible with all members of the set (think float, currency, decimal, bcd, timespan, ...). Since there is no set you cannot imply a return type, thus NULL is your best option.

Note: In most cases you could imply a return type from the column type you are looking at, but you can do SUMs not only on columns but on all kinds of things. Implying a return type might get very difficult if not impossible under certain circumstances, especially when you think about possible expansions of the standard (dynamic types come to mind).

6
  • 5
    Why can we not imply a return type in a SUM(column) expression? Don't we have empty tables - and there all the columns have defined types? Why should it be any different for an empty result set? Commented Oct 5, 2012 at 12:04
  • 5
    You get it wrong where you say "since there is NO SET". There is a set. The set of all possible values of the declared type of the involved columns or expression. That declared type exists even if the table you're looking at is empty. Even empty tables still have a heading. And that declared type is exactly your "implied return type". Commented Oct 5, 2012 at 13:09
  • Did the both of you actually read my note? Yes, it would work for column-based SUMs as of now. But as soon as you encounter a variable datatype-column (not in SQL Server - yet), you are out of luck.
    – TToni
    Commented Oct 5, 2012 at 14:12
  • 2
    How will you define the sum in that case? What will the result of 24 + 56.07 + '2012-10-05' + 'Red' be? I mean there is no pint in worrying how SUM() will behave when we have a problem defining addition. Commented Oct 5, 2012 at 14:28
  • 1
    @TToni : "especially when you think about possible expansions of the standard" is not the context that the OP was referring to. the OP was very clearly referring to the current version of the standard, which does not include any sort of notion of "dynamic types" or some such. (Oh, and I only commented, but didn't downvote. Apart from that tiny slip I took issue with, nothing in your answer was wrong enough to warrant a downvote. IMO.) Commented Oct 5, 2012 at 22:08

Not the answer you're looking for? Browse other questions tagged or ask your own question.