1
\$\begingroup\$

The Question

What is the percentage of customers who increase their closing balance by more than 5%?


Source Code

The database and all details can be found here.


ERD

enter image description here


My Proposed Solution

Select TOP 1
        Format(Cast(Count(Distinct Y.customer_id) AS Decimal) / 
               (Select Count(Distinct customer_id) From customer_transactions), 'P1') As PercentageOfCutomer

From(
    Select 
            X.customer_id
            ,X.RowNum
            ,CAST(BalanceAccumSum - LAG(BalanceAccumSum, 1, 0) 
             OVER(Partition By X.customer_id Order By X.RowNum ASC) AS DECIMAL) / NULLIF(LAG(BalanceAccumSum, 1, 0) 
             OVER(Partition By X.customer_id Order By X.RowNum ASC),0)  As ClosingBalance
        
    From 
     (
        Select 
                A.customer_id
                ,A.RowNum
                ,Sum(A.signedTxnAmount) Over(Partition By A.customer_id Order By A.RowNum)  As BalanceAccumSum
        From (
                Select  
                        customer_id
                        ,CASE
                            When txn_type = 'deposit' Then txn_amount
                            When txn_type in ('purchase', 'withdrawal') Then txn_amount * -1
                            Else txn_amount
                         END As signedTxnAmount
                        ,ROW_NUMBER() OVER(Partition By customer_id Order By txn_date) AS RowNum
                From customer_transactions) A
            ) X
        ) Y
Where Y.ClosingBalance > 0.05
Group By Y.customer_id, Y.RowNum

Output

enter image description here


My Query

Assuming that my understanding of the question is correct. That is: Get the Percentage of Customers whose closing balance is > 5% compared to the previous balance before the latest.

Is my solution overcomplicated with these subqueries, or is it OK? Am I missing anything to make it better?

\$\endgroup\$
1
  • \$\begingroup\$ What is a "node" in this context? \$\endgroup\$
    – J_H
    Commented Jul 12, 2023 at 5:53

1 Answer 1

2
\$\begingroup\$

The Good

Using subqueries instead of CTEs (which would impose an optimization boundary).

The Bad

You form a window in three different subqueries; it should only be needed in one.

When txn_type = 'deposit' Then txn_amount is redundant.

This query does not call for formatting. In real life, essentially all queries are consumed by applications and not by people, so the output should be a single floating-point fraction from 0 to 1.

You should not need top 1 with a properly-written count. You should not count distinct.

Never label subqueries A, X etc.; always give them meaningful names.

txn_amount * -1 is just -txn_amount.

More broadly, your approach needs to be re-thought. Instead of going through all of the trouble to calculate the actual close-balance change (which is not needed), all you need is a boolean indicating whether that change was high enough. This interpretation has significant benefits, including that you don't need to care about divide-by-zero for the first-row edge case within a window since you don't need to divide at all, only compare.

Your results are incorrect. At least one bug: if there are only deposits and no withdrawals or purchases, the fraction should be 1.0 but your query does not produce any rows.

After a bunch of work, I believe all three methods below to be equivalent and correct. But you'll find that they all return slightly different results! This is explained by the fact that the test data are ambiguous: there is no clear "last transaction" since many transactions overlap and only have dates set, not times. This is the fault of the course, which should have used unique timestamps instead of overlapping dates.

Critically, this calls for unit tests (which are certainly possible, especially in PostgreSQL). If you do not unit test with small, well-defined, repeatable cases, there is no way to know whether your output percentage is correct. As a matter of practice and diligence, any non-trivial queries like this should get unit tests. For instance, with these data:

INSERT INTO customer_transactions
  (customer_id, txn_date, txn_type, txn_amount)
VALUES
('429', '2020-01-01', 'deposit', '500'),
('429', '2020-01-02', 'deposit', '500'),
('429', '2020-01-03', 'deposit', '49'),
('622', '2020-01-01', 'deposit', '1'),
('622', '2020-01-02', 'deposit', '1'),
('622', '2020-01-03', 'deposit', '-1'),
('266', '2020-01-01', 'deposit', '1'),
('266', '2020-01-02', 'deposit', '1'),
('266', '2020-01-03', 'deposit', '-1');

your query should produce 0, but replacing the 49 with a 51 should produce 0.333. Without running more such tests it will be difficult to figure out why your query produces 0.2% and mine 41.6%.

Format() is not a SQL-standard function. This makes me guess that you're using TSQL, which is not as good of an idea as using the course's PostgreSQL.

The course should not have taught you to use a string for a transaction type. You should be using an enum instead.

The course has a poor table definition. It should be closer to something like:

CREATE TABLE customer_transactions (
  customer_id INTEGER not null,
  txn_date DATE not null,
  txn_type VARCHAR(10) not null check(txn_type in ('deposit', 'purchase', 'withdrawal')),
  txn_amount INTEGER check (txn_amount >= 0)
);

It's also missing a foreign key.

The Ugly

Standard practice for SQL keywords is to CAPITALIZE them. You use a random mix of CAPITALS and TitleCase. I personally don't like either and prefer lowercase, but you should be consistent. Also keep in mind that your identifiers like BalanceAccumSum lose their case in many systems (PostgreSQL included), and so balance_accum_sum stands a better chance at its legibility surviving.

Moving commas to be a prefix on the following line is deeply bizarre and I find illegible. Any small benefit of "some commas lining up on the same column" is thoroughly outweighed by the loss of legibility.

Your indentation ranges from 4 (pretty reasonable) to 8 (not reasonable).

Suggested Query

Again, please test this with small, well-understood cases, comparing your own results to mine (also see fiddle).

set search_path = data_bank;

select (
    count(*) filter (where big_gain)
) / cast(count(*) as real) as customers_with_gain_frac
from (
    select distinct on (customer_id) customer_id, big_gain
    from (
        select customer_id,
            -- curr/(cumsum - curr) > 0.05
            -- curr > cumsum*0.05/1.05
            signed_amount > 0.05/1.05*abs(
                sum(signed_amount) over cust_window
            ) as big_gain,
            row_number() over cust_window as idx
        from (
            select
                customer_id, txn_date,
                case
                    when txn_type in ('purchase', 'withdrawal') then -txn_amount
                    else txn_amount
                end as signed_amount
            from customer_transactions
        ) transactions_with_sign
        window cust_window as (partition by customer_id order by txn_date)
    ) transactions_with_change
    order by customer_id, idx desc
) customers_with_gains;

For a quite different take, this is also possible with no windowing:

with transactions_with_sign as (
    select customer_id, txn_date,
        case
            when txn_type in ('purchase', 'withdrawal') then -txn_amount
            else txn_amount
        end as signed_amount
    from customer_transactions
)
select
    count(*) filter (
        where last_transactions.signed_amount > 0.05/1.05*abs(accounts.balance)
    ) / cast(count(*) as real) as customers_with_gain_frac
from
    (
        select distinct on (customer_id) customer_id, signed_amount 
        from transactions_with_sign
        order by customer_id, txn_date desc
    ) last_transactions
    join (
        select customer_id, sum(signed_amount) as balance
        from transactions_with_sign
        group by customer_id
    ) accounts on accounts.customer_id = last_transactions.customer_id;

Or, even, with no CTE:


select
    count(*) filter (
        where last_transactions.txn_amount > 0.05/1.05*abs(accounts.balance)
        and last_transactions.txn_type = 'deposit'
    ) / cast(count(*) as real) as customers_with_gain_frac
from
    (
        select customer_id,
            sum(
                case
                    when txn_type in ('purchase', 'withdrawal') then -txn_amount
                    else txn_amount
                end
            ) as balance
        from customer_transactions
        group by customer_id
    ) accounts
    left join (
        select distinct on (customer_id) customer_id, txn_type, txn_amount 
        from customer_transactions
        order by customer_id, txn_date desc
    ) last_transactions
    on last_transactions.customer_id = accounts.customer_id;
\$\endgroup\$
5
  • \$\begingroup\$ Thanks for your answer and guidelines. I still want a T-SQL answer, and it's my bad I tagged my question only as SQL. In your answer, there are a couple of keywords that I'm not familiar with; maybe they are PostgreSQL, or I have not encountered them before. Regarding the complexity (maybe I mean by this - the size and the usage of 2 window functions) of your answer, it is not so far away from mine, but I appreciate the optimisation. I don't know if I can repeat it in T-SQL, but I will try. Thanks! \$\endgroup\$
    – Mike
    Commented Jul 12, 2023 at 11:13
  • \$\begingroup\$ PostgreSQL is standards-compliant (and T-SQL kind of isn't). If there are keywords I've used here that are non-portable, please let me know and I'll try to suggest alternatives. But again, as a learning platform, T-SQL teaches you bad habits. If you have no choice (due to a workplace or something), then fine. \$\endgroup\$
    – Reinderien
    Commented Jul 12, 2023 at 11:21
  • \$\begingroup\$ Gotchya. I learned MySQL in college (although I don't remember it). Recently I wanted to move from the basic/intermediate level to an advanced level in SQL, so I had the "101 SQL questions" course in Udemy, where the tutor used just T-SQL. It's good to know about the PostgreSQL. Funny enough, the potential position I applied for is Oracle SQL. Thanks again! \$\endgroup\$
    – Mike
    Commented Jul 12, 2023 at 12:58
  • \$\begingroup\$ "Good: Using subqueries instead of CTEs (which would impose an optimization boundary)." @Reinderien, fascinating, thank you for that remark. Focusing on postgres, I'd like to highlight this blog post. It's from a decade ago, 2014, but it may have aged well. I'd be happy to read a more modern take on it. FWIW I do agree with the sentiment that query hints are trouble, on balance they do more harm than good. I really like the PG philosophy of "fix the planner" rather than papering over a problem. \$\endgroup\$
    – J_H
    Commented Sep 29, 2023 at 22:33
  • \$\begingroup\$ Yes, that's about the gist of it. The good news is that many CTEs are convertible. \$\endgroup\$
    – Reinderien
    Commented Sep 29, 2023 at 22:56

Not the answer you're looking for? Browse other questions tagged or ask your own question.