6

I think I've avoided the XY problem here, as I'm laying out my solutions to the real underlying problem (summarizing multiple tables in a dynamic fashion) and I'm only asking about the one final part where I got stuck. Therefore, there is a fair bit of background first. I've provided a minimal example dataset, and working code for summarizing the data in the fashion I describe.


Consider a setup as follows:

create temp table tbl1 (id int primary key, category text, passing boolean);

insert into tbl1 values
(1, 'A', 't'),
(2, 'A', 't'),
(3, 'A', 't'),
(4, 'A', 'f'),
(5, 'B', 't'),
(6, 'B', 'f'),
(7, 'C', 't'),
(8, 'C', 't'),
(9, 'C', 'f'),
(10, 'C', 'f'),
(11, 'C', 'f'),
(12, 'C', 'f'),
(13, 'B', 't'),
(14, 'B', 'f'),
(15, 'B', 't'),
(16, 'B', 'f'),
(17, 'B', 't'),
(18, 'B', 'f'),
(19, 'B', 't'),
(20, 'B', 'f');

Then I can produce the following summary:

postgres=> select category, passing, count(*) from tbl1 group by category, passing order by category, passing;
 category | passing | count
----------+---------+-------
 A        | f       |     1
 A        | t       |     3
 B        | f       |     5
 B        | t       |     5
 C        | f       |     4
 C        | t       |     2
(6 rows)

However, I have multiple such tables (all using the same categories A, B, C) that I want to summarize, so the final result I want to have displayed needs to be just a single row to summarize one table, like so:

 Table Name | Overall passing rate | A passing rate | B passing rate | C passing rate
------------+----------------------+----------------+----------------+----------------
 tbl1       | 50% (10/20)          | 75% (3/4)      | 50% (5/10)     | 33% (2/6)

I also need to be able to filter sometimes, such as only returning information about categories A and B and ignoring C, like so:

 Table Name | Overall passing rate | A passing rate | B passing rate
------------+----------------------+----------------+----------------
 tbl1       | 57% (8/14)           | 75% (3/4)      | 50% (5/10)

I can produce the first output shown above with a query using the count(*) filter (where...) syntax in a somewhat kludgy CTE, like so:

with tallies as (
select
count(*) filter (where category in ('A', 'B', 'C') and passing) as abc_pass,
count(*) filter (where category in ('A', 'B', 'C')) as abc_all,
count(*) filter (where category = 'A' and passing) as a_pass,
count(*) filter (where category = 'A') as a_all,
count(*) filter (where category = 'B' and passing) as b_pass,
count(*) filter (where category = 'B') as b_all,
count(*) filter (where category = 'C' and passing) as c_pass,
count(*) filter (where category = 'C') as c_all
from tbl1
)
select 'tbl1' as "Table Name",
format('%s%% (%s/%s)', 100*abc_pass/abc_all, abc_pass, abc_all) as "Overall passing rate",
format('%s%% (%s/%s)', 100*a_pass/a_all, a_pass, a_all) as "A passing rate",
format('%s%% (%s/%s)', 100*b_pass/b_all, b_pass, b_all) as "B passing rate",
format('%s%% (%s/%s)', 100*c_pass/c_all, c_pass, c_all) as "C passing rate"
from tallies;

And I can modify this to omit category C without much difficulty, to produce the second example output above. (Not showing that here as it would be mostly repetitive.)

The trouble is, with so many tables to summarize (actually views, not tables, but that doesn't matter) and with the requirement that I be able to easily summarize any group of the tables ad hoc, and include or omit categories at will (e.g. "summarize tbl1, tbl2 and tbl3, but only categories B and C", or "summarize just category B for all tables") the above SQL isn't flexible enough.

I can accomplish the "summarize any group of tables ad hoc" requirement with a plpgsql function that accepts an arbitrary number of arguments of type name, and feed it the names of all the tables I want to summarize, like so:

create function summarize_tables(variadic tbls name[])
returns table ("Table Name" text, "Overall pass rate" text, "A passing rate" text, "B passing rate" text, "C passing rate" text)
language plpgsql
as $funcdef$
declare
  tbl name;
begin
  foreach tbl in array tbls
  loop
    return query execute
      format(
        $query$
          with tallies as (
            select
              count(*) filter (where category in ('A', 'B', 'C') and passing) as abc_pass,
              count(*) filter (where category in ('A', 'B', 'C')) as abc_all,
              count(*) filter (where category = 'A' and passing) as a_pass,
              count(*) filter (where category = 'A') as a_all,
              count(*) filter (where category = 'B' and passing) as b_pass,
              count(*) filter (where category = 'B') as b_all,
              count(*) filter (where category = 'C' and passing) as c_pass,
              count(*) filter (where category = 'C') as c_all
            from %I
          )
          select
            %L as "Table Name",
            format('%%s%%%% (%%s/%%s)', 100*abc_pass/abc_all, abc_pass, abc_all) as "Overall passing rate",
            format('%%s%%%% (%%s/%%s)', 100*a_pass/a_all, a_pass, a_all) as "A passing rate",
            format('%%s%%%% (%%s/%%s)', 100*b_pass/b_all, b_pass, b_all) as "B passing rate",
            format('%%s%%%% (%%s/%%s)', 100*c_pass/c_all, c_pass, c_all) as "C passing rate"
          from tallies;
        $query$,
        tbl,
        tbl
      );
  end loop;
  return;
end
$funcdef$
;

This can be called with select * from summarize_tables('tbl1'); to summarize the example dataset above, or select * from summarize_tables('tbl1', 'tbl2'); to summarize additional tables.

However, this doesn't accomplish the second requirement at all - that I be able to compute different result columns to include or exclude A, B or C arbitrarily.

I thought perhaps there would be a way to do this with a function signature that would look like so:

create function summarize_tables(categories text[], variadic tbls name[])

And then call it like so:

select * from summarize_tables('{A,B}', 'tbl1', 'tbl2');

But I can't figure out how I could make use of the "categories" array from within my SQL. Is this even possible, to summarize the results in a filtered fashion like this according to the categories passed in?


On a related note, I found https://stackoverflow.com/a/11751557/5419599 so I'm aware that if I want truly dynamic columns to be returned, I'll have to use returns setof record and I'll have to specify the full names and types of the columns to be returned every time I call the function. I would be interested in a workaround for that if there is one.

Possibly the combination of these two factors means that I should just accept that I'll have to have a separate function for each combination of categories A, B, and C that I want to summarize - seven functions total.

But in that case, woe betide me if a category D and category E are added later!

That combinatorial possibility makes me think it could be worth it to have to specify the return column names and types each time I call the function, as a price to pay for only needing to have a single function. In other words, change the returns table (...) in the function definition to returns setof record and then change the call from select * from summarize_tables(...); to:

select * from summarize_tables('{A,C,D}', ...)
as x ("Table Name" text, "Overall pass rate" text, "A passing rate" text, "C passing rate" text, "D passing rate" text)
;

However, this trade-off won't even be possible unless there is a way to make the filtering more dynamic than it is in the current CTE - i.e. a way to make use of a categories text[] parameter passed in. That's what my question is.

(Any pointers regarding the above design would also be welcome, though.)

For purposes of this question, I have omitted handling for null "passing" values, which would be handled by changing "where passing" to "where passing is true" - and have omitted case switches to avoid divide by zero errors in case some particular table doesn't contain a particular category.

3 Answers 3

2

I worked out how to do it using unnest(...) with ordinality and returning an array as part of the results.

Here is the function definition:

create function summarize_tables(categories text[], variadic tbls name[])
returns table (tbl name, overall text, by_category text[])
language plpgsql
as $funcdef$
  begin
    foreach tbl in array tbls
    loop
      return query execute format(
        $query$
          with tallies as (
            select
              category,
              count(*) filter (where passing) as passcount,
              count(*) as allcount from %I group by category
          ),
          categories_passed as (
            select * from unnest(%L::text[]) with ordinality as x(category, rn)
          )
          select
            %1$L::name as tbl,
            format('%%s%%%% (%%s/%%s)', (sum(passcount)*100/sum(allcount))::int, sum(passcount), sum(allcount)) as overall,
            array_agg(format('%%s%%%% (%%s/%%s)', passcount*100/allcount, passcount, allcount) order by rn) as by_category
          from categories_passed natural left join tallies;
        $query$,
        tbl,
        categories
      );
    end loop;
    return;
  end
$funcdef$
;

The raw results (with select *) look like:

postgres=> select * from summarize_tables('{A,B,C}', 'tbl1');
 tbl  |   overall   |              by_category
------+-------------+----------------------------------------
 tbl1 | 50% (10/20) | {"75% (3/4)","50% (5/10)","33% (2/6)"}
(1 row)

postgres=> select * from summarize_tables('{A,B}', 'tbl1');
 tbl  |  overall   |        by_category
------+------------+----------------------------
 tbl1 | 57% (8/14) | {"75% (3/4)","50% (5/10)"}
(1 row)

Note that the results that are broken down by category are not sorted alphabetically, but kept in the same order passed in:

postgres=> select * from summarize_tables('{B,A}', 'tbl1');
 tbl  |  overall   |        by_category
------+------------+----------------------------
 tbl1 | 57% (8/14) | {"50% (5/10)","75% (3/4)"}
(1 row)

To get the exact results shown in the question requires column names to be given and the results to be pulled out of the array:

select
  tbl as "Table Name",
  overall as "Overall passing rate",
  by_category[1] as "A passing rate",
  by_category[2] as "B passing rate",
  by_category[3] as "C passing rate"
from summarize_tables('{A,B,C}', 'tbl1');

 Table Name | Overall passing rate | A passing rate | B passing rate | C passing rate
------------+----------------------+----------------+----------------+----------------
 tbl1       | 50% (10/20)          | 75% (3/4)      | 50% (5/10)     | 33% (2/6)
(1 row)

select
  tbl as "Table Name",
  overall as "Overall passing rate",
  by_category[1] as "A passing rate",
  by_category[2] as "B passing rate"
from summarize_tables('{A,B}', 'tbl1');

 Table Name | Overall passing rate | A passing rate | B passing rate
------------+----------------------+----------------+----------------
 tbl1       | 57% (8/14)           | 75% (3/4)      | 50% (5/10)
(1 row)

Not called for in the question, but if it's wanted for the caller of the function to have access to the raw numbers (e.g. for sorting or numeric conditions), then nested arrays can be returned instead, and the formatting can be done by the caller:

create function summarize_tables(categories text[], variadic tbls name[])
returns table (tbl name, overall numeric[], by_category numeric[][])
language plpgsql
as $funcdef$
  begin
    foreach tbl in array tbls
    loop
      return query execute format(
        $query$
          with tallies as (
            select
              category,
              count(*) filter (where passing)::numeric as passcount,
              count(*)::numeric as allcount from %I group by category
          ),
          categories_passed as (
            select * from unnest(%L::text[]) with ordinality as x(category, rn)
          )
          select
            %1$L::name as tbl,
            array[sum(passcount), sum(allcount)] as overall,
            array_agg(array[passcount, allcount] order by rn) as by_category
          from categories_passed natural left join tallies;
        $query$,
        tbl,
        categories
      );
    end loop;
    return;
  end
$funcdef$
;

The SQL to pull out the values from the nested arrays isn't pretty, but it works:

with x as (
  select
    tbl,
    overall as o,
    by_category[1:1] as a,
    by_category[2:2] as b,
    by_category[3:3] as c
  from summarize_tables('{A,B,C}', 'tbl1')
)
select
  tbl,
  format('%s%% (%s/%s)', (100*x.o[1]/x.o[2])::int, x.o[1], x.o[2]) as "Overall passing rate",
  format('%s%% (%s/%s)', (100*x.a[1][1]/x.a[1][2])::int, x.a[1][1], x.a[1][2]) as "A passing rate",
  format('%s%% (%s/%s)', (100*x.b[1][1]/x.b[1][2])::int, x.b[1][1], x.b[1][2]) as "B passing rate",
  format('%s%% (%s/%s)', (100*x.c[1][1]/x.c[1][2])::int, x.c[1][1], x.c[1][2]) as "C passing rate"
from x
;

Related reading about how Postgres arrays work: https://stackoverflow.com/a/34408253/5419599

2
  • altrenatively a struct type like json or hstore could be used instead of the array.
    – Jasen
    Commented Jan 9, 2020 at 21:32
  • @Jasen do you mean for the addendum version that returns numbers rather than formatted text, or do you mean to pass in a json rather than '{A,B,C}'?
    – Wildcard
    Commented Jan 9, 2020 at 21:39
0

You described the problem well. I cannot give a complete solution, but I'll give you a few ideas.

As you have found out, a function where the number of result columns is variable cannot be done the way you want; if you define it as RETURNS SETOF record, you'll have to specify the actual result columns in every query to the function. This is because the columns have to be known at query parse time.

You will have to write a function that composes a query string based on the categories and tbls parameters; this is known as dynamic SQL. Whether you write that function in PL/pgSQL or some other language on the client side does not matter - use whatever is the best language for the task for you.

Then, in a second step, run the resulting query against the database.

0

Great post,

I have had to deal with the problem with a table that can grow and shrink the number of columns.

I did the same thing you did and that is dynamically construct the SQL to create the new table. the problem becomes how to return the table in a simple query from a function. Flat out one can not do that. so the Solution is very simple pass the destination table name to the function then select * from the destination Name like so.

Select build_dynamic_table( MyRANDOM_Table_Name, rest_of_args);

--table build and data is put in

Select * from Myrandom_Table_Name  
   [any join conditions]
   [where filter_conditions];  

The result can be a real table that can be accessed from other users or just Temporary table that is thrown away.

On tables that are constructed in this fashion I almost always make them none logged or temps

When i'm constructing tables that are accessed by other others, I first create a staging table, once that is done, "drop" the old table then issue

begin ;
drop Shared_table_name 
create table like Myrandom_table_Name;
commit;  --done this way to limit errors the clients can see
Select * from Myrandom_table_Name into Shared_Table_Name;

even if the columns grow or shrink this works.

The big draw back is this table can never be updated, and one cannot issue Select column_list or a where clause or etc one is limited to Select * from, as other users no long know the structure of the table. To work around this problem one has to build dynamic queries on the client side by going to pg_catalog to figure out the structure the new table. I also do that...

During the drop and moving the data into the shared_table_name this does cause issues when users try to query the table either Access Deny due to being locked, or object not found or no data is returned.

Not the answer you're looking for? Browse other questions tagged or ask your own question.